Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common contracted forms are missing from the English stop word list #22

Closed
DavidNemeskey opened this issue Feb 10, 2015 · 7 comments
Closed
Assignees
Milestone

Comments

@DavidNemeskey
Copy link

While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.

  • d as in she'd,
  • ll as in we'll,
  • m as in I'm,
  • o as in o'clock,
  • re as in you're,
  • ve as in they've,
  • y as in y'all
    are missing.

Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).

Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.

@DavidNemeskey
Copy link
Author

This issue can be solved by appending the following list of words to the English stop word list:

d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn

Unfortunately, I don't know how to contribute data changes to this project; opening a PR for a zip file feels a bit strange.

@stevenbird stevenbird self-assigned this Feb 28, 2016
@stevenbird stevenbird added this to the 3.2 milestone Feb 28, 2016
@stevenbird
Copy link
Member

Thanks @DavidNemeskey, and sorry for the long delay.

@aellenhicks
Copy link

Why is 'ma' on the list? I tried searching for contractions with 'ma' and only came up with 'ma'am'.

@tsolakghukasyan
Copy link

@aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think.

@aellenhicks
Copy link

Thanks!

From: Tsolak Ghukasyan <notifications@github.commailto:notifications@github.com>
Reply-To: nltk/nltk_data <reply@reply.github.commailto:reply@reply.github.com>
Date: Thursday, October 27, 2016 at 1:48 PM
To: nltk/nltk_data <nltk_data@noreply.github.commailto:nltk_data@noreply.github.com>
Cc: aellenhicks <aellenhicks@gmail.commailto:aellenhicks@gmail.com>, Mention <mention@noreply.github.commailto:mention@noreply.github.com>
Subject: Re: [nltk/nltk_data] Common contracted forms are missing from the English stop word list (#22)

@aellenhickshttps://github.com/aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/22#issuecomment-256719357, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHyM2VpeTguNbTdx152DExv3BZaPoXKOks5q4OQJgaJpZM4DeUWp.

@tenstriker
Copy link

why would "won" be part of english stop word? Seems incorrect way to separate out "won" and "t"

@DavidNemeskey
Copy link
Author

@tenstriker I completely agree, "won" is a meaningful word; I should not have added it to the list.

Maybe instead of a stop word list, an ngram-based detection would be better, but I don't know if Nltk has that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants