Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple language stopwords with customizable stop word paths #40

Merged
merged 4 commits into from Sep 2, 2015

Conversation

kreynolds
Copy link
Contributor

This adds the ability to have stop words in multiple languages as well as prepend a custom stopword path. I personally have a much larger stopword list for english that came with this library but I wanted to write everything in a completely backwards compatible way.

@@ -6,8 +6,12 @@

module ClassifierReborn
module Hasher
@stopwords_path = [File.expand_path(File.dirname(__FILE__) + '/../../../data/stopwords')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, consts are fun.

@parkr
Copy link
Member

parkr commented Sep 1, 2015

This is really awesome! How much does this weigh?

Will wait for @Ch4s3's input.

@kreynolds
Copy link
Contributor Author

So when you say 'weigh', you mean does this slow things down any?

@parkr
Copy link
Member

parkr commented Sep 2, 2015

@kreynolds Yes, but also the byte size increase in downloading the gem.

@kreynolds
Copy link
Contributor Author

Its not any slower, just takes up slightly more memory if you are classifying multiple languages in the same runtime. There are 25Kb of stopwords among all of the languages put together. Its probably worth noting that after this patch, I have another set of patches to improve performance, particularly around the Hasher (300% speedup, give or take).

@@ -18,22 +18,22 @@ def without_punctuation(str)

# Return a Hash of strings => ints. Each word in the string is stemmed,
# interned, and indexes to its frequency in the document.
def word_hash(str)
word_hash = clean_word_hash(str)
def word_hash(str, language='en')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we follow the GitHub Ruby Styleguide for new code. It states:

Use spaces around the = operator when assigning default values to method parameters.

Would you mind updating your changes to match this?

@kreynolds
Copy link
Contributor Author

Done, and I removed Indonesian, which was empty.

@parkr
Copy link
Member

parkr commented Sep 2, 2015

LGTM. @Ch4s3? Please merge and update the history if you think it's good to merge.

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 2, 2015

Looks great. I'll merge it in as soon as I have time to update the history.

parkr added a commit that referenced this pull request Sep 2, 2015
@parkr parkr merged commit be41227 into jekyll:master Sep 2, 2015
parkr added a commit that referenced this pull request Sep 2, 2015
@parkr
Copy link
Member

parkr commented Sep 2, 2015

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants