Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory and Redis backend support #84

Merged
merged 14 commits into from
Jan 5, 2017
Merged

Memory and Redis backend support #84

merged 14 commits into from
Jan 5, 2017

Conversation

ibnesayeed
Copy link
Contributor

@ibnesayeed ibnesayeed commented Dec 22, 2016

This PR is to support #81

TODO

  • Refactor Bayes classifier code to extract all the persistent states and data structures away
  • Conditional initialization of the backends (Memory or Redis)
  • Meaningful proxy read and write method names to allow uniform backend API
  • Change in data structure for better generalization
  • Implementation of the Memory backend
  • Passing existing Bayes Memory tests
  • Implementation of the Redis backend
  • Writing Bayes tests for Redis backend
  • DRY test cases for both the backends
  • Writing tests for both backend classes
  • Perform integration test
  • Add Redis in Travis CI
  • Update README
  • Code review

Copy link
Member

@parkr parkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! Left a few comments to align with the Jekyll coding style and some helpful Ruby tips. 😄

@@ -0,0 +1,72 @@
# Author: Sawood Alam <@ibnesayeed>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any author comments in our code – your authorship is present in the Git history and here on GitHub. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be gone. I usually don't add such things, but I have seen many other files here that have Author, License, and some other metadata associated in comments, so I thought it was the style of this code base.

end

def total_words
@total_words
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these, you can use the built-in attr_reader.

Replace this method with

attr_reader :total_words

It's convenient short-hand for

def total_words
  @total_words
end

@total_words += diff
end

def total_trainings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use attr_reader here as well. And they can be combined:

attr_reader :total_words, :total_trainings

Just a comma-separated list of symbols, where the symbol is the name of the variable and the name of the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of attr_reader, but I thought in other backends they wont be attributes. So for the sake of uniformity, I used a more usual method definition. However, I can certainly change it in the Memory backend in the more concise form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be unlikely that any future backend wouldn't have these available as attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ch4s3, in the Redis backend they are not plain in-memory attributes. They are stored in the Redis instance. Keeping them in memory as plain attributes would require to traverse through all the records and when initialized from an existing dataset. This was the whole point of using another backend to keep the counters and other state information in the backend store while keeping the configurations and other similar things in the memory.

@categories[category] ||= Hash.new(0)
end

def category_keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be categories? They keys of the @categories Hash correspond to the categories that the user has input, so the user might like to write categories to get the list of categories that have data.

Copy link
Contributor Author

@ibnesayeed ibnesayeed Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the bayes.rb file, there is a method named categories that returns a more natural list of category names. However, in the data structure, those categories are converted to symbols (see CategoryNamer class: name.to_s.tr('_', ' ').capitalize.intern). I want to keep the transformations and logic away from the storage (backend) class. So, the category_keys method returns the form that was stored in the data structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

options = { language: 'en',
auto_categorize: false,
enable_threshold: false,
threshold: 0.0,
enable_stemmer: true
enable_stemmer: true,
redis_con_str: ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best option here is to use dependency injection. Thus the user does:

memory_backed_classifier = ClassifierReborn::Bayes.new
redis_backed_classifier = ClassifierReborn::Bayes.new(:backend => ClassifierReborn::BayesRedisBackend.new("localhost", "6987"))

So they pass in a :backend key which is an instance of the relevant backend they wish to use. Then they can configure Redis and we don't have to worry about any backend configurations in this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked this approach, thanks! Initially, I thought about letting users directly create instance of the Redis class and pass that as for connection, but that would be too much abstraction. I will go ahead and make the Memory backend as default and let the user pass an instance of the backend to allow override.

@ibnesayeed
Copy link
Contributor Author

@parkr, I have taken care of all your comments. Please have a look at the new changes and suggest for more changes if necessary. Meanwhile, I can work on the Redis backend.

@ibnesayeed
Copy link
Contributor Author

All tests are passing in both Memory and Redis backends. If a Redis server is not running on the same machine with default configs then Redis backend tests are omitted.

Currently, I have just copied the BayesianTest class to BayesianRedisTest, updated the setup method, and added cleanup method (because Redis is persistent). I was thinking of somehow not repeating the test cases and run all the test cases of BayesianTest twice with different configurations, but can't see a clean way of doing this. I also tried inheriting BayesianRedisTest from BayesianTest and only overriding the setup and cleanup methods while expecting that all the test_* methods will be inherited and run, but it looks like that's not how tests run. Any other approach to DRY up the test cases?

@ibnesayeed
Copy link
Contributor Author

ibnesayeed commented Dec 23, 2016

Here is an article about DRYing up the test cases so that they can be used as modules, but it is for Minitest. We want similar capability in our test environment. Still looking for my options.

@ibnesayeed
Copy link
Contributor Author

I was able to put all test cases in a module and include that in both the test classes. Yay!

Should the name BayesianTest be changed to BayesianMemoryTest in accordance to the BayesianRedisTest (and corresponding file name)?

@Ch4s3
Copy link
Member

Ch4s3 commented Dec 28, 2016

I'd probably leave the file name the same to preserve history, unless you need to change the contents of the tests dramatically.

@ibnesayeed
Copy link
Contributor Author

@Ch4s3, did you get a chance to review the code changes yet?

@ibnesayeed
Copy link
Contributor Author

@parkr and @Ch4s3 if the code so far looks good then I can go ahead and write unit tests for the backend classes and update the README so that we can proceed with merging it before it gets stale.

@ibnesayeed ibnesayeed changed the title WIP: Memory and Redis backend support Memory and Redis backend support Jan 2, 2017
@ibnesayeed
Copy link
Contributor Author

  • Added necessary unit tests for the newly added backend classes.
  • Merged recent changes in the master branch and made all the necessary changes to work with the new testing library.
  • Updated the README to illustrate the usage of the Redis backend.

It is no more a work in progress from my end. Please review the code and let me know if some changes are needed.

@@ -55,19 +59,20 @@ def train(category, text)
category = CategoryNamer.prepare_name(category)

# Add the category dynamically or raise an error
unless @categories.key?(category)
unless category_keys.include?(category)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep .key? include? and key? are functionally the same, unless I'm missing something, and I think key is more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@categories was a nested hash which has key? method. However, category_keys is an array of just the keys of that hash (which was abstracted to support Redis backend) and arrays don't have key? method defined. Am I missing something?

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 2, 2017

This generally looks very good @ibnesayeed! Its a big change, so I need to put a bit more time into looking it over. I'd also like to make sure @parkr is on board because I don't want to cause any upstream problems for Jekyll.

@ibnesayeed
Copy link
Contributor Author

@Ch4s3 and @parkr do we have a real or pseudo-real (fairly good in size) training set that we can use to train and ask for scores of certain queries without this PR, then we can repeat the same with this PR in both Memory and Redis backends. If the scores come out to be exactly the same then we will have more confidence that nothing was changed while refactoring inline memory operations into a separate backend. I think I am essentially talking about an integration test.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 5, 2017

@ibnesayeed no, but we should. I usually pull down some wikipedia articles to demo the gem, but that might not be enough.

@ibnesayeed
Copy link
Contributor Author

@Ch4s3 I have tried to test it on a public data set about SMS spam classification. I am yet to match the results, but here is what I did.

Cases:

  • Pre-refactoring version
  • Post-refactoring Memory backend
  • Post-refactoring Redis backend

For each of these cases:

  • Trained 5474 instances of Ham and Spam classes
  • Generated scores of 100 untrained instances
  • Untrained 2000 instances from the already trained model
  • Generated scores of 100 untrained instances

Please find the test scripts, data, and results at https://gist.github.com/ibnesayeed/73680bb26c4ea2dc1c55651d126e57fb

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 5, 2017

I'll take a look as soon as possible.

@ibnesayeed
Copy link
Contributor Author

ibnesayeed commented Jan 5, 2017

Now that I am back to my desk and had a look at the produced output, they are exactly the same. The gist I posted before had some wrong data initially as I was not flushing the Redis database, so after successive runs, scores based on Redis backend were changing due to persistence of earlier tests. Now I have updated the gist.

I am now very confident that the implementation was safely refactored and extended without changing the existing behavior. Coding style and other things can be reviewed obviously.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 5, 2017

Awesome! Maybe we should consider using that data for integration testing.

@Ch4s3 Ch4s3 merged commit 92112f5 into jekyll:master Jan 5, 2017
@ibnesayeed
Copy link
Contributor Author

@Ch4s3: Awesome! Maybe we should consider using that data for integration testing.

That was going to be my next question, if you would be interested in getting that test data added in the repo with some test cases and more tests can be written based on that data later.

How should we put a test case for now? Should we just do what I did in that external script, then compare the results of the two backends or should we compare them against a static expected output?

@ibnesayeed
Copy link
Contributor Author

@Ch4s3: I'd probably leave the file name the same to preserve history, unless you need to change the contents of the tests dramatically.

Actually, now that there are more files in the classifier-reborn/test/bayes directory with an emerging pattern, I think it now makes more sense to rename the file and the class to make it in sync with others. Otherwise it will be confusing for new contributors.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 7, 2017

I agree

@ibnesayeed
Copy link
Contributor Author

I agree

PR #102 should do.

Ch4s3 pushed a commit that referenced this pull request Jan 7, 2017
* Disabled Redis disc persistence and refactored integration test, fixes #95

* Changed test class and file name as per #84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants