Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't train or untrain empty word hashes #132

Merged
merged 4 commits into from
Jan 17, 2017
Merged

don't train or untrain empty word hashes #132

merged 4 commits into from
Jan 17, 2017

Conversation

Ch4s3
Copy link
Member

@Ch4s3 Ch4s3 commented Jan 17, 2017

No description provided.

@@ -73,7 +73,9 @@ def train(category, text)

@backend.update_category_training_count(category, 1)
@backend.update_total_trainings(1)
Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
return if word_hash.length == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it the very first thing to happen in the method. This current change is essentially changing nothing significantly. update_category_training_count and update_total_trainings are still being called, hence causing undesired stats changes. Also, if the @auto_categorize is set to true, it might add a new category without any trainings in it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to preempt add_category or just update_category_training_count and update_total_trainings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it the very first thing in the method, even before calling CategoryNamer. This way, if the training has to be skipped, it does not make any changes in the data-structure state in any circumstances.

@@ -91,7 +93,9 @@ def untrain(category, text)
category = CategoryNamer.prepare_name(category)
@backend.update_category_training_count(category, -1)
@backend.update_total_trainings(-1)
Hasher.word_hash(text, @language, @enable_stemmer).each do |word, count|
word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
return if word_hash.length == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this up in the method as well.

@ibnesayeed
Copy link
Contributor

Also, using word_hash.empty? instead of word_hash.length == 0 would be more elegant in my opinion.

@ibnesayeed
Copy link
Contributor

Additionally, the classifications method should return a score hash with Infinity value for each category if the word_hash is empty like this:

def classifications(text)
  score = {}
  word_hash = Hasher.word_hash(text, @language, @enable_stemmer)
  if word_hash.empty
    category_keys.each do |category|
      score[category.to_s] = Float::INFINITY
    end
    return score
  end
  #... remaining code
end

@ibnesayeed
Copy link
Contributor

I had these changes in a local branch on my code base, but since you are taking care of it, I will just discard that.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

So I made these changes, but it breaks this test .

10.times do |a_number|
    assert_equal 'Digit', b.classify(a_number.to_s)
end

But, I'm not sure it even makes sense to classify a single digit that way.

# now add prior probability for the category
s = @backend.category_has_trainings?(category) ? @backend.category_training_count(category) : 0.1
score[category.to_s] += Math.log(s / @backend.total_trainings.to_f)
score
Copy link
Contributor

@ibnesayeed ibnesayeed Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you did not used return keyword in the if section of the word_hash.empty? clause and wrapped the remaining code in the else clause, I would encourage to remove implicit return of score from both the clauses and return it outside the conditional as the last line.

Alternatively, explicit return in the special case and not wrapping rest of the code in else clause would reduce a level of nesting, while gaining the same effect.

@ibnesayeed
Copy link
Contributor

ibnesayeed commented Jan 17, 2017

So I made these changes, but it breaks this test .

This is happening because how the tokenization is implemented currently. If the length of the token is not greater than 2 letters then it is skipped and not added in the word_hash. I think it was a meaningless test because there was only one category present and the system was returning that category with some score no matter what was passed to classify. Those categories were actually not being trained with anything.

In the new implementation of the classification method, with empty strings getting Infinity score, and a threshold is set, the classify method would return nil.

next unless word.length > 2 && !STOPWORDS[language].include?(word)

In conclusion, we need to change the test so that it trains for numbers above 100. This should make the test pass. Then we need to think if we want the limit of length where the token is nothing but a number such as 2, 20, or .5. This makes it more important to be able to pass a custom tokenizer (#131) for situations like this.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

I changed the test to use words, which I think better reflects the intent, and I tweaked the threshold. But in general I agree that it was a bad test, we seem to have a lot of those.

Yeah, let's prioritize #131, it will be pretty useful going forward.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

The results of classification_scores(@memory_classifier) and classification_scores(@redis_classifier) look nearly identical but the hashes are different. I'm guessing this is a floating point math issue, but I'm not sure.

@ibnesayeed
Copy link
Contributor

I changed the test to use words, which I think better reflects the intent...

You might want to make it something like 100.upto(110) do ....

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

@ibnesayeed thoughts on the failing test?

@ibnesayeed
Copy link
Contributor

@ibnesayeed any insight into why bayesian_integration_test.rb:35 fails on newer rubies with this PR?

I am trying to look into it. If it was failing for all Rubies, I would have though there might be some edge condition on how numbers are stored in Redis.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

It might be an issue in the Redis Ruby gem

@ibnesayeed
Copy link
Contributor

I think the issue might be on another level. Because even the expected value changes of each run. That expected value is calculated by generating a hash digest of the array of the results and scores concatenated. If the array is somehow changing the order or the digest calculation is not reliable to test the equality then we might need to make some changes there.

@ibnesayeed
Copy link
Contributor

I think I got the culprit, just need to dig deeper and see how it goes.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

@ibnesayeed good catch

@ibnesayeed
Copy link
Contributor

The issue is related to how classify_with_score method returns the value.

def classify_with_score(text)
  (classifications(text).sort_by { |a| -a[1] })[0]
end

This returns the first value with the highest score. However, if more than one categories have the same score, how to resolve the conflict? In this particular test case, there is an instance in the dataset on like 4499, there is a record ham Ok which is being used as test. Due to the only-stopword string, both teh categories are assigned Infinity value, and the method might return one or the other as the chosen one.

@ibnesayeed
Copy link
Contributor

If the test uses a threshold, the situation will be little different as discussed in #123.

@ibnesayeed
Copy link
Contributor

However, should have a fix for this situation without making any changes in the library code, only in the test. I will let you know those changes in a bit.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

awesome, thanks!

@ibnesayeed
Copy link
Contributor

Please cherry pick ibnesayeed@ed116ea

@ibnesayeed
Copy link
Contributor

I was in rush and could not make proper PR on time. PR #133 should take care of the error.

@ibnesayeed
Copy link
Contributor

It was interesting to find out that the record that was nothing but a stopword was paced at line 4499. We are using records 1-4000 for training and 4001-5000 for testing and it happened to be the second last record on the test set. This was coincidental, but a good one that made the test more reliable.

@Ch4s3
Copy link
Member Author

Ch4s3 commented Jan 17, 2017

yeah, that was a good catch

Copy link
Contributor

@ibnesayeed ibnesayeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ch4s3 Ch4s3 merged commit 8bc8d55 into master Jan 17, 2017
@Ch4s3 Ch4s3 deleted the check_blank branch January 17, 2017 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants