How should the classifier behave against empty strings? #130

ibnesayeed · 2017-01-15T04:06:04Z

Currently the Bayes classifier allows passing empty string for all training, untraining, and classification. Also, the strings that have nothing, but stopwords behave the same way. This means, we are essentially messing up with training count while no real training is happening.

I think, we should check the length of word_hash and if it is zero then we should just skip the training and untraing methods. If the same is the case when classify method is called, then it should return nil as the score for each category should be Infinity for empty strings.

I found this out while I was working on #125.

I can make a PR if this sounds a sensible option to do.

The text was updated successfully, but these errors were encountered:

ibnesayeed · 2017-01-15T04:20:27Z

Another weirdness could happen when untrain is called more often than train for a category. Some counts will be negative. category_has_trainings? in this case returns false, but actual counts are still negative and they are not frozen at zero. Additionally, the total_trainings can also go negative. The scores will me messed up in either case.

Ch4s3 · 2017-01-17T15:36:31Z

Yeah, this has been sort of a longstanding issue. I'm not sure what the best way to handle empty string would be.

ibnesayeed · 2017-01-17T15:54:38Z

I have given my recommendation already in the second paragraph of the first post. Without this being resolved, writing tests for #129 would be tricky.

Ch4s3 · 2017-01-17T15:59:37Z

Ahh sorry, I think I skimmed this one on my phone. Yeah, a length check is probably the right way to go. I wonder if we should log anything out if a user does that.

ibnesayeed · 2017-01-17T16:06:10Z

Logging is fine, but it should be an information level logging, not a warning or error. Because due to stopword filtering it might get empty unintentionally/unknowingly.

Ch4s3 · 2017-01-17T16:06:14Z

It occurs to me that text.length == 0 fails for input like " ".

We need a blank?() method

ibnesayeed · 2017-01-17T16:09:07Z

It occurs to me that text.length == 0 fails for input like " "

In what context? In general Ruby code " ".length => 1 would be the case.

Ch4s3 · 2017-01-17T16:12:46Z

That's what I mean. the string " " is empty, but has a length greater than 0.

Now, this is only an issue if we check for string emptiness before calling the Hasher to get the word hash, which probably makes sense in most cases you wouldn't want to execute any of that code on an empty string. This still leaves the issue of what to do about text that only has stopwords and the word_hash returns empty.

ibnesayeed · 2017-01-17T16:21:00Z

I would just check the length of word_hash which would be empty whether a blank string "" was called, something containing only non-letters was called such as " " or "...? " etc., or only stopwords were passed such as to be or not to be. In all the cases the returned word_hash` would be empty.

Ch4s3 · 2017-01-17T16:23:16Z

Yeah, I guess I was just thinking of a case where you're iteratively training over a large csv or something and it hits a bunch of blank data and wastes time trying to hash that stuff. But your solution is a lot simpler, so let's do that. I may also add a simple return if text.length == 0 check to the train and untrain methods as well.

ibnesayeed · 2017-01-17T16:25:06Z

We can check the emptiness of the supplied string before calling the Hasher, to avoid unnecessary call to the stack of methods in Hasher, but this would be a micro-optimization, because in many practical applications, empty strings wont be passed, so checking for emptiness just to avoid Hasher call will perhaps cost more when accumulated for all the number of attempts that are not empty.

Ch4s3 · 2017-01-17T16:28:07Z

Does #132 solve this in a way that facilitates testing on #125?

ibnesayeed · 2017-01-17T16:30:41Z

Also, making such decisions should be done in the done in Hasher (if at all), not in the main classifier methods like train, untrain, or classify. The rationale here is that if we allow support for custom tokenizer, then this early decision making will come into it's way. A custom tokenizer might want to preserve the leading and/or trailing spaces (so is the case with letter based n-grams), while emptiness check would call .strip! method that will change the essence of the token. So, it is best to leave it to the tokeniser to decide what a legitimate token is.

ibnesayeed · 2017-01-17T17:01:41Z

@Ch4s3: Does #132 solve this in a way that facilitates testing on #125?

Added some inline reviews and more comments in the PR.

Ch4s3 · 2017-01-17T20:18:23Z

closed by #132

* Abbility to add custom stopwords at classifier initialization * Downcased custom test stopwords * Documented and improved custom stopwords handling * Added test cases for custom stopwords and empty trainings, #125 and #130 * Added documentation for auto-categorization and custom stopwords

Ch4s3 added the discussion label Jan 17, 2017

ibnesayeed mentioned this issue Jan 17, 2017

Abbility to add custom stopwords at classifier initialization #129

Merged

Ch4s3 closed this as completed Jan 17, 2017

ibnesayeed mentioned this issue Jan 18, 2017

Return the startus of the training/untraining when run #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should the classifier behave against empty strings? #130

How should the classifier behave against empty strings? #130

ibnesayeed commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 •

edited

Loading

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

How should the classifier behave against empty strings? #130

How should the classifier behave against empty strings? #130

Comments

ibnesayeed commented Jan 15, 2017

ibnesayeed commented Jan 15, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 • edited Loading

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

ibnesayeed commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017

Ch4s3 commented Jan 17, 2017 •

edited

Loading