Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should the classifier behave against empty strings? #130

Closed
ibnesayeed opened this issue Jan 15, 2017 · 15 comments
Closed

How should the classifier behave against empty strings? #130

ibnesayeed opened this issue Jan 15, 2017 · 15 comments

Comments

@ibnesayeed
Copy link
Contributor

Currently the Bayes classifier allows passing empty string for all training, untraining, and classification. Also, the strings that have nothing, but stopwords behave the same way. This means, we are essentially messing up with training count while no real training is happening.

I think, we should check the length of word_hash and if it is zero then we should just skip the training and untraing methods. If the same is the case when classify method is called, then it should return nil as the score for each category should be Infinity for empty strings.

I found this out while I was working on #125.

I can make a PR if this sounds a sensible option to do.

@ibnesayeed
Copy link
Contributor Author

Another weirdness could happen when untrain is called more often than train for a category. Some counts will be negative. category_has_trainings? in this case returns false, but actual counts are still negative and they are not frozen at zero. Additionally, the total_trainings can also go negative. The scores will me messed up in either case.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

Yeah, this has been sort of a longstanding issue. I'm not sure what the best way to handle empty string would be.

@ibnesayeed
Copy link
Contributor Author

I have given my recommendation already in the second paragraph of the first post. Without this being resolved, writing tests for #129 would be tricky.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

Ahh sorry, I think I skimmed this one on my phone. Yeah, a length check is probably the right way to go. I wonder if we should log anything out if a user does that.

@ibnesayeed
Copy link
Contributor Author

Logging is fine, but it should be an information level logging, not a warning or error. Because due to stopword filtering it might get empty unintentionally/unknowingly.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

It occurs to me that text.length == 0 fails for input like " ".

We need a blank?() method

@ibnesayeed
Copy link
Contributor Author

It occurs to me that text.length == 0 fails for input like " "

In what context? In general Ruby code " ".length => 1 would be the case.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

That's what I mean. the string " " is empty, but has a length greater than 0.

Now, this is only an issue if we check for string emptiness before calling the Hasher to get the word hash, which probably makes sense in most cases you wouldn't want to execute any of that code on an empty string. This still leaves the issue of what to do about text that only has stopwords and the word_hash returns empty.

@ibnesayeed
Copy link
Contributor Author

I would just check the length of word_hash which would be empty whether a blank string "" was called, something containing only non-letters was called such as " " or "...? " etc., or only stopwords were passed such as to be or not to be. In all the cases the returned word_hash` would be empty.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

Yeah, I guess I was just thinking of a case where you're iteratively training over a large csv or something and it hits a bunch of blank data and wastes time trying to hash that stuff. But your solution is a lot simpler, so let's do that. I may also add a simple return if text.length == 0 check to the train and untrain methods as well.

@ibnesayeed
Copy link
Contributor Author

We can check the emptiness of the supplied string before calling the Hasher, to avoid unnecessary call to the stack of methods in Hasher, but this would be a micro-optimization, because in many practical applications, empty strings wont be passed, so checking for emptiness just to avoid Hasher call will perhaps cost more when accumulated for all the number of attempts that are not empty.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

Does #132 solve this in a way that facilitates testing on #125?

@ibnesayeed
Copy link
Contributor Author

Also, making such decisions should be done in the done in Hasher (if at all), not in the main classifier methods like train, untrain, or classify. The rationale here is that if we allow support for custom tokenizer, then this early decision making will come into it's way. A custom tokenizer might want to preserve the leading and/or trailing spaces (so is the case with letter based n-grams), while emptiness check would call .strip! method that will change the essence of the token. So, it is best to leave it to the tokeniser to decide what a legitimate token is.

@ibnesayeed
Copy link
Contributor Author

@Ch4s3: Does #132 solve this in a way that facilitates testing on #125?

Added some inline reviews and more comments in the PR.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 17, 2017

closed by #132

@Ch4s3 Ch4s3 closed this as completed Jan 17, 2017
Ch4s3 pushed a commit that referenced this issue Jan 18, 2017
* Abbility to add custom stopwords at classifier initialization

* Downcased custom test stopwords

* Documented and improved custom stopwords handling

* Added test cases for custom stopwords and empty trainings, #125 and #130

* Added documentation for auto-categorization and custom stopwords
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants