-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step 2: First look into the data #3
Comments
I checked what is wrong with the tf-idf function and I just overlooked that it is generating a matrix, but by using We can get the highest tf-idf scores using this, which gives us the following: The issue with the counting during the demo was caused by a missing default value, so the CLI overwrote the other default value. |
Thanks for the update Lennart. About the number issue. Are you removing the numerical characters before running the tf-idf? I believe this would be easier, as we treat the input before using it in any processing, right before/after the stopword removal. I'm still wondering if we should use Tfidf vectorizer instead of transformer. The former is usually used when the input is the raw documents, and the latter if you already have a count matrix. Also, in the first, several tasks can be automated with a parameter flag (e.g. stopword removal, n-grams, max features, min, regex, etc). |
The CountVectorizer I use has the parameters you mentioned and creates the matrix the Tf-idf Transformer needs. I can check if the results would be the same. The parameters are also the issue for the stopwords, as I can only pass a list of stopwords, which sklearn will remove. I think I can smuggle it into the tokenization, so numbers will also be removed. Then we would also do the stopwords removal ourselfs, because we have to check numbers with a function and can't pass a list of all numbers to remove. Maybe I also missed something and you can also pass a function. |
Sorry for the delay Lennart. |
I removed some numbers by casting them to a float and checking if it works for now. Tokens like |
First we should take a look into the data we have by analysing keywords and using tf-idf.
The text was updated successfully, but these errors were encountered: