Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

Closed
senatet opened this issue Jan 4, 2022 · 2 comments

Comments

@senatet
Copy link

senatet commented Jan 4, 2022

natural.Tfidf.addDocument accepts either a string or an array of pre-tokenized texts. When a document is added using an array of tokens, listTerms still applies the tokenization to the individual document tokens when computing the tfidf score, resulting in a tfidf score of 0, even though the tf and idf scores are > 0.

(natural version: ^5.1.11)
An example:

> var natural = require('natural')
> var tfidf = new natural.TfIdf()
> tfidf.listTerms(0)
[
  {
    term: 'domain',
    tf: 1,
    idf: 0.3068528194400547,
    tfidf: 0.3068528194400547
  },
  { term: 'google.com', tf: 1, idf: 0.3068528194400547, tfidf: 0 }
]

The second document should have a tfidf score of 0.306... (1 * .0.3068..), but it is 0.

The fix is simple.. Update the listTerms(...) function to pass an array in tfidf: _this.tfidf(term, d) call (change to:
tfidf: _this.tfidf([term], d) (line 174 here: https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js ).

Thanks.

@DSchmidlin
Copy link

I ran into the same problem. The workaround I used was to set a custom tokenizer that does this work.

tfidf.setTokenizer( { tokenize(x) { return [x] } });

Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024
Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024
@Hugo-ter-Doest
Copy link
Collaborator

Solved in #748

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants