natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

senatet · 2022-01-04T20:58:03Z

natural.Tfidf.addDocument accepts either a string or an array of pre-tokenized texts. When a document is added using an array of tokens, listTerms still applies the tokenization to the individual document tokens when computing the tfidf score, resulting in a tfidf score of 0, even though the tf and idf scores are > 0.

(natural version: ^5.1.11)
An example:

> var natural = require('natural')
> var tfidf = new natural.TfIdf()
> tfidf.listTerms(0)
[
  {
    term: 'domain',
    tf: 1,
    idf: 0.3068528194400547,
    tfidf: 0.3068528194400547
  },
  { term: 'google.com', tf: 1, idf: 0.3068528194400547, tfidf: 0 }
]

The second document should have a tfidf score of 0.306... (1 * .0.3068..), but it is 0.

The fix is simple.. Update the listTerms(...) function to pass an array in tfidf: _this.tfidf(term, d) call (change to:
tfidf: _this.tfidf([term], d) (line 174 here: https://github.com/NaturalNode/natural/blob/master/lib/natural/tfidf/tfidf.js ).

Thanks.

The text was updated successfully, but these errors were encountered:

DSchmidlin · 2023-04-06T22:49:10Z

I ran into the same problem. The workaround I used was to set a custom tokenizer that does this work.

tfidf.setTokenizer( { tokenize(x) { return [x] } });

Repairs #634

Hugo-ter-Doest · 2024-07-02T08:03:16Z

Solved in #748

Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024

Repaired issue #634

da68353

Hugo-ter-Doest added a commit that referenced this issue Jul 2, 2024

Issue#634 (#748)

9c96754

Repairs #634

Hugo-ter-Doest closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

senatet commented Jan 4, 2022

DSchmidlin commented Apr 6, 2023

Hugo-ter-Doest commented Jul 2, 2024

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

natural.Tfidf.listTerms works incorrectly for custom-generated tokens (those passed as array to addDocument(...)) #634

Comments

senatet commented Jan 4, 2022

DSchmidlin commented Apr 6, 2023

Hugo-ter-Doest commented Jul 2, 2024