You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running some text mining and for one specific word ("the") I've got negative idf values. The word "the" is present in all the documents of my sample so, according to the idf definition, idf("the") should be zero. Here there is my workflow:
I narrowed down the problem to a wrong counting of the total number of documents (issues) in the sample.
The number of documents is 1305, you can find out with:
print(nrow(data_df %>% distinct(issue)))
However, inside the function bind_tf_idf() the total number of documents is computed from the result of tapply() (so from grouping) and the count is 1304. You can check it with (extracted from bind_tf_idf()) :
Because of this wrong counting, the idf of "the" results negative.
Can you confirm the issue? Am I missing something? Why does tapply() "skip" a group? Why is the number of documents in bind_tf_idf() computed via tapply() instead of, for example, using distinct()?
... and obviously this happened because there was a NA in the issue column of the DataFrame, '''tapply()''' ignored it when grouping, but if you count the number of rows after applying distinct the NA is still there. Pardon.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
Hi guys,
I am running some text mining and for one specific word ("the") I've got negative idf values. The word "the" is present in all the documents of my sample so, according to the idf definition, idf("the") should be zero. Here there is my workflow:
I narrowed down the problem to a wrong counting of the total number of documents (issues) in the sample.
The number of documents is 1305, you can find out with:
However, inside the function
bind_tf_idf()
the total number of documents is computed from the result oftapply()
(so from grouping) and the count is 1304. You can check it with (extracted frombind_tf_idf()
) :Because of this wrong counting, the idf of "the" results negative.
Can you confirm the issue? Am I missing something? Why does
tapply()
"skip" a group? Why is the number of documents inbind_tf_idf()
computed viatapply()
instead of, for example, usingdistinct()
?Here you can find the csv data.
times_ocr=80_100&date=1945-01-01_2010-12-31&query=european_union&category=News.csv
The text was updated successfully, but these errors were encountered: