tidytext idf negative values due to wrong counting of number of documents #234

StefanoRapisarda · 2023-04-22T22:57:28Z

Hi guys,

I am running some text mining and for one specific word ("the") I've got negative idf values. The word "the" is present in all the documents of my sample so, according to the idf definition, idf("the") should be zero. Here there is my workflow:

data_file_name <- 'times_ocr=80_100&date=1945-01-01_2010-12-31&query=_european_union_&category=News.csv'
data_df <- read_delim(data_file_name, delim = ";", escape_double = FALSE, col_types = cols(`date-pub` = col_date(format = "%B %d, %Y")), trim_ws = TRUE)

issue_words <- data_df %>%
  unnest_tokens(word, content) %>%
  count(issue, word, sort = TRUE)

issue_tf_idf <- issue_words %>%
  bind_tf_idf(word, issue, n)

issue_tf_idf %>%
  arrange(tf_idf)

I narrowed down the problem to a wrong counting of the total number of documents (issues) in the sample.
The number of documents is 1305, you can find out with:

print(nrow(data_df %>% distinct(issue)))

However, inside the function bind_tf_idf() the total number of documents is computed from the result of tapply() (so from grouping) and the count is 1304. You can check it with (extracted from bind_tf_idf()) :

print(length(tapply(issue_words$n, issue_words$issue, sum)))

Because of this wrong counting, the idf of "the" results negative.

Can you confirm the issue? Am I missing something? Why does tapply() "skip" a group? Why is the number of documents in bind_tf_idf() computed via tapply() instead of, for example, using distinct()?

Here you can find the csv data.

times_ocr=80_100&date=1945-01-01_2010-12-31&query=european_union&category=News.csv

The text was updated successfully, but these errors were encountered:

StefanoRapisarda · 2023-04-22T23:48:24Z

... and obviously this happened because there was a NA in the issue column of the DataFrame, '''tapply()''' ignored it when grouping, but if you count the number of rows after applying distinct the NA is still there. Pardon.

github-actions · 2023-05-07T00:09:56Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

StefanoRapisarda closed this as completed Apr 22, 2023

github-actions bot locked and limited conversation to collaborators May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tidytext idf negative values due to wrong counting of number of documents #234

tidytext idf negative values due to wrong counting of number of documents #234

StefanoRapisarda commented Apr 22, 2023

StefanoRapisarda commented Apr 22, 2023

github-actions bot commented May 7, 2023

tidytext idf negative values due to wrong counting of number of documents #234

tidytext idf negative values due to wrong counting of number of documents #234

Comments

StefanoRapisarda commented Apr 22, 2023

StefanoRapisarda commented Apr 22, 2023

github-actions bot commented May 7, 2023