Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidytext idf negative values due to wrong counting of number of documents #234

Closed
StefanoRapisarda opened this issue Apr 22, 2023 · 2 comments

Comments

@StefanoRapisarda
Copy link

Hi guys,

I am running some text mining and for one specific word ("the") I've got negative idf values. The word "the" is present in all the documents of my sample so, according to the idf definition, idf("the") should be zero. Here there is my workflow:

data_file_name <- 'times_ocr=80_100&date=1945-01-01_2010-12-31&query=_european_union_&category=News.csv'
data_df <- read_delim(data_file_name, delim = ";", escape_double = FALSE, col_types = cols(`date-pub` = col_date(format = "%B %d, %Y")), trim_ws = TRUE)

issue_words <- data_df %>%
  unnest_tokens(word, content) %>%
  count(issue, word, sort = TRUE)

issue_tf_idf <- issue_words %>%
  bind_tf_idf(word, issue, n)

issue_tf_idf %>%
  arrange(tf_idf)

I narrowed down the problem to a wrong counting of the total number of documents (issues) in the sample.
The number of documents is 1305, you can find out with:

print(nrow(data_df %>% distinct(issue)))

However, inside the function bind_tf_idf() the total number of documents is computed from the result of tapply() (so from grouping) and the count is 1304. You can check it with (extracted from bind_tf_idf()) :

print(length(tapply(issue_words$n, issue_words$issue, sum)))

Because of this wrong counting, the idf of "the" results negative.

Can you confirm the issue? Am I missing something? Why does tapply() "skip" a group? Why is the number of documents in bind_tf_idf() computed via tapply() instead of, for example, using distinct()?

Here you can find the csv data.

times_ocr=80_100&date=1945-01-01_2010-12-31&query=european_union&category=News.csv

@StefanoRapisarda
Copy link
Author

... and obviously this happened because there was a NA in the issue column of the DataFrame, '''tapply()''' ignored it when grouping, but if you count the number of rows after applying distinct the NA is still there. Pardon.

@github-actions
Copy link

github-actions bot commented May 7, 2023

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant