Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should not set lowercase argument of tokenizers::tokenize_words #175

Closed
randy3k opened this issue Jun 3, 2020 · 6 comments
Closed

should not set lowercase argument of tokenizers::tokenize_words #175

randy3k opened this issue Jun 3, 2020 · 6 comments

Comments

@randy3k
Copy link

randy3k commented Jun 3, 2020

In https://github.com/juliasilge/tidytext/blob/master/R/unnest_tokens.R#L148,
lowercase is set to FALSE for tokenizers::tokenize_words.

It breaks the case sensitivity of the stopwords argument in tokenizers::tokenize_words. For example,

library(tidyverse)
library(tidytext)
d <- data.frame(text = "The apple")
d %>% unnest_tokens(word, text, stopwords = c("the"))
#>      word
#> 1     the
#> 1.1 apple
@EmilHvitfeldt
Copy link
Contributor

This is the expected behavior. The stopwords argument is case sensitive, this way you can discriminate between removing the and The.

library(tidytext)
library(tidyverse)
d <- tibble(text = c("The apple and the pear"))

d %>% unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 4 x 1
#>   word 
#>   <chr>
#> 1 the  
#> 2 apple
#> 3 and  
#> 4 pear

d %>% unnest_tokens(word, text, stopwords = c("The"))
#> # A tibble: 4 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 the  
#> 4 pear

If you want to remove stopwords after the tokens have been converted to lowercase then you can follow unnest_tokens() with an anti_join() with your stop word list.

stop_word_list <- tibble(word = "the")

d %>% unnest_tokens(word, text) %>%
  anti_join(stop_word_list, by = "word")
#> # A tibble: 3 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 pear

@randy3k
Copy link
Author

randy3k commented Jun 3, 2020

Actually, the stopwords argument is not really case sensitive.

tokenizers::tokenize_words("The apple and the pear", stopwords = "the")
#> [[1]]
#> [1] "apple" "and"   "pear"

It is now case sensitive because tidytext has forced it to be.

By the way, I wanted to specify stopwords via tokenizers::tokenize_words directly because it is more efficient than anti_join.

Is there any reason you need to set lowercase to FALSE?

tidytext/R/unnest_tokens.R

Lines 147 to 149 in f644ef1

)) {
tokenfunc <- function(col, ...) tf(col, lowercase = FALSE, ...)
} else {

Perhaps it should be

tokenfunc <- function(col, ...) tf(col, lowercase = to_lower, ...)

@EmilHvitfeldt
Copy link
Contributor

EmilHvitfeldt commented Jun 3, 2020

You are right, my bad 😄

At a cursory glance, I don't see any reason why that wouldn't be possible.

@juliasilge
Copy link
Owner

This is a really great point @randy3k; thanks so much for surfacing this. 🙌

You can install from GitHub to get the new version that will work this way:

library(tidytext)
library(tidyverse)

d <- tibble(text = c("The apple and the pear"))

d %>% 
  unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 3 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 pear

Created on 2020-06-10 by the reprex package (v0.3.0.9001)

@MaximilianKrauss
Copy link

MaximilianKrauss commented Jun 11, 2020

Wow, amazing! I ran into this very problem literally half an hour ago, when I passed stopwords = c(tidytext::stop_words$words, "custom_stop_word") to tidytext::unnest_tokens() and when I was puzzled why a lot of occuring stop words are removed, but not all of them. So, I debugged the code, found the hard-coded lowercase = FALSE causing the uppercase versions of stop words to be missed and was about to open an issue about that.

I am glad, though, I looked for related issues first to find out that @randy3k came up with this just 9 days ago and I can install the fix from GitHub right away. :) Remarkable coincidence! Thank you, too, @juliasilge! 🙏

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants