should not set `lowercase` argument of `tokenizers::tokenize_words` #175

randy3k · 2020-06-03T04:52:32Z

In https://github.com/juliasilge/tidytext/blob/master/R/unnest_tokens.R#L148,
lowercase is set to FALSE for tokenizers::tokenize_words.

It breaks the case sensitivity of the stopwords argument in tokenizers::tokenize_words. For example,

library(tidyverse)
library(tidytext)
d <- data.frame(text = "The apple")
d %>% unnest_tokens(word, text, stopwords = c("the"))
#>      word
#> 1     the
#> 1.1 apple

The text was updated successfully, but these errors were encountered:

EmilHvitfeldt · 2020-06-03T07:03:54Z

This is the expected behavior. The stopwords argument is case sensitive, this way you can discriminate between removing the and The.

library(tidytext)
library(tidyverse)
d <- tibble(text = c("The apple and the pear"))

d %>% unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 4 x 1
#>   word 
#>   <chr>
#> 1 the  
#> 2 apple
#> 3 and  
#> 4 pear

d %>% unnest_tokens(word, text, stopwords = c("The"))
#> # A tibble: 4 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 the  
#> 4 pear

If you want to remove stopwords after the tokens have been converted to lowercase then you can follow unnest_tokens() with an anti_join() with your stop word list.

stop_word_list <- tibble(word = "the")

d %>% unnest_tokens(word, text) %>%
  anti_join(stop_word_list, by = "word")
#> # A tibble: 3 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 pear

randy3k · 2020-06-03T07:14:49Z

Actually, the stopwords argument is not really case sensitive.

tokenizers::tokenize_words("The apple and the pear", stopwords = "the")
#> [[1]]
#> [1] "apple" "and"   "pear"

It is now case sensitive because tidytext has forced it to be.

By the way, I wanted to specify stopwords via tokenizers::tokenize_words directly because it is more efficient than anti_join.

Is there any reason you need to set lowercase to FALSE?

tidytext/R/unnest_tokens.R

Lines 147 to 149 in f644ef1

    
           )) { 
        
             tokenfunc <- function(col, ...) tf(col, lowercase = FALSE, ...) 
        
           } else {

Perhaps it should be

tokenfunc <- function(col, ...) tf(col, lowercase = to_lower, ...)

EmilHvitfeldt · 2020-06-03T07:53:09Z

You are right, my bad 😄

At a cursory glance, I don't see any reason why that wouldn't be possible.

juliasilge · 2020-06-10T23:34:35Z

This is a really great point @randy3k; thanks so much for surfacing this. 🙌

You can install from GitHub to get the new version that will work this way:

library(tidytext)
library(tidyverse)

d <- tibble(text = c("The apple and the pear"))

d %>% 
  unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 3 x 1
#>   word 
#>   <chr>
#> 1 apple
#> 2 and  
#> 3 pear

^{Created on 2020-06-10 by the reprex package (v0.3.0.9001)}

MaximilianKrauss · 2020-06-11T18:42:17Z

Wow, amazing! I ran into this very problem literally half an hour ago, when I passed stopwords = c(tidytext::stop_words$words, "custom_stop_word") to tidytext::unnest_tokens() and when I was puzzled why a lot of occuring stop words are removed, but not all of them. So, I debugged the code, found the hard-coded lowercase = FALSE causing the uppercase versions of stop words to be missed and was about to open an issue about that.

I am glad, though, I looked for related issues first to find out that @randy3k came up with this just 9 days ago and I can install the fix from GitHub right away. :) Remarkable coincidence! Thank you, too, @juliasilge! 🙏

github-actions · 2022-03-24T00:08:34Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

juliasilge closed this as completed in b12d371 Jun 10, 2020

github-actions bot locked and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should not set `lowercase` argument of `tokenizers::tokenize_words` #175

should not set `lowercase` argument of `tokenizers::tokenize_words` #175

randy3k commented Jun 3, 2020 •

edited

Loading

EmilHvitfeldt commented Jun 3, 2020

randy3k commented Jun 3, 2020 •

edited

Loading

EmilHvitfeldt commented Jun 3, 2020 •

edited

Loading

juliasilge commented Jun 10, 2020

MaximilianKrauss commented Jun 11, 2020 •

edited

Loading

github-actions bot commented Mar 24, 2022

should not set lowercase argument of tokenizers::tokenize_words #175

should not set lowercase argument of tokenizers::tokenize_words #175

Comments

randy3k commented Jun 3, 2020 • edited Loading

EmilHvitfeldt commented Jun 3, 2020

randy3k commented Jun 3, 2020 • edited Loading

EmilHvitfeldt commented Jun 3, 2020 • edited Loading

juliasilge commented Jun 10, 2020

MaximilianKrauss commented Jun 11, 2020 • edited Loading

github-actions bot commented Mar 24, 2022

should not set `lowercase` argument of `tokenizers::tokenize_words` #175

should not set `lowercase` argument of `tokenizers::tokenize_words` #175

randy3k commented Jun 3, 2020 •

edited

Loading

randy3k commented Jun 3, 2020 •

edited

Loading

EmilHvitfeldt commented Jun 3, 2020 •

edited

Loading

MaximilianKrauss commented Jun 11, 2020 •

edited

Loading