-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should not set lowercase
argument of tokenizers::tokenize_words
#175
Comments
This is the expected behavior. The library(tidytext)
library(tidyverse)
d <- tibble(text = c("The apple and the pear"))
d %>% unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 the
#> 2 apple
#> 3 and
#> 4 pear
d %>% unnest_tokens(word, text, stopwords = c("The"))
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 apple
#> 2 and
#> 3 the
#> 4 pear If you want to remove stopwords after the tokens have been converted to lowercase then you can follow stop_word_list <- tibble(word = "the")
d %>% unnest_tokens(word, text) %>%
anti_join(stop_word_list, by = "word")
#> # A tibble: 3 x 1
#> word
#> <chr>
#> 1 apple
#> 2 and
#> 3 pear |
Actually, the tokenizers::tokenize_words("The apple and the pear", stopwords = "the")
#> [[1]]
#> [1] "apple" "and" "pear" It is now case sensitive because By the way, I wanted to specify Is there any reason you need to set Lines 147 to 149 in f644ef1
Perhaps it should be
|
You are right, my bad 😄 At a cursory glance, I don't see any reason why that wouldn't be possible. |
This is a really great point @randy3k; thanks so much for surfacing this. 🙌 You can install from GitHub to get the new version that will work this way: library(tidytext)
library(tidyverse)
d <- tibble(text = c("The apple and the pear"))
d %>%
unnest_tokens(word, text, stopwords = c("the"))
#> # A tibble: 3 x 1
#> word
#> <chr>
#> 1 apple
#> 2 and
#> 3 pear Created on 2020-06-10 by the reprex package (v0.3.0.9001) |
Wow, amazing! I ran into this very problem literally half an hour ago, when I passed I am glad, though, I looked for related issues first to find out that @randy3k came up with this just 9 days ago and I can install the fix from GitHub right away. :) Remarkable coincidence! Thank you, too, @juliasilge! 🙏 |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
In https://github.com/juliasilge/tidytext/blob/master/R/unnest_tokens.R#L148,
lowercase
is set toFALSE
fortokenizers::tokenize_words
.It breaks the case sensitivity of the
stopwords
argument intokenizers::tokenize_words
. For example,The text was updated successfully, but these errors were encountered: