New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check out tokenizing on ngram and skip ngram #7
Comments
May just need to set up tests? Looks like it is largely working. |
This both works and is really, really interesting to use, especially with I'm going to write a vignette for it once I have a chance. library(dplyr)
library(tidyr)
library(stringr)
library(tidytext)
# King James Bible
kjv <- gutenberg_download(10)
kjv_words <- kjv %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
filter(!word %in% stop_words$word)
kjv_words
#> Source: local data frame [12,514 x 2]
#>
#> word n
#> (chr) (int)
#> 1 lord 7830
#> 2 thou 5474
#> 3 thy 4600
#> 4 god 4445
#> 5 ye 3983
#> 6 thee 3827
#> 7 1 2783
#> 8 2 2721
#> 9 israel 2565
#> 10 3 2560
#> .. ... ...
kjv_2grams <- kjv %>%
unnest_tokens(ngram, text, token = "ngrams", n = 2) %>%
count(ngram, sort = TRUE) %>%
separate(ngram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!str_detect(word1, "^\\d+"),
!str_detect(word2, "^\\d+"),
n > 25)
kjv_2grams
#> Source: local data frame [125 x 3]
#>
#> word1 word2 n
#> (chr) (chr) (int)
#> 1 thou shalt 1160
#> 2 thou hast 723
#> 3 lord god 521
#> 4 thy god 322
#> 5 thou art 308
#> 6 lord thy 302
#> 7 lord hath 270
#> 8 shalt thou 238
#> 9 jesus christ 174
#> 10 god hath 170
#> .. ... ... ...
vertices <- kjv_words %>%
filter(word %in% kjv_2grams$word1 | word %in% kjv_2grams$word2)
library(ggraph)
library(igraph)
graph_from_data_frame(kjv_2grams, vertices = vertices) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), arrow = arrow(length = unit(.15, "inches"))) +
geom_node_point(aes(size = n), color = "lightblue") +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
scale_edge_alpha_continuous(trans = "log10") +
scale_size_continuous(range = c(1, 10)) +
theme_void() |
I wrote two tests for unnest_tokens, one for ngram and one for skip_ngram, in commit 986570c. |
And then I fixed those (broken) tests, with commit 66c532b. 😁 |
I added ngrams and skip_ngrams to the documentation and examples for unnest_tokens in commit 02f79bc. |
@dgrtwo I agree that this is so interesting and I love the analysis you have here! But can we put it in a vignette? |
@dgrtwo I think I am going to close this unless you have an objection. Did you see my last thoughts on doing a vignette with this? |
Sorry for delay. I think it's still a good idea to have a tidytext vignette on n-grams:
So is it OK if we keep it open for now? |
Ah, I did not know that about reciprocal SUGGESTS! Sounds good. Let's keep this open and plan for a vignette. I think that other than the vignette, the package is good to go for ngram and skip_ngram. (Assuming my tests are extensive enough.) |
Cool! BTW check out the latest commit- I set up tidiers for LDA from topicmodels, along with adding a vignette (work in progress, but shows how cool tidy topic models can be). |
I just look at the topic modeling vignette. Wow, that is cool. |
Hi, I have a suggestion about this feature. Usually, I want to process each word like stemming or filtering stopwords. So I want to create nrams, skip-grams from unnested state. quanteda::skipgrams seems to be a useful function to do so. generate_ngrams <- function(tbl, group_col, token_col, n, skip){
loadNamespace("dplyr")
loadNamespace("tidyr")
loadNamespace("quanteda")
grouped <- (
tbl
%>% dplyr::group_by_(group_col))
indices <- attr(grouped, "indices")
labels <- attr(grouped, "labels")
labels[[token_col]] <- lapply(indices, function(index){
quanteda::skipgrams(as.character(tbl[index+1, token_col]), n=n, skip=skip)
})
unnested <- tidyr::unnest_(labels, token_col)
}
test_that("generate_ngrams", {
# 100 random terms from term1~term20
term <- paste("term", as.vector(sapply(seq(10), function(x){
sample(seq(20), 5)
})), sep="")
# data frame with 10 document and 10 random terms for each
tidy_test_df <- data.frame(
document = as.vector(t(replicate(10, paste("doc", seq(10), sep="")))),
term=term)
result <- generate_ngrams(tidy_test_df, "document", "term", n=1:2, skip=0)
expect_equal(nrow(result), 190)
}) |
In commit 21f7e5f I changed |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
These are in tokenizers but we haven't tested them out or anything. I have had one question from a user (potential user?) about this so far.
The text was updated successfully, but these errors were encountered: