Check out tokenizing on ngram and skip ngram #7

juliasilge · 2016-04-29T04:32:23Z

These are in tokenizers but we haven't tested them out or anything. I have had one question from a user (potential user?) about this so far.

juliasilge · 2016-04-29T04:39:19Z

May just need to set up tests? Looks like it is largely working.

dgrtwo · 2016-04-29T20:21:05Z

This both works and is really, really interesting to use, especially with separate, since it gives a format useful with ggraph.

I'm going to write a vignette for it once I have a chance.

library(dplyr)
library(tidyr)
library(stringr)
library(tidytext)

# King James Bible
kjv <- gutenberg_download(10)

kjv_words <- kjv %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% stop_words$word)

kjv_words
#> Source: local data frame [12,514 x 2]
#> 
#>      word     n
#>     (chr) (int)
#> 1    lord  7830
#> 2    thou  5474
#> 3     thy  4600
#> 4     god  4445
#> 5      ye  3983
#> 6    thee  3827
#> 7       1  2783
#> 8       2  2721
#> 9  israel  2565
#> 10      3  2560
#> ..    ...   ...

kjv_2grams <- kjv %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 2) %>%
  count(ngram, sort = TRUE) %>%
  separate(ngram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^\\d+"),
         !str_detect(word2, "^\\d+"),
         n > 25)

kjv_2grams
#> Source: local data frame [125 x 3]
#> 
#>    word1  word2     n
#>    (chr)  (chr) (int)
#> 1   thou  shalt  1160
#> 2   thou   hast   723
#> 3   lord    god   521
#> 4    thy    god   322
#> 5   thou    art   308
#> 6   lord    thy   302
#> 7   lord   hath   270
#> 8  shalt   thou   238
#> 9  jesus christ   174
#> 10   god   hath   170
#> ..   ...    ...   ...

vertices <- kjv_words %>%
  filter(word %in% kjv_2grams$word1 | word %in% kjv_2grams$word2)

library(ggraph)
library(igraph)

graph_from_data_frame(kjv_2grams, vertices = vertices) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), arrow = arrow(length = unit(.15, "inches"))) +
  geom_node_point(aes(size = n), color = "lightblue") +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  scale_edge_alpha_continuous(trans = "log10") +
  scale_size_continuous(range = c(1, 10)) +
  theme_void()

juliasilge · 2016-05-02T20:40:23Z

I wrote two tests for unnest_tokens, one for ngram and one for skip_ngram, in commit 986570c.

juliasilge · 2016-05-02T20:50:50Z

And then I fixed those (broken) tests, with commit 66c532b. 😁

juliasilge · 2016-05-04T15:20:50Z

I added ngrams and skip_ngrams to the documentation and examples for unnest_tokens in commit 02f79bc.

juliasilge · 2016-05-04T15:26:44Z

@dgrtwo I agree that this is so interesting and I love the analysis you have here! But can we put it in a vignette? gutenbergr reverse suggests tidytext; doesn't that mean we can't import gutenbergr to use it in a vignette? And ggraph is only on Github right now anyway. Would it be better to put this analysis in a blog post or in gutenbergr somewhere, in a vignette or README over there?

juliasilge · 2016-05-10T01:23:17Z

@dgrtwo I think I am going to close this unless you have an objection. Did you see my last thoughts on doing a vignette with this?

dgrtwo · 2016-05-10T15:17:07Z

Sorry for delay. I think it's still a good idea to have a tidytext vignette on n-grams:

Reciprocal SUGGESTS are permitted by CRAN, though reciprocal DEPENDS/IMPORTS are not. In any case I could imagine future tidytext vignettes making use of downloaded books so it's a SUGGESTS I'd like to add.
We don't have any vignette examples of n-gram tokenization yet (though thanks for adding it to docs+examples!)
I think part of the value of the tidytext package is that its vignettes can serve as a collection of tidy text mining examples. In contrast I'm trying to keep gutenbergr as a utility.
I'll leave ggraph out of the vignette until it is added to CRAN. I'm about to bug Thomas about that :)

So is it OK if we keep it open for now?

juliasilge · 2016-05-10T16:13:20Z

Ah, I did not know that about reciprocal SUGGESTS! Sounds good. Let's keep this open and plan for a vignette. I think that other than the vignette, the package is good to go for ngram and skip_ngram. (Assuming my tests are extensive enough.)

dgrtwo · 2016-05-10T16:31:26Z

Cool!

BTW check out the latest commit- I set up tidiers for LDA from topicmodels, along with adding a vignette (work in progress, but shows how cool tidy topic models can be).

juliasilge · 2016-05-11T15:38:48Z

I just look at the topic modeling vignette. Wow, that is cool.

yosuke-yasuda · 2016-05-17T01:32:00Z

Hi, I have a suggestion about this feature.

Usually, I want to process each word like stemming or filtering stopwords. So I want to create nrams, skip-grams from unnested state.

quanteda::skipgrams seems to be a useful function to do so.
How about implementing something like this which create ngrams and skip-grams from unnested state.

generate_ngrams <- function(tbl, group_col, token_col, n, skip){
  loadNamespace("dplyr")
  loadNamespace("tidyr")
  loadNamespace("quanteda")
  grouped <- (
    tbl
    %>%  dplyr::group_by_(group_col))

  indices <- attr(grouped, "indices")
  labels <- attr(grouped, "labels")
  labels[[token_col]] <- lapply(indices, function(index){
    quanteda::skipgrams(as.character(tbl[index+1, token_col]), n=n, skip=skip)
  })
  unnested <- tidyr::unnest_(labels, token_col)
}

test_that("generate_ngrams", {
  # 100 random terms from term1~term20
  term <- paste("term", as.vector(sapply(seq(10), function(x){
    sample(seq(20), 5)
  })), sep="")
  # data frame with 10 document and 10 random terms for each
  tidy_test_df <- data.frame(
    document = as.vector(t(replicate(10, paste("doc", seq(10), sep="")))),
    term=term)

  result <- generate_ngrams(tidy_test_df, "document", "term", n=1:2, skip=0)
  expect_equal(nrow(result), 190)
})

juliasilge · 2016-05-19T16:04:28Z

In commit 21f7e5f I changed unnest_tokens so that it will collapse the input text before unnesting for ngrams and skip_ngrams. These options were not on the list for the collapsing option, but they should have been. I also updated the relevant tests. I think I got everything related to this needed edit, but let me know if I missed something!

github-actions · 2022-03-26T00:08:57Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

dgrtwo mentioned this issue May 23, 2016

Write paper.md #6

Closed

dgrtwo closed this as completed Feb 21, 2017

TheOne000 mentioned this issue Jul 21, 2017

Error message from unnest: Atomic vectors #73

Closed

github-actions bot locked and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check out tokenizing on ngram and skip ngram #7

Check out tokenizing on ngram and skip ngram #7

juliasilge commented Apr 29, 2016 •

edited

juliasilge commented Apr 29, 2016

dgrtwo commented Apr 29, 2016

juliasilge commented May 2, 2016

juliasilge commented May 2, 2016

juliasilge commented May 4, 2016

juliasilge commented May 4, 2016

juliasilge commented May 10, 2016

dgrtwo commented May 10, 2016

juliasilge commented May 10, 2016

dgrtwo commented May 10, 2016

juliasilge commented May 11, 2016

yosuke-yasuda commented May 17, 2016

juliasilge commented May 19, 2016

github-actions bot commented Mar 26, 2022

Check out tokenizing on ngram and skip ngram #7

Check out tokenizing on ngram and skip ngram #7

Comments

juliasilge commented Apr 29, 2016 • edited

juliasilge commented Apr 29, 2016

dgrtwo commented Apr 29, 2016

juliasilge commented May 2, 2016

juliasilge commented May 2, 2016

juliasilge commented May 4, 2016

juliasilge commented May 4, 2016

juliasilge commented May 10, 2016

dgrtwo commented May 10, 2016

juliasilge commented May 10, 2016

dgrtwo commented May 10, 2016

juliasilge commented May 11, 2016

yosuke-yasuda commented May 17, 2016

juliasilge commented May 19, 2016

github-actions bot commented Mar 26, 2022

juliasilge commented Apr 29, 2016 •

edited