Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check out tokenizing on ngram and skip ngram #7

Closed
juliasilge opened this issue Apr 29, 2016 · 14 comments
Closed

Check out tokenizing on ngram and skip ngram #7

juliasilge opened this issue Apr 29, 2016 · 14 comments

Comments

@juliasilge
Copy link
Owner

juliasilge commented Apr 29, 2016

These are in tokenizers but we haven't tested them out or anything. I have had one question from a user (potential user?) about this so far.

@juliasilge
Copy link
Owner Author

May just need to set up tests? Looks like it is largely working.

@dgrtwo
Copy link
Collaborator

dgrtwo commented Apr 29, 2016

This both works and is really, really interesting to use, especially with separate, since it gives a format useful with ggraph.

I'm going to write a vignette for it once I have a chance.

library(dplyr)
library(tidyr)
library(stringr)
library(tidytext)

# King James Bible
kjv <- gutenberg_download(10)

kjv_words <- kjv %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  filter(!word %in% stop_words$word)

kjv_words
#> Source: local data frame [12,514 x 2]
#> 
#>      word     n
#>     (chr) (int)
#> 1    lord  7830
#> 2    thou  5474
#> 3     thy  4600
#> 4     god  4445
#> 5      ye  3983
#> 6    thee  3827
#> 7       1  2783
#> 8       2  2721
#> 9  israel  2565
#> 10      3  2560
#> ..    ...   ...

kjv_2grams <- kjv %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 2) %>%
  count(ngram, sort = TRUE) %>%
  separate(ngram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^\\d+"),
         !str_detect(word2, "^\\d+"),
         n > 25)

kjv_2grams
#> Source: local data frame [125 x 3]
#> 
#>    word1  word2     n
#>    (chr)  (chr) (int)
#> 1   thou  shalt  1160
#> 2   thou   hast   723
#> 3   lord    god   521
#> 4    thy    god   322
#> 5   thou    art   308
#> 6   lord    thy   302
#> 7   lord   hath   270
#> 8  shalt   thou   238
#> 9  jesus christ   174
#> 10   god   hath   170
#> ..   ...    ...   ...

vertices <- kjv_words %>%
  filter(word %in% kjv_2grams$word1 | word %in% kjv_2grams$word2)

library(ggraph)
library(igraph)

graph_from_data_frame(kjv_2grams, vertices = vertices) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), arrow = arrow(length = unit(.15, "inches"))) +
  geom_node_point(aes(size = n), color = "lightblue") +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  scale_edge_alpha_continuous(trans = "log10") +
  scale_size_continuous(range = c(1, 10)) +
  theme_void()

@juliasilge
Copy link
Owner Author

I wrote two tests for unnest_tokens, one for ngram and one for skip_ngram, in commit 986570c.

@juliasilge
Copy link
Owner Author

And then I fixed those (broken) tests, with commit 66c532b. 😁

@juliasilge
Copy link
Owner Author

I added ngrams and skip_ngrams to the documentation and examples for unnest_tokens in commit 02f79bc.

@juliasilge
Copy link
Owner Author

@dgrtwo I agree that this is so interesting and I love the analysis you have here! But can we put it in a vignette? gutenbergr reverse suggests tidytext; doesn't that mean we can't import gutenbergr to use it in a vignette? And ggraph is only on Github right now anyway. Would it be better to put this analysis in a blog post or in gutenbergr somewhere, in a vignette or README over there?

@juliasilge
Copy link
Owner Author

@dgrtwo I think I am going to close this unless you have an objection. Did you see my last thoughts on doing a vignette with this?

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 10, 2016

Sorry for delay. I think it's still a good idea to have a tidytext vignette on n-grams:

  1. Reciprocal SUGGESTS are permitted by CRAN, though reciprocal DEPENDS/IMPORTS are not. In any case I could imagine future tidytext vignettes making use of downloaded books so it's a SUGGESTS I'd like to add.
  2. We don't have any vignette examples of n-gram tokenization yet (though thanks for adding it to docs+examples!)
  3. I think part of the value of the tidytext package is that its vignettes can serve as a collection of tidy text mining examples. In contrast I'm trying to keep gutenbergr as a utility.
  4. I'll leave ggraph out of the vignette until it is added to CRAN. I'm about to bug Thomas about that :)

So is it OK if we keep it open for now?

@juliasilge
Copy link
Owner Author

Ah, I did not know that about reciprocal SUGGESTS! Sounds good. Let's keep this open and plan for a vignette. I think that other than the vignette, the package is good to go for ngram and skip_ngram. (Assuming my tests are extensive enough.)

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 10, 2016

Cool!

BTW check out the latest commit- I set up tidiers for LDA from topicmodels, along with adding a vignette (work in progress, but shows how cool tidy topic models can be).

@juliasilge
Copy link
Owner Author

I just look at the topic modeling vignette. Wow, that is cool.

@yosuke-yasuda
Copy link

Hi, I have a suggestion about this feature.

Usually, I want to process each word like stemming or filtering stopwords. So I want to create nrams, skip-grams from unnested state.

quanteda::skipgrams seems to be a useful function to do so.
How about implementing something like this which create ngrams and skip-grams from unnested state.

generate_ngrams <- function(tbl, group_col, token_col, n, skip){
  loadNamespace("dplyr")
  loadNamespace("tidyr")
  loadNamespace("quanteda")
  grouped <- (
    tbl
    %>%  dplyr::group_by_(group_col))

  indices <- attr(grouped, "indices")
  labels <- attr(grouped, "labels")
  labels[[token_col]] <- lapply(indices, function(index){
    quanteda::skipgrams(as.character(tbl[index+1, token_col]), n=n, skip=skip)
  })
  unnested <- tidyr::unnest_(labels, token_col)
}

test_that("generate_ngrams", {
  # 100 random terms from term1~term20
  term <- paste("term", as.vector(sapply(seq(10), function(x){
    sample(seq(20), 5)
  })), sep="")
  # data frame with 10 document and 10 random terms for each
  tidy_test_df <- data.frame(
    document = as.vector(t(replicate(10, paste("doc", seq(10), sep="")))),
    term=term)

  result <- generate_ngrams(tidy_test_df, "document", "term", n=1:2, skip=0)
  expect_equal(nrow(result), 190)
})

@juliasilge
Copy link
Owner Author

In commit 21f7e5f I changed unnest_tokens so that it will collapse the input text before unnesting for ngrams and skip_ngrams. These options were not on the list for the collapsing option, but they should have been. I also updated the relevant tests. I think I got everything related to this needed edit, but let me know if I missed something!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants