Custom tokenizer in unnest_tokens #10

yosuke-yasuda · 2016-05-16T06:56:11Z

Hi, thanks for developing this helpful package!

unnest_tokens seems to be using tokenizers from "tokenizers" package but how about making it possible to use a tokenizer users choose?

I'm from Japan and tokenizer for Japanese is totally different, so I want to use a custom tokenizer.

juliasilge · 2016-05-17T01:47:58Z

@Double-y Thanks for your interest in the package! To clarify, are you saying there are existing tokenizers for Japanese that you think it would be nice to implement as an option in the way that the tokenizers package is used, or that you think it would be good to structure unnest_tokens in a more general way so that users can supply their own tokenizers?

dgrtwo · 2016-05-17T02:01:10Z

@juliasilge What if we allowed users to pass a function (or a string like we already do) to the token argument? I could set that up!

…dded a collapse argument that makes the choice whether to combine lines before tokenizing explicit. See #10

dgrtwo · 2016-05-17T15:47:00Z

Here's one implementation (along with test cases). You can pass a custom function to token = (and any extra arguments to ...). Please try it out @juliasilge and @Double-y!

juliasilge · 2016-05-17T16:26:57Z

@dgrtwo This looks really good on an initial work through on my part, and I like the way the user interacts with the function here, the flexibility, etc. Really nice! @Double-y, let us know what you think as you try it out.

yosuke-yasuda · 2016-05-17T16:52:28Z

Thanks for considering this feature @juliasilge, @dgrtwo! Yes, I know a tokenizer for Japanese called RMeCab. I'll try it on 338cc6f

yosuke-yasuda · 2016-05-17T17:32:07Z

Yeah, it works :)

This might be a very specific matter for Japanese but it is tokenised with part of speech detection at the same time for efficiency. But in the current unnest_tokens framework, I can't preserve the part of speech information right? The output of tokenizer is like this.

If you have a good idea to preserve this information as well. It's gonna be amazing.

juliasilge · 2016-05-17T17:40:21Z

That's great that it works! Thanks for bringing this up; I think this improves flexibility/usability for the package.

In English, we have been doing parts of speech detection separately (after unnesting) using a join with a data set that's in the package; at this point this has been separate from the unnest_tokens function because we have been working for thinking within tidy data principles, etc. Let me/us think about that.

yosuke-yasuda · 2016-05-17T18:04:49Z

Yeah, as for English (and probably for most alphabetical languages too), separating those steps seems to be better.

That demand will be very specific to Japanese, so I don't think it should be implemented in this package either.

I just wanted to know if there's a good way to do it. Don't worry too much about it.

Thank you!

dgrtwo · 2016-05-17T18:12:07Z

One suggestion for keeping parts of speech (I agree it unfortunately wouldn't fit in the tidytext package since it's a very specific need) is to use unnest manually. I haven't been able to get RMeCab to work on my computer so this is a rough guess I haven't tested:

RMeCabWrapper <- function(...) {
    ret <- RMeCab::RMeCabC(...)
    lapply(ret, function(e) dplyr::data_frame(pos = names(e), word = e))
}
d %>% 
  tidyr::unnest(RMeCabWrapper(text))

This (or something similar to it) should be able to create two columns: pos (with the parts of speech) and word (with the words).

yosuke-yasuda · 2016-05-17T18:17:54Z

@dgrtwo Thank you for the suggestion! It looks good. I'll try it.

github-actions · 2022-03-26T00:09:12Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

dgrtwo pushed a commit that referenced this issue May 17, 2016

Added ability to pass a (custom tokenizing) function to token. Also a…

338cc6f

…dded a collapse argument that makes the choice whether to combine lines before tokenizing explicit. See #10

yosuke-yasuda closed this as completed May 17, 2016

juliasilge mentioned this issue Oct 1, 2016

internationalization #22

Closed

TheOne000 mentioned this issue Jul 21, 2017

Error message from unnest: Atomic vectors #73

Closed

github-actions bot locked and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom tokenizer in unnest_tokens #10

Custom tokenizer in unnest_tokens #10

yosuke-yasuda commented May 16, 2016

juliasilge commented May 17, 2016

dgrtwo commented May 17, 2016

dgrtwo commented May 17, 2016

juliasilge commented May 17, 2016

yosuke-yasuda commented May 17, 2016

yosuke-yasuda commented May 17, 2016

juliasilge commented May 17, 2016

yosuke-yasuda commented May 17, 2016

dgrtwo commented May 17, 2016

yosuke-yasuda commented May 17, 2016

github-actions bot commented Mar 26, 2022

Custom tokenizer in unnest_tokens #10

Custom tokenizer in unnest_tokens #10

Comments

yosuke-yasuda commented May 16, 2016

juliasilge commented May 17, 2016

dgrtwo commented May 17, 2016

dgrtwo commented May 17, 2016

juliasilge commented May 17, 2016

yosuke-yasuda commented May 17, 2016

yosuke-yasuda commented May 17, 2016

juliasilge commented May 17, 2016

yosuke-yasuda commented May 17, 2016

dgrtwo commented May 17, 2016

yosuke-yasuda commented May 17, 2016

github-actions bot commented Mar 26, 2022