Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom tokenizer in unnest_tokens #10

Closed
yosuke-yasuda opened this issue May 16, 2016 · 11 comments
Closed

Custom tokenizer in unnest_tokens #10

yosuke-yasuda opened this issue May 16, 2016 · 11 comments

Comments

@yosuke-yasuda
Copy link

Hi, thanks for developing this helpful package!

unnest_tokens seems to be using tokenizers from "tokenizers" package but how about making it possible to use a tokenizer users choose?

I'm from Japan and tokenizer for Japanese is totally different, so I want to use a custom tokenizer.

@juliasilge
Copy link
Owner

@Double-y Thanks for your interest in the package! To clarify, are you saying there are existing tokenizers for Japanese that you think it would be nice to implement as an option in the way that the tokenizers package is used, or that you think it would be good to structure unnest_tokens in a more general way so that users can supply their own tokenizers?

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 17, 2016

@juliasilge What if we allowed users to pass a function (or a string like we already do) to the token argument? I could set that up!

dgrtwo pushed a commit that referenced this issue May 17, 2016
…dded a collapse argument that makes the choice whether to combine lines before tokenizing explicit.

See #10
@dgrtwo
Copy link
Collaborator

dgrtwo commented May 17, 2016

Here's one implementation (along with test cases). You can pass a custom function to token = (and any extra arguments to ...). Please try it out @juliasilge and @Double-y!

@juliasilge
Copy link
Owner

@dgrtwo This looks really good on an initial work through on my part, and I like the way the user interacts with the function here, the flexibility, etc. Really nice! @Double-y, let us know what you think as you try it out.

@yosuke-yasuda
Copy link
Author

Thanks for considering this feature @juliasilge, @dgrtwo! Yes, I know a tokenizer for Japanese called RMeCab. I'll try it on 338cc6f

@yosuke-yasuda
Copy link
Author

Yeah, it works :)

This might be a very specific matter for Japanese but it is tokenised with part of speech detection at the same time for efficiency. But in the current unnest_tokens framework, I can't preserve the part of speech information right? The output of tokenizer is like this.

screen shot 2016-05-17 at 10 15 20 am

If you have a good idea to preserve this information as well. It's gonna be amazing.

@juliasilge
Copy link
Owner

That's great that it works! Thanks for bringing this up; I think this improves flexibility/usability for the package.

In English, we have been doing parts of speech detection separately (after unnesting) using a join with a data set that's in the package; at this point this has been separate from the unnest_tokens function because we have been working for thinking within tidy data principles, etc. Let me/us think about that.

@yosuke-yasuda
Copy link
Author

Yeah, as for English (and probably for most alphabetical languages too), separating those steps seems to be better.

That demand will be very specific to Japanese, so I don't think it should be implemented in this package either.

I just wanted to know if there's a good way to do it. Don't worry too much about it.

Thank you!

@dgrtwo
Copy link
Collaborator

dgrtwo commented May 17, 2016

One suggestion for keeping parts of speech (I agree it unfortunately wouldn't fit in the tidytext package since it's a very specific need) is to use unnest manually. I haven't been able to get RMeCab to work on my computer so this is a rough guess I haven't tested:

RMeCabWrapper <- function(...) {
    ret <- RMeCab::RMeCabC(...)
    lapply(ret, function(e) dplyr::data_frame(pos = names(e), word = e))
}
d %>% 
  tidyr::unnest(RMeCabWrapper(text))

This (or something similar to it) should be able to create two columns: pos (with the parts of speech) and word (with the words).

@yosuke-yasuda
Copy link
Author

@dgrtwo Thank you for the suggestion! It looks good. I'll try it.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants