-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom tokenizer in unnest_tokens #10
Comments
@Double-y Thanks for your interest in the package! To clarify, are you saying there are existing tokenizers for Japanese that you think it would be nice to implement as an option in the way that the tokenizers package is used, or that you think it would be good to structure unnest_tokens in a more general way so that users can supply their own tokenizers? |
@juliasilge What if we allowed users to pass a function (or a string like we already do) to the |
…dded a collapse argument that makes the choice whether to combine lines before tokenizing explicit. See #10
Here's one implementation (along with test cases). You can pass a custom function to |
Thanks for considering this feature @juliasilge, @dgrtwo! Yes, I know a tokenizer for Japanese called RMeCab. I'll try it on 338cc6f |
Yeah, it works :) This might be a very specific matter for Japanese but it is tokenised with part of speech detection at the same time for efficiency. But in the current unnest_tokens framework, I can't preserve the part of speech information right? The output of tokenizer is like this. If you have a good idea to preserve this information as well. It's gonna be amazing. |
That's great that it works! Thanks for bringing this up; I think this improves flexibility/usability for the package. In English, we have been doing parts of speech detection separately (after unnesting) using a join with a data set that's in the package; at this point this has been separate from the unnest_tokens function because we have been working for thinking within tidy data principles, etc. Let me/us think about that. |
Yeah, as for English (and probably for most alphabetical languages too), separating those steps seems to be better. That demand will be very specific to Japanese, so I don't think it should be implemented in this package either. I just wanted to know if there's a good way to do it. Don't worry too much about it. Thank you! |
One suggestion for keeping parts of speech (I agree it unfortunately wouldn't fit in the tidytext package since it's a very specific need) is to use unnest manually. I haven't been able to get RMeCab to work on my computer so this is a rough guess I haven't tested:
This (or something similar to it) should be able to create two columns: |
@dgrtwo Thank you for the suggestion! It looks good. I'll try it. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Hi, thanks for developing this helpful package!
unnest_tokens seems to be using tokenizers from "tokenizers" package but how about making it possible to use a tokenizer users choose?
I'm from Japan and tokenizer for Japanese is totally different, so I want to use a custom tokenizer.
The text was updated successfully, but these errors were encountered: