Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Faroese #37

Open
sigmundv opened this issue Nov 2, 2023 · 2 comments
Open

Add support for Faroese #37

sigmundv opened this issue Nov 2, 2023 · 2 comments

Comments

@sigmundv
Copy link

sigmundv commented Nov 2, 2023

Hello, I was wondering how I would go about adding support for more languages. I can see that the key is to have training data, but how do I generate the required JSON file? Thank you in advance for making this package!

@neurosnap
Copy link
Owner

Hi! Thanks so much for opening this issue, much appreciated.

So I didn't perform any of the training for this library, I leveraged the pre-trained models that already existed inside NLTK: https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

If you wanted to add support for Faroese, you would want to figure out how to use the PunktTrainer to generate the model, convert it to JSON, and then we could add support for it inside this library.

The PunktTrainer can be found here: https://github.com/nltk/nltk/blob/e2d368e00ef806121aaa39f6e5f90d9f8243631b/nltk/tokenize/punkt.py#L636

I hope that helps!

@sigmundv
Copy link
Author

sigmundv commented Nov 7, 2023

That's perfect, I'll look into the PunktTrainer in NLTK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants