Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpora used to train Punkt Segmenter in German #3253

Open
AugustinErnoult opened this issue May 13, 2024 · 1 comment
Open

Corpora used to train Punkt Segmenter in German #3253

AugustinErnoult opened this issue May 13, 2024 · 1 comment

Comments

@AugustinErnoult
Copy link

Hi,
Thank you very much for your amazing work.
I used NLTK to segment a german text. I see that this language is available and the sentence tokenizer gives quite good result with the default training. However, I would like to now on which corpora it as been trained and I don't find any german text in this list (or I missed it) .
Do you know where I can find an explicit list of text used?
Thank you

@AugustinErnoult
Copy link
Author

I found it in "nltk-data/punkt/tokenizer/README". For German it seems to be only the corpus initially used by Strunk and Kiss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant