Corpora used to train Punkt Segmenter in German #3253

AugustinErnoult · 2024-05-13T14:53:37Z

Hi,
Thank you very much for your amazing work.
I used NLTK to segment a german text. I see that this language is available and the sentence tokenizer gives quite good result with the default training. However, I would like to now on which corpora it as been trained and I don't find any german text in this list (or I missed it) .
Do you know where I can find an explicit list of text used?
Thank you

AugustinErnoult · 2024-05-13T16:37:03Z

I found it in "nltk-data/punkt/tokenizer/README". For German it seems to be only the corpus initially used by Strunk and Kiss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpora used to train Punkt Segmenter in German #3253

Corpora used to train Punkt Segmenter in German #3253

AugustinErnoult commented May 13, 2024

AugustinErnoult commented May 13, 2024

Corpora used to train Punkt Segmenter in German #3253

Corpora used to train Punkt Segmenter in German #3253

Comments

AugustinErnoult commented May 13, 2024

AugustinErnoult commented May 13, 2024