Indonesian NLP resources
- Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
- Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
- PANL10N POS tagging. This corpus has 39K sentences and 900K word tokens.
- IDN tagged corpus. This corpus contains 10K sentences and 250K word tokens. The POS tags are annotated manually.
- Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
- UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)
- PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
- PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.
- IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.
TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.
The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here
frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.