Skip to content

Latest commit

 

History

History
115 lines (108 loc) · 10.2 KB

README.md

File metadata and controls

115 lines (108 loc) · 10.2 KB

indonesian-NLP-resources

Data NLP for bahasa indonesia (last update 20 sep 2020)

Sentences Dataset

  1. leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
  2. wn-msa.sourceforge.net Wordnet Bahasa
  3. Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
  4. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  5. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
  6. corpus-frog-storytelling spoken text story telling
  7. TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
  8. Opus Opus NLPL
  9. Sealang Sealang dataset

Word reference (kemdikbud) link

  1. Entri Dasar : 50.668 (45,02 %)
  2. Kata Turunan : 26.835 (23,85 %)
  3. Gabungan Kata : 31.492 (27,98 %)
  4. Peribahasa : 2.054 (1,83 %)
  5. Kiasan : 269 (0,24 %)
  6. Ungkapan : 1.131 (1,00 %)
  7. Varian : 89 (0,08 %)
  8. Entri Total : 112.538 (100,00 %)
  9. Makna Total : 131.533
  10. Contoh Total : 30.010
  11. Kategori Total : 234
  12. Makna Per Entri : 1,169
  13. Contoh Per Makna : 0,228

Words dataset (PUEBI word type )

  1. word class => word noun(18647), word verb(39070) = 57717 words
  2. word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
  3. Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
  4. Word spaCy : id
  5. word : serangkai
  6. Word name : random-name
  7. Word Indo name : genderprediction
  8. Word Wiktionary : word id
  9. word compound =>
  10. Word Acronims =>
  11. Word Negative =>
  12. Word Positive =>
  13. Word Slang =>
  14. Stopwords =>
  15. Emoticon =>
  16. Name Entity =>
    • source#1 : [Place] country
    • source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
    • source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
    • source#3 : [Place] indonesian-region
    • source#3 : [Person] gender prediction
    • source#4 : [Person] random name
    • source#5 : [Person] title of name
    • source#6 : [Person] degree
    • source#7 : [Org] institution

Tagged dataset

  1. NER =>
    • source#1 : yohanesgultom/nlp-experiments 1700 sentences
    • source#2 : yusufsyaifudin/indonesia-ner 1835 sentences
  2. POS-TAG
    • POS-TAG : famrashel/idn-tagged-corpus
    • POS-TAG : pebbie/pebahasa ~600 sentence
    • POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
  3. Sentimen =>
    • source#1 : 1506 sentences ;
    • source#2 : Sentiment word with strenght agusmakmun/SentiStrengthID 1573 (range : -5 until 5 ) ;
    • source#3 : Sentiment with weight fajri91/InSet -> separate word list with weight of the strength (range : -5 until 5 ). 6610 negative words and 3619 positive words
  4. panl10n Pan Localization
  5. Acronyms : ramaprakoso/analisis-sentimen 4085 words

Parallel corpus Eng-Ind

  1. parallel-corpora-en-id
  2. Indonesian-English-Bilingual-Corpus
  3. TALPCo
  4. opus
  5. Multi-Wiki

Sentence Analyzer

  1. MALINDO_Morph
  2. morphind
  3. INDRA
  4. pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
  5. id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
  6. kawat : A Word Analogy Task Dataset for Indonesian

Crawler Data

  1. Crawler Indonesian news portal