Collection of Bahasa Malaysia (Malay) Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.
Library | Description | Programming Languages | License | Author & Link |
---|---|---|---|---|
Malaya | Natural-Language-Toolkit for Bahasa Malaysia | iPython | MIT License (MIT) | DevconX |
Library | Description | Programming Languages | License | Author & Link |
---|---|---|---|---|
polyglot | Polyglot is a natural language pipeline that supports massive multilingual applications such as Transliteration, NER, Sentiment Analysis, Morphological Analysis | Python | GPLv3 | aboSamoor |
API | Description | Programming Languages | License | Guide & Link |
---|---|---|---|---|
Malay NLP | Frequency Based and Max-ent POS Taggers | Malay NLP Blog |
Library | Description | Programming Languages | License | Author & Link |
---|---|---|---|---|
hltdi-morphology | Mirror Repository for ParaMorfo, HornMorpho, AntiMorfo, and MorfoMelayu | LowResourceLanguages |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
MALINDO_Morph | Morphological dictionary for Malay / Indonesian | English-Malay, English-Indonesian | CC BY-NC-SA 4.0 TH | english | |
TALPCo | The TUFS Asian Language Parallel Corpus | Japanese -> Malay | Creative Commons Attribution 4.0 International (CC BY 4.0) license | matbahasa | |
Open Parallel Corpus | OPUS is a growing collection of translated texts from the web. | Malay <-> Many languages | Modified BSD License | OPUS |
Pre-trained Model | Description | Size | Dimensions | License | Link |
---|---|---|---|---|---|
fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only | |
wordvectors | Pre-trained word vectors of 30+ languages | 173MB | 100 | MIT License | Kyubyong |
Malay is currently a low-resource language with few NLP resources out there. Due to its close resemblence to Bahasa Indonesia, it may be useful to try using resources built for Bahasa Indonesia. If you're looking for a place to start, here is a great resource: https://github.com/keyreply/Bahasa-Indo-NLP-Dataset