Malaya is a Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.
Proper documentation is available at https://malaya.readthedocs.io/
CPU version
$ pip install malaya
GPU version
$ pip install malaya-gpu
Only Python 3.6.x and above and Tensorflow 1.10 and above but not 2.0 are supported.
Emotion Analysis
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Entities Recognition
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Language Detection
using Fast-text and Sparse Deep learning Model to classify Malay (formal and social media), Indonesia (formal and social media), Rojak language and Manglish.
Normalizer
using local Malaysia NLP researches hybrid with Transformer models to normalize any bahasa texts.
Num2Word
Convert from numbers to cardinal or ordinal representation.
Part-of-Speech Recognition
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Dependency Parsing
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Relevancy Analysis
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Sentiment Analysis
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Spell Correction
Using local Malaysia NLP researches hybrid with Transformer models to auto-correct any bahasa words.
Stemmer
Use BPE LSTM Seq2Seq with attention state-of-art to do Bahasa stemming.
Subjectivity Analysis
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Similarity
Use deep Encoder, Doc2Vec, BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa and ALXLNET-base-bahasa to build deep semantic similarity models.
Summarization
Using BERT, XLNET, ALBERT, skip-thought, LDA, LSA and Doc2Vec to give precise unsupervised summarization, and TextRank as scoring algorithm.
Topic Modelling
Provide Attention, LDA2Vec, LDA, NMF and LSA interface for easy topic modelling with topics visualization.
Toxicity Analysis
Transfer learning on BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
Word2Vec
Provide pretrained bahasa wikipedia and bahasa news Word2Vec, with easy interface and visualization.
Transformer
Provide easy interface to load BERT-base-bahasa, Tiny-BERT-bahasa, Albert-base-bahasa, Albert-tiny-bahasa, XLNET-base-bahasa, ALXLNET-base-bahasa.
If you use our software for research, please cite:
@misc{Malaya, Natural-Language-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow, author = {Husein, Zolkepli}, title = {Malaya}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya}} }
Thanks to Im Big, LigBlou, Mesolitica and KeyReply for sponsoring AWS Google and private cloud to train Malaya models.
Thank you for contributing this library, really helps a lot. Feel free to contact me to suggest me anything or want to contribute other kind of forms, we accept everything, not just code!