Summary

Aksara is an Indonesian NLP tool that conforms to the Universal Dependencies (UD) v2 annotation guidelines. Aksara can perform six tasks:

Word segmentation (tokenization)
Lemmatization
POS tagging
Morphological features tagging
Morphological analysis
Dependency parsing

The output is in the CoNLL-U format.

Installation

Install Foma.

a. Linux
1. apt-get install foma-bin.
  
  Make sure you have the privilege to install package or use sudo.
b. Windows
1. Get precompiled foma binary from foma-zip
2. Unzip the precompiled foma binary
3. Add the win32 folder path (from precompiled foma zip) to environment variable PATH
c. MacOS
1. brew install foma
[OPTIONAL] It is strongly recommended to use virtual environment (see venv on how to create Python virtual environment using venv) to avoid dependency conflict.
Use the package manager pip to install Aksara library.
```
pip install aksara
```

Usage

You can use Aksara in command line or python program

Python Library Usage

Aksara can be used as a Python library. See our docs for more information.

Command Line Usage

Example to process formal Indonesian text:

foo@bar:$ python3 -m aksara -s "Pengeluaran baru ini dipasok oleh rekening bank gemuk Clinton."
# sent_id = 1
# text = Pengeluaran baru ini dipasok oleh rekening bank gemuk Clinton.
1	Pengeluaran	keluar	NOUN	_	Number=Sing	4	nsubj	_	Morf=peN+keluar<VERB>+an_NOUN
2	baru	baru	ADJ	_	_	1	amod	_	Morf=baru<ADJ>_ADJ
3	ini	ini	DET	_	PronType=Dem	1	det	_	Morf=ini<DET>_DET
4	dipasok	pasok	VERB	_	Voice=Pass	0	root	_	Morf=di+pasok<VERB>_VERB
5	oleh	oleh	ADP	_	_	6	case	_	Morf=oleh<ADP>_ADP
6	rekening	rekening	NOUN	_	Number=Sing	4	obl	_	Morf=rekening<NOUN>_NOUN
7	bank	bank	NOUN	_	Number=Sing	6	nmod	_	Morf=bank<NOUN>_NOUN
8	gemuk	gemuk	ADJ	_	_	9	amod	_	Morf=gemuk<ADJ>_ADJ
9	Clinton	Clinton	PROPN	_	_	6	appos	_	Morf=Clinton<PROPN>_PROPN
10	.	.	PUNCT	_	_	4	punct	_	Morf=.<PUNCT>_PUNCT

Example to process informal Indonesian text:

foo@bar:$ python3 -m aksara -s "Sering ngikutin gayanya lg nyanyi." --informal
# sent_id = 1
# text = Sering ngikutin gayanya lg nyanyi.
1	Sering	sering	ADV	_	_	2	advmod	_	Morf=sering<ADV>_ADV
2	ngikutin	ikut	VERB	_	Polite=Infm|Voice=Act	0	root	_	Morf=NGE+ikut<VERB>+in_VERB
3-4	gayanya	_	_	_	_	_	_	_	_
3	gaya	gaya	NOUN	_	Number=Sing	2	obj	_	Morf=gaya<NOUN>_NOUN
4	nya	nya	PRON	_	Number=Sing|Person=3|Poss=Yes|PronType=Prs	3	nmod	_	Morf=nya<PRON>_PRON
5	lg	lagi	ADV	_	Abbr=Yes|Polite=Infm	6	advmod	_	Morf=lagi<ADV>_ADV
6	nyanyi	nyanyi	VERB	_	Polite=Infm	2	ccomp	_	Morf=nyanyi<VERB>_VERB|SpaceAfter=No
7	.	.	PUNCT	_	_	6	punct	_	Morf=.<PUNCT>_PUNCT

Accepting text file as input and write to file.

foo@bar:$ python3 -m aksara -f "input_example.txt" --output "output_example.conllu" --informal
Processing inputs...
100%|██████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.45s/it]
foo@bar:$

Documentation

Aksara as a Python Library

Aksara's documentation can be built locally.
1. Clone our repository.
2. Install required dependencies (requirements.txt and doc_requirements.txt).
```
foo@bar: pip install -r requirements.txt
foo@bar: pip install -r doc_requirements.txt
```
1. Run make.bat in docs folder.
```
foo@bar: cd docs
foo@bar: make html
```
The html version of our documentation will be generated in docs/build/html folder. Using your favorite browser, open index.html.
Command Line Usage

Use -s [SENTENCES] or --string [SENTENCES] to analyze a sentence.
Use -f [FILE] or --file [FILE] to analyze multiple sentences in a file.
Use --output [FILE] to select a file for the output. Otherwise, the output will be displayed in the standard output.
Use --lemma option to get only the output of lemmatization task.
Use --postag option to get only the output of POS tagging task.
Use --informal option to use the informal word handler.
Use --model [MODEL_NAME] option to use which dependency parser machine-learning model. The list below is the name of the model that can be used.
- FR_GSD-ID_CSUI (default)
- FR_GSD-ID_GSD
- IT_ISDT-ID_CSUI
- IT_ISDT-ID_GSD
- EN_GUM-ID_CSUI
- EN_GUM-ID_GSD
- SL_SSJ-ID_CSUI
- SL_SSJ-ID_GSD
Please use option -h or --help for further documentation.

Acknowledgments

Aksara conforms to the annotation guidelines for the Indonesian dependency treebank proposed by Alfina et al. (2019) and Alfina et al. (2020)
Aksara v1.0 was built by M. Yudistira Hanifmuti and Ika Alfina as the research project for Yudistira's undergraduate thesis at the Faculty of Computer Science, Universitas Indonesia, in 2020.
Aksara v1.1 was built by Muhammad Ridho Ananda and Ika Alfina as the research project for Ridho's undergraduate thesis at the Faculty of Computer Science, Universitas Indonesia, in 2021. Aksara v1.1 uses a hybrid POS tagger method of Aksara and Hidden Markov Model (HMM) to disambiguate.
Aksara v1.2 was built by I Made Krisna Dwitama, Muhammad Salman Al Farisi, Ika Alfina, and Arawinda Dinakaramani as the research project for Krisna and Salman undergraduate thesis at Faculty of Computer Science, Universitas Indonesia in 2022. Aksara v1.2 improves the ability of the morphological analyzer in Aksara to be able to process informal Indonesian text.
Aksara v1.3 was built by Andhika Yusup Maulana, Ika Alfina, and Kurniawati Azizah as the research project for Yusup's undergraduate thesis at the Faculty of Computer Science, Universitas Indonesia, in August 2022. Aksara v1.3 introduces a machine-learning-based dependency parser to fill the 7-8th column that was previously left empty.
Aksara v1.4 was built by Erica Harlin, Ika Alfina, and Arlisa Yuliawati as the research project for Erica's undergraduate thesis at the Faculty of Computer Science, Universitas Indonesia, in June 2023. Aksara v1.4 handles noisy text normalization for informal Indonesian words.
Aksara v1.5 was built by Ainul Malik Zaidan Ismail, Debra Mazaya, Farel Muhammad Daffa, Judah Ariesaka Magaini, Rizki Kurniawan, and Michelle Shie as the group project for their PPL (Proyek Perangkat Lunak - Software Project) class at the Faculty of Computer Science, Universitas Indonesia, in June 2023. They built the Python library for Aksara.

References

Andhika Yusup Maulana, Ika Alfina, and Kurniawati Azizah. "Building Indonesian Dependency Parser Using Cross-lingual Transfer Learning". In Proceeding of the 2022 International Conference of Asian Language Processing (IALP).
I Made Krisna Dwitama, Muhammad Salman Al Farisi, Ika Alfina, dan Arawinda Dinakaramani. "Building Morphological Analyzer for Informal Text in Indonesian". In Proceeding of the ICACSIS 2022 (online).
M. Ridho Ananda, M. Yudistira Hanifmuti, and Ika Alfina. "A Hybrid of Rule-based and HMM-based POS Taggers for Indonesian". In Proceeding of the 2021 International Conference of Asian Language Processing (IALP)
M. Yudistira Hanifmuti and Ika Alfina. "Aksara: An Indonesian Morphological Analyzer that Conforms to the UD v2 Annotation Guidelines". In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020.
Ika Alfina, Daniel Zeman, Arawinda Dinakaramani, Indra Budi, and Heru Suhartanto. "Selecting the UD v2 Morphological Features for Indonesian Dependency Treebank". In Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020.
Ika Alfina, Arawinda Dinakaramani, Mohamad Ivan Fanany, and Heru Suhartanto. "A Gold Standard Dependency Treebank for Indonesian". In Proceeding of 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC) 2019 in Hakodate, Japan, 13-15 September 2019.

Changelog

2023-06-27 v1.5
- built Aksara as a Python library
2023-06-19 v1.4
- added noisy text normalization
2022-10-21 v1.3
- added new flag --model [MODEL_NAME]
- added dependency parser
- integrated existing flow with dependency parsing task
2022-08-30 v1.2
- added informal lexicon, morphotactic rules, and morphophonemic rules
- added feature Polite=Infm
- fixed bugs
2021-08-07 v1.1
- added the disambiguation for POS tag, lemma, and morphological features
- updated lexicon
- removed features: Subcat, NumForm, AdpType, VerbType
- added feature NumType
- removed feature values: Degree=Pos
- fixed bugs
2020-10-27 v1.0
- Initial release.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Contact

ika.alfina [at] cs.ui.ac.id

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
aksara		aksara
docs		docs
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build_requirements.txt		build_requirements.txt
doc_requirements.txt		doc_requirements.txt
input_example.txt		input_example.txt
output_example.conllu		output_example.conllu
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

ir-nlp-csui/aksara

Folders and files

Latest commit

History

Repository files navigation

Summary

Installation

Usage

Documentation

Acknowledgments

References

Changelog

Contributing

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages