Hausa Part-of-Speech Tagging in a Low-Resource Scenario

This project is for my Bachelor thesis at the University of Potsdam for which I investigated automatic POS Tagging in Hausa by assuming that no annotated data is available. Instead, I utilized parallel sentences in Englisch, French, Arabic and German to induce word classes.

Overview

Parallel sentences for the Quran can be found in the directory parallel. Each file has alignment information in the directory aligned and tags for the source language in the directory tagged.
Alignments and source tags are used to project tags onto the Hausa sentences. These files can be found in the directory projected. On the data in projected I train different tagging models. I evaluate them on the test data in the directory test_data. Predictions for each tagger can also be found in this directory under predictions. All the figure can be found in the directory figures.

Notebooks

For various aspects of this work, I made use of Jupyter notebooks in Google Colab. They can be found in the directory notebooks:

In word_alignment_hausa.ipynb, parallel sentences are aligned with fast_align and SimAlign
In tag_english.ipynb, tag_multilingual.ipynb and tag_arabic.ipynb, POS tagging for the source languages is done
In bi-lstm-trainer.ipynb, a BI-LSTM model with a CRF layer is trained.

See the individual notebooks for instructions on how to use them.

Annotation Projection

After word alignment and tagging of the source languages, the tags of the source language are projected onto the Hausa sentences. This is can be reproduced with project_tags.py. It takes the following arguments:

langs: These are the source languages that should be used. The following languages are available ar (Arabic), en (English), de (German) and fr (French). If multiple source languages should be used, seperate them with a comma.
align: This is the alignment method that should be used. Can either be SimAlign or fast_align
type: This is the type of alignment. For SimAlign, this can either be inter, itermax or mwmf. For fast_align, this can be forward, reverse or sym
out: Directory where the results should be stored.

For example: project_tags.py ar,en SimAlign itermax results/

Train Tagger

In the files baseline.py and hmm, the Unigram and Hidden Markov Models are trained and predictions for the test set are written to files.

Evaluation

for accuracy values, execute python evaluation/accuracy.py
for confusion matrices, execute python evaluation/confusion_matrix.py
for performance on out-of-vocabulary words, execute python evaluation/oov.py
for performance on ambiguous words, execute python evaluation/ambig.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hausa Part-of-Speech Tagging in a Low-Resource Scenario

Overview

Notebooks

Annotation Projection

Train Tagger

Evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
aligned		aligned
evaluation		evaluation
figures		figures
notebooks		notebooks
parallel		parallel
projected		projected
tagged		tagged
test_data		test_data
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
hmm.py		hmm.py
project_tags.py		project_tags.py
utils.py		utils.py

katjakon/hausa-pos

Folders and files

Latest commit

History

Repository files navigation

Hausa Part-of-Speech Tagging in a Low-Resource Scenario

Overview

Notebooks

Annotation Projection

Train Tagger

Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages