This project is for my Bachelor thesis at the University of Potsdam for which I investigated automatic POS Tagging in Hausa by assuming that no annotated data is available. Instead, I utilized parallel sentences in Englisch, French, Arabic and German to induce word classes.
Parallel sentences for the Quran can be found in the directory parallel
. Each file has alignment information in the directory aligned
and tags for the source language in the directory tagged
.
Alignments and source tags are used to project tags onto the Hausa sentences. These files can be found in the directory projected
. On the data in projected
I train different tagging models.
I evaluate them on the test data in the directory test_data
. Predictions for each tagger can also be found in this directory under predictions
.
All the figure can be found in the directory figures
.
For various aspects of this work, I made use of Jupyter notebooks in Google Colab. They can be found in the directory notebooks
:
- In
word_alignment_hausa.ipynb
, parallel sentences are aligned withfast_align
andSimAlign
- In
tag_english.ipynb
,tag_multilingual.ipynb
andtag_arabic.ipynb
, POS tagging for the source languages is done - In
bi-lstm-trainer.ipynb
, a BI-LSTM model with a CRF layer is trained.
See the individual notebooks for instructions on how to use them.
After word alignment and tagging of the source languages, the tags of the source language are projected onto the Hausa sentences.
This is can be reproduced with project_tags.py
. It takes the following arguments:
langs
: These are the source languages that should be used. The following languages are availablear
(Arabic),en
(English),de
(German) andfr
(French). If multiple source languages should be used, seperate them with a comma.align
: This is the alignment method that should be used. Can either beSimAlign
orfast_align
type
: This is the type of alignment. ForSimAlign
, this can either beinter
,itermax
ormwmf
. Forfast_align
, this can beforward
,reverse
orsym
out
: Directory where the results should be stored.
For example:
project_tags.py ar,en SimAlign itermax results/
In the files baseline.py
and hmm
, the Unigram and Hidden Markov Models are trained and predictions for the test set are written to files.
- for accuracy values, execute
python evaluation/accuracy.py
- for confusion matrices, execute
python evaluation/confusion_matrix.py
- for performance on out-of-vocabulary words, execute
python evaluation/oov.py
- for performance on ambiguous words, execute
python evaluation/ambig.py