Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
Clone or download
Latest commit bc210e8 Nov 1, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
transliteration
README.md
baseline.py
our-model.py
prep.py

README.md

This is the code and data for the experiments in "Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification". (EMNLP 2018)

Install Dependencies

Download pre-trained word embeddings

Clone the indic-word2vec-embeddings repository into the data/raw directory.

git clone https://kelseyball@bitbucket.org/kelseyball/indic-word2vec-embeddings.git data/raw/

Data prep

python prep.py

This script pre-transliterates the Hindi training data and word embeddings into Latin. The generated files are placed in the data/clean directory and used for the baseline experiment.

Experiments

The baseline and experimental models are in the baseline.py and our-model.py files, respectively. The experiments listed below and results are described in greater detail in our paper.

  • Baseline
python baseline.py --htrain data/clean/hi_roman-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/clean/Hindi_roman.vec --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (multi-lingual)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (forced language choice)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20 --use-ltags
  • Our model (languages weighted by HMM)
python our-model.py --htrain data/raw/hi-ud-train.conllu --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (oracle language choice)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/raw/TWEETS-dev-v2.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20 --use-ltags