Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

This is the code and data for the experiments in "Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification". (EMNLP 2018)

Install Dependencies

Download pre-trained word embeddings

Clone the indic-word2vec-embeddings repository into the data/raw directory.

git clone https://kelseyball@bitbucket.org/kelseyball/indic-word2vec-embeddings.git data/raw/

Data prep

python prep.py

This script pre-transliterates the Hindi training data and word embeddings into Latin. The generated files are placed in the data/clean directory and used for the baseline experiment.

Experiments

The baseline and experimental models are in the baseline.py and our-model.py files, respectively. The experiments listed below and results are described in greater detail in our paper.

  • Baseline
python baseline.py --htrain data/clean/hi_roman-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/clean/Hindi_roman.vec --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (multi-lingual)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (forced language choice)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20 --use-ltags
  • Our model (languages weighted by HMM)
python our-model.py --htrain data/raw/hi-ud-train.conllu --etrain data/raw/en-ud-train.conllu --cdev data/clean/TWEETS-dev-v2-unsup-dist.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20
  • Our model (oracle language choice)
python our-model.py --htrain data/raw/hi-ud-train.conllu  --etrain data/raw/en-ud-train.conllu --cdev data/raw/TWEETS-dev-v2.conll --hi-embds data/raw/indic-word2vec-embeddings/Hindi_utf.vec  --en-embds data/raw/indic-word2vec-embeddings/English.vec --hi-limit 50000 --en-limit 50000 --iter 20 --use-ltags

About

Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages