mixture models for creole genesis
Python R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
scripts
.gitignore
LICENSE
README

README

*About

  Yugo Murawaki. 2016. Statistical Modeling of Creole Genesis. In
  Proceedings of the 2016 Conference of the North American Chapter of
  the Association for Computational Linguistics: Human Language
  Technologies (NAACL-HLT 2016), pp. 1329-1339.



*Requirements

-Python2
  - scikit-learn
  - numpy
  - matplotlib
- R (for missing data imputation)
  - missMDA package
- WALS data
  - git clone 'comp-typology' and follow the instructions in MEMO

  **TODO** remove dependencies on 'comp-typology'



*Data setup

1. Download
  % cd data/creole
  % wget http://zenodo.org/record/11135/files/apics-v2013.zip


2. Unzip apics-v2013.zip to obtain data.zip

  % unzip apics-v2013.zip clld-apics-1a306df/data.zip

3. Unzip data.zip to obtain CSV files

  % mv clld-apics-1a306df apics
  % cd apics
  % unzip data.zip


4. Convert various CSV files into a JSON file

  % cd scripts/format_apics
  % python apics.py ../../data/creole/apics ../../data/creole/apics_langs.json ../../data/creole/apics_features.json


5. Merge APiCS langs with WALS langs
   (see MEMO for how to preprocess WALS data)

  % python wals_apics.py ../../data/wals/langs_aug.json ../../data/wals/flist_aug.json ../../data/creole/apics_langs.json ../../data/creole/merged_full.json ../../data/creole/flist.json

  Notes:
  - wals_value_number may not be defined in valueset
    - An APiCS feature can take multiple values (freq < 100)
    - Sometimes no corresponding value is defined in WALS
  - We drop languages classified as "Sign Languages" or "Creoles and Pidgings",
    but some contact languages (e.g., Afrikaans) remain.


6. Filter out languages by feature coverage

  % python ../format_wals/lthres.py --lthres=0.3 ../../data/creole/merged_full.json ../../data/creole/flist.json ../../data/creole/merged.json



7. Missing value imputation

  % python mv/json2tsv.py ../data/creole/merged.json ../data/creole/flist.json ../data/creole/merged.tsv
  % R --vanilla -f mv/impute_mca.r --args ../data/creole/merged.tsv ../data/creole/merged.filled.tsv
  % python mv/tsv2json.py ../data/creole/merged.json ../data/creole/merged.filled.tsv ../data/creole/flist.json ../data/creole/merged.filled.json



8 Add categorical vectors

  % python ../format_wals/catvect.py ../../data/creole/merged.filled.json ../../data/creole/flist.json ../../data/creole/merged.filled.cat.json



9. Add binarized vects

  % python ../format_wals/binarize.py ../../data/creole/merged.filled.cat.json ../../data/creole/flist.json ../../data/creole/merged.filled.catbin.json ../../data/creole/flist_bin.json


10. Remove pidgin languages

  % python format_apics/remove_pidgins.py ../data/creole/merged.filled.catbin.json ../data/creole/apics_features.json ../data/creole/nonpidgins.filled.catbin.json



*Binary classification (creoles vs. non-creoles) and PCA plots

1. Mark IDs for cross-validation

  % python format_wals/mark_cv.py --seed=20 --cv=5 ../data/creole/nonpidgins.filled.catbin.json ../data/creole/nonpidgins.filled.catbin.cv5.json 


2. Perform SVM classificatoin

  % python svm.py --seed=20 --cv=5 ../data/creole/nonpidgins.filled.catbin.cv5.json 2>&1 | tee ../data/creole/nonpidgins.filled.catbin.cv5.svm.log


3. PCA plots

  **TODO**: replace hard-coded vars with command-line arguments

  % python format_apics/pca.py --plot_type=0 --output ../data/creole/pca12.pdf ../data/creole/nonpidgins.filled.catbin.json
  % python format_apics/pca.py --plot_type=0 --output ../data/creole/pca12C.pdf ../data/creole/nonpidgins.filled.catbin.json
  % python format_apics/pca.py --plot_type=0 --output ../data/creole/pca12N.pdf ../data/creole/nonpidgins.filled.catbin.json



*Mixture models

1. MONO model

  % python mixture_discrete.py --seed=20 --iter=10000 --dump=../data/creole/mono.dump.json ../data/creole/merged.filled.catbin.json ../data/creole/flist.json ../data/creole/sources.tsv 2>&1 >../data/creole/mono.assign.json | tee ../data/creole/mono.log

  % python format_apics/simplex.py --mtype=mono --output ../data/creole/mono.lang.pdf ../data/creole/mono.assign.json

  **TODO** remove hard-coded vars

  % python format_apics/assign_stat.py ../data/creole/mono.assign.json


2. FACT model

  % python mixture_discrete.py --seed=20 --iter=10000 --dump ../data/creole/discrete2_2.dump.json ../data/creole/merged.filled.catbin.json ../data/creole/flist.json ../data/creole/sources.tsv 2>&1 >../data/creole/discrete2.assign.json | tee ../data/creole/discrete2.log

  % python format_apics/simplex.py -mtype=fact --type=theta --output ../data/creole/discrete2.theta.pdf ../data/creole/discrete2.assign.json
  % python format_apics/simplex.py -mtype=fact --type=lang --output ../data/creole/discrete2.lang.pdf ../data/creole/discrete2.assign.json
  % python format_apics/simplex.py -mtype=fact --type=lang --output ../data/creole/discrete2.lang.pdf ../data/creole/discrete2.assign.json

  % python format_apics/assign_stat.py ../data/creole/discrete2.assign.json

  **TODO** remove hard-coded vars

  % python format_apics/assign_stat_fact.py --type=feature ../data/creole/discrete2.assign.json
  % python format_apics/assign_stat_fact.py --type=lang ../data/creole/discrete2.assign.json

  **TODO** remove hard-coded vars

  % python format_apics/feature_stat.py --type=lang ../data/creole/discrete2.assign.json ../data/creole/merged.filled.catbin.json ../data/creole/flist.json