Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This is PyTorch implementation of the paper:

Unsupervised Learning of Syntactic Structure with Invertible Neural Projections
Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick
EMNLP 2018

The code performs unsupervised structure learning on language, specifically to learn Markov structure and dependency structure.

Please concact if you have any questions.


  • Python 3
  • PyTorch >=0.4
  • scikit-learn (for tagging task only)
  • NLTK (for parsing task only)


We provide the pre-trained word vector file we used in the paper and a small subset of Penn Treebank data for testing the tagging code. This dataset contains 10% samples of Penn Treebank and is public in NLTK corpus. Full Penn Treebank dataset requires a LDC license.

To download the sample data, run:


The downloaded data is located in sample_data.

Throughout two tasks we use simplified CoNLL format as data input that contains four columns:

ID Token Tag Head

At training time only Token is used, Head represents the dependency head index (for evaluation of parsing task). Tag is used for evaluation of tagging task. As observations in our generative model, pre-trained word vectors are required. The input word2vec map should be a pickled representation of Python dict object.

We also provide script to preprocess full Penn Treebank dataset for parsing (e.g. converting parse trees, removing punctuations, etc.), the wsj directory should look like:

+-- 00
|   +-- wsj_0001.mrg
|   +-- ...
+-- 01
+-- ...


python --ptbdir /path/to/wsj

This command would generate train/test files in ptb_parse_data. Note that the generated data files contain gold POS tags in the Tag column, thus are not the files we used in the paper, where the tags are induced from the Markov model.

TODO: Simpify the pipline to generate train/test files without gold POS tags for parsing to reproduce the parsing results.

Markov Structure for Tagging


Train a Gaussian HMM baseline:

python --model gaussian --train_file /path/to/train --word_vec /path/to/word_vec_file

By default we evaluate on the training data (this is not cheating in unsupervised learning case), different test dataset can be specified by --test_file option. Training uses GPU when there is GPU available, and CPU otherwise, but running on CPU can be extremely slow. Full configuration options can be found in After training the trained model will be saved in dump_models/markov/.

Unsupervised learning is usually very sensitive to initializations, for this task we run multiple random restarts and pick the one with the highest training data likelihood as described in paper. It is generally sufficient to run 10 random restarts. When running with multiple random restarts, it is necessary to specify the --jobid or --taskid options to avoid model overwriting.

After training the Gaussian HMM, train a projection model with Markov prior:

python \
        --model nice \
        --lr 0.01 \
        --train_file /path/to/train \
        --word_vec /path/to/word_vec_file \
        --load_gaussian /path/to/gaussian_model 

Initializing the prior with pre-trained Gaussian baseline would make the training much more stable. By default 4 coupling layers are used in NICE projection.


On the provided subset of Penn Treebank that contains 3914 sentences, the Gaussian HMM is able to achieve ~76.5% M1 accuracy and ~0.692 VM score, and the projection model (4 layers) achieves ~79.2% M1 accuracy and ~0.718 VM score.


After training, prediction can be performed with :

python --model nice --train_file /path/to/tag_file --tag_from /path/to/pretrained_model

Here --train_file represents the file to be tagged, the output file is located in the current directory.

DMV Structure for Parsing


First train a vanilla DMV model with viterbi EM (this only runs on CPU):

python --train_file /path/to/train_data --test_file /path/to/test_data

Trained model is saved in dump_models/dmv/viterbi_dmv.pickle. Implementation of this basic DMV training is partially based on this repo.

Then use the pre-trained DMV to initialize the syntax model in flow/Gaussian model:

python \
        --model nice \
        --train_file /path/to/train_data \
        --test_file /path/to/test_data \
        --word_vec /path/to/word_vec_file \
        --load_viterbi_dmv dump_models/dmv/viterbi_dmv.pickle

The script trains a Gaussian baseline when --model is specified as gaussian. Training uses GPU when there is GPU available, and CPU otherwise. Trained model is saved in dump_models/dmv/.


The awesome nlp_commons package (for preprocessing the Penn Treebank) in this repo was originally developed by Franco M. Luque and can be found in this repo.


    title = {Unsupervised Learning of Syntactic Structure with Invertible Neural Projections},
    author = {Junxian He and Graham Neubig and Taylor Berg-Kirkpatrick},
    booktitle = {Proceedings of EMNLP},
    year = {2018}


PyTorch Implementation of "Unsupervised Learning of Syntactic Structure with Invertible Neural Projections" (EMNLP 2018)




No releases published


No packages published


You can’t perform that action at this time.