Skip to content

It is a multi-source (many languages) trainable dependency parser extended by BIST-parser.

Notifications You must be signed in to change notification settings

jujbob/multilingual-bist-parser

Repository files navigation

What is the Multilingual-bist-parser?

It is a multi-source (many languages) trainable dependency parser extended by BIST-parser. If you want to know basic background of this parser, please refer to Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. (We strongly recommend you to check the paper from Eliyahu Kiperwasser 2016).

This parser was used for CoNLL 2017 shared task. we ranked at 5th among 33teams. For further information, please check A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations

This parser can train models both monolingual and multilingual way, but here we give you an example shell file (sme.sh) that is just for multilingual approach as a beginning step. Note that!! Even if you train a model in a monolingual way, the output model would be different between original BIST-parser in terms of performance and usability because there are some differences.

  • Multilingualism: It takes language-hot encoding and multilingual word embedding as additional features.
  • Having a Unique ROOT: It has just one ROOT for addressing each sentence. ref:here
  • Training parsers with Universal Dependency (CoNLL-X format).
  • Several options for training different scenarios (use of XPOS and UPOS, concatenation of word embeddings or not, multilingual training ... so on)

Dependencies

  • Or boost-1.61.0 ~ 1.65.0: you can run it but need to fix self.trainer.update_epoch() -->> self.trainer.update() for a file "bist-parser/bmstparser/src/mstlstm.py" ref
  • eigen hg clone https://bitbucket.org/eigen/eigen

Causion: based on version of Boost and Dynet, you may get different performance!!

How to setup?

Dowonload the source code

    git clone git@github.com:jujbob/multilingual-bist-parser.git

setup Dynet

    sudo pip install cython 
    sudo apt-get install libboost-all-dev
    sudo pip install numpy
    cd dynet-base
    cd dynet
    mkdir build
    cd build
    sudo cmake .. -DEIGEN3_INCLUDE_DIR=../../eigen -DPYTHON=`which python`
    sudo make -j 2 
    cd python
    sudo python setup.py build
    sudo python setup.py install --user

On Macintosh we have succesfully installed Dynet and eigen following these instructions with Dynet version 2.0 and Boost version 1.65.1.

How to train the parser?

An Example for training North Sami and Finnish

    cd $PARSERHOME
    mkdir external_embeddings
    cd external_embeddings
    wget https://mycore.core-cloud.net/index.php/s/gbHTEsIllrbQVKy/download  # Download a multilingual word embedding
    bzip2 -d download
    mv download.out model_sme_fin_no.vec
    
    cd ..   # move to $PARSERHOME
    vi train_bmst_multi_sme.sh   # change PRJ_DIR path for your home directory
     - PRJ_DIR=/home/ktlim/parser/multilingual-bist-parser
    vi training_file_list.txt    # change corpus path for your home directory
     - /home/ktlim/parser/multilingual-bist-parser/corpus/ud-treebanks-v2.0/UD_Finnish/fi-ud-train.conllu|||fi
     - /home/ktlim/parser/multilingual-bist-parser/corpus/ud-treebanks-v2.0/UD_North_Sami/sme-ud-train_200.conllu|||sme
     
    ./train_bmst_multi_sme.sh 

Train your own models

You can train your own model by editing "train_bmst_multi_sme.sh" shell, and if you want to train a new combination of multilingual model, you may have multilingual word embedding and then also add related languages and codes to "language_vec.csv"

Citation

we appreciate citing the following:

Paper for 2017 CoNLL shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

@article{lim2017system,
  title={A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations},
  author={Lim, KyungTae and Poibeau, Thierry},
  journal={Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  volume={501},
  pages={63--70},
  year={2017}
}

Contact

If you have any questions or suggestions, please send an email jujbob@gmail.com

About

It is a multi-source (many languages) trainable dependency parser extended by BIST-parser.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published