COMBO

COMBO is jointly trained neural tagger, lemmatizer and dependency parser implemented in python 3 using Keras framework. It took part in 2018 CoNLL Universal Dependency shared task and ranked 3rd/4th in the official evaluation.

Paper

The COMBO description can be found here: Semi-Supervised Neural System for Tagging, Parsing and Lematization.

Usage

Training your own model:

python main.py --mode autotrain --train train_data.conllu --valid valid_data.conllu --embed external_embedding.txt --model model_name.pkl --force_trees

Making predictions:

python main.py --mode predict --test test_data.conllu --pred output_path.conllu --model model_name.pkl

Trained models

Models trained on UD dataset:

Language	Treebank	LAS	MLAS	BLEX	Model
Afrikaans	af_afribooms	84.72	72.91	74.98	377 MB
Ancient Greek	grc_perseus	74.20	53.30	54.29	101 MB
Ancient Greek	grc_proiel	76.45	59.95	67.47	101 MB
Arabic	ar_padt	71.95	62.75	64.38	737 MB
Armenian	hy_armtdp	28.15	5.02	11.25	738 MB
Basque	eu_bdt	83.12	68.82	77.96	737 MB
Bulgarian	bg_btb	89.36	81.10	79.98	738 MB
Buryat	bxr_bdt	15.16	1.09	1.92	90 MB
Catalan	ca_ancora	90.54	83.11	85.20	737 MB
Chinese	zh_gsd	63.92	53.48	57.84	744 MB
Croatian	hr_set	86.32	71.12	79.74	737 MB
Czech	cs_cac	90.72	83.27	86.69	740 MB
Czech	cs_fictree	91.83	84.23	87.81	740 MB
Czech	cs_pdt	90.34	84.04	86.96	740 MB
Danish	da_ddt	83.43	74.22	77.58	737 MB
Dutch	nl_alpino	87.15	74.93	77.06	737 MB
Dutch	nl_lassysmall	84.27	72.65	75.44	737 MB
English	en_ewt	82.31	73.33	76.52	737 MB
English	en_gum	82.82	73.24	73.57	737 MB
English	en_lines	80.33	72.25	74.01	737 MB
Estonian	et_edt	83.46	75.79	72.07	738 MB
Finnish	fi_ftb	86.89	78.42	81.06	739 MB
Finnish	fi_tdt	85.93	78.65	72.39	739 MB
French	fr_gsd	85.42	77.08	79.72	738 MB
French	fr_sequoia	88.99	81.48	84.67	738 MB
French	fr_spoken	74.31	63.43	65.34	738 MB
Galician	gl_ctg	81.17	68.15	73.60	736 MB
Galician	gl_treegal	73.21	52.88	62.86	736 MB
German	de_gsd	77.43	54.28	68.59	738 MB
Gothic	got_proiel	65.87	50.81	59.30	48 MB
Greek	el_gdt	88.49	76.15	78.57	738 MB
Hebrew	he_htb	63.69	50.26	53.58	737 MB
Hindi	hi_hdtb	91.43	76.23	86.29	593 MB
Hungarian	hu_szeged	79.47	66.09	72.51	737 MB
Indonesian	id_gsd	78.40	67.30	75.10	737 MB
Irish	ga_idt	69.24	37.31	47.32	206 MB
Italian	it_isdt	91.03	83.18	84.76	737 MB
Italian	it_postwita	73.99	61.14	62.98	737 MB
Japanese	ja_gsd	73.69	57.82	60.62	743 MB
Kazakh	kk_ktb	22.38	4.40	7.86	738 MB
Korean	ko_gsd	80.66	74.49	66.13	741 MB
Korean	ko_kaist	84.88	76.92	72.40	743 MB
Kurmanji	kmr_mg	21.95	2.26	05.01	45 MB
Latin	la_ittb	85.54	79.84	83.51	526 MB
Latin	la_perseus	68.07	49.77	52.75	526 MB
Latin	la_proiel	70.08	56.82	64.94	526 MB
Latvian	lv_lvtb	80.71	66.22	71.80	637 MB
North Sámi	sme_giella	57.16	39.66	45.03	47 MB
Norwegian	no_bokmaal	89.33	79.51	84.68	737 MB
Norwegian	no_nynorsk	88.36	79.32	82.89	737 MB
Norwegian	no_nynorsklia	68.26	57.51	60.98	737 MB
Old Church Slavonic	cu_proiel	71.14	56.52	66.04	48 MB
Old French	fro_srcmf	84.81	76.75	81.20	52 MB
Persian	fa_seraji	86.14	80.30	76.29	737 MB
Polish	pl_lfg	94.62	86.44	89.31	737 MB
Polish	pl_sz	91.38	80.45	85.59	737 MB
Polish	poleval2018	86.11	76.18	79.86	115 MB
Portuguese	pt_bosque	87.57	74.31	80.31	737 MB
Romanian	ro_rrt	85.31	76.84	79.54	737 MB
Russian	ru_syntagrus	91.10	85.37	87.16	741 MB
Russian	ru_taiga	74.24	61.59	64.36	741 MB
Serbian	sr_set	87.27	73.79	79.92	738 MB
Slovak	sk_snk	83.76	63.97	75.34	54 MB
Slovenian	sl_ssj	85.72	75.07	81.11	737 MB
Slovenian	sl_sst	58.12	45.93	50.94	737 MB
Spanish	es_ancora	89.68	82.60	84.51	737 MB
Swedish	sv_lines	81.97	66.26	77.01	737 MB
Swedish	sv_talbanken	85.89	77.68	80.74	737 MB
Turkish	tr_imst	63.54	52.51	58.89	737 MB
Ukrainian	uk_iu	84.71	69.88	77.97	738 MB
Upper Sorbian	hsb_ufal	21.30	1.45	4.53	139 MB
Urdu	ur_udtb	81.53	55.70	72.49	485 MB
Uyghur	ug_udt	63.10	40.71	52.76	165 MB
Vietnamese	vi_vtb	42.53	35.11	38.47	736 MB

License

CC BY-NC-SA 4.0

Citation

@InProceedings{rybak-wrblewska:2018:K18-2,
  author    = {Rybak, Piotr  and  Wr{\'{o}}blewska, Alina},
  title     = {Semi-Supervised Neural System for Tagging, Parsing and Lematization},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {45--54},
  url       = {http://www.aclweb.org/anthology/K18-2004}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.md		README.md
encoders.py		encoders.py
main.py		main.py
models.py		models.py
mst.py		mst.py
parser.py		parser.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

encoders.py

encoders.py

main.py

main.py

models.py

models.py

mst.py

mst.py

parser.py

parser.py

requirements.txt

requirements.txt

utils.py

utils.py

Repository files navigation

COMBO

Paper

Usage

Trained models

License

Citation

About

Releases

Packages

Languages

IlyaAndr/COMBO

Folders and files

Latest commit

History

Repository files navigation

COMBO

Paper

Usage

Trained models

License

Citation

About

Resources

Stars

Watchers

Forks

Languages