A compound splitter based on the semantic regularities in the vector space of word embeddings.
Python Jupyter Notebook Shell
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 87 commits ahead of mrmutator:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
demo
jobs
models
training
visualization_and_test
LICENSE
README.md
compound.py
decompound.py
decompound_dict.py
generate_dict.sh
lattice.py
setup.py
train_splitter.sh
train_weights.py
viterbi_decompounder.py

README.md

Unsupervised German Compound Splitter

A compound splitter based on the semantic regularities in the vector space of word embeddings. For more information see this presentation or our paper.

Basic usage

To use this tool with standard settings, do the following:

$ wget https://raw.githubusercontent.com/jodaiber/semantic_compound_splitting/master/decompound_dict.py https://raw.githubusercontent.com/jodaiber/semantic_compound_splitting/master/models/de.dict
$ python decompound_dict.py de.dict < your_file
Verhandlungs Ablauf

The file your_file should contain tokenized sentences.

Options:

--drop_fugenlaute If this flag is set, Fugenlaute (infixes such as -s, -es) are dropped from the final words.

$ python decompound_dict.py de.dict --drop_fugenlaute < your_file
Verhandlung Ablauf

--lowercase Lowercase all words.

--restore_case True/False Restore the case of the parts of the compound (words will take the case of the original word). Default: True

--ignore_case Ignores case: all input words should be lowercase.

Advanced usage

Citation

If you use this splitter in your work, please cite:

@inproceedings{daiber2015compoundsplitting,
  title={Splitting Compounds by Semantic Analogy},
  author={Daiber, Joachim and Quiroz, Lautaro and Wechsler, Roger and Frank, Stella},
  booktitle={Proceedings of the 1st Deep Machine Translation Workshop},
  editor = {Jan Haji&#269; and António Branco},
  pages={20--28},
  year={2015},
  isbn = {978-80-904571-7-1},
  publisher={Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics},
  url={http://jodaiber.github.io/doc/compound_analogy.pdf}
}

Contributers

  • Roger Wechsler, University of Amsterdam
  • Lautaro Quiroz, University of Amsterdam
  • Joachim Daiber, ILLC, University of Amsterdam

License

Apache 2.0