This package represents code for generating semantic translingual document representations.
The package implements the following methodologies
- Latent Semantic Indexing
- Latent Dirichlet Allocation
- (Orthonormal) Explicit Topic Analysis
- G++ (>= 4.4.5) or a similar C++11 compiler with
- Make
- ARPACK
- GNU Scientific Library
- CMake
cmake .
make
The input format of files is tokenized text files with one document per line. The output format is a sparse vector of the following form:
i1 i2 ... in ||| v1 v2 ... vn
Where the i1
element of the vector has value v1
and omitted values are zero
bin/lsi train-corpus test-corpus-src test-corpus-trg K vectors-src vectors-trg
Applies the Latent Semantic Indexing method
traincorpus
: The training corpus where documents consist of both languages appended (i.e., by usingscripts/join.py
)test-corpus-src
: The source language evaluation corpustest-corpus-trg
: The target language evaluation corpusK
: The number of topics to calculate in the representationvectors-src
: The output file to write the source language representations tovectors-trg
: The output file to write the target language representations to
bin/lda train-corpus-src train-corpus-trg test-corpus-src test-corpus-trg K N vectors-src vectors-trg
Applies the Polylingual Latent Dirichlet Allocation method
train-corpus-src
: The source language training corpustrain-corpus-trg
: The target language training corpustest-corpus-src
: The source language evaluation corpustest-corpus-trg
: The target language evaluation corpusK
: The number of topics to calculateN
: The number of iterations of Gibbs sampling to performvectors-src
: The output file to write the source language representations tovectors-trg
: The output file to write the target language representations to
bin/oneta train-corpus test-corpus kernel-size vectors [-sqnorm]
Applies the (Orthonormal) explicit topic analysis method
train-corpus
: The training corpus (Note this command should be run once per language)test-corpus
: The test corpuskernel-size
: The degree of complexity in the calculation. At N = 0 this command runs no orthonormalization, if N is the number of lines (documents) in the train corpus this is full orthonormalization.vectors
: The output to write the translingual representations to-sqnorm
: If this is set apply term frequency normalization (Note this must appear at the end of the command list)
bin/mate-finding topics1 topics2
Compare two sets of translingual representations. It is assumed the files correspond line-by-line
J.P. McCrae, P. Cimiano & R. Klinger (2013). Orthonormal explicit topic analysis for cross-lingual document matching. In Proceedings of EMNLP 2013.
The data used in this paper is in here