Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Paraphrastic Sentence Compression

This is a course research project for NLP investigating the use of deep-linking bilingual phrase alignments and cross-domain parallel corpora for improving paraphrastic sentence compression results.

Preparing Data

Obtaining Parallel Sentences

The parallel corpora used are:

The Bible corpora needs to be pre-processed out of the initial XML format, which can be done with the following command:

python <xmlfile> <outputfile>


To tokenize sentences from parallel corpora, a tokenizer script is provided in the root directory.

Some extra dependencies are required to run the script.

  • Download the nltk library through sudo pip install nltk.
  • Run python in the terminal, and This will open an installion directory, from which you can install the necessary punkt tokenizer models.

Finally, you can run the tokenizer script.

python <input filename 1> <output filename 1> ...

You may wish to normalize your sentences by lowercasing them.

cat <input> | perl lowercase.perl > <output filename>

Paraphrase Extraction

Word Alignment

The unsupervised Berkeley Aligner is provided for the use of language-agnostic word alignment. Alignment may take serveral hours and a large amount of memory, so it is recommended to submit the condor jobs provided. Make sure to submit them from within the berkeleyaligner directory, and modify the absolute paths accordingly.

Phrase Alignment

Koehn 2003 Implementation, or find another one online. An unoptimized implementation of the word-alignment based technique found in Koehn 2003 is provided, which outputs two maps from Lang1 phrases to arrays of aligned Lang2 phrases.

python <srcfile> <trgfile> <word alignment file> <srcdict filename> <trgdict filename>

Extracting Paraphrases

To extract paraphrases, a universal script is provided for basic, parallel, and deep-linking paraphrase acquisition over the generated phrase-aligned dictionaries. These dictionaries are the JSON-formatted files produced in the previous step.

Depth 1: Bilingual single-corpus or cross-domain paraphrase extraction.
  e.g. python 1 data/phrases.txt <en-de> <de-en>

Depth 2+: Deep-linking paraphrase extraction across multiple languages.
  e.g. python 2 data/phrases.txt <en-de> <de-sp> <sp-en>

Parallel: Basic single-corpus extraction using multiple corpora.
  e.g. python parallel data/phrases.txt <en-de> <de-en> <en-fr> <fr-en> ...

Ranking Paraphrases

To rank paraphrases, an implementation of WordNet-based distributional similarity was used (gangeli:sim). You can run similarity tests with the following commands:

javac -cp sim/dist/sim-release.jar
java -cp sim/dist/sim-release.jar:. -Dwordnet.database.dir=sim/etc/WordNet-3.1/dict -mx3g ParaphraseRanker <paraphrase file>


Paraphrasic Sentence Compression using Deep-Link Bilingual Phrase Alignments.






No releases published


No packages published