Paraphrasic Sentence Compression using Deep-Link Bilingual Phrase Alignments.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
alignment
berkeleyaligner
condor
data
reports
sim
.gitignore
ParaphraseRanker.java
README.md
bible_parser.py
lowercase.perl
nlp-final.pdf
paraphraser.py
phrase_aligner.py
tokenizer.py

README.md

Paraphrastic Sentence Compression

This is a course research project for NLP investigating the use of deep-linking bilingual phrase alignments and cross-domain parallel corpora for improving paraphrastic sentence compression results.

Preparing Data

Obtaining Parallel Sentences

The parallel corpora used are:

The Bible corpora needs to be pre-processed out of the initial XML format, which can be done with the following command:

python bible_parser.py <xmlfile> <outputfile>

Tokenizing

To tokenize sentences from parallel corpora, a tokenizer script is provided in the root directory.

Some extra dependencies are required to run the script.

  • Download the nltk library through sudo pip install nltk.
  • Run python in the terminal, and nltk.download(). This will open an installion directory, from which you can install the necessary punkt tokenizer models.

Finally, you can run the tokenizer script.

python tokenizer.py <input filename 1> <output filename 1> ...

You may wish to normalize your sentences by lowercasing them.

cat <input> | perl lowercase.perl > <output filename>

Paraphrase Extraction

Word Alignment

The unsupervised Berkeley Aligner is provided for the use of language-agnostic word alignment. Alignment may take serveral hours and a large amount of memory, so it is recommended to submit the condor jobs provided. Make sure to submit them from within the berkeleyaligner directory, and modify the absolute paths accordingly.

Phrase Alignment

Koehn 2003 Implementation, or find another one online. An unoptimized implementation of the word-alignment based technique found in Koehn 2003 is provided, which outputs two maps from Lang1 phrases to arrays of aligned Lang2 phrases.

python phrase_aligner.py <srcfile> <trgfile> <word alignment file> <srcdict filename> <trgdict filename>

Extracting Paraphrases

To extract paraphrases, a universal script is provided for basic, parallel, and deep-linking paraphrase acquisition over the generated phrase-aligned dictionaries. These dictionaries are the JSON-formatted files produced in the previous step.

Depth 1: Bilingual single-corpus or cross-domain paraphrase extraction.
  e.g. python paraphraser.py 1 data/phrases.txt <en-de> <de-en>

Depth 2+: Deep-linking paraphrase extraction across multiple languages.
  e.g. python paraphraser.py 2 data/phrases.txt <en-de> <de-sp> <sp-en>

Parallel: Basic single-corpus extraction using multiple corpora.
  e.g. python paraphraser.py parallel data/phrases.txt <en-de> <de-en> <en-fr> <fr-en> ...

Ranking Paraphrases

To rank paraphrases, an implementation of WordNet-based distributional similarity was used (gangeli:sim). You can run similarity tests with the following commands:

javac -cp sim/dist/sim-release.jar ParaphraseRanker.java
java -cp sim/dist/sim-release.jar:. -Dwordnet.database.dir=sim/etc/WordNet-3.1/dict -mx3g ParaphraseRanker <paraphrase file>