Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

..
Octocat-spinner-32 Makefile
Octocat-spinner-32 README.txt
Octocat-spinner-32 WIN32_functions.cpp
Octocat-spinner-32 WIN32_functions.h
Octocat-spinner-32 check-install
Octocat-spinner-32 filter-pt.cpp
Octocat-spinner-32 sigtest-filter.sln
Octocat-spinner-32 sigtest-filter.vcproj
README.txt
Re-implementation of Johnson et al. (2007)'s phrasetable filtering strategy.

This implementation relies on Joy Zhang's SALM Suffix Array toolkit. It is
available here:

  http://projectile.is.cs.cmu.edu/research/public/tools/salm/salm.htm

--Chris Dyer <redpony@umd.edu>

BUILD INSTRUCTIONS
---------------------------------

1. Download and build SALM.

2. make SALMDIR=/path/to/SALM


USAGE INSTRUCTIONS
---------------------------------

1. Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index
   of the source and target sides of your training bitext.

2. cat phrase-table.txt | ./filter-pt -e TARG.suffix -f SOURCE.suffix \
    -l <FILTER-VALUE>

   FILTER-VALUE is the -log prob threshold described in Johnson et al.
     (2007)'s paper.  It may be either 'a+e', 'a-e', or a positive real
     value.

3. Run with no options to see more use-cases.


REFERENCES
---------------------------------

H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) Improving Translation
  Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007
  Joint Conference on Empirical Methods in Natural Language Processing and
  Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.
Something went wrong with that request. Please try again.