Skip to content
mjpost edited this page Sep 7, 2011 · 20 revisions

Projects

  • clean up the chart code

    • ✓ control pruning with a cell-level cube pruning pop limit instead of the various thresholds
    • ✓ clean up code in CubePruneCombiner.java
      • ✓ cell signatures on CubePruneState should be something quicker to compute than a string
      • ✓ more could be done on this; in particular, it hasn't been implemented for beam and threshold pruning
    • Hash cells instead of creating a complete 2d grid in Chart.java
  • clean up joshua's logging output

  • simplify joshua invocation

  • OOVs: better handling for handling OOV words.

    • Allow users to specify the behavior for OOVs (pass through as-is, delete, transliterate)
    • Integrate a transliteration module into the code
    • Include instructions on how to give OOVs a good non-terminal
  • Integrate Thrax into the Joshua codebase

    • If possible, re-use portions of the code that should be shared, like a Rule class.
  • Implement PRO for parameter tuning

    • Make it modular so that any evaluation metric can be used similar to our Z-MERT implementation
  • Data distributed with code

    • Look over the example folders and see if any of them are worth keeping
    • Include some good sample data with the distribution that people can use to run the system on initially
  • Change LICENSE to BSD

    • Fix across all of the files
  • General code clean-up

    • Fold together redundant code
      • Should the lattice package be at joshua.lattice or somewhere else?
      • Where should oracle be located? Not the top-level, presumably.
    • Better high-level organization of code
      • Rename the packages to have functional names? Decode, Tune, Preprocess?
  • Subsampling

    • Experiment with the subsampler to make sure it doesn't change translation performance too much
    • Reuse the joshua.corpus classes instead of having redundant ones for the subsampler.

Completed

  • ✓ Fix multithreaded Joshua (sentences should be placed in a queue that threads pop and deposit somewhere; deposits would then be assembled sequentially) --- currently on the fix_threads branch

  • ✓ Clean up the input handling routines (HackishSegmentParser, SAXSegmentParser, PlainSegmentParser)

  • ✓ Configuration parameters should be overridable from the command line. This is especially true of runtime related parameters such as the number of threads.

    • ✓ Rudimentary support has been added for -threads...
    • ✓ ...but it should be rewritten in a more general fashion: (1) load the configuration file, then (2) process command line arguments and let anything be overridden.
  • ✓ Fix KenLM integration

    • ✓ KenLM typically scores between 0.5 and 1.0 BLEU points less than SRILM using the same model
    • ✓ fix the vocabulary mapping
    • ✓ use a proper UNK
  • ✓ get rid of SRILM

  • ✓ Delete packages that are no longer used

    • ✓ all of the suffix-array-based grammar extraction code
    • ✓ aligner
    • ✓ prefix_tree
    • ✓ bloomfilter_lm
    • ✓ distributed_lm
    • buildin_lm (keeping around)