Depth-Bounded PCFG Induction
This is the repo for the paper Unsupervised Grammar Induction with Depth-bounded PCFG that appears in Transcations of Association for Computational Linguistics. A large part of the code is based on another system called UHHMM so some scripts may still have older names.
- Python 3.4+
- CUDA 8.0+
- Numpy & Scipy
Steps to set up
There are a few steps you need to do before the system is runnable. The first step is to compile to GPU source files. The second step is to compile the Cython files. Finally you need to set up a config file.
1. Compile the CUDA C scripts
There are a couple of compile scripts in the
./gpusrc/ folder for reference. Since it is most
likely that you want to run the system on a super computer with many GPUs, the
file best shows how to compile the CUDA scripts. You need to replace the paths in the file for
CUDA libraries, Python and Numpy with your paths. After you have setup the paths, you only need
source ./gpusrc/compile_osc.sh for
nvcc to compile all the scripts.
2. Compile Cython scripts
After compiling the CUDA C scripts, you should be able to compile to Cython scripts by doing
python3 setup.py build_ext.
3. Set up a config file
There is a sample config file in the
./config/ folder. The config file has two parts,
params. The settings are explained below. Please see the sample file for the format in which
the parameters should be written.
io.input_file: the path to the input
io.output_dir: the folder where all the outputs will be write into.
io.dict_file: the path to the input
params.random_restarts: the number of random restarts the sampler will do and evaluate before
doing a chain.
params.num_samples: the number of iterations the sampler will run.
params.startabp: the number of A/B/P categories given to the sampler, which is equivalent to
K in the paper.
params.init_alpha: the value for the hyperparameter for the symmetric Dirichlet prior, which is
equivalent to beta in the paper.
params.cpu_workers: the number of workers on CPUs. The CPU workers only do model compilation,
params.gpu_workers: the number of workers on GPUs. The GPU workers do both model compilation
params.depth: the maximum depth limit to the sampler.
gpu: the flag to use GPU or not.
gpu_batch_size: the size of a batch used on the GPU.
- You can do
make xxx.ints.txtto convert a space-delimited one-line-per-sentence file into an
dictfile used by the system.
depthcontrol the size of the compiled model. The largest value one can reasonably try is 15 and 2 respectively, which is what's used in the paper. Larger than this, you may risk running out of memory on the GPU.
gpu_workerscan both to set to 0, which is usually what you want in order to run on a super computer or a cluster. In this case, the master process will write out a
masterConfig.txtfile into the root directory of the package, and you can start arbitrary number of workers by doing
python scripts/workers.py ..
Run the system
Now the hard work is over, and the system is runnable. By running the system, you should just do
python scripts/uhhmm-trainer.py config/config.file. This launches the master process. The outputs
will be written into the specified output directory. If you have specified number of GPU workers
in the config file, then the system will start running now. If you have 0 specified GPU workers,
please follow Note 3 to start some GPU workers for sampling.
Because most of the runs take longer than a couple of days, you may want to continue a stopped
run. In order to do that, you just pass in the output directory instead of the config file to the
python scripts/uhhmm-trainer.py outputs/output_dir and the master process will be
started and pick up where it is stopped at. For the workers it is the same as starting a new run.
The important output files written to the output directory include a
records log-likelihoods and right-branching tendency scores at every iteration. The linetrees
files are the sampled trees for the training corpus in the brackets format at every iteration.
They are ready for normal unlabeled bracketing evaluation when punctuation is removed from these