Source Code Language Model (SCLM)
We created a language model for source code in order to find buggy code. The intuition is that snippest of code with high Entropy according to the language model are likely to be buggy. See the project slides for more information.
This code was modified from https://github.com/jcjohnson/torch-rnn
You'll need to install the header files for Python 2.7 and the HDF5 library. On Ubuntu you should be able to install like this:
sudo apt-get -y install python2.7-dev sudo apt-get install libhdf5-dev
The preprocessing script is written in Python 2.7; its dependencies are in the file
You can install these dependencies in a virtual environment like this:
virtualenv .env # Create the virtual environment source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install Python dependencies # Work for a while ... deactivate # Exit the virtual environment
After installing torch, you can install / update these packages by running the following:
# Install most things using luarocks luarocks install torch luarocks install nn luarocks install optim luarocks install lua-cjson # We need to install torch-hdf5 from GitHub git clone https://github.com/deepmind/torch-hdf5 cd torch-hdf5 luarocks make hdf5-0-0.rockspec
CUDA support (Optional)
To enable GPU acceleration with CUDA, you'll need to install CUDA 6.5 or higher and the following Lua packages:
You can install / update them by running:
luarocks install cutorch luarocks install cunn
To train a model and use it to generate new text, you'll need to follow three simple steps:
Step 1: Preprocess the data
You can use any text file for training models. Before training, you'll need to preprocess the data using the script
scripts/preprocess.py; this will generate an HDF5 file and JSON file containing a preprocessed version of the data.
If you have training data stored in
my_data.txt, you can run the script like this:
python scripts/preprocess.py \ --input_txt my_data.txt \ --output_h5 my_data.h5 \ --output_json my_data.json
This will produce files
my_data.json that will be passed to the training script.
There are a few more flags you can use to configure preprocessing; read about them here
Step 2: Train the model
After preprocessing the data, you'll need to train the model using the
train.lua script. This will be the slowest step.
You can run the training script like this:
th train.lua -input_h5 my_data.h5 -input_json my_data.json
This will read the data stored in
my_data.json, run for a while, and save checkpoints to files with
You can change the RNN model type, hidden state size, and number of RNN layers like this:
th train.lua -input_h5 my_data.h5 -input_json my_data.json -model_type rnn -num_layers 3 -rnn_size 256
By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag
To run with OpenCL, add the flag
There are many more flags you can use to configure training; read about them here.
Step 3: Sample from the model
After training a model, you can generate new text by sampling from it using the script
sample.lua. Run it like this:
th sample.lua -checkpoint cv/checkpoint_10000.t7 -length 2000
This will load the trained checkpoint
cv/checkpoint_10000.t7 from the previous step, sample 2000 characters from it,
and print the results to the console.
By default the sampling script will run in GPU mode using CUDA; to run in CPU-only mode add the flag
-gpu -1 and
to run in OpenCL mode add the flag
There are more flags you can use to configure sampling; read about them here.