Hierarchical Encoder Decoder for Dialog Modelling
Creating A Dataset
The script convert-text2dict.py can be used to generate model datasets based on text files with dialogues. It is assumed that each dialogue consists of three turns: A-B-A.
Prepare your dataset as a text file for with one dialogue (one triple) per line. There must be exactly three utterances in each dialogue, and they must be separated by the tab symbol. There must no be any tab symbols elsewhwere in the file. The dialogues are assumed to be tokenized. If you have a validation and tests sets, they must satisfy the same requirements.
Once you're ready, you can create the model dataset files by running:
python convert-text2dict.py <training_file> --cutoff <vocabulary_size> Training python convert-text2dict.py <validation_file> --dict=Training.dict.pkl Validation python convert-text2dict.py <test_file> --dict=Training.dict.pkl <vocabulary_size> Test
where <training_file> is the training file, and <vocabulary_size> is the number of tokens that you want to train on (all other tokens will be converted to symbols).
Training The Model
If you have Theano with GPU installed (bleeding edge version), you can train the model as follows:
- Clone the Github repository
- Create a new "Output" and "Data" directories inside it.
- Unpack your dataset files into "Data" directory.
- Create a new prototype inside state.py (look at prototype_moviedic or prototype_test as examples)
- From the terminal, cd into the code directory and run:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python train.py --prototype <prototype_name> &> Model_Output.txt
For a 13M word dataset, such as MovieTriples, this takes about 1-2 days until it reaches convergence.
To test the model afterwards, you can run:
THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python evaluate.py --exclude-sos --plot-graphs Output/<model_name> --document_ids Data/Test_Shuffled_Dataset_Labels.txt &> Model_Evaluation.txt
where <model_name> is the name automatically generated by train.py.
If your GPU runs out of memory, you can adjust the bs (batch size) parameter inside the state.py, but training will be slower. You can also play around with the other parameters inside state.py.