Source Code for VHDA

This is the source code for Variational Hierarchical Dialog Autoencoder (VHDA), a hierarchical and recurrent VAE model for generating task-oriented dialog corpora (arXiv).

Preparation

A fresh python environment (preferrably using Anaconda3) is recommended. This is the configuration of our original research environment.

OS: Ubuntu 18.04
CPU: i9-9900kf
RAM: 64GB
GPU: NVIDIA GTX 2080Ti / CUDA 10.0 / CuDNN 7.X
Python 3.7.3

Git Clone

Clone this repository "recursively".

git clone https://github.com/kaniblu/vhda --recursive

This repository contains FastText as a submodule.

Install Python Packages

Install python packages from requirements.txt.

pip install -r requirements.txt

Install Tokenizer (Choose One)

Either install CoreNLP through the following instructions or install spacy and download the basic English model.

Choice 1: Install Docker and CoreNLP

Docker is required to run the CoreNLP image. Assuming that the Docker daemon is running, run the following command to start a CoreNLP container instance.

docker run --name corenlp -d -p 9000:9000 vzhong/corenlp-server

This is needed to tokenize utterances using CoreNLP.

Choice 2: Install SpaCy

After instaling the SpaCy library from requirements.txt, run python -m spacy download en to download the necessary nlp engine.

Prepare Datasets

Run python -m datasets.prepare to prepare (preprocessing and format conversion) specific datasets.

If you installed CoreNLP is the previous step, run with --tokenizer corenlp option.

Requires an internet connection to download all raw data.

Preprocessed datasets will be stored under data/.

Prepare Word Embeddings / FastText

Due to the size of word embeddings, we did not include the word embeddings required for our experiments.

Download the latest FastText english model from the official website. Make sure that the path to the FastText model is updated in the yaml configurations under dst/internal/configs/ (these are configurations for dialog state trackers).

For other embeddings (GloVe and character embeddings), a python package will take care of downloading them.

Also, make sure that the fasttext binaries under third_party/fasttext are updated. This code was packaged with compiled binaries, but those might not work in a new environment. Run make clean && make to re-compile.

Running

These are the main entry points:

train
train_eval
train_eval_gda
gda
gda_helper
generate
interpolate
interpolate_helper
evaluate
evaluate_helper
datasets.prepare
dst.internal.run
dst.internal.evaluate
dst.internal.multirun

We will list all the basic commands here required for replication, but some options might have to be tweaked depending on the environment.

Training Generator

To train a toy VHDA, run the following. This trains a small VHDA generator using a toy dataset and saves the results under out/. Use this to check whether everything is in place.

python -m train [--gpu 0]

To replicate the full VHDA as described in the paper, run the following.

for dataset in woz
do 
    python -m train_eval \
        --save-dir out/$dataset \
        --model-path configs/vhda-dhier.yml \
        --data-dir data/$dataset/json/preprocessed \
        --epochs 10000 --batch-size 32 \
        --l2norm 0.000001 --validate-every 5 \
        --valid-batch-size 32 --gradient-clip 2.0 \
        --early-stop --early-stop-criterion "~nll" \
        --early-stop-patience 30 \
        --kl-mode "kl-mi" \
        --kld-schedule "[(0,0.0),(250000,1.0)]"
done

This trains the model for the WoZ2.0 dataset and stores the results under out/woz. Change woz to other datasets if needed.

Training State Tracker

This code comes with a blazing fast implementation of the state tracker.

Run the following to verify whether DST can be run.

python -m dst.internal.run

To run a baseline on WoZ2.0 without data augmentation using GCE+, run the following.

for dataset in woz
do 
    python -m dst.internal.multirun \
        --data-dir data/$dataset/json/preprocessed \
        --model-path dst/internal/configs/gce-ft-wed0.2.yml \
        --epochs 200 --batch-size 100 --gradient-clip 2.0 \
        --early-stop --runs 10
done

This produces multiple runs of the baseline and the final results are aggregated under out/summary.json.

This assumes that FastText is properly installed. If not, refer to the preparation guide.

Training Tracker with GDA

To train trackers with GDA enabled, use gda_helper:

for dataset in woz
do 
    python -m gda_helper \
        --dataset $dataset \
        --exp-dir <path-to-gen> \
        --dst-model gce-ft-wed0.2 \
        --scale 1.0 1.0 1.0 1.0 1.0 \
        --gen-runs 3 \
        --dst-runs 3 \
        --epochs 200 \
        --save-dir <save-dir> \
        --multiplier 1.0 \
        --batch-size 100 
done

Specify the directory to the trained generator using the --exp-dir option and the directory to save the results in the --save-dir option.

Above command will generate samples from a pretrained generator (VHDA) and train a new set of trackers using the augmented dataset using gce-ft-wed0.2 configuration of the tracker model.

The aggregated results of the DST runs will be stored under <save-dir>/dst-summary.json.

Notes

The training time for VHDA is about 20 minutes for WoZ2.0 and the training time for GCE+ is about 10 minutes for WoZ2.0.
Available datasets can be checked under data/ (after running data preparation).
- Make soft links to the MultiWoZ-Hotel and MultiWoZ-Rest datasets if needed (these datasets are created under data/multiwoz/domains/ directory)
Available generator models are listed under configs/ and available tracker models are listed under dst/internal/configs/.
Generation samples can be checked under the results of gda_helper.

Citation

Please cite the main paper as follows.

@article{yoo2020variational,
    title={Variational Hierarchical Dialog Autoencoder for Dialogue State Tracking Data Augmentation},
    author={Yoo, Kang Min and Lee, Hanbit and Dernoncourt, Franck and Bui, Trung and Chang, Walter and Lee, Sang-goo},
    journal={arXiv preprint arXiv:2001.08604},
    year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
datasets		datasets
dst		dst
embeds		embeds
evaluators		evaluators
loopers		loopers
losses		losses
models		models
tests		tests
third_party		third_party
tools		tools
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
evaluate_helper.py		evaluate_helper.py
gda.py		gda.py
gda_helper.py		gda_helper.py
generate.py		generate.py
interpolate.py		interpolate.py
interpolate_helper.py		interpolate_helper.py
optimize.py		optimize.py
requirements.txt		requirements.txt
train.py		train.py
train_eval.py		train_eval.py
train_eval_gda.py		train_eval_gda.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Source Code for VHDA

Preparation

Git Clone

Install Python Packages

Install Tokenizer (Choose One)

Choice 1: Install Docker and CoreNLP

Choice 2: Install SpaCy

Prepare Datasets

Prepare Word Embeddings / FastText

Running

Training Generator

Training State Tracker

Training Tracker with GDA

Notes

Citation

About

Releases

Packages

Languages

License

kaniblu/vhda

Folders and files

Latest commit

History

Repository files navigation

Source Code for VHDA

Preparation

Git Clone

Install Python Packages

Install Tokenizer (Choose One)

Choice 1: Install Docker and CoreNLP

Choice 2: Install SpaCy

Prepare Datasets

Prepare Word Embeddings / FastText

Running

Training Generator

Training State Tracker

Training Tracker with GDA

Notes

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages