# Demo of data conversion for CRAFT pre-training

This notebook demonstrates how to process data from a ConvoKit Corpus for use in CRAFT pre-training. Because pre-training typically involves large amounts of data, the pre-training script does not read directly from the ConvoKit corpus (for efficiency reasons), so you need to first run this notebook to reformat the conversational data into a more compact JSON lines format.

In [1]:
import json
import os
from convokit import Corpus, download

As a simple toy example, we will convert ConvoKit's version of the famous Switchboard Corpus. If you want to use this notebook for your own training data, simply change the following cell to load your desired Corpus.

In [2]:
corpus = Corpus(filename=download("switchboard-corpus"))

Downloading switchboard-corpus to /home/jonathan/.convokit/downloads/switchboard-corpus
Downloading switchboard-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/switchboard-corpus/switchboard-corpus.zip (5.8MB)... Done


## Output format

The pre-training script reads conversational data in a JSON lines format. Each line is a *dialog*, or a linear chain of replies. Each dialog is represented as a list of dicts (JSON objects) where each dict is a comment/utterance in the dialog, and the order of utterances in the list is determined by order of replies, such that each utterance is the reply to the one right before it. The utterance dict is formatted as follows:
```
{'text': '<utterance text here>'}
```
The reason the utterances are dicts and not strings is so that we can support incorporating utterance metadata in the future, e.g., for some future extension of CRAFT.

The following loop converts ConvoKit Conversations into this format and writes the resulting JSON lines to disk.

In [3]:
corpus_name = "switchboard" # or set your own custom corpus name
if not os.path.exists(os.path.join("nn_input_data", corpus_name)):
    os.makedirs(os.path.join("nn_input_data", corpus_name))
with open(os.path.join("nn_input_data", corpus_name, "train_processed_dialogs.txt"), "w") as fp:
    for convo in corpus.iter_conversations():
        # use Corpus.get_root_to_leaf_paths() to get linear reply chains from the conversation
        for dialog in convo.get_root_to_leaf_paths():
            dialog_json = [{'text': utt.text} for utt in dialog]
            fp.write(json.dumps(dialog_json))
            fp.write('\n')