This notebook demos some functionality in ConvoKit to preprocess text, and store the results. In particular, it shows examples of:

* A `TextProcessor` base class that maps per-utterance attributes to per-utterance outputs;
* A `TextParser` class that does dependency parsing;
* Selective and decoupled data storage and loading;
* Per-utterance calls to a transformer;
* Pipelining transformers. 

## Preliminaries: loading an existing corpus.

To start, we load a clean version of a corpus. For speed we will use a 200-utterance subset of the tennis corpus.

In [1]:
import os

In [2]:
import convokit
from convokit import download

In [3]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = download('tennis-corpus')

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE TENNIS-CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = convokit.Corpus(ROOT_DIR, utterance_end_index=199)

In [4]:
corpus.print_summary_stats()

Number of Speakers: 9
Number of Utterances: 200
Number of Conversations: 100


In [5]:
# SET YOUR OWN OUTPUT DIRECTORY HERE. 
OUT_DIR = '<YOUR DIRECTORY>'

Here's an example of an utterance from this corpus (questions asked to tennis players after matches, and the answers they give):

In [6]:
test_utt_id = '1681_14.a'
utt = corpus.get_utterance(test_utt_id)

In [7]:
utt.text

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick."

Right now, `utt.meta` contains the following fields:

In [8]:
utt.meta

{'is_answer': True, 'is_question': False, 'pair_idx': '1681_14'}

## The TextProcessor class

Many of our transformers are per-utterance mappings of one attribute of an utterance to another. To facilitate these calls, we use a `TextProcessor` class that inherits from `Transformer`. 

`TextProcessor` is initialized with the following arguments:

* `proc_fn`: the mapping function. Supports one of two function signatures: `proc_fn(input)` and `proc_fn(input, auxiliary_info)`. 
* `input_field`: the attribute of the utterance that `proc_fn` will take as input. If set to `None`, will default to reading `utt.text`, as seems to be presently done.
* `output_field`: the name of the attribute that the output of `proc_fn` will be written to. 
* `aux_input`: any auxiliary input that `proc_fn` needs (e.g., a pre-loaded model); passed in as a dict.
* `input_filter`: a boolean function of signature `input_filter(utterance, aux_input)`, where `aux_input` is again passed as a dict. If this returns `False` then the particular utterance will be skipped; by default it will always return `True`.

Both `input_field` and `output_field` support multiple items -- that is, `proc_fn` could take in multiple attributes of an utterance and output multiple attributes. I'll show how this works in advanced usage, below.

"Attribute" is a deliberately generic term. `TextProcessor` could produce "features" as we may conventionally think of them (e.g., wordcount, politeness strategies). It can also be used to pre-process text, i.e., generate alternate representations of the text. 

In [9]:
from convokit.text_processing import TextProcessor

### simple example: cleaning the text

As a simple example, suppose we want to remove hyphens "`--`" from the text as a preprocessing step. To use `TextProcessor` to do this for us, we'd define the following as a `proc_fn`:

In [10]:
def preprocess_text(text):
    text = text.replace(' -- ', ' ')
    return text

Below, we initialize `prep`, a `TextProcessor` object that will run `preprocess_text` on each utterance.

When we call `prep.transform()`, the following will occur:

* Because we didn't specify an input field, `prep` will pass `utterance.text` into `preprocess_text`
* It will write the output -- the text minus the hyphens -- to a field called `clean_text` that will be stored in the utterance meta and that can be accessed as `utt.meta['clean_text']` or `utt.get_info('clean_text')`

In [11]:
prep = TextProcessor(proc_fn=preprocess_text, output_field='clean_text')
corpus = prep.transform(corpus)

And as desired, we now have a new field attached to `utt`.

In [12]:
utt.get_info('clean_text')

"Yeah, but many friends went with me, Japanese guy. So I wasn't I wasn't like homesick. But now sometimes I get homesick."

## Parsing text with the TextParser class

One common utterance-level thing we want to do is parse the text. In practice, in increasing order of (computational) difficulty, this typically entails:

* proper tokenizing of words and sentences;
* POS-tagging;
* dependency-parsing. 

As such, we provide a `TextParser` class that inherits from `TextProcessor` to do all of this, taking in the following arguments:

* `output_field`: defaults to `'parsed'`
* `input_field`
* `mode`: whether we want to go through all of the above steps (which may be expensive) or stop mid-way through. Supports the following options: `'tokenize'`, `'tag'`, `'parse'` (the default).

Under the surface, `TextParser` actually uses two separate models: a `spacy` object that does word tokenization, tagging and parsing _per sentence_, and `nltk`'s sentence tokenizer. The rationale is:

* `spacy` doesn't support sentence tokenization without dependency-parsing, and we often want sentence tokenization without having to go through the effort of parsing.
* We want to be consistent (as much as possible, given changes to spacy and nltk) in the tokenizations we produce, between runs where we don't want parsing and runs where we do.

If we've pre-loaded these models, we can pass them into the constructor too, as:

* `spacy_nlp`
* `sent_tokenizer`

In [13]:
from convokit.text_processing import TextParser

In [14]:
parser = TextParser(input_field='clean_text', verbosity=50)

In [15]:
corpus = parser.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed
200/200 utterances processed


### parse output

A parse produced by `TextParser` is serialized in text form. It is a list consisting of sentences, where each sentence is a dict with

* `toks`: a list of tokens (i.e., words) in the sentence;
* `rt`: the index of the root of the dependency tree (i.e., `sentence['toks'][sentence['rt']` gives the root)

Each token, in turn, contains the following:

* `tok`: the text of the token;
* `tag`: the tag;
* `up`: the index of the parent of the token in the dependency tree (no entry for the root);
* `down`: the indices of the children of the token;
* `dep`: the dependency of the edge between the token and its parent.

In [16]:
test_parse = utt.get_info('parsed')

In [17]:
test_parse[0]

{'rt': 5,
 'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
  {'dep': 'nsubj', 'dn': [3], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
  {'dep': 'ROOT',
   'dn': [0, 1, 2, 4, 6, 8, 10, 11],
   'tag': 'VBD',
   'tok': 'went'},
  {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
  {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
  {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
  {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
  {'dep': 'npadvmod', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 5},
  {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]}

If we didn't want to go through the trouble of dependency-parsing (which could be expensive) we could initialize `TextParser` with `mode='tag'`, which only POS-tags tokens:

In [18]:
texttagger = TextParser(output_field='tagged', input_field='clean_text', mode='tag')
corpus = texttagger.transform(corpus)

In [19]:
utt.get_info('tagged')[0]

{'toks': [{'tag': 'UH', 'tok': 'Yeah'},
  {'tag': ',', 'tok': ','},
  {'tag': 'CC', 'tok': 'but'},
  {'tag': 'JJ', 'tok': 'many'},
  {'tag': 'NNS', 'tok': 'friends'},
  {'tag': 'VBD', 'tok': 'went'},
  {'tag': 'IN', 'tok': 'with'},
  {'tag': 'PRP', 'tok': 'me'},
  {'tag': ',', 'tok': ','},
  {'tag': 'JJ', 'tok': 'Japanese'},
  {'tag': 'NN', 'tok': 'guy'},
  {'tag': '.', 'tok': '.'}]}

## Storing and loading corpora

We've now computed a bunch of utterance-level attributes. 

In [20]:
list(utt.meta.keys())

['is_answer', 'is_question', 'pair_idx', 'clean_text', 'parsed', 'tagged']

By default, calling `corpus.dump` will write all of these attributes to disk, within the file that stores utterances; later calling `corpus.load` will load all of these attributes back into a new corpus. For big objects like parses, this incurs a high computational burden (especially if in a later use case you might not even need to look at parses).

To avoid this, `corpus.dump`  takes an optional argument `fields_to_skip`, which is a dict of object type (`'utterance'`, `'conversation'`, `'speaker'`, `'corpus'`) to a list of fields that we do not want to write to disk. 

The following call will write the corpus to disk, without any of the preprocessing output we generated above:

In [21]:
corpus.dump(os.path.basename(OUT_DIR), base_path=os.path.dirname(OUT_DIR), 
            fields_to_skip={'utterance': ['parsed','tagged','clean_text']})

For attributes we want to keep around, but that we don't want to read and write to disk in a big batch with all the other corpus data, `corpus.dump_info` will dump fields of a Corpus object into separate files. This takes the following arguments as input:

* `obj_type`: which type of Corpus object you're dealing with.
* `fields`: a list of the fields to write. 
* `dir_name`: which directory to write to; by default will write to the directory you read the corpus from.

This function will write each field in `fields` to a separate file called `info.<field>.jsonl` where each line of the file is a json-serialized dict: `{"id": <ID of object>, "value": <object.get_info(field)>}`. 

In [22]:
corpus.dump_info('utterance',['parsed','tagged'], dir_name = OUT_DIR)

As expected, we now have the following files in the output directory:

In [23]:
ls $OUT_DIR

conversations.json  index.json         info.tagged.jsonl  utterances.jsonl
corpus.json         info.parsed.jsonl  speakers.json


If we now initialize a new corpus by reading from this directory:

In [24]:
new_corpus = convokit.Corpus(OUT_DIR)

In [25]:
new_utt = new_corpus.get_utterance(test_utt_id)

We see that things that we've omitted in the `corpus.dump` call will not be read.

In [26]:
new_utt.meta.keys()

KeysView({'is_answer': True, 'is_question': False, 'pair_idx': '1681_14'})

As a counterpart to `corpus.dump_info` we can also load auxiliary information on-demand. Here, this call will look for `info.<field>.jsonl` in the directory of `new_corpus` (or an optionally-specified `dir_name`) and attach the value specified in each line of the file to the utterance with the associated id:

In [27]:
new_corpus.load_info('utterance',['parsed'])

In [28]:
new_utt.get_info('parsed')

[{'rt': 5,
  'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
   {'dep': 'nsubj', 'dn': [3], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
   {'dep': 'ROOT',
    'dn': [0, 1, 2, 4, 6, 8, 10, 11],
    'tag': 'VBD',
    'tok': 'went'},
   {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
   {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
   {'dep': 'npadvmod', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 5},
   {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]},
 {'rt': 2,
  'toks': [{'dep': 'advmod', 'dn': [], 'tag': 'RB', 'tok': 'So', 'up': 2},
   {'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'I', 'u

## Per-utterance calls

`TextProcessor` objects also support calls per-utterance via `TextProcessor.transform_utterance()`. These calls take in raw strings as well as utterances, and will return an utterance:



In [29]:
test_str = "I played -- a tennis match."

In [30]:
prep.transform_utterance(test_str)

Utterance({'obj_type': 'utterance', '_owner': None, 'meta': {'clean_text': 'I played a tennis match.'}, '_id': None, 'speaker': None, 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'I played -- a tennis match.'})

In [31]:
from convokit.model import Utterance
adhoc_utt = Utterance(text=test_str)

In [32]:
adhoc_utt = prep.transform_utterance(adhoc_utt)

In [33]:
adhoc_utt.get_info('clean_text')

'I played a tennis match.'

## Pipelines

Finally, we can string together multiple transformers, and hence `TextProcessors`, into a pipeline, using a `ConvokitPipeline` object. This is analogous to (and in fact inherits from) scikit-learn's `Pipeline` class. 

In [34]:
from convokit.convokitPipeline import ConvokitPipeline

As an example, suppose we want to both clean the text and parse it. We can chain the required steps to get there by initializing `ConvokitPipeline` with a list of steps, represented as a tuple of `(<step name>, initialized transformer-like object)`:

* `'prep'`, our de-hyphenator
* `'parse'`, our parser


In [35]:
parse_pipe = ConvokitPipeline([('prep', TextProcessor(preprocess_text, 'clean_text_pipe')),
                ('parse', TextParser('parsed_pipe', input_field='clean_text_pipe',
                                    verbosity=50))])

In [36]:
corpus = parse_pipe.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed
200/200 utterances processed


In [37]:
utt.get_info('parsed_pipe')

[{'rt': 5,
  'toks': [{'dep': 'intj', 'dn': [], 'tag': 'UH', 'tok': 'Yeah', 'up': 5},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'cc', 'dn': [], 'tag': 'CC', 'tok': 'but', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'many', 'up': 4},
   {'dep': 'nsubj', 'dn': [3], 'tag': 'NNS', 'tok': 'friends', 'up': 5},
   {'dep': 'ROOT',
    'dn': [0, 1, 2, 4, 6, 8, 10, 11],
    'tag': 'VBD',
    'tok': 'went'},
   {'dep': 'prep', 'dn': [7], 'tag': 'IN', 'tok': 'with', 'up': 5},
   {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'me', 'up': 6},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 5},
   {'dep': 'amod', 'dn': [], 'tag': 'JJ', 'tok': 'Japanese', 'up': 10},
   {'dep': 'npadvmod', 'dn': [9], 'tag': 'NN', 'tok': 'guy', 'up': 5},
   {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 5}]},
 {'rt': 2,
  'toks': [{'dep': 'advmod', 'dn': [], 'tag': 'RB', 'tok': 'So', 'up': 2},
   {'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'I', 'u

As promised, the pipeline also works to transform utterances.

In [38]:
test_utt = parse_pipe.transform_utterance(test_str)

In [39]:
test_utt.get_info('parsed_pipe')

[{'rt': 1,
  'toks': [{'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'I', 'up': 1},
   {'dep': 'ROOT', 'dn': [0, 4, 5], 'tag': 'VBD', 'tok': 'played'},
   {'dep': 'det', 'dn': [], 'tag': 'DT', 'tok': 'a', 'up': 4},
   {'dep': 'compound', 'dn': [], 'tag': 'NN', 'tok': 'tennis', 'up': 4},
   {'dep': 'dobj', 'dn': [2, 3], 'tag': 'NN', 'tok': 'match', 'up': 1},
   {'dep': 'punct', 'dn': [], 'tag': '.', 'tok': '.', 'up': 1}]}]

### Some advanced usage: playing around with parameters

The point of the following is to demonstrate more elaborate calls to `TextProcessor`. As an example, we will count words in an utterance.

First, we'll initialize a `TextProcessor` that does wordcounts (i.e., `len(x.split())`) on just the raw text (`utt.text`), writing output to field `wc_raw`.

In [40]:
wc_raw = TextProcessor(proc_fn=lambda x: len(x.split()), output_field='wc_raw')
corpus = wc_raw.transform(corpus)

In [41]:
utt.get_info('wc_raw')

23

If we instead wanted to wordcount our preprocessed text, with the hyphens removed, we can specify `input_field='clean_text'` -- as such, the `TextProcessor` will read from `utt.get_info('clean_text')` instead. 

In [42]:
wc = TextProcessor(proc_fn=lambda x: len(x.split()), output_field='wc', input_field='clean_text')
corpus = wc.transform(corpus)

Here we see that we are no longer counting the extra hyphen.

In [43]:
utt.get_info('wc')

22

Likewise, we can count characters:

In [44]:
chars = TextProcessor(proc_fn=lambda x: len(x), output_field='ch', input_field='clean_text')
corpus = chars.transform(corpus)

In [45]:
utt.get_info('ch')

120

Suppose that for some reason we now wanted to calculate:

* characters per word
* words per character (the reciprocal)

This requires:

* a `TextProcessor` that takes in multiple input fields, `'ch'` and `'wc'`;
* and that writes to multiple output fields, `'char_per_word'` and `'word_per_char'`.

Here's how the resultant object, `char_per_word`, handles this:

* in `transform()`, we pass `proc_fn` a dict mapping input field name to value, e.g., `{'wc': 22, 'ch': 120}`
* `proc_fn` will be written to return a tuple, where each element of that tuple corresponds to each element of the list we've passed to `output_field`, e.g., 

```out0, out1 = proc_fn(input)
utt.set_info('char_per_word', out0) 
utt.set_info('word_per_char', out1)```

In [46]:
char_per_word = TextProcessor(proc_fn=lambda x: (x['ch']/x['wc'], x['wc']/x['ch']), 
                              output_field=['char_per_word', 'word_per_char'], input_field=['ch','wc'])
corpus = char_per_word.transform(corpus)

In [47]:
utt.get_info('char_per_word')

5.454545454545454

In [48]:
utt.get_info('word_per_char')

0.18333333333333332

### Some advanced usage: input filters

Just for the sake of demonstration, suppose we wished to save some computation time and only parse the questions in a corpus. We can do this by specifying `input_filter` (which, recall discussion above, takes as argument an `Utterance` object). 

In [49]:
def is_question(utt, aux={}):
    return utt.meta['is_question']

In [50]:
qparser = TextParser(output_field='qparsed', input_field='clean_text', input_filter=is_question, verbosity=50)

In [51]:
corpus = qparser.transform(corpus)

050/200 utterances processed
100/200 utterances processed
150/200 utterances processed
200/200 utterances processed


Since our test utterance is not a question, `qparser.transform()` will skip over it, and hence the utterance won't have the 'qparsed' attribute (and `get_info` returns `None`):

In [52]:
utt.get_info('qparsed')

However, if we take an utterance that's a question, we see that it is indeed parsed:

In [53]:
q_utt_id = '1681_14.q'
q_utt = corpus.get_utterance(q_utt_id)
q_utt.text

'How hard was it for you when, 13 years, left your parents, left Japan to go to the States. Was it a big step for you?'

In [54]:
q_utt.get_info('qparsed')

[{'rt': 2,
  'toks': [{'dep': 'advmod', 'dn': [], 'tag': 'WRB', 'tok': 'How', 'up': 1},
   {'dep': 'acomp', 'dn': [0], 'tag': 'RB', 'tok': 'hard', 'up': 2},
   {'dep': 'ROOT', 'dn': [1, 3, 4, 9, 10, 11, 22], 'tag': 'VBD', 'tok': 'was'},
   {'dep': 'nsubj', 'dn': [], 'tag': 'PRP', 'tok': 'it', 'up': 2},
   {'dep': 'prep', 'dn': [5], 'tag': 'IN', 'tok': 'for', 'up': 2},
   {'dep': 'pobj', 'dn': [], 'tag': 'PRP', 'tok': 'you', 'up': 4},
   {'dep': 'advmod', 'dn': [7], 'tag': 'WRB', 'tok': 'when', 'up': 9},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 6},
   {'dep': 'nummod', 'dn': [], 'tag': 'CD', 'tok': '13', 'up': 9},
   {'dep': 'npadvmod', 'dn': [6, 8], 'tag': 'NNS', 'tok': 'years', 'up': 2},
   {'dep': 'punct', 'dn': [], 'tag': ',', 'tok': ',', 'up': 2},
   {'dep': 'dep', 'dn': [13, 14, 15], 'tag': 'VBD', 'tok': 'left', 'up': 2},
   {'dep': 'poss', 'dn': [], 'tag': 'PRP$', 'tok': 'your', 'up': 13},
   {'dep': 'dobj', 'dn': [12], 'tag': 'NNS', 'tok': 'parents', 'up': 11}