# Inferring prompt types

This notebook demos two transformers, which broadly aim at producing abstract representations of an utterance in terms of its phrasing and its rhetorical intent: 

* The `PhrasingMotifs` transformer extracts representations of utterances in terms of how they are phrased;
* The `PromptTypes` transformer computes latent representations of utterances in terms of their rhetorical intention -- the _responses_ they aim at prompting -- and assigns utterances to different (automatically-inferred) types of intentions.

It also demos some additional transformers used in preprocessing steps.



Together, these transformers implement the methodology detailed in the [paper](http://www.cs.cornell.edu/~cristian/Asking_too_much.html), 

```
Asking Too Much? The Rhetorical Role of Questions in Political Discourse 
Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil
Proceedings of EMNLP 2017
```

ConvoKit also includes an end-to-end implementation, `PromptTypesWrapper`, that runs the transformers one after another, and handles the particular pre-processing steps found in the paper. See [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/prompt-types/prompt-type-wrapper-demo.ipynb) for a demonstration of this end-to-end transformer.

This is a really clear example of a method which reflects both good (we think) ideas and somewhat ad-hoc implementation decisions. As such, there are lots of options and potential variations to consider (beyond the deeper question of what phrasings and intentions even are) -- I'll detail these as I go along.

Note that due to small methodological tweaks and changes in the random seed, the particular output of the transformers as presently implemented may not totally match the output from the paper, but the broad types of questions returned are comparable.

## Preliminaries

First we load the corpus. We will examine a dataset of questions from question periods that take place in the British House of Commons (also detailed in the paper). 

In [1]:
import convokit
from convokit import download

We'll load the corpus, plus some pre-computed dependency parses (see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/text-processing/text_preprocessing_demo.ipynb) for a demonstration of how to get these parses on your own; for this dataset they should be included with our release).

In [None]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = download('parliament-corpus', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE PARLIAMENT-CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = convokit.Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
VERBOSITY = 10000

Our specific goal, which we'll use ConvoKit to accomplish, is to produce an abstract representation of questions asked by members of parliament, in terms of:

* how they are phrased: what phrasing, or lexico-syntatic "motif", does a question have? 
* their rhetorical intention: what's the intent of the asker -- which we take to mean the response the asker aims to prompt? 

In other words, what are the different types of questions people ask in parliament?

Here's an example of an utterance:

In [5]:
test_utt_id = '1997-01-27a.4.0'
utt = corpus.get_utterance(test_utt_id)

In [6]:
utt.text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

To state our goals more precisely:

* For each _sentence_ that has a question (all but the last), we want to come up with a representation of the sentence's phrasing. Intuitively, for instance, the first two sentences sound like they could both be thought of as a "Does X agree that Y?" -- whether Y is asking about a yacht or a harbour. 
* For each utterance, we want to come up with a representation of the utterance's rhetorical intent. Intuitively, all the questions could be construed as asking if the answerer is in agreement with the asker -- whether they "agree" with the opinion or "share" the opinion. We might think of this as being an example of an "agreeing" type of question.

Intuitively, if we want to get at this higher level of abstraction, we have to look beyond the particular n-grams: it doesn't seem plausible that there is a meaningful type of question about yachts (unless our specific context is the parliamentary subcommittee on yachts). 



## Preprocessing step: Arcs

One place to start is to look at the structural "skeleton" of sentences -- i.e., its dependency parse. Thus, we are first going to provide a representation of questions in terms of their dependency parse by extracting all the parent-to-child token edges, or "arcs". We will use the `TextToArcs` class to do this:

In [7]:
from convokit.text_processing import TextToArcs

`get_arcs` is a transformer (actually a `TextProcessor`) that will read the dependency parse of an utterance and write the resultant arcs to a field called `'arcs'`:

(demo continues after long output)

In [10]:
get_arcs = TextToArcs('arcs', verbosity=VERBOSITY)
corpus = get_arcs.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

(demo continued)

`'arcs'` is a list where each element corresponds to a sentence in the utterance. Each sentence is represented in terms of its arcs, in a space-separated string. 

Each arc, in turn, can be read as follows:

* `x_y` means that `x` is the parent and `y` is the child token (e.g., `agree_does` = `agree --> does`)
* `x_*` means that `x` is a token with at least one descendant, which we do not resolve (this is roughly like bigrams backing off to unigrams)
* `x>y` means that `x` and `y` are the first two tokens in the sentence (the decision here was that how the sentence starts is a signal of "phrasing structure" on par with the dependency tree structure)
* `x>*` means that `x` is the first token in the sentence. 

In [10]:
utt.retrieve_meta('arcs')

["'s_* a_* about_* about_yacht agree_* agree_does agree_hon agree_welcomed been_* does>* does>my does_* friend_* has_* hon_* hon_friend hon_my hon_right last_* my_* replacement_* right_* royal_* statement_* statement_about statement_week that_* week_'s week_* week_last welcomed_* welcomed_been welcomed_has welcomed_statement welcomed_that welcomed_widely widely_* yacht_* yacht_a yacht_replacement yacht_royal",
 'agree_* agree_also agree_become agree_does agree_he also_* become_* become_britannia become_centrepiece become_ideally become_should become_spanning become_that britannia_* centrepiece_* centrepiece_of centrepiece_the does>* does>he does_* gosport_* harbour_* harbour_portsmouth he_* ideally_* in_* in_harbour millennium_* of_* of_project portsmouth_* project_* project_in project_millennium project_the should_* spanning_* spanning_gosport that_* the_*',
 'am_* am_i am_sure i>* i_* idea_* idea_that popular_* popular_very prove_* prove_idea prove_popular prove_that prove_would sure

### Further preprocessing: cleaned-up arcs

At this point, while we've got the methodology to start making sense of the dependency tree, we arguably haven't progressed beyond producing fancy bigram representations of sentences. One problem is perhaps that the default arc extraction is a bit too permissive -- it gives us _all_ of the arcs. We might not want this for a few reasons:

* We only want to learn about question phrasings; we don't actually care about non-question sentences.
* The structure of a question might be best encapsulated by the arcs that go out of the _root_ of the tree; as you get further down we might end up with less structural and more content-specific representations.
* Likewise, the particular _nouns_ used (e.g., `yacht`) might not be good descriptions of the more abstract phrasing pattern.

All of these points are debatable, and the resultant modules I'll show below hopefully allow you to play around with them. Taking these point as is for now, though, we'll do the following.

In [8]:
from convokit.phrasing_motifs import CensorNouns, QuestionSentences
from convokit.convokitPipeline import ConvokitPipeline

We will actually create a pipeline to extract the arcs we want. This pipeline has the following components, in order:

* `CensorNouns`: a transformer that removes all the nouns and pronouns from a dependency parse. This transformer also collapses constructions like `What time [is it]` into `What [is it]`.
* `TextToArcs`: calling the arc extractor from above with an extra parameter: `root_only=True` which will only extract arcs attached to the root (in addition to the first two tokens, though this is also tunable by passing in parameter `use_start=True`).
* `QuestionSentences`: a transformer that, given utterance fields consisting of a list of sentences, removes all the sentences which contain question marks. Here, we want to focus on questions  that are asked as part of the procedure of questions period, not questions that a minister who is playing the role of answer raises in their response. Thus, we pass an extra parameter `input_filter=question_filter`, telling it to ignore utterances which aren't labeled as (official) questions in the corpus. 
    * (you may wonder how this transformer can tell whether a sentence has a question mark in it, given that the output of `TextToArcs` doesn't have any punctuation. Under the hood, `QuestionSentences` looks at the dependency parse of the sentence and checks whether the last token is a question.)
    * `QuestionSentences` also omits any sentences which don't begin in capital letters. To turn this off, pass parameter `use_caps=False`.

In [9]:
def question_filter(utt):
    return utt.retrieve_meta('is_question')

In [10]:
q_arc_pipe = ConvokitPipeline([
    ('censor_nouns', CensorNouns('parsed_censored', verbosity=VERBOSITY)),
    ('shallow_arcs', TextToArcs('arcs_censored', input_field='parsed_censored', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs', input_field='arcs_censored',
                                         input_filter=question_filter, verbosity=VERBOSITY))
])

The pipeline should accordingly annotate each utterance with arcs for questions only, in a field called `question_arcs`.
(demo continues after long output)

In [11]:
corpus = q_arc_pipe.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

(demo continues)

This pipeline results in a more minimalistic representation of utterances, in terms of just the arcs at the root of dependency trees, just the questions, and no nouns:

In [12]:
utt.retrieve_meta('question_arcs')

['agree_* agree_does agree_welcomed does>*',
 'agree_* agree_also agree_become agree_does does>*',
 'as>* as>to share_* share_does']

Here's another example:

In [13]:
test_utt_id_1 = '2015-06-09c.1041.5'
utt1 = corpus.get_utterance(test_utt_id_1)

In [14]:
utt1.text

'Given what the Foreign Secretary has said about the importance of the Iran discussions on the nuclear agreement , what is he doing to ensure greater clarity about the baselines , the extent of the inspection regime and the consequences of infringement ? Given that the agreement will allow advanced centrifuge , the infringements might arrive a little earlier than anticipated .'

In [15]:
utt1.retrieve_meta('question_arcs')

['doing_* doing_ensure doing_given doing_is doing_what given>* given>what']

## Phrasing motifs

Finally, to arrive at our representation of phrasings, we can go one further level of abstraction. In short, some of these arcs feel less fully-specified than others. While `agree_does` sounds like it hints at a coherent question, `doing_is` seems like it's not meaningful until you consider that it occurs in the same sentence as `doing_ensure` (i.e., "_what is the Government doing to ensure...?_")

Our intuition is to think of phrasings as frequently-cooccurring sets of multiple arcs. To extract these frequent arc-sets (which may remind you of the data mining idea of extracting frequent itemsets) we will use the `PhrasingMotifs` class.

In [16]:
from convokit.phrasing_motifs import PhrasingMotifs

In [17]:
pm_model = PhrasingMotifs('motifs','question_arcs',min_support=100,fit_filter=question_filter,
                          verbosity=VERBOSITY)

Here, `pm_model` will:

* extract all sets of arcs, as read from the `question_arcs` field, which occur at least 50 times in a corpus. These frequently-occurring arc sets will constitute the set, or "vocabulary", of phrasings.
* write the resultant output -- the phrasings that an utterance contains -- to a field called `question_motifs`. 

On the latter point, `pm_model` will only transform (i.e., label phrasings for) utterances which are questions, i.e., `question_filter(utterance) = True`. That is, in both the train and transform steps, we totally ignore non-questions.

Note that the phrasings learned by `pm_model` are therefore _corpus-specific_ -- different corpora may have different frequently-occurring sets, resulting in different vocabularies of phrasings. For instance, you wouldn't expect people in the British House of Commons to ask questions that sound like questions asked to tennis players. In this respect, think of `PhrasingMotifs` like models from scikit learn (e.g., `LogisticRegression`) -- it is fit to a particular dataset:

(demo continues after long output)

In [18]:
pm_model.fit(corpus)

counting frequent itemsets for 325339 sets
	first pass: counting itemsets up to and including 5 items large
	first pass: 10000/325339 sets processed
	first pass: 20000/325339 sets processed
	first pass: 30000/325339 sets processed
	first pass: 40000/325339 sets processed
	first pass: 50000/325339 sets processed
	first pass: 60000/325339 sets processed
	first pass: 70000/325339 sets processed
	first pass: 80000/325339 sets processed
	first pass: 90000/325339 sets processed
	first pass: 100000/325339 sets processed
	first pass: 110000/325339 sets processed
	first pass: 120000/325339 sets processed
	first pass: 130000/325339 sets processed
	first pass: 140000/325339 sets processed
	first pass: 150000/325339 sets processed
	first pass: 160000/325339 sets processed
	first pass: 170000/325339 sets processed
	first pass: 180000/325339 sets processed
	first pass: 190000/325339 sets processed
	first pass: 200000/325339 sets processed
	first pass: 210000/325339 sets processed
	first pass: 220000

(demo continues)

Here are the most common phrasings and how often they occur in the data (in # of sentences). Note that `('*',)` denotes the null phrasing -- i.e., it encapsulates sentences with _any_ root word. 

In [19]:
pm_model.print_top_phrasings(25)

('*',) 325339
('will>*',) 67920
('does>*',) 59959
('is_*',) 57904
('is>*',) 45238
('is>*', 'is_*') 42850
('agree_*',) 36086
('agree_does',) 33686
('agree_*', 'agree_does') 33686
('agree_*', 'does>*') 30010
('agree_does', 'does>*') 29985
('agree_*', 'agree_does', 'does>*') 29985
('is_aware',) 22049
('is_*', 'is_aware') 22049
('is>*', 'is_aware') 20704
('is>*', 'is_*', 'is_aware') 20704
('what>*',) 20518
('is_not',) 15977
('is_*', 'is_not') 15977
('is>*', 'is_not') 13408
('is>*', 'is_*', 'is_not') 13408
('accept_*',) 11867
('agree_is',) 11059
('agree_*', 'agree_is') 11059
('agree_does', 'agree_is') 10613


Having "trained", or fitted our model, we can then use it to annotate each (question) utterance in the corpus with the phrasings this utterance contains, in a field called `motifs`.

(demo continues after long output)

In [19]:
corpus = pm_model.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

(demo continues)

One thing to note here is that each sentence can and probably will have multiple phrasings it embodies. For instance, two sentences with phrasing `agree_do` and `agree_will` will also have phrasing `agree_*`. Intuitively, more finely-specified phrasings (i.e., `agree_does`) more closely specify the phrasing embodied by a sentence (we could imagine "Do you agree..." and "Will you agree..." being very different, but perhaps also more similar to each other than "Can you explain.."). 

We want to keep track of both the complete set of phrasings and the most finely-specified phrasing you can have for each utterance. Therefore, `PhrasingMotifs` actually annotates utterances with _two_ fields.

`motifs` lists all the phrasings (arcs in a phrasing motif are separated by two underscores, `'__'`):

In [20]:
utt.retrieve_meta('motifs')

['agree_* agree_*__does>* does>*',
 'agree_* agree_*__agree_also agree_*__does>* does>*',
 'as>* share_* share_*__share_does']

and `motifs__sink` lists the most finely specified _sink phrasings_ (they are "sinks" in the sense that if you think of phrasings as a directed graph where A-->B when B is a more finely-specified version of A, these sinks have no child phrasings which are contained in the utterance)

In [21]:
utt.retrieve_meta('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

### model persistence

We can save `pm_model` to disk and later reload it, thus caching the trained model (i.e., the motifs in a corpus and the internal representation of these motifs). Here, we save the model to a `pm_model` subfolder in the corpus directory via `dump_model()`:

In [22]:
import os

In [29]:
pm_model.dump_model(os.path.join(ROOT_DIR, 'pm_model'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information


This subfolder then stores the motifs, as well as relations between the motifs that facilitate transforming new utterances.

In [30]:
pm_model_dir = os.path.join(ROOT_DIR, 'pm_model')
!ls $pm_model_dir

downlinks.json	itemset_counts.json  itemset_to_ids.json  meta.json


Suppose we later initialize a new `PhrasingMotifs` model, `new_pm_model`.

In [31]:
new_pm_model = PhrasingMotifs('motifs_new','question_arcs',min_support=100,fit_filter=question_filter,
                          verbosity=VERBOSITY)

Calling `load_model()` then reloads the stored model from our earlier run into this new model:

In [32]:
new_pm_model.load_model(os.path.join(ROOT_DIR, 'pm_model'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information


Just to check that we've loaded the same thing that we previously saved, we'll get the motifs in our test utterance using `new_pm_model`:

In [33]:
utt = new_pm_model.transform_utterance(utt)

This is the output from the original run:

In [34]:
utt.retrieve_meta('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

And we see the new output matches.

In [35]:
utt.retrieve_meta('motifs_new__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

### example variation: not removing the nouns

**note** this takes a while to run, and is somewhat of an extension -- you can safely skip these cells.

There are other ways to use `PhrasingMotifs` that might be more or less suited to your own application. For instance, you may wonder what happens if we do not remove the nouns (as we did with `CensorNouns` above). To try this out, we can create an alternate pipeline that uses `TextToArcs` to generate root arcs (setting argument `root_only=True`) on the original parses, not the noun-censored ones.

In [36]:
q_arc_pipe_full = ConvokitPipeline([
    ('shallow_arcs_full', TextToArcs('root_arcs', input_field='parsed', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs_full', input_field='root_arcs',
                                         input_filter=question_filter, verbosity=VERBOSITY)),

])
corpus = q_arc_pipe_full.transform(corpus)


10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

We can then train a new `PhrasingMotifs` model that finds phrasings with the nouns still included.

(demo continues after long output)

In [37]:
noun_pm_model = PhrasingMotifs('motifs_full','question_arcs_full',min_support=100,
                               fit_filter=question_filter, 
                          verbosity=VERBOSITY)
noun_pm_model.fit(corpus)

counting frequent itemsets for 325339 sets
	first pass: counting itemsets up to and including 5 items large
	first pass: 10000/325339 sets processed
	first pass: 20000/325339 sets processed
	first pass: 30000/325339 sets processed
	first pass: 40000/325339 sets processed
	first pass: 50000/325339 sets processed
	first pass: 60000/325339 sets processed
	first pass: 70000/325339 sets processed
	first pass: 80000/325339 sets processed
	first pass: 90000/325339 sets processed
	first pass: 100000/325339 sets processed
	first pass: 110000/325339 sets processed
	first pass: 120000/325339 sets processed
	first pass: 130000/325339 sets processed
	first pass: 140000/325339 sets processed
	first pass: 150000/325339 sets processed
	first pass: 160000/325339 sets processed
	first pass: 170000/325339 sets processed
	first pass: 180000/325339 sets processed
	first pass: 190000/325339 sets processed
	first pass: 200000/325339 sets processed
	first pass: 210000/325339 sets processed
	first pass: 220000

(demo continues)

The most common phrasings, of course, won't be very topic-specific (unless people talk about yachts very very frequently in parliament). However, we do see that phrasings now reflect the pronoun used (which may be troublesome if we believe that "Does _he_ agree" and "Does _she_ agree" are getting at similar things).

In [38]:
noun_pm_model.print_top_phrasings(25)

('*',) 325339
('will>*',) 70226
('does>*',) 61032
('is_*',) 57964
('is>*',) 45268
('is>*', 'is_*') 42850
('agree_*',) 36109
('agree_does',) 33705
('agree_*', 'agree_does') 33705
('agree_*', 'does>*') 30013
('agree_does', 'does>*') 29988
('agree_*', 'agree_does', 'does>*') 29988
('will>the',) 26218
('will>*', 'will>the') 26218
('will>he',) 23049
('will>*', 'will>he') 23049
('is_aware',) 22063
('is_*', 'is_aware') 22063
('does>the',) 20932
('does>*', 'does>the') 20932
('what>*',) 20791
('is>*', 'is_aware') 20707
('is>*', 'is_*', 'is_aware') 20707
('does>he',) 16417
('does>*', 'does>he') 16417


Here are the sink phrasings for our example utterance from earlier, comparing against the noun-less run:

In [39]:
utt = noun_pm_model.transform_utterance(utt)

In [40]:
utt.retrieve_meta('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

In [41]:
utt.retrieve_meta('motifs_full__sink')

['agree_*__agree_hon',
 'agree_*__agree_also__agree_he',
 'as>* share_*__share_hon']

We see that we get this extra "hon" -- which actually stands for "honourable [member]" -- an artefact of parliamentary etiquette. 

For our particular dataset, removing nouns has the benefit of removing most of these etiquette-related words. However, you may also imagine cases where nouns actually carry a lot of useful information about rhetorical intent (including in this domain -- one could argue that asking about a person versus asking about a department is a strong signal of trying to get at different things, for instance). As such, noun-removal is something that you may want to play around with, and/or try to improve upon. 

## PromptTypes

As we intuited above, "do you agree" and "do you share my opinion" are both getting at similar intentions. However, extracting these phrasings alone won't allow us to make this association. Rather, our strategy will be to produce vector representations of them which encode this similarity. Clustering these representations then gives us different "types of question".

Our key intuition here is that questions with similar intentions will tend to be answered in similiar ways. Thus, "do you agree" and "do you share" may both often be answered with "yes, I agree"; if tomorrow I asked a new question of this ilk ("do you agree that we should invest in planes, instead of yachts"), I might be expecting a similar sort of answer. 

For a full explanation of this idea, and how we operationalized it, you can read our paper. In ConvoKit, we implement this methodology of producing vector representations and clusterings via the `PromptTypes` transformer:

In [23]:
from convokit.prompt_types import PromptTypes

`PromptTypes` will train a model -- a low-dimensional embedding, along with a k-means clustering -- by using question-answer pairs as input. 

In [24]:
def question_filter(utt):
    return utt.retrieve_meta('is_question')
def response_filter(utt):
    return (not utt.retrieve_meta('is_question')) and (utt.reply_to is not None)

We initialize `pt` with the following arguments:

* `n_types=8`: we want to infer 8 types of questions.
* `prompt_field='motifs'`: we want to encode questions in terms of the phrasing motifs we extracted above. thus, `pt` will produce representations of these motifs (rather than, e.g., the raw tokens in a question)
* `reference_field='arcs_censored'`: we will encode responses in terms of the noun-less arcs we extracted above (in practice, this appears to work better than using phrasings of responses as well, perhaps because responses are noisier)
* `prompt_transform_field='motifs__sink'`: while we want to come up with a representation of _all_ phrasing motifs, when we produce a vector representation of a _particular_ utterance we want to use the most finely-specified phrasing.

There are some other arguments you can set, which are listed in the docstring. 



In [25]:
pt = PromptTypes(n_types=8, prompt_field='motifs', reference_field='arcs_censored', 
                 prompt_transform_field='motifs__sink',
                 output_field='prompt_types',
    random_state=1000, verbosity=1)

We can fit `pt` to the corpus -- that is, learn the associations between question phrasings and response dependency arcs that allow us to produce our vector representations, as well as a clsutering of these representations that gives us our different question types. To focus on questions, we will use the following filters to select a subset of the corpus as training data:

* `prompt_selector=question_filter` and `ref_selector=response_filter`: To tell the transformer what counts as a question and an answer, we will pass the constructor the above filters (i.e., boolean functions). Note that in a less questions-heavy dataset, we could omit these filters and hence infer types of "prompts" beyond questions.

In [26]:
pt.fit(corpus, prompt_selector=question_filter, reference_selector=response_filter)

fitting 195441 input pairs
fitting reference tfidf model
fitting prompt tfidf model
fitting svd model
fitting 8 prompt types


calling `summarize()` as below will print the question phrasings, response arcs, and prototypical questions and responses that are associated with each inferred type of question. We will examine some of these types more closely by way of examples, below.

(long output follows)

In [29]:
pt.summarize(corpus=corpus, k=15)

TYPE 0
top prompt:
                                     0         1         2         3  \
made_*                        0.642670  1.260725  1.112202  1.119975   
made_*__made_in               0.686119  1.166733  1.092458  1.101651   
in>*__tell_*                  0.686683  1.330053  1.175609  1.265570   
made_*__made_to               0.697633  1.386197  1.226340  1.180139   
made_*__made_what             0.698968  1.247663  1.124691  0.959685   
happen_*__happen_will         0.701071  1.231780  1.202306  1.119589   
made_*__made_been             0.709813  1.263380  1.122509  1.178115   
made_*__what>*                0.716440  1.247333  1.148910  0.997612   
give_*__give_on               0.720038  1.212984  1.041744  1.057502   
include_*                     0.720358  1.198579  1.072563  1.225554   
made_*__made_been__made_what  0.722511  1.225216  1.105467  1.051967   
made_*__made_has              0.725824  1.304950  1.122941  1.108813   
give_*                        0.728294  1.168


2012-09-13a.405.5 I thank the right hon and learned Lady for her kind words and look forward to continuing to work with her on these issues and those of women and equality . The right hon and learned Lady is absolutely right that there are issues within Leveson that have clear read - across to the report that was released yesterday . However , at this time I want to ensure that we continue to focus first and foremost on the importance of getting it right for the families involved . We will examine the report in great detail to ensure that any necessary actions are taken so that we do not have the same scandalous situation again .
['learned_* learned_for look_* look_forward look_to thank_*', '', 'however>* want_* want_at want_ensure want_however', 'examine_* examine_ensure examine_will']

2006-12-04b.7.1 I am grateful to the hon Gentleman for raising that issue . I am not aware of the letter to the Home Secretary , but I will look into that , and if it is pertinent for me to visit West

(demo continues)

When this trained model is used to transform a corpus, it will output several representations or features associated with each utterance. 

Detail: The fields corresponding to representations and features computed with the model are each titled `prompt_types__<feature name>`. The prefix, `prompt_types` can be modified by changing the `output_field` argument of the constructor.

In [29]:
utt = pt.transform_utterance(utt)

First, a vector representation encapsulating the utterance's rhetorical intent (in short, an embedding of the utterance based on the responses associated with questions containing its constituent phrasings):

In [30]:
utt.retrieve_meta('prompt_types__prompt_repr')

[-0.17103395551495287,
 0.030694092789899603,
 -0.14371185586935595,
 0.10998245525877463,
 -0.31508472326375,
 -0.03187113204172867,
 -0.22291774431496747,
 -0.1278562931647348,
 0.17717804384550123,
 0.02097518862685271,
 -0.3543799065246014,
 -0.23905016478526944,
 -0.0635970446676691,
 -0.19447723846509896,
 -0.05206238289580816,
 -0.033106993095678466,
 -0.4151244327411294,
 -0.060491493289427684,
 -0.11375878457482796,
 -0.017597837784700098,
 -0.046578984088077695,
 -0.5431360277316315,
 0.12980649779704173,
 -0.08504893017823376]

Second, the distance between the vector of that utterance and the centroid of each cluster, or type of question.

In [31]:
utt.retrieve_meta('prompt_types__prompt_dists.8')

[1.130855626510634,
 0.39130608715180415,
 0.9490040025393338,
 1.1140869968500255,
 0.7542719064025534,
 1.1279773340447152,
 0.8453197995402353,
 1.1400944717972439]

The particular type of question this utterance is, i.e., the centroid that its vector representation is closest to, as well its distance to the centroid (roughly, how well it fits that question type):

In [36]:
utt.retrieve_meta('prompt_types__prompt_type_dist.8')

0.3913060871518036

Here, we see that our running example is of a question type exemplified by phrasings like `does [the Minister] agree...` -- we may characterize the entire cluster as encapsulating "agreeing" questions which are perhaps asked helpfully to bolster the answerer's reputation.

(long output)

In [31]:
pt.summarize(corpus,type_ids=utt.retrieve_meta('prompt_types__prompt_type.8'), k=15)

TYPE 1.0
top prompt:
                                                0         1         2  \
agree_*__agree_is                        1.178031  0.392994  1.060446   
agree_*__agree_be__does>*                1.135692  0.400345  1.017279   
agree_*__agree_is__does>*                1.175818  0.400844  1.064866   
agree_*__agree_be                        1.130732  0.401616  1.019973   
agree_*__agree_have                      1.150848  0.443389  1.058488   
agree_*__agree_are                       1.190365  0.449033  1.090853   
agree_*__agree_does__agree_have__does>*  1.135804  0.457290  1.089983   
agree_*__agree_are__agree_does__does>*   1.191144  0.460577  1.096827   
agree_*__agree_also                      1.176007  0.469945  1.133404   
continue_*__will>*                       1.144928  0.474175  1.038749   
agree_*__as>*                            1.124752  0.501247  0.949284   
agree_*__agree_does__as>*                1.147453  0.507481  0.971820   
agree_*__agree_welcome        

(demo continues)

We can transform the other utterances in the corpus as such:

In [27]:
corpus = pt.transform(corpus)

Having transformed the corpus, we can first see what the prompt types of other utterances are.

This utterance is of a type that's perhaps more information-seeking and querying for an update ("what steps is the Government taking, what are they doing to ensure", etc)

In [33]:
utt1.text

'Given what the Foreign Secretary has said about the importance of the Iran discussions on the nuclear agreement , what is he doing to ensure greater clarity about the baselines , the extent of the inspection regime and the consequences of infringement ? Given that the agreement will allow advanced centrifuge , the infringements might arrive a little earlier than anticipated .'

In [34]:
utt1.retrieve_meta('motifs__sink')

['doing_*__doing_ensure__doing_is doing_*__doing_is__doing_what given>*']

In [35]:
utt1.retrieve_meta('prompt_types__prompt_type.8')

3.0

(long output)

In [34]:
pt.summarize(corpus,type_ids=utt1.retrieve_meta('prompt_types__prompt_type.8'), k=15)

TYPE 3.0
top prompt:
                                     0         1         2         3  \
doing_*__what>*               1.184430  1.160051  1.178999  0.488050   
doing_*                       1.191700  1.142959  1.177319  0.501816   
taking_*__taking_is__what>*   1.123601  1.159411  1.190865  0.510735   
doing_*__doing_is__what>*     1.191761  1.212333  1.201727  0.529937   
take_*__take_what             1.146526  1.089185  1.011824  0.533417   
will>*__work_*__work_with     1.050524  1.017746  0.970711  0.534472   
taking_*__taking_are          1.084673  1.191355  1.218779  0.537272   
taking_*                      1.130205  1.188690  1.230882  0.538174   
do_*__do_can__do_what         1.180091  1.113888  1.030961  0.540809   
do_*__do_what                 1.165401  1.098882  1.067617  0.541254   
doing_*__doing_is             1.200928  1.207163  1.219990  0.542058   
taking_*__what>*              1.081665  1.195913  1.223669  0.544559   
doing_*__doing_are__what>*    1.168104  1.0

(demo continues)

This utterance, on the other hand is a lot more aggressive -- perhaps _accusatory_ to the ends of putting the answerer on the spot ("will the secretary admit that the policy is a failure?")

In [29]:
utt2 = corpus.get_utterance('1987-03-04a.857.5')

In [64]:
utt2.text

'Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs were invented and that , even during the past three years the number of repair and improvement grants , which would bring some private homes back into use , have dropped by 100,000 ? Does not the right hon Gentleman understand that , if the private owner and the local authority are starved of resources , we are left with lengthy queues , homelessness and all the other scandals of poor housing that exist today ?'

In [37]:
utt2.retrieve_meta('motifs__sink')

['stop_*__stop_will__will>*',
 'admit_*__admit_will__will>*',
 'does>*__does>not does>*__understand_*']

In [38]:
utt2.retrieve_meta('prompt_types__prompt_type.8')

7.0

(long output)

In [39]:
pt.summarize(corpus,type_ids=utt2.retrieve_meta('prompt_types__prompt_type.8'), k=15)

TYPE 7.0
top prompt:
                                                     0         1         2  \
why>*                                         1.104096  1.277495  1.306421   
explain_*                                     1.074950  1.264843  1.250738   
admit_*                                       1.193359  1.260325  1.288746   
admit_*__will>*                               1.222762  1.234878  1.309631   
justify_*                                     1.194117  1.271824  1.319768   
explain_*__explain_will                       1.080877  1.182869  1.291523   
is>*__is_*__is_true                           1.162309  1.139200  1.311678   
is_*__why>*                                   1.175892  1.168984  1.278892   
does>*__realise_*__realise_does__realise_not  1.152017  1.244742  1.338369   
admit_*__admit_will__will>*                   1.231131  1.266068  1.307017   
explain_*__will>*                             1.092443  1.146115  1.287732   
is_*__is_true                              

(demo continues)

The transform step also annotates _responses_ to questions in question period with the prompt type that the utterance is the most appropriate response to, according to our model.

For instance, consider the following response: (which actually responds to the utterance in the first example)

In [48]:
response_utt = corpus.get_utterance('1997-01-27a.4.1')

In [49]:
corpus.get_utterance(response_utt.reply_to).text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

In [50]:
response_utt.text

"I am grateful to my hon Friend for reminding us that the royal naval ship to replace Britannia should be commissioned in 2002 , which is the golden jubilee of Her Majesty the Queen . I hope that the new ship will play an important role in those celebrations . As to the Opposition 's attitude , we have witnessed their small - mindedness and their misunderstanding not only of the role of Her Majesty but of the promotion of the best interests of the United Kingdom economy abroad ."

The response has the following type:

In [52]:
response_utt.retrieve_meta('prompt_types__reference_type.8')

1.0

corresponding to questions which, like the first example, are relatively friendly and agreeable. In other words, out of all the prompt types, the response would be the most appropriate reply to these agreeable questions.

(detail: the term `reference_type` refers to the fact that the set of question responses are used as "references"  to structure the space of questions, per the methodology detailed in the paper. We keep the terminology deliberately generic, as opposed to calling it a "response type", to suggest that other data could serve as references -- for instance, reversing the role of questions and answers in the method. This possibility is something we'll explore in future work.)

### vector representations

As mentioned above, `PromptTypes` produces a few vector representations of utterances. For efficiency, rather than storing these representations attached to the utterance (as values in `utterance.meta`), we store them in corpus-wide matrices. (See the following demo [here](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/vectors/vector_demo.ipynb) for more details on using vectors in ConvoKit.)

To view (a subset of) these matrices, we'll call `corpus.get_vectors(matrix name, [utterance ids])` which allows us to access the vectors corresponding to all utterances in the list of [utterance ids]).

Each row of the matrix `prompt_types__prompt_repr` corresponds to the vector representation of the utterance's latent intent. The first row should be exactly the latent representation we printed above, for the first example question:
(here I've returned the matrix as a dataframe to highlight that each row corresponds to an utterance indicated by the ID in the index)

In [65]:
corpus.get_vectors('prompt_types__prompt_repr', ids=[utt.id, utt1.id, utt2.id], 
                   as_dataframe=True).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
1997-01-27a.4.0,-0.171034,0.030694,-0.143712,0.109982,-0.315085,-0.031871,-0.222918,-0.127856,0.177178,0.020975,...,-0.052062,-0.033107,-0.415124,-0.060491,-0.113759,-0.017598,-0.046579,-0.543136,0.129806,-0.085049
2015-06-09c.1041.5,0.77873,-0.048167,-0.072003,0.030372,-0.138309,-0.028404,-0.088819,0.130886,-0.013354,0.182926,...,-0.061019,0.031302,-0.245416,-0.149492,-0.019458,-0.132673,-0.004026,0.053251,-0.213131,0.234169
1987-03-04a.857.5,-0.364659,-0.074532,0.142105,0.332828,-0.238983,-0.165824,-0.306348,0.225581,0.082435,-0.116489,...,0.061312,0.11207,0.225955,0.002027,-0.014692,0.41398,0.114056,0.228563,-0.143224,0.058481


Each row of the matrix `prompt_types__prompt_dists.8` lists the distances between each question (as its vector representation) and each type centroid. The first row should be exactly the distances corresponding to the first example question, as we printed above.

In [30]:
corpus.get_vectors('prompt_types__prompt_dists.8', ids=[utt.id, utt1.id, utt2.id], 
                   as_dataframe=True)

Unnamed: 0,type_0_dist,type_1_dist,type_2_dist,type_3_dist,type_4_dist,type_5_dist,type_6_dist,type_7_dist
1997-01-27a.4.0,1.130856,0.391306,0.949004,1.114087,0.754272,1.127977,0.84532,1.140094
2015-06-09c.1041.5,1.176341,1.13651,1.088411,0.490026,1.334271,1.098959,1.241002,1.271116
1987-03-04a.857.5,1.150327,1.173718,1.263869,1.368205,0.802161,0.858921,1.159636,0.543815


Note that these vectors could be used as features in a prediction task: [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb) has an example for predicting derailed conversations, using the distance between utterances and type centroids (`prompt_dists.#`); using the latent representations (`prompt_repr`) is another good option.

To save all of these representations to disk, we can call the following:



In [31]:
corpus.dump_vectors('prompt_types__prompt_repr')


In [32]:
corpus.dump_vectors('prompt_types__prompt_dists.8')


These vector representations can later be re-loaded:

In [35]:
new_corpus = convokit.Corpus(ROOT_DIR, preload_vectors=['prompt_types__prompt_repr',
                                                       'prompt_types__prompt_dists.8'])


In [77]:
new_corpus.get_vectors('prompt_types__prompt_repr', ids=[utt.id, utt1.id, utt2.id],
                      as_dataframe=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
1997-01-27a.4.0,-0.171034,0.030694,-0.143712,0.109982,-0.315085,-0.031871,-0.222918,-0.127856,0.177178,0.020975,...,-0.052062,-0.033107,-0.415124,-0.060491,-0.113759,-0.017598,-0.046579,-0.543136,0.129806,-0.085049
2015-06-09c.1041.5,0.77873,-0.048167,-0.072003,0.030372,-0.138309,-0.028404,-0.088819,0.130886,-0.013354,0.182926,...,-0.061019,0.031302,-0.245416,-0.149492,-0.019458,-0.132673,-0.004026,0.053251,-0.213131,0.234169
1987-03-04a.857.5,-0.364659,-0.074532,0.142105,0.332828,-0.238983,-0.165824,-0.306348,0.225581,0.082435,-0.116489,...,0.061312,0.11207,0.225955,0.002027,-0.014692,0.41398,0.114056,0.228563,-0.143224,0.058481


In [36]:
new_corpus.get_vectors('prompt_types__prompt_dists.8', ids=[utt.id, utt1.id, utt2.id],
                      as_dataframe=True)

Unnamed: 0,type_0_dist,type_1_dist,type_2_dist,type_3_dist,type_4_dist,type_5_dist,type_6_dist,type_7_dist
1997-01-27a.4.0,1.130856,0.391306,0.949004,1.114087,0.754272,1.127977,0.84532,1.140094
2015-06-09c.1041.5,1.176341,1.13651,1.088411,0.490026,1.334271,1.098959,1.241002,1.271116
1987-03-04a.857.5,1.150327,1.173718,1.263869,1.368205,0.802161,0.858921,1.159636,0.543815


### a few caveats and potential modifications

One thorn in our sides might be that the model occasionally gets caught up on very generic motifs e.g., `'is>*'`, and as such, will fit many questions to the type containing `'is>*'` instead of going with a better signal; various optional parameters detailed in the documentation may provide incomplete solutions to this. Another caveat is that while this model allows us to associate together lexically-diverging phrasings (e.g., "will the Minister admit" and "does the Minister not realise" both serve to be accusatory towards the Minister), we are ultimately relying on the fact that our domain has a sufficient amount of lexical regularity (e.g., the institutional norms of how people talk in parliament) -- we might need to be cleverer when dealing with noisier settings where this regularity isn't guaranteed (like social media data). 

Finally, as a data-specific note, one of the types may be a result of the parser assuming that "Will the learned Gentleman please answer my question?" has "learned" as the root verb -- an artefact of parliamentary discourse we haven't handled. You may wish to play around with this by modifying how the data is preprocessed.

### model persistence

We can save our trained `pt_model` to disk for later use:

In [40]:
import os

In [48]:
pt.dump_model(os.path.join(ROOT_DIR, 'pt_model'))

dumping embedding model
dumping training embeddings
dumping type model 8


In broad strokes, what's loaded to disk is:

* TfIdf models that store the distribution of phrasings and arcs in the training data;
* SVD models that allow us to map raw phrasing/arc counts to vector representations;
* a KMeans model to cluster vector representations.

In [49]:
pt_model_dir = os.path.join(ROOT_DIR, 'pt_model')
!ls $pt_model_dir

km_model.8.joblib	   svd_model.joblib	   train_ref_ids.npy
prompt_df.8.tsv		   train_prompt_df.8.tsv   train_ref_vects.npy
prompt_tfidf_model.joblib  train_prompt_ids.npy    U_prompt.npy
ref_df.8.tsv		   train_prompt_vects.npy  U_ref.npy
ref_tfidf_model.joblib	   train_ref_df.8.tsv


Initializing a new `PromptTypes` model and loading our saved model then allows us to use it again:

In [50]:
new_pt = PromptTypes(prompt_field='motifs', reference_field='arcs_censored', 
                 prompt_transform_field='motifs__sink',
                 output_field='prompt_types_new', prompt__tfidf_min_df=100,
                 reference__tfidf_min_df=100, 
    random_state=1000, verbosity=1)

In [51]:
new_pt.load_model(pt_model_dir)

loading embedding model
loading training embeddings
loading type model 8


In [52]:
utt = new_pt.transform_utterance(utt)

In [53]:
utt.retrieve_meta('prompt_types_new__prompt_type.8')

1.0

## examples of potential variations

### trying other numbers of prompt types:

Calling `refit_types(n)` will retrain the clustering component of the `PromptType` model to infer a different number of types. Suppose we only wanted 4 types of questions:

In [41]:
pt.refit_types(4)

fitting 4 prompt types


(long output)

In [42]:
pt.summarize(corpus, type_key=4, k=15)

TYPE 0
top prompt:
                                   0         1         2         3  type_id
give_*__will>*              0.655711  1.044415  1.071511  1.029346      0.0
give_*__give_will           0.658192  1.014454  1.078425  1.061438      0.0
give_*                      0.673571  0.990091  1.022852  1.040259      0.0
make_*                      0.684921  1.048797  1.156398  0.804960      0.0
in>*                        0.692132  0.950216  1.197650  0.760962      0.0
ask_*                       0.696715  1.045955  1.109069  0.983805      0.0
raise_*                     0.705024  1.153468  1.039631  0.994258      0.0
press_*                     0.705496  1.134460  1.047380  0.893885      0.0
have_*                      0.706000  0.876789  1.027710  0.973250      0.0
give_*__give_to__give_will  0.707316  1.122148  1.194305  0.997719      0.0
may>*__press_*              0.708647  1.115044  1.082358  1.012375      0.0
be_*                        0.714519  0.913248  1.190963  0.722537   

top prompts:
1999-11-25a.748.10 I thank my hon and learned Friend for that answer . Yes , I am aware that the CPS in Staffordshire is good at dealing with complaints , but would it be fair to say that at the national level its attitude has been a trifle defensive ? Does he agree that complaints should be viewed positively , as part of the management information that any organisation needs to improve its service ? I hasten to add that I would say the same about people who make other comments , including compliments . Does my hon and learned Friend agree that the CPS should change its attitude towards making use of the information that comes from the public in those forms ?
['am_* be_*__be_would', 'agree_*__does>*', 'learned_*__learned_agree']

1989-04-06a.327.1 Does my right hon Friend agree that the best form of tax relief is straight reductions in tax rates ? As Opposition Members seem to have grasped that fact and , in their increasingly desperate scramble for office , are making the

(demo continues)

### trying other input formats

We may also experiment with different representations of the input text -- for instance, in lieu of using phrasing motifs we may instead pass questions into the model as just the raw arcs, similar to the responses. This may help with datasets where the phrasing motifs are  relatively sparse (due to size or noise/linguistic variability). This can be modified by changing the `prompt_field` argument:

In [55]:
pt_arcs = PromptTypes(prompt_field='arcs_censored', reference_field='arcs_censored', 
                 prompt_transform_field='arcs_censored',
                 output_field='prompt_types_arcs', prompt__tfidf_min_df=100,
                 reference__tfidf_min_df=100, n_types=8,
    random_state=1000, verbosity=1)

In [56]:
pt_arcs.fit(corpus)

fitting 214798 input pairs
fitting ref tfidf model
fitting prompt tfidf model
fitting svd model
fitting 8 prompt types


(long output)

In [57]:
pt_arcs.summarize(corpus, k=10)

TYPE 0
top prompt:
                  0         1         2         3         4         5  \
be_would   0.530219  0.963077  1.226676  0.938349  0.917846  0.772022   
be_not     0.531131  0.893339  1.250256  0.938005  1.015710  0.823600   
asked_for  0.567070  1.004391  1.152797  1.014580  1.030091  0.796486   
would>*    0.568361  0.898698  1.225740  0.971188  0.936750  0.737356   
hope_*     0.568801  1.141948  1.105715  1.029404  0.762265  0.735186   
be_*       0.572172  1.049550  1.080383  0.870872  0.767274  0.797001   
will_*     0.574922  1.111423  1.269132  0.945342  0.945557  0.869572   
bearing>*  0.581783  0.884632  1.250685  0.913127  0.963242  0.881540   
take_will  0.589269  1.074388  1.162336  1.000842  0.827029  0.747008   
in>*       0.590970  0.874668  1.180957  0.886673  0.916052  0.806717   

                  6         7  type_id  
be_would   0.884726  1.143401      0.0  
be_not     0.879968  1.132878      0.0  
asked_for  0.907388  1.025336      0.0  
would>*    0.

2011-12-08a.382.3 The Minister will be aware that those who work with children and vulnerable adults can play a vital role in their protection . What is he doing to ensure that new employees , who often see problems with established bad practice , are protected if they decide to become whistleblowers ?
['be_* be_aware be_will', 'doing_* doing_ensure doing_is doing_what what>* what>is']

2013-06-12a.321.7 The Secretary of State will be aware that developing countries lose more than £ 160 billion each year through tax avoidance , more than one and a half times what they receive in aid . What is she doing to ensure that we get country - by - country reporting so that we see how much those multinationals are taking from developing countries ?
['be_* be_aware be_will', 'doing_* doing_ensure doing_is doing_what what>* what>is']

2012-07-16d.663.6 Many SMEs in Northern Ireland are involved merely on the periphery of large MOD contracts . What steps are the Government taking to ensure that the

1997-12-17a.319.3 I thank my right hon Friend for that answer . Is not the problem that little progress was made in this area under the previous Government ? We had to depend on local initiatives to make the running in the electronic delivery of services . Does my right hon Friend agree with me that the Government have a lot to learn from what has already been achieved at local level ?
['thank_* thank_for', 'is>* is_* is_not', 'had_* had_depend', 'agree_* agree_does agree_have agree_with does>*']

2003-05-06.512.6 I am sure that the Minister shares my delight at seeing Iraqi opposition leaders elected to Mosul city council last weekend , but does he agree that the Iraq crisis will not be over until the weapons of mass destruction , about which the war was fought , are found and secured ? Does he further agree that , should the intelligence that the Government received before the war be shown to be wanting in that respect , a fundamental review of our intelligence service will be requir

(demo continues)

### going beyond root arcs

If we initialize the `TextToArcs` transformer with `root_only=False`, we will use arcs beyond those attached to the root of the dependency parse. This may produce neater output, especially in domains where utterances are less well-structured (see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb) for a demo of this on Wikipedia talk page data)