This notebook provides examples of how to use convokit to perform analyses of behaviors of particular speakers within conversations. In other words, we will be dealing with attributes at the (speakers, conversation) level.
Attributes at this granularity include linguistic diversity, described in the following paper : http://www.cs.cornell.edu/~cristian/Finding_your_voice__linguistic_development.html
They can be used to perform longitudinal analyses of speaker behaviors across multiple conversations.

Since we cannot publicly release the dataset of counseling conversations used in that paper, we will use the ChangeMyView subreddit as a test case---as such, this notebook is mostly to demonstrate how the functionality works, rather than to suggest any substantive scientific claims about longitudinal behavior change.

In [1]:
import convokit
from convokit import Corpus
from convokit import download
from convokit.text_processing import TextParser

# Setup

imports and loading corpora:

In [2]:
# OPTION 1: DOWNLOAD CORPUS
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = convokit.download('subreddit-changemyview', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = Corpus(ROOT_DIR)

In [3]:
corpus.print_summary_stats()

Number of Speakers: 217100
Number of Utterances: 5017556
Number of Conversations: 117492


To start, we will set up a data structure mapping each speaker to their conversations, and each utterance they contributed in the conversation.

To do this we call the `organize_speaker_convo_history` function, which annotates each `Speaker` in a corpus with a dict of conversations --> the speaker's utterances in that conversation, and the timestamp of their first utterance (i.e., when they "entered" the conversation).

Note that we can specify what counts as participating in a conversation. Here, we omit posts and focus only on comments (such that a speaker doesn't count as participating if they only submitted the root post)

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot','AutoModerator']
def utterance_is_valid(utterance):
    return (utterance.id != utterance.root) and (utterance.speaker.id not in SPEAKER_BLACKLIST)

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)

example of what this function call gives us:

In [6]:
corpus.get_speaker('ThatBelligerentSloth').meta['n_convos']

1039

In [7]:
corpus.get_speaker('ThatBelligerentSloth').meta['start_time']

1424463398

For each speaker, we maintain a dictionary in their `meta` information of conversation ID to a record of the speaker's behavior in that conversation:

In [8]:
corpus.get_speaker('ThatBelligerentSloth').meta['conversations']['2wm22t']

{'idx': 2,
 'n_utterances': 2,
 'start_time': 1424491188,
 'utterance_ids': ['cos7k4p', 'cos8ffz']}

to speed up this demo, we will only take the top 100 most active speakers. 

To help with this, the `get_attribute_table` function call gives us a Pandas dataframe where indices correspond to speaker names, and which contains the number of comments each speaker participated in.

In [9]:
speaker_activities = corpus.get_attribute_table('speaker',['n_convos'])

In [10]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
cdb03b,7159.0
Ansuz07,6501.0
garnteller,6290.0
hacksoncode,5947.0
Nepene,5408.0
GnosticGnome,5211.0
huadpe,4847.0
Grunt08,4623.0
caw81,4204.0
Glory2Hypnotoad,3984.0


In [11]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [12]:
subset_utts = []
for speaker in top_speakers:
    subset_utts += list(corpus.get_speaker(speaker).iter_utterances())
subset_corpus = Corpus(utterances=subset_utts)

In [13]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 539413
Number of Conversations: 66051


Finally, to finish setting things up, we will tokenize the utterances using the `TextParser` transformer (this is somewhat slow; setting the mode to 'tokenize' means we avoid having to perform expensive dependency-parse computations, which we do not need for the present analysis).

In [14]:
from convokit.text_processing import TextProcessor, TextParser

In [16]:
tokenizer = TextParser(mode='tokenize', output_field='tokens', verbosity=1000)
subset_corpus = tokenizer.transform(subset_corpus)

1000/539413 utterances processed
2000/539413 utterances processed
3000/539413 utterances processed
4000/539413 utterances processed
5000/539413 utterances processed
6000/539413 utterances processed
7000/539413 utterances processed
8000/539413 utterances processed
9000/539413 utterances processed
10000/539413 utterances processed
11000/539413 utterances processed
12000/539413 utterances processed
13000/539413 utterances processed
14000/539413 utterances processed
15000/539413 utterances processed
16000/539413 utterances processed
17000/539413 utterances processed
18000/539413 utterances processed
19000/539413 utterances processed
20000/539413 utterances processed
21000/539413 utterances processed
22000/539413 utterances processed
23000/539413 utterances processed
24000/539413 utterances processed
25000/539413 utterances processed
26000/539413 utterances processed
27000/539413 utterances processed
28000/539413 utterances processed
29000/539413 utterances processed
30000/539413 utterances

239000/539413 utterances processed
240000/539413 utterances processed
241000/539413 utterances processed
242000/539413 utterances processed
243000/539413 utterances processed
244000/539413 utterances processed
245000/539413 utterances processed
246000/539413 utterances processed
247000/539413 utterances processed
248000/539413 utterances processed
249000/539413 utterances processed
250000/539413 utterances processed
251000/539413 utterances processed
252000/539413 utterances processed
253000/539413 utterances processed
254000/539413 utterances processed
255000/539413 utterances processed
256000/539413 utterances processed
257000/539413 utterances processed
258000/539413 utterances processed
259000/539413 utterances processed
260000/539413 utterances processed
261000/539413 utterances processed
262000/539413 utterances processed
263000/539413 utterances processed
264000/539413 utterances processed
265000/539413 utterances processed
266000/539413 utterances processed
267000/539413 uttera

474000/539413 utterances processed
475000/539413 utterances processed
476000/539413 utterances processed
477000/539413 utterances processed
478000/539413 utterances processed
479000/539413 utterances processed
480000/539413 utterances processed
481000/539413 utterances processed
482000/539413 utterances processed
483000/539413 utterances processed
484000/539413 utterances processed
485000/539413 utterances processed
486000/539413 utterances processed
487000/539413 utterances processed
488000/539413 utterances processed
489000/539413 utterances processed
490000/539413 utterances processed
491000/539413 utterances processed
492000/539413 utterances processed
493000/539413 utterances processed
494000/539413 utterances processed
495000/539413 utterances processed
496000/539413 utterances processed
497000/539413 utterances processed
498000/539413 utterances processed
499000/539413 utterances processed
500000/539413 utterances processed
501000/539413 utterances processed
502000/539413 uttera

Here's what the tokenized output looks like for one utterance (for a more in-depth explanation, check out the `TextParser` documentation.

In [17]:
subset_corpus.get_utterance('cos7k4p').get_info('tokens')

[{'toks': [{'tok': 'Strictly'},
   {'tok': 'speaking'},
   {'tok': 'yes'},
   {'tok': ','},
   {'tok': 'they'},
   {'tok': 'are'},
   {'tok': 'probably'},
   {'tok': 'entitled'},
   {'tok': 'to'},
   {'tok': 'their'},
   {'tok': 'view'},
   {'tok': 'if'},
   {'tok': 'they'},
   {'tok': 'live'},
   {'tok': 'in'},
   {'tok': 'a'},
   {'tok': 'developed'},
   {'tok': 'country'},
   {'tok': '.'}]},
 {'toks': [{'tok': 'Typically'},
   {'tok': 'these'},
   {'tok': 'countries'},
   {'tok': 'agree'},
   {'tok': 'to'},
   {'tok': 'by'},
   {'tok': 'and'},
   {'tok': 'large'},
   {'tok': 'protect'},
   {'tok': 'speech'},
   {'tok': 'as'},
   {'tok': 'free'},
   {'tok': '.'}]},
 {'toks': [{'tok': 'You'},
   {'tok': 'are'},
   {'tok': 'literally'},
   {'tok': 'entitled'},
   {'tok': 'to'},
   {'tok': 'say'},
   {'tok': 'whatever'},
   {'tok': 'you'},
   {'tok': 'want'},
   {'tok': '.'}]},
 {'toks': [{'tok': 'However'},
   {'tok': 'this'},
   {'tok': 'does'},
   {'tok': 'not'},
   {'tok': 'mean'},


# Analysis

The goal of this analysis is to examine how a speaker's conversational behavior looks like within a single conversation, and then how it evolves over the conversations they take. To demonstrate what this looks like we'll start with a simple attribute, wordcount. 
First, we count the words in each utterance using the `TextProcessor` transformer. Note this computes _per utterance_ statistics.

In [18]:
wordcounter = TextProcessor(input_field='tokens', output_field='wordcount', 
                           proc_fn=lambda sents: sum(len(sent['toks']) for sent in sents), verbosity=25000)
subset_corpus = wordcounter.transform(subset_corpus) 

25000/539413 utterances processed
50000/539413 utterances processed
75000/539413 utterances processed
100000/539413 utterances processed
125000/539413 utterances processed
150000/539413 utterances processed
175000/539413 utterances processed
200000/539413 utterances processed
225000/539413 utterances processed
250000/539413 utterances processed
275000/539413 utterances processed
300000/539413 utterances processed
325000/539413 utterances processed
350000/539413 utterances processed
375000/539413 utterances processed
400000/539413 utterances processed
425000/539413 utterances processed
450000/539413 utterances processed
475000/539413 utterances processed
500000/539413 utterances processed
525000/539413 utterances processed
539413/539413 utterances processed


In [19]:
subset_corpus.get_utterance('cos7k4p').get_info('wordcount')

97

In [20]:
subset_corpus.get_utterance('cos8ffz').get_info('wordcount')

32

Next, we aggregate per-utterance statistics over all the utterances a particular speaker contributed in a conversation. That is, we will turn wordcount into a speaker,convo-level attribute.

We call the `SpeakerConvoAttrs` transformer to do this. Here, `agg_fn=np.mean` means that the speaker,convo-level attribute is an _average_ over utterance lengths, but you could replace this with your own aggregation function (e.g., `max`)

In [21]:
import numpy as np

In [22]:
sc_wordcount = convokit.speaker_convo_helpers.speaker_convo_attrs.SpeakerConvoAttrs('wordcount', agg_fn=np.mean)
subset_corpus = sc_wordcount.transform(subset_corpus)

This transformer annotates each conversation in each Speaker object with a (mean) wordcount:

In [23]:
subset_corpus.get_speaker('ThatBelligerentSloth').meta['conversations']['2wm22t']

{'idx': 2,
 'n_utterances': 2,
 'start_time': 1424491188,
 'utterance_ids': ['cos7k4p', 'cos8ffz'],
 'wordcount': 64.5}

We will now use this aggregate statistic to analyze how speakers change behavior over time. The particular question here is whether or not speakers systematically increase or decrease in wordcount, and in the number of utterances contributed to each conversation.

To facilitate further analyses, we'll load all the speaker,convo information pertaining to the attributes we want into a dataframe. We'll use the `get_full_attribute_table` function to do this (the particular call tells the function to load a table with wordcount and # of utterances at the speaker,conversation level, and # of conversations i.e., how active the speaker was, at the speaker level).

In [24]:
speaker_convo_len_df = subset_corpus.get_full_attribute_table(speaker_convo_attrs=['wordcount','n_utterances'],
                                             speaker_attrs=['n_convos'])

In [25]:
speaker_convo_len_df.head()

Unnamed: 0_level_0,convo_id,convo_idx,n_utterances,speaker,wordcount,n_convos__speaker
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
cdb03b__18x6j5,18x6j5,0,2,cdb03b,24.5,7159
cdb03b__1adg1v,1adg1v,1,1,cdb03b,4.0,7159
cdb03b__1cciah,1cciah,2,2,cdb03b,25.0,7159
cdb03b__1ccvs4,1ccvs4,3,1,cdb03b,35.0,7159
cdb03b__1e2r7u,1e2r7u,4,1,cdb03b,74.0,7159


We perform our longitudinal analyses at the level of life-stages: i.e., contiguous blocks of conversations. Here, we compare between the first two life-stages of 10 conversations: how the speaker behaves in their first 10, versus their 10th to 20th conversations. 
We say that speakers systematically increase (or decrease) in an attribute if for a significant majority of speakers the value of this attribute at one life-stage increases to the next. 

To this end, we need to aggregate attributes over a life-stage, e.g., mean wordcount. To perform this aggregation we'll use the `get_lifestage_attributes` function, specifying lifestages of 10 conversations each.


In [26]:
def get_lifestage_attributes(attr_df, attr, lifestage_size, agg_fn=np.mean):
    aggs = attr_df.groupby(['speaker', attr_df.convo_idx // lifestage_size])\
        [attr].agg(agg_fn)
    aggs = aggs.reset_index().pivot(index='speaker', columns='convo_idx',
                                   values=attr)
    return aggs

We focus on the first 20 conversations (i.e., 2 life-stages). We also ignore all speakers with less than 20 conversations---so we are not biased by survivorship.

In [27]:
subset = speaker_convo_len_df[(speaker_convo_len_df.n_convos__speaker >= 20)
                          & (speaker_convo_len_df.convo_idx < 20)]

In [28]:
stage_wc_df = get_lifestage_attributes(subset, 'wordcount', 10)

In [29]:
stage_wc_df.head()

convo_idx,0,1
speaker,Unnamed: 1_level_1,Unnamed: 2_level_1
ACrusaderA,99.640812,131.716667
A_Mirror,71.4,94.583333
A_Soporific,287.816667,229.133333
AlphaGoGoDancer,297.825,418.933333
Amablue,161.7,298.008333


In [30]:
stage_wc_df.mean()

convo_idx
0    157.456438
1    147.400859
dtype: float64

In [31]:
from scipy import stats

In [32]:
def print_lifestage_comparisons(stage_df):
    for i in range(stage_df.columns.max()):
        
        mask = stage_df[i+1].notnull() & stage_df[i].notnull()
        c1 = stage_df[i+1][mask]
        c0 = stage_df[i][mask]
        
        print('stages %d vs %d (%d speakers)' % (i + 1, i, sum(mask)))
        n_more = sum(c1 > c0)
        n = sum(c1 != c0)
        print('\tprop more: %.3f, binom_p=%.2f' % (n_more/n, stats.binom_test(n_more,n)))

In [33]:
print_lifestage_comparisons(stage_wc_df)

stages 1 vs 0 (100 speakers)
	prop more: 0.410, binom_p=0.09


In [34]:
stage_convo_len_df = get_lifestage_attributes(subset, 'n_utterances', 10)

In [35]:
stage_convo_len_df.mean()

convo_idx
0    2.786
1    2.733
dtype: float64

Just looking at the means, it looks like there's a slight decrease in wordcount across the population from the first to the second lifestage. To check significance, we can compute that % of speakers who experience this decrease, and see if it's significant per a binomial test against a null proportion of 50% of speakers (ie., people randomly increase or decrease)

We see that this is (almost) significant ... maybe more data would help!

In [36]:
print_lifestage_comparisons(stage_convo_len_df)

stages 1 vs 0 (100 speakers)
	prop more: 0.458, binom_p=0.48


Finally, we'll compute some attributes related to linguistic diversity described in the following paper : http://www.cs.cornell.edu/~cristian/Finding_your_voice__linguistic_development.html 

In short, for each life-stage, we compare the words used by one speaker in one conversation to the words they use in their other conversations, or the words that others use. As such, this is a speaker,convo-level attribute. Given our small sample here (and the fact that CMV and crisis counseling conversations are very different), we're not going for any scientific claims, but use the following function calls to demostrate how the pipeline would work.

These attributes are all computed through the `SpeakerConvoDiversityWrapper` transformer, which computes three attributes:

* `div__self`: within-diversity in the paper, comparing language use across a speaker's own conversations
* `div__other`: between-diversity in the paper, comparing language use across different speakers
* `div__adj`: relative diversity: between - within. (intuitively, is the diversity coming from speakers being different from others, beyond being diverse in their own right?)

Under the surface, `SpeakerConvoDiversityWrapper` calls a more general `SpeakerConvoDiversity` transformer, which allows for computation of how divergent a conversation is from any arbitrary reference set of conversations, beyond life-stages (see the documentation for details).

In [37]:
from convokit import SpeakerConvoDiversityWrapper

In [38]:
scd = convokit.SpeakerConvoDiversityWrapper(lifestage_size=10, max_exp=20,
                sample_size=300, min_n_utterances=1, n_iters=50, cohort_delta=60*60*24*30*2, verbosity=100)

(this takes a while to run, especially with more speakers involved)

In [39]:
subset_corpus = scd.transform(subset_corpus)

getting lifestages
getting within diversity
joining tokens across conversation utterances
100 / 708
200 / 708
300 / 708
400 / 708
500 / 708
600 / 708
700 / 708
getting across diversity
joining tokens across conversation utterances
100 / 708
200 / 708
300 / 708
400 / 708
500 / 708
600 / 708
700 / 708
getting relative diversity
100 / 696
200 / 696
300 / 696
400 / 696
500 / 696
600 / 696


In [40]:
div_df = subset_corpus.get_full_attribute_table(['div__self','div__other','div__adj', 'tokens', 'n_utterances'], ['n_convos'])

note that one present limitation of this methodology is that it requires a speaker's activity in a conversation---and in their other conversations---to be substantive enough. if a speaker doesn't meet the minimum wordcount per conversation, then the function returns `np.nan` for that particular speaker,conversation. Filtering out these null values:

In [41]:
div_df = div_df[div_df.div__self.notnull() | div_df.div__other.notnull()]

In [42]:
div_df.shape

(696, 9)

as with the wordcount example, we can make cross-lifestage comparisons. here we unfortunately see no significant population-wide change in either direction. This might be worth exploring with more speakers, though note that interpreting this result for CMV versus for counseling conversations where speakers are randomly assigned might be different. 

In [43]:
for attr in ['div__self','div__other','div__adj']:
    print(attr)
    stage_df = get_lifestage_attributes(div_df, attr, 10)
    print_lifestage_comparisons(stage_df)
    print('\n\n===')

div__self
stages 1 vs 0 (40 speakers)
	prop more: 0.500, binom_p=1.00


===
div__other
stages 1 vs 0 (83 speakers)
	prop more: 0.458, binom_p=0.51


===
div__adj
stages 1 vs 0 (36 speakers)
	prop more: 0.444, binom_p=0.62


===
