## Converting the paired data from "Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions" into ConvoKit format (the data used in section 4 of their paper).

#### Note: we are only converting the subset data used to measure successful vs. unsuccessful arguments. All data provided by 
--------------------

Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions
Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, Lillian Lee. 
In Proceedings of the 25th International World Wide Web Conference (WWW'2016).

The paper, data, and associated materials can be found at:
http://chenhaot.com/pages/changemyview.html

If you use this data, please cite:
@inproceedings{tan+etal:16a, 
    author = {Chenhao Tan and Vlad Niculae and Cristian Danescu-Niculescu-Mizil and Lillian Lee}, 
    title = {Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions}, 
    year = {2016}, 
    booktitle = {Proceedings of WWW} 
}

Note at the blog in the hyperlink above, the data we used is the original data (linked with corresponding README, PDF and Slides). We did *not* use the updated data provided on 11/11/2016

Before starting the data conversion, you need to download the data, linked above, and extract the data from the tar archive.

------------------------------------

In [1]:
import os

In [2]:
#here I set the working directory to where I store the convokit package 
os.chdir('C:\\Users\\Andrew\\Desktop\\Cornell-Conversational-Analysis-Toolkit')
from convokit import Corpus, Speaker, Utterance, meta_index

In [3]:
import pandas as pd

Load the original pair data:

In [4]:
pairDFtrain=pd.read_json('C:\\Users\\Andrew\\Documents\\pair_task\\train_pair_data.jsonlist',lines=True)
print(len(pairDFtrain))
pairDFtrain['train']=1
pairDFtrain.tail()

3456


Unnamed: 0,op_author,op_text,op_title,positive,negative,op_name,train
3451,helpful_hank,"In opposing injustice, we must strive not to p...",CMV: Drawing images of Mohammed and posting th...,"{'ancestor': 't1_cniw4jr', 'author': 'cold08',...","{'ancestor': 't1_cniu655', 'author': 'learhpa'...",t3_2rsgv3,1
3452,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnirwl5', 'author': 'scottevi...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a,1
3453,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnjiwww', 'author': 'AdmiralC...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a,1
3454,GetCapeFly,It seems logical to me that school hours shoul...,CMV: School hours should be 9am to 5pm to matc...,"{'ancestor': 't1_cnii75i', 'author': '[deleted...","{'ancestor': 't1_cnijhp3', 'author': 'funchy',...",t3_2rqvf8,1
3455,luxo42,My argument assumes the Christian theology tau...,"CMV: In heaven, as long as an individual has f...","{'ancestor': 't1_cnj7d44', 'author': 'Field-K'...","{'ancestor': 't1_cnih5d9', 'author': '____Matt...",t3_2rq5g3,1


In [5]:
pairDFhold=pd.read_json('C:\\Users\\Andrew\\Documents\\pair_task\\heldout_pair_data.jsonlist',lines=True)
print(len(pairDFhold))
pairDFhold['train']=0
pairDFhold.head()

807


Unnamed: 0,op_author,op_text,op_title,positive,negative,op_name,train
0,923iwek,I'll start off by saying I'm a vegetarian and ...,CMV: The contribution of vegans/vegetarians an...,"{'ancestor': 't1_cundk5r', 'author': 'ghoooooo...","{'ancestor': 't1_cunbl8g', 'author': 'ClimateM...",t3_3j8yfq,0
1,923iwek,I'll start off by saying I'm a vegetarian and ...,CMV: The contribution of vegans/vegetarians an...,"{'ancestor': 't1_cunbkbz', 'author': 'archagon...","{'ancestor': 't1_cuncrke', 'author': 'Diomange...",t3_3j8yfq,0
2,Navyurf,"Hello, I'm Luke and for the longest time a sma...",CMV:I want to live in Scandinavia,"{'ancestor': 't1_cun0c3t', 'author': 'huadpe',...","{'ancestor': 't1_cun2oqr', 'author': 'iamambie...",t3_3j7dlx,0
3,trashlunch,"By ""practical reason,"" I mean a reason that mo...",CMV: There is no practical reason for any indi...,"{'ancestor': 't1_cumn3j4', 'author': 'ReOsIr10...","{'ancestor': 't1_cumqh8n', 'author': 'Omega037...",t3_3j64aa,0
4,VaginalExcrement,"\n_____\n\nAlright, so, i was challenged by a ...",CMV: We Should execute the weak to improve the...,"{'ancestor': 't1_cumhf65', 'author': 'BadKeyMa...","{'ancestor': 't1_cumi9vr', 'author': 'MCBeatho...",t3_3j5b7g,0


In [6]:
pairDF=pd.concat([pairDFtrain,pairDFhold])

In [7]:
len(pairDF)

4263

Note: Each observation has the reply comments in a conversation that changes the OP's (OP: original poster) mind (positive column) and a conversation that does not change the OP's mind (negative column). Unfortunately, this does not include the comments that OP made after their original post: the comments made by the OP in response to the second conversant's arguments. To find the comments made by OP (i.e. the other half of the conversation), we need to retrieve them from the 'all' dataset.

First: collect the unique identifiers for each original post in our dataset

In [8]:
nyms = list(set(pairDF.op_name))
len(nyms)

3051

Collect each post from the full dataset (this has the full comment threads, whereas the pair data above only has the first response):

Note: if you have not run this notebook before, then you will need to uncomment the following seven code cells. It will load the full dataset into your working memory and save only the observations that match with the posts in the pair_data above.

In [9]:
# #note: this is over 2 GB of data, uncomment the following two lines to read in the data

# dataT = pd.read_json('C:\\Users\\Andrew\\Documents\\all\\train_period_data.jsonlist', lines=True)
# # len(dataT)

Keep only the posts that are identified in our original dataset:

In [10]:
# #note: this reduces the 2 GB dataset to a similar size as our original dataset

# dataT=dataT[dataT.name.isin(nyms)]
# len(dataT)

In [11]:
# # do the same for the holdout data 
# dataH = pd.read_json('C:\\Users\\Andrew\\Documents\\all\\heldout_period_data.jsonlist', lines=True)
# len(dataH)

In [12]:
# dataH=dataH[dataH.name.isin(nyms)]
# len(dataH)

In [13]:
# #combine holdout and train datasets
# data = pd.concat([dataT,dataH])

In [14]:
# len(data)

Saving the posts from the full dataset that are the same as posts in our pair data. 

In [15]:
# #note: I save the data as a pickle file so I don't have to reload the 2 GB dataset in my working memory

# data.to_pickle('C:\\Users\\Andrew\\Downloads\\pairAll.pkl')

Here, I have already run this notebook, so I can just load this dataset back into working memory.

In [16]:
data = pd.read_pickle('C:\\Users\\Andrew\\Downloads\\pairAll.pkl')

In [17]:
data.tail()

Unnamed: 0,approved_by,archived,author,author_flair_css_class,author_flair_text,banned_by,clicked,comments,created,created_utc,...,stickied,subreddit,subreddit_id,suggested_sort,thumbnail,title,ups,url,user_reports,visited
2242,,0.0,EconomistMagazine,,2Δ,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1431231205,1431227605,...,False,changemyview,t5_2w2s8,,,CMV: There will never be another military draf...,31,http://www.reddit.com/r/changemyview/comments/...,[],False
2247,,0.0,Mike2800,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1431213658,1431210058,...,False,changemyview,t5_2w2s8,,,"CMV: I'm anti abortion, and I feel like an ass...",44,http://www.reddit.com/r/changemyview/comments/...,[],False
2249,,0.0,EpicPiDude,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1431210521,1431206921,...,False,changemyview,t5_2w2s8,,,CMV: You should have to pass the citizenship t...,34,http://www.reddit.com/r/changemyview/comments/...,[],False
2254,,0.0,G01denW01f11,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1431187622,1431184022,...,False,changemyview,t5_2w2s8,,,CMV: Black and white are colors,27,http://www.reddit.com/r/changemyview/comments/...,[],False
2262,,0.0,WumboWombo,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1431111913,1431108313,...,False,changemyview,t5_2w2s8,,,CMV: America's economy is destined to fail,97,http://www.reddit.com/r/changemyview/comments/...,[],False


In [18]:
len(data)

3051

In [19]:
len(pairDF)

4263

In [20]:
data.columns

Index(['approved_by', 'archived', 'author', 'author_flair_css_class',
       'author_flair_text', 'banned_by', 'clicked', 'comments', 'created',
       'created_utc', 'distinguished', 'domain', 'downs', 'edited', 'from',
       'from_id', 'from_kind', 'gilded', 'hidden', 'hide_score', 'id',
       'is_self', 'likes', 'link_flair_css_class', 'link_flair_text', 'media',
       'media_embed', 'mod_reports', 'name', 'num_comments', 'num_reports',
       'over_18', 'permalink', 'quarantine', 'removal_reason',
       'report_reasons', 'saved', 'score', 'secure_media',
       'secure_media_embed', 'selftext', 'selftext_html', 'stickied',
       'subreddit', 'subreddit_id', 'suggested_sort', 'thumbnail', 'title',
       'ups', 'url', 'user_reports', 'visited'],
      dtype='object')

only keep the comments and the identifier for merging with the original dataset:

In [21]:
data=data[['comments','name']]

In [22]:
pairDF.columns

Index(['op_author', 'op_text', 'op_title', 'positive', 'negative', 'op_name',
       'train'],
      dtype='object')

This joins the comments in the 'all' data, with the posts we are interested in studying:

In [23]:
pairDF=pairDF.join(data.set_index('name'), on='op_name')

In [24]:
len(pairDF)

4263

In [25]:
pairDF.tail()

Unnamed: 0,op_author,op_text,op_title,positive,negative,op_name,train,comments
802,EpicPiDude,"Take three people, Persons A, B, and C. They l...",CMV: You should have to pass the citizenship t...,"{'ancestor': 't1_cr427yp', 'author': 'Raintee9...","{'ancestor': 't1_cr3xa7x', 'author': '[deleted...",t3_35fjb1,0,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
803,EpicPiDude,"Take three people, Persons A, B, and C. They l...",CMV: You should have to pass the citizenship t...,"{'ancestor': 't1_cr49j5s', 'author': 'phcullen...","{'ancestor': 't1_cr3xa7x', 'author': '[deleted...",t3_35fjb1,0,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
804,G01denW01f11,"Artists, pedants, and pedantic artists like to...",CMV: Black and white are colors,"{'ancestor': 't1_cr3mhui', 'author': 'woahmani...","{'ancestor': 't1_cr3n4yp', 'author': 'niczar',...",t3_35edub,0,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
805,WumboWombo,"I'm young, so of course my biggest concern at ...",CMV: America's economy is destined to fail,"{'ancestor': 't1_cr2suj3', 'author': 'huadpe',...","{'ancestor': 't1_cr2s3nt', 'author': 'scottevi...",t3_35bc4b,0,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
806,WumboWombo,"I'm young, so of course my biggest concern at ...",CMV: America's economy is destined to fail,"{'ancestor': 't1_cr2sadm', 'author': 'gunnervi...","{'ancestor': 't1_cr2soeg', 'author': 'celerita...",t3_35bc4b,0,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."


Now that we have all comments made within every CMV post in our dataset, we need to extract only the comments that correspond to a positive argument and negative argument (i.e. the ones recorded as either changing OP's mind or not).

First, collect the identifiers for each comment made by the respondent attempting to change the OP's mind (there is a respondent in both the positive and negative columns).

In [26]:
def collectResponses(responseList):
    iDs=[]
    if len(responseList['comments'])>0:
        for each in responseList['comments']:
            iDs.append(each['id'])
    return iDs
pairDF['negIDs']=pairDF.negative.apply(lambda x: collectResponses(x))
pairDF['posIDs']=pairDF.positive.apply(lambda x: collectResponses(x))

Now collect each of the comment identifiers that signify a response to the challenger by OP

In [27]:
def collectOPcommentIDs(op_auth, allComments, replyIDs):
    opIds =[]
    for comment in allComments:
        if comment['parent_id'].split('_')[1] in replyIDs: 
            if 'author' in comment.keys():
                if comment['author'] == op_auth:
                    opIds.append(comment['id'])

    return opIds

In [28]:
pairDF['opRepliesPos'] = pairDF[['op_author','comments','posIDs']].apply(lambda x: collectOPcommentIDs(x['op_author'],x['comments'],x['posIDs']),axis=1)

In [29]:
pairDF['opRepliesNeg'] = pairDF[['op_author','comments','negIDs']].apply(lambda x: collectOPcommentIDs(x['op_author'],x['comments'],x['negIDs']),axis=1)

Here I collect and properly order each of the comment IDs made in the thread _only_ by either OP or the 2nd conversant studied for both succesful and unsuccesful arguments:

In [30]:
def orderThreadids(comments, replyIDs, opCommentIDs):
    threadIDs=list(replyIDs)
    for comment in comments:
        if comment['id'] in opCommentIDs:
            pID= comment['parent_id'].split('_')[1]
            if pID in replyIDs:
                threadIDs.insert(threadIDs.index(pID)+1,comment['id'])
            
    return threadIDs

In [31]:
pairDF['posOrder']= pairDF[['comments','posIDs','opRepliesPos']].apply(lambda x: orderThreadids(x['comments'],x['posIDs'],x['opRepliesPos']) ,axis = 1)

In [32]:
pairDF['negOrder']= pairDF[['comments','negIDs','opRepliesNeg']].apply(lambda x: orderThreadids(x['comments'],x['negIDs'],x['opRepliesNeg']) ,axis = 1)

This function takes the ordered thread IDs for only the successful and unsuccesful arguments measured in the original paper (although, note: I have also collected the OP replies from the 'all' data, which wasn't included in the smaller pair_data).

Note: I don't convert this section into convokit format, but instead I convert the full comment threads later in this notebook. If you are interested in looking at the successful and unsuccessful arguments in the convokit format, see the 'success' attribute in each utterance's metadata

In [33]:
def collectThread(comments, orderedThreadids):
    threadComments=[]
    for iD in orderedThreadids:
        for comment in comments:
            if iD==comment['id']:
                threadComments.append(comment)
    return threadComments

In [34]:
pairDF['positiveThread'] = pairDF[['comments','posOrder']].apply(lambda x: collectThread(x['comments'],x['posOrder']),axis=1)
pairDF['negativeThread'] = pairDF[['comments','negOrder']].apply(lambda x: collectThread(x['comments'],x['negOrder']),axis=1)

Note above: I have just collected each individual thread (with OP comments). However, when studying this data, we may be interested in looking at the entire conversation. Therefore, instead of only converting the positive threads and negative threads into convokit format, here I simply add an attribute to the comments if they are part of either the positive or negative thread.

Here I add the success attribute and the pair identification (see my readme file for a more detailed explanation of 'success' and 'pair_ids') :

In [35]:
# Create an identification # for the paired unsuccessful/successful arguments,
# Note: the pair # will be the same for successful-unsuccessful matched pairs with the prefix 'p_' for pair 
# if there is no paired argument for the comment (i.e. it was either the original post by OP or an uncategorized comment), 
# then pair_id = None
c=0
pairIDS={}
for i, r in pairDF.iterrows():
    
    c=c+1
    for comment in r.comments:
        
        if comment['id'] in r.posOrder:
            comment['success']=1
            if comment['name'] in pairIDS.keys():
                pairIDS[comment['name']].append('p_'+str(c))
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
            else:
                pairIDS[comment['name']]=['p_'+str(c)]
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
                
                
        elif comment['id'] in r.negOrder:
            comment['success']=0

            if comment['name'] in pairIDS.keys():
                pairIDS[comment['name']].append('p_'+str(c))
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
            else:
                pairIDS[comment['name']]=['p_'+str(c)]
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
                

        
        if comment['name'] not in pairIDS.keys():
            pairIDS[comment['name']]=[]
        if 'success' not in comment.keys():
            comment['success']=None

In [36]:
#make a column for pair_ids collected at the op post level, note: this won't be unique at the observation level in our pairDF dataframe, but I'm just doing this for quick conversion and after converting it into convokit, I add the list in at the conversation-level metadata and it is unique per conversation
threads = list(set(pairDF.op_name))
pids =[]
for thread in threads:
    pid=[]
    for i,r in pairDF[pairDF.op_name==thread].iterrows():
        for comment in r.comments:
            if len(pairIDS[comment['name']])>0:
                for p in pairIDS[comment['name']]:
                    pid.append(p)
    pid=list(set(pid))
    pids.append(pid)
pairDF['pIDs']=pairDF.op_name.apply(lambda x: pids[threads.index(x)])

Now the data is collected in a pandas dataframe with each thread's comments fully accounted for. Convert it into convokit format:

The first step is to create a list of all Redditors, or 'users' in convokit parlance:

In [37]:
users = list(set(pairDF.op_author))

for i,r in pairDF.iterrows():
    for comment in r.comments:
        if 'author' in comment.keys():
            if comment['author'] not in users:
                users.append(comment['author'])
        else: continue

In [38]:
len(users)

34910

Note: I don't have metadata on individual users. I briefly considered creating a unique identifier for each user and including the 'username' as metadata, but since each Reddit username is unique, it would be superfluousC:\Speakers\Andrew\Desktop\Cornell-Conversational-Analysis-Toolkit. I believe other relevant information (such as whether a Redditor is the original poster) is specific to individual conversations and utterances.

2 metadata points of note: 'author_flair_css_class' and 'author_flair_text' both describe flags that appear next to an author in a subeddit. In the changemyview subreddit the moderators use this to illustrate whether the author has changed someone's mind and it can be seen as both an award and evidence of credibility in the subreddit. While I would include this as author metadata, I believe, instead, that it is actually 'conversation' metadata because this flag would be updated overtime if the author changes multiple people's minds over the course of many conversations. Since this data was collected overtime, the flag is likely to change per user across multiple conversations, possibly across utterances.

I will include the user_meta dictionary, just in case, so data can be added to it later.

In [39]:
user_meta={}
for user in users:
    user_meta[user]={}

In [40]:
corpus_speakers = {k: Speaker(id = k, meta = v) for k,v in user_meta.items()}

In [41]:
print("number of users in the data = {0}".format(len(corpus_speakers)))

number of users in the data = 34910


Next: create utterances

In [42]:
c=0
count=0
errors=[]
utterance_corpus = {}

for i , r in pairDF.iterrows():
    #this creates an Utterance using the metadata provided in the original file. Note: this is for the original post in each observation within the pandas dataframe
    utterance_corpus[r.op_name]=Utterance(id=r.op_name ,
                                          speaker=corpus_speakers[r.op_author],
                                          conversation_id=r.op_name ,
                                          reply_to=None,
                                          timestamp=None,
                                          text=r.op_text,
                                          meta= {'pair_ids':[],
                                                 'success':None,
                                                 'approved_by': None,
                                                 'author_flair_css_class': None,
                                                 'author_flair_text': None,
                                                 'banned_by': None,
                                                 'controversiality': None,
                                                 'distinguished': None,
                                                 'downs': None,
                                                 'edited': None,
                                                 'gilded': None,
                                                 'likes': None,
                                                 'mod_reports':None,
                                                 'num_reports': None,
                                                 'replies': [com['id'] for com in r.comments if com['parent_id']==r.op_name],
                                                 'report_reasons': None,
                                                 'saved': None,
                                                 'score': None,
                                                 'score_hidden': None,
                                                 'subreddit': None,
                                                 'subreddit_id': None,
                                                 'ups': None,
                                                 'user_reports': None})
    #note: now for every comment in the original thread, make an utterance
    for comment in r.comments:
        try:
            utterance_corpus[comment['name']]=Utterance(id=comment['name'],
                                                        speaker=corpus_speakers[comment['author']],
                                                        conversation_id=r.op_name,
                                                        reply_to=comment['parent_id'],
                                                        timestamp=int(comment['created']),
                                                        text=comment['body'] ,
                                                        meta={
                                                            'pair_ids':pairIDS[comment['name']],
                                                            'success':comment['success'],
                                                            'approved_by': comment['approved_by'],
                                                            'author_flair_css_class': comment['author_flair_css_class'],
                                                            'author_flair_text': comment['author_flair_text'],
                                                            'banned_by': comment['banned_by'],
                                                            'controversiality': comment['controversiality'],
                                                            'distinguished': comment['distinguished'],
                                                            'downs': comment['downs'],
                                                            'edited': comment['edited'],
                                                            'gilded': comment['gilded'],
                                                            'likes': comment['likes'],
                                                            'mod_reports':comment['mod_reports'],
                                                            'num_reports': comment['num_reports'],
                                                            'replies':comment['replies'],
                                                            'report_reasons': comment['report_reasons'],
                                                            'saved': comment['saved'],
                                                            'score': comment['score'],
                                                            'score_hidden': comment['score_hidden'],
                                                            'subreddit': comment['subreddit'],
                                                            'subreddit_id': comment['subreddit_id'],
                                                            'ups': comment['ups'],
                                                            'user_reports': comment['user_reports']
                                                             })

        #this except catches multiple comments that have no text body, see errors examples below
        except:
            c=c+1
            errors.append(comment)
            utterance_corpus[comment['name']]=Utterance(id=comment['name'],
                                                        speaker=Speaker(id='[missing]'),
                                                        conversation_id=r.op_name,
                                                        reply_to=comment['parent_id'],
                                                        timestamp=None,
                                                        text=None ,
                                                        meta={
                                                            'pair_ids':pairIDS[comment['name']],
                                                            'success':comment['success'],
                                                            'approved_by': None,
                                                            'author_flair_css_class':  None,
                                                            'author_flair_text':  None,
                                                            'banned_by':  None,
                                                            'controversiality':  None,
                                                            'distinguished': None,
                                                            'downs':  None,
                                                            'edited':  None,
                                                            'gilded':  None,
                                                            'likes':  None,
                                                            'mod_reports': None,
                                                            'num_reports':  None,
                                                            'replies': None,
                                                            'report_reasons':  None,
                                                            'saved':  None,
                                                            'score':  None,
                                                            'score_hidden':  None,
                                                            'subreddit': None,
                                                            'subreddit_id': None,
                                                            'ups':  None,
                                                            'user_reports':  None
                                                             })

           
print('there were '+str(c)+' comments that were missing common attributes')

there were 530 comments that were missing common attributes


The 530 comments missing common attributes (note that none of them have a text body or author) have been included in the corpus for completeness (note: each were caught by the exception in the above code, but still included), here are some examples of these comments:

In [43]:
errors[22]

{'children': ['cm6tktt'],
 'count': 0,
 'id': 'cm6tktt',
 'name': 't1_cm6tktt',
 'parent_id': 't1_cm6tfr2',
 'success': None}

In [44]:
errors[99]

{'children': ['cj50shk'],
 'count': 0,
 'id': 'cj50shk',
 'name': 't1_cj50shk',
 'parent_id': 't1_cj4wzvs',
 'success': None}

In [45]:
errors[395]

{'children': ['cq9ci0g'],
 'count': 1,
 'id': 'cq9ci0g',
 'name': 't1_cq9ci0g',
 'parent_id': 't1_cq9465r',
 'success': None}

In [46]:
len(utterance_corpus)

293297

Note above: the # of individual posts is less than each recorded comment in our dataset. This stands scrutiny when reviewing the dataset for two reasons:
    1. each positive and negative thread correspond to the same original post.
    2. original posts were re-used to compare different successful/non-successful arguments.

##### Creating a corpus from a list of utterances:

In [47]:
utterance_list = [utterance for k,utterance in utterance_corpus.items()]

In [48]:
change_my_view_corpus = Corpus(utterances=utterance_list, version=1)

In [49]:
print("number of conversations in the dataset = {}".format(len(change_my_view_corpus.get_conversation_ids())))

number of conversations in the dataset = 3051


Note: 3051 is the number of original posts recorded in the dataset (both train and hold out data)

In [50]:
convo_ids = change_my_view_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:2]):
    print("sample conversation {}:".format(i))
    print(change_my_view_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['t3_2ro9ux', 't1_cnhplrm', 't1_cnhrvq7', 't1_cnhz66d', 't1_cniauhy', 't1_cnibfev', 't1_cnic0gj', 't1_cnhpsmr', 't1_cnhpvqs', 't1_cnhq7iw', 't1_cnhqrw1', 't1_cnhqzsf', 't1_cni8tcx', 't1_cnhpp4o', 't1_cnhqouu', 't1_cnhrd8u', 't1_cnhrwsq', 't1_cnhs6sc', 't1_cnhtr4t', 't1_cnhuopi', 't1_cnio1bg', 't1_cnhq330', 't1_cnhs7xb', 't1_cnhpnmr', 't1_cnhqhxa', 't1_cnhrkoc', 't1_cnhq7nv', 't1_cnhqcwz', 't1_cnhsyft', 't1_cnhww76', 't1_cnhz5wq', 't1_cni80dr', 't1_cni8e2y']
sample conversation 1:
['t3_2ro0ti', 't1_cnhpddf', 't1_cnhpqan', 't1_cnhuxye', 't1_cni1m79', 't1_cni24ug', 't1_cnhrcu4', 't1_cni06fr', 't1_cnhp0bu', 't1_cnhppsw', 't1_cnhwhma', 't1_cnho6mi', 't1_cnhot32', 't1_cnhp1pb', 't1_cnho7iy', 't1_cnhoqp4', 't1_cnhobzs', 't1_cnhop4t', 't1_cnhp1nq', 't1_cnhpgyd', 't1_cnhp5lp', 't1_cnhplmn', 't1_cni3tyd', 't1_cnhqck4', 't1_cnhpee3', 't1_cnhregg', 't1_cniogf7', 't1_cnhowj2', 't1_cnhxuu1', 't1_cniedbg', 't1_cnixgm0']


##### Add conversation-level metadata:

In [51]:
convos = change_my_view_corpus.iter_conversations()
for convo in convos:
    convo.add_meta('op-userID',pairDF[pairDF.op_name==convo._id].op_author[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('op-text-body',pairDF[pairDF.op_name==convo._id].op_text[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('op-title',pairDF[pairDF.op_name==convo._id].op_title[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('pair_ids',pairDF[pairDF.op_name==convo._id].pIDs[pairDF[pairDF.op_name==convo._id].index[0]])
  

In [52]:
convo_ids= change_my_view_corpus.get_conversation_ids()

In [53]:
for cv in convo_ids:
    change_my_view_corpus.get_conversation(cv).add_meta('train',int(pairDF[pairDF.op_name==cv].train[pairDF[pairDF.op_name==cv].index[0]]))

##### Add corpus title:

In [54]:
change_my_view_corpus.meta['name'] = "Change My View Corpus"

In [55]:
change_my_view_corpus.print_summary_stats()

Number of Speakers: 34911
Number of Utterances: 293297
Number of Conversations: 3051


In [56]:
change_my_view_corpus.dump('change-my-view-corpus', base_path='C:\\Users\\Andrew\\Desktop\\CMV data')

### Notes
- The original data compiled by the authors only included the challenger replies. To extract the full argument (i.e. a conversation between OP and the challenger), we selected the comments by OP for inclusion in a successful or unsuccessful argument (i.e. "success" = 1 or 0) by collecting all OP replies to any of the corresponding successful/unsuccessful comments by the challenger. This is a conservative measure of the overall "argument." It does not include comments made in response to the challenger's posts by other individuals nor include comments made by OP if he replied to those outside individuals. All other comments in the thread (including separate comments made by OP) have the "success" field taking the value of None.
- If you are interested in expanding the 'arguments' to ensure all conversants are included, then I would suggest the following method:

		1. Collect all originally provided successful and unsuccessful comments (collected at the Utterance-level conditioning on both "success" = 1 or 0 and user_id != the OP's user_id).
		2. Collect all comments made by the OP.
		3. Using the reply_to identifier, recur up the comments made in the full comment thread for each original post; collecting every comment thread that OP has made a comment in.
		4. Select any comment thread from step 3 for inclusion in a successful/unsuccessful argument if the challenger has also made a comment in that thread. 
- Overall, I believe the conservative measurement of 'argument' that I have used is better because the second method (above) would include argument threads where a challenger is only minimally relevant.

- Details on the collection procedure:

We started from the data collected by the Winning Arguments paper (cited above). The data was collected from their host at this blog:
https://chenhaot.com/pages/changemyview.html (note: data used in this corpus is from the original data collection -- NOT the updated data on 11/11/16) 

Note: we originally intended to only convert their pair_data into Convokit format (i.e. the data they use in Section 4 of the paper, which looks at differences between arguments that were convincing/unconvincing to the OP in changing their mind). However, the pair_data only included the replies to the original post (not OP's other comments in the thread -- so there was no conversation, nor did they have all comments in the thread). Therefore, we matched the OP posts in the pair_data with the same observation in their 'all' data, from which we collected all comments for each thread.