# Convert IQ2 debates to Convokit
*by Marianne Aubin Le Quere and Lucas Van Bramer*

This python script converts the raw IQ2 dataset into a Convokit format. The original dataset can be found at http://tisjune.github.io/research/iq2. The input file is:
  * iq2_data_release.json
  
Much of the instructions below were taken from this Github tutorial: https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/converting_movie_corpus.ipynb. 

## Environment Setup

The first step is to ensure your environment is correctly set up. You need to be able to access the convokit package to use this notebook. For more information about how to install convokit please visit the Github page https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit. 

In [235]:
# First, we start by ensuring that the current path is correct. 
# Replace the file path with your local version of convokit to avoid import issues.

import os
os.chdir('/Users/marianneaubin/Documents/Classes/CS6742/Cornell-Conversational-Analysis-Toolkit')

['/Users/marianneaubin/Documents/Classes/CS6742/IQ2', '/Users/marianneaubin/anaconda3/lib/python37.zip', '/Users/marianneaubin/anaconda3/lib/python3.7', '/Users/marianneaubin/anaconda3/lib/python3.7/lib-dynload', '', '/Users/marianneaubin/anaconda3/lib/python3.7/site-packages', '/Users/marianneaubin/anaconda3/lib/python3.7/site-packages/aeosa', '/Users/marianneaubin/anaconda3/lib/python3.7/site-packages/IPython/extensions', '/Users/marianneaubin/.ipython']


In [12]:
# if you are having issues with spacy, you may need to download it here.
# this is an optional step.

!{sys.executable} -m pip install spacy
!python -m spacy info

[1m

spaCy version    2.1.8                         
Location         /Users/marianneaubin/anaconda3/lib/python3.7/site-packages/spacy
Platform         Darwin-18.7.0-x86_64-i386-64bit
Python version   3.7.3                         
Models           en                            



In [237]:
# import convokit, then validate it is correctly imported
# by running the convokit command and checking it exists
import convokit
convokit

<module 'convokit' from '/Users/marianneaubin/Documents/Classes/CS6742/Cornell-Conversational-Analysis-Toolkit/convokit/__init__.py'>

In [238]:
# import required modules

from tqdm import tqdm
from convokit import Corpus, User, Utterance

## Importing your data

Now that you have correctly set up your environment and gotten convokit to work, it's time to important and represent your data! We are working with a json file, so it should be easy to load into a workable dictionary format.

In [239]:
# set your data directory to the IQ2 file location
data_dir = "../IQ2/iq2_data_release/"

In [240]:
import json
with open(data_dir + "iq2_data_release.json", "r", encoding='utf-8', errors='ignore') as f:
    debates = json.load(f)

In [242]:
# optionally, check the data is represented correctly
# note this will print the whole first debate, so may be long

print(str(debates['PerformanceEnhancingDrugs-011508']))
print(str(debates['PerformanceEnhancingDrugs-011508']['title']))

Because the debates do not have an id from the outset, we will create one for ease of use.

In [244]:
#add debate id to dict
id = 1;
for debate in debates:
    debates[debate]['id'] = id
    id = id + 1;

## Creating Users

Our dataset already has all speakers listed along with their metadata. They are broken up into 'for,' 'against,' or 'moderator.' For each of these personas, we will create User representation in our corpus that contains the following meta data:
  * speaker name
  * bio, if available
  * short bio, if available
  * debate id
  * debate name
  * position (define as 'for,' 'against,' 'moderator,' or 'misc')
  
The key to access a user is their name.

Note that we do not have a perfectly clean dataset. There are some speakers who speak but are not represented in the given speaker list of the dataset because they are not officially a part of the debate. One example of this is the 'host' of the debate or a 'panelist.' To account for this, we create a generic user called 'Misc,' to map all of these utterances on to later. That way the utterances are still represented.

In [245]:
#the following is to synthesise all the user metadata
user_meta = {}

for debate in debates:
    speakers = debates[debate]['speakers']
    
    #for speakers
    for speaker in speakers['for']:
        user_info = {};
        user_info['bio'] = speaker['bio'];
        user_info['bio_short'] = speaker['bio_short'];
        user_info['debate_id'] = debates[debate]['id'];
        user_info['debate_name'] = debate
        user_info['position'] = 'for';
        user_info['name'] = speaker['name'];
        user_meta[speaker['name']] = user_info;
        
    #against speakers
    for speaker in speakers['against']:
        user_info = {};
        user_info['bio'] = speaker['bio'];
        user_info['bio_short'] = speaker['bio_short'];
        user_info['debate_id'] = debates[debate]['id'];
        user_info['debate_name'] = debate
        user_info['position'] = 'against';
        user_info['name'] = speaker['name'];
        user_meta[speaker['name']] = user_info;
    
    #moderator
    user_info = {};
    user_info['bio'] = speakers['moderator']['bio'];
    user_info['bio_short'] = speakers['moderator']['bio_short'];
    user_info['debate_id'] = debates[debate]['id'];
    user_info['debate_name'] = debate
    user_info['position'] = 'moderator';
    user_info['name'] = (speakers['moderator']['name']);
    user_meta[(speakers['moderator']['name'])] = user_info;
    
    #misc speaker
    user_info = {};
    user_info['bio'] = None;
    user_info['bio_short'] = None;
    user_info['debate_id'] = debates[debate]['id'];
    user_info['debate_name'] = debate
    user_info['position'] = 'misc';
    user_info['name'] = 'Misc';
    user_meta['Misc'] = user_info
    

In [246]:
# we now create the corpus of users
corpus_users = {k: User(name = k, meta = v) for k,v in user_meta.items()}

In [247]:
# sanity check that the number of users is as expected
print("number of users in the data = {0}".format(len(corpus_users)))
# sanity check on one instance
print(corpus_users['Bob Costas'])
print(corpus_users['Bob Costas'].meta)

number of users in the data = 470
User([('name', 'Bob Costas')])
{'bio': None, 'bio_short': None, 'debate_id': 1, 'debate_name': 'PerformanceEnhancingDrugs-011508', 'position': 'moderator', 'name': 'Bob Costas'}


## Creating utterances

Each time a speaker speaks uninterrupted, this counts as an utterance in our dataset. Each of the dataset utterances consists of:
  * utterance id: this is a unique id we have created for each utterance in the format "debateid_utteranceid." This renderes each utterance unique throughout the dataset
  * user: the speaking user. This will be 'Misc' if the speaker is not officially listed as 'for,' 'against,' or a 'moderator'
  * root: this is the first utterance of the debate
  * reply_to: the id of the preceding utterance
  * timestamp: not present in this case
  * text: text of the utterance
  * metadata:
      * the debate id
      * the current segment
      * nontextual information (e.g. applause, laughter)

In [252]:
# Now moving on to creating Utterances

utterance_corpus = {}

for debate in debates:
    utt_id = 1;
    transcript = debates[debate]['transcript']
    debate_id = debates[debate]['id']
    for utt in transcript:
        speaker = utt['speaker']
        text = utt['paragraphs']
        meta = {'debate id': debates[debate]['id'], 'segment': utt['segment'], 'nontext': utt['nontext']}
        
        utt_unique_id = str(debate_id)+'_'+str(utt_id)
        if utt_id != 1:
            reply_to = str(debate_id)+'_'+str(utt_id-1)
        else:
            reply_to = None
            
        root = str(debate_id) + '_1'
            
        if utt['speakertype'] != ('mod' or 'for' or 'against'):
            speaker="Misc"
            
        utt_id = utt_id +1;
        utterance_corpus[utt_unique_id] = Utterance(utt_unique_id, corpus_users[speaker], root, reply_to, None, text, meta)
        

In [255]:
#sanity check a few utterances
print(utterance_corpus['1_1'])
print(utterance_corpus['1_53'])
print(utterance_corpus['4_7'])
len(utterance_corpus)

Utterance({'id': 'L666499', 'user': User([('name', 'u9028')]), 'root': 'L666497', 'reply_to': 'L666498', 'timestamp': None, 'text': 'How quickly can you move your artillery forward?', 'meta': {'movie_id': 'm616', 'test': []}})

print(corpus_users['Gray Davis'].name)

Utterance({'id': '1_1', 'user': User([('name', 'Bob Costas')]), 'root': '1_1', 'reply_to': None, 'timestamp': None, 'text': ['… And now I’d like to introduce Robert Rosenkranz, who is the chairman of the Rosenkranz Foundation, and the sponsor of Intelligence Squared, who will frame tonight’s debate. Bob? This is Bob.'], 'meta': {'debate id': 1, 'segment': 0, 'nontext': {'applause': [[0, 29]]}}})
Utterance({'id': '1_53', 'user': User([('name', 'Bob Costas')]), 'root': '1_1', 'reply_to': '1_52', 'timestamp': None, 'text': ['All right, suppose we were to adopt Julian’s suggestion, that there were regulated, permissible regulated use of performance- The Rosenkranz Foundation - Intelligence Squared US Debate “Performance Enhancing Drugs in Competitive Sports” enhancing drugs, and in each case it was appropriate to the spirit of the particular sport, that’s fine in the ideal. But it’s naïve to believe that each competitor, many of them obsessed with victory and believing in the full bloom of

In [256]:
#creating corpus from list of utterances
utterance_list = [utterance for k,utterance in utterance_corpus.items()]

## Creating corpus

Create the corpus. Convokit will automatically create conversations from the data in the utterances list.

In [257]:
# create corpus
iq2_corpus = Corpus(utterances=utterance_list, version=1)

In [259]:
# in our case, the number of conversations will be equivalent to the number of debates
print("number of conversations in the dataset = {}".format(len(iq2_corpus.get_conversation_ids())))

number of conversations in the dataset = 108


In [261]:
# sanity check a conversation if desired
# note this will be quite long
convo_ids = iq2_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(iq2_corpus.get_conversation(convo_idx).get_utterance_ids())

## Parsing the corpus

In [13]:
# if needed, download en package for spacy
#!python -m spacy download en
#import nltk;
#nltk.download('punkt')

/Users/marianneaubin/anaconda3/bin/python: Error while finding module specification for 'spacy==2.0.12' (ModuleNotFoundError: No module named 'spacy==2')


In [284]:
# if desired, parse info
# not currently working
#from convokit import Parser
#annotator = Parser()
#iq2_corpus_parsed = annotator.fit_transform(iq2_corpus)

TypeError: Argument 'string' has incorrect type (expected str, got list)

## Adding Corpus level meta data

Each debate also has some meta data associated to it that we want to capture. For each debate, we want to include in the metadata:
  * debate id
  * debate results
  * debate title
  * debate date
  * debate url
  * debate summary

In [276]:
debate_meta = {}
for debate in debates:
    d = debates[debate]
    debate_id, results, title, date, url, summary = \
        d['id'], d['results'], d['title'], d['date'], d['url'], d['summary']
    debate_meta[debate_id] = {'title': title, "url": url, 'summary': summary, 'date': date, 'results': results}

In [277]:
# sanity check for random debate
len(debate_meta)
print(debate_meta[3])

{'title': 'Freedom of Expression Must Include the License to Offend', 'url': 'http://intelligencesquaredus.org/debates/past-debates/item/545-freedom-of-expression-must-include-the-license-to-offend', 'summary': 'Debate description coming soon.', 'date': 'Tuesday, October 16, 2006', 'results': {'breakdown': None, 'post': {'undecided': 1.0, 'for': 83.0, 'against': 16.0}, 'pre': {'undecided': 11.0, 'for': 78.0, 'against': 11.0}}}


In [278]:
iq2_corpus.meta['debate_metadata'] = debate_meta

## Save the dataset
We will now perform a dump of the dataset wherever is preferred, and validate that it correctly saved.

In [279]:
# name the dataset
iq2_corpus.meta['name'] = "IQ2 Debates Corpus"

In [266]:
# specify the path where you want to save the corpus
iq2_corpus.dump("iq2-corpus", base_path='datasets/iq2-corpus')

In [267]:
from convokit import meta_index

In [268]:
meta_index(filename = "datasets/iq2-corpus/iq2-corpus")

{'utterances-index': {'debate id': "<class 'int'>",
  'segment': "<class 'int'>",
  'nontext': "<class 'dict'>"},
 'users-index': {'bio': "<class 'NoneType'>",
  'bio_short': "<class 'NoneType'>",
  'debate_id': "<class 'int'>",
  'debate_name': "<class 'str'>",
  'position': "<class 'str'>",
  'name': "<class 'str'>"},
 'conversations-index': {},
 'overall-index': {'debate_metadata': "<class 'dict'>",
  'name': "<class 'str'>"},
 'version': 1}