In [1]:
import convokit

In [3]:
from convokit import Corpus, download
import pandas as pd

In this notebook, we demonstrate how to generate corpora from a pandas DataFrame. In general, users with csv data may find it more straightforward to load their csv data as a DataFrame, make a few adjustments, and then generate a Corpus using the `Corpus.from_pandas()` method.

We will use an existing corpora to demonstrate what your own DataFrame should look like.

In [6]:
# using an existing Corpus of the subreddit named 'hey'
corpus = Corpus(download('conversations-gone-awry-corpus'))
conversations_gone_awry_conversationdf = corpus.get_conversations_dataframe()
conversations_gone_awry_utterances_df = corpus.get_utterances_dataframe()
corpus.get_speakers_dataframe()
print(conversations_gone_awry_conversationdf.columns)
print(conversations_gone_awry_utterances_df.columns)
print(corpus.get_speakers_dataframe().columns)
# Merge the two DataFrames
merged_df = pd.merge(conversations_gone_awry_conversationdf, conversations_gone_awry_utterances_df, on='conversation_id')
merged_df.get_conversations_dataframe().to_csv('conversations.csv')

Dataset already exists at /Users/frankabugnail/.convokit/downloads/conversations-gone-awry-corpus
Index(['vectors', 'meta.page_title', 'meta.page_id', 'meta.pair_id',
       'meta.conversation_has_personal_attack', 'meta.verified',
       'meta.pair_verified', 'meta.annotation_year', 'meta.split'],
      dtype='object')
Index(['timestamp', 'text', 'speaker', 'reply_to', 'conversation_id',
       'meta.is_section_header', 'meta.comment_has_personal_attack',
       'meta.toxicity', 'meta.parsed', 'vectors'],
      dtype='object')
Index(['vectors'], dtype='object')


KeyError: 'conversation_id'

In [39]:

# Get the utterances as a DataFrame
utterances_df = corpus.get_utterances_dataframe()

# Get the conversations as a DataFrame
conversations_df = corpus.get_conversations_dataframe()

# Rename the 'speaker' column in the conversations DataFrame to 'speaker_id'
conversations_df = conversations_df.rename(columns={'speaker': 'speaker_id'})

# Merge the utterances with the conversations
merged_df = pd.merge(utterances_df, conversations_df, left_on='conversation_id', right_on='id')


In [43]:
import pandas as pd
import numpy as np
# Read the CSV file
df = pd.read_csv('conversations.csv')

# Split the DataFrame into chunks
chunks = np.array_split(df, 40)

# Save each chunk to a separate CSV file
for i, chunk in enumerate(chunks):
    chunk.to_csv(f'chunk_{i}.csv', index=False)

In [10]:
# you can ignore this
utt_df = corpus.get_utterances_dataframe().drop(columns=['vectors'])
convo_df = corpus.get_conversations_dataframe().drop(columns=['vectors'])
speaker_df = corpus.get_speakers_dataframe().drop(columns=['vectors'])

Now, take a close look at each of these dataframes. Notice that each utterance, speaker, conversation has its own ID. (In this corpus in particular, the conversation ID is based on the ID of the first utterance in the conversation.)

Utterances have the following **primary data fields**: ID, timestamp, text, speaker (a string ID), reply_to (a string ID), conversation_id (a string ID).

Conversations and Speakers have only one **primary data field**, their ID.

All other information associated with these objects are *metadata* and included in the dataframes as *meta.[keyname]*.

In [11]:
utt_df.head(20)

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.is_section_header,meta.comment_has_personal_attack,meta.toxicity,meta.parsed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
146743638.12652.12652,1185295934.0,== [WIKI_LINK: WP:COMMONNAME] ==\n,Sirex98,,146743638.12652.12652,True,False,0.0,"[{'rt': 3, 'toks': [{'tok': '=', 'tag': 'NFP',..."
146743638.12667.12652,1185277934.0,I notice that earier that moved wiki_link to ...,Sirex98,146743638.12652.12652,146743638.12652.12652,False,False,0.078141,"[{'rt': 33, 'toks': [{'tok': 'I', 'tag': 'PRP'..."
146842219.12874.12874,1185310317.0,"Chen was known in the poker world as ""William""...",2005,146743638.12667.12652,146743638.12652.12652,False,False,0.031784,"[{'rt': 2, 'toks': [{'tok': 'Chen', 'tag': 'NN..."
146860774.13072.13072,1185316241.0,I see what you saying I just read his pokersta...,Sirex98,146842696.12874.12874,146743638.12652.12652,False,False,0.030405,"[{'rt': 13, 'toks': [{'tok': 'I', 'tag': 'PRP'..."
143890867.11926.11926,1184144351.0,==List of slang terms for poker hands==\n,WilyD,,143890867.11926.11926,True,False,0.0,"[{'rt': 0, 'toks': [{'tok': '=', 'tag': 'NFP',..."
143890867.11944.11926,1184126351.0,No more than two editors advocated deletion. ...,WilyD,143890867.11926.11926,143890867.11926.11926,False,False,0.068956,"[{'rt': 5, 'toks': [{'tok': 'No', 'tag': 'DT',..."
143902946.11991.11991,1184131325.0,In the future please don't close Afds when you...,2005,143890867.11944.11926,143890867.11926.11926,False,False,0.150349,"[{'rt': 6, 'toks': [{'tok': 'In', 'tag': 'IN',..."
143945536.12065.12065,1184153887.0,That simply isn't true. If you read the comme...,WilyD,143902946.11991.11991,143890867.11926.11926,False,False,0.228468,"[{'rt': 2, 'toks': [{'tok': 'That', 'tag': 'DT..."
144052463.12169.12169,1184189922.0,"Somehow, I suspect you may wish to participate...",WilyD,143890867.11926.11926,143890867.11926.11926,False,False,0.048606,"[{'rt': 3, 'toks': [{'tok': 'Somehow', 'tag': ..."
144065917.12226.12226,1184194629.0,"I assume your deliberate lying has a point, bu...",2005,144052514.12169.12169,143890867.11926.11926,False,True,0.607831,"[{'rt': 1, 'toks': [{'tok': 'I', 'tag': 'PRP',..."


In [12]:
convo_df.head(10)

Unnamed: 0_level_0,meta.page_title,meta.page_id,meta.pair_id,meta.conversation_has_personal_attack,meta.verified,meta.pair_verified,meta.annotation_year,meta.split
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
146743638.12652.12652,User talk:2005,1003212,143890867.11926.11926,False,True,True,2018,train
143890867.11926.11926,User talk:2005,1003212,146743638.12652.12652,True,True,True,2018,train
127296808.516.516,User talk:Billzilla,10051321,144643838.1236.1236,False,True,True,2018,train
144643838.1236.1236,User talk:Billzilla,10051321,127296808.516.516,True,True,True,2018,train
66813686.23567.23567,Talk:Niger uranium forgeries,1005730,68000691.25417.25417,False,True,True,2018,train
68000691.25417.25417,Talk:Niger uranium forgeries,1005730,66813686.23567.23567,True,True,True,2018,train
95100799.58.76,Talk:Indo-Pakistani War of 1965,1006996,14969685.9097.9097,False,True,True,2018,train
14969685.9097.9097,Talk:Indo-Pakistani War of 1965,1006996,95100799.58.76,True,True,True,2018,train
158714043.10458.10458,User talk:Greswik,10094463,156725805.9367.0,False,True,True,2018,train
156725805.9367.0,User talk:Greswik,10094463,158714043.10458.10458,True,True,True,2018,train


In [13]:
# looks the like speakers have no metadata
speaker_df.head(10)

Sirex98
2005
WilyD
H20h0us391
Billzilla
AuburnPilot
The Evil Spartan
Morton devonshire
Commodore Sloat
Captainktainer


If you format your data to follow this format, you can generate Corpora from them easily. For example, using the above dataframes, we can re-generate the original corpus! 

In [33]:
new_corpus = Corpus.from_pandas(utterances_df=utt_df, speakers_df=speaker_df, conversations_df=convo_df)

30021it [00:01, 20734.48it/s]


In [35]:
new_corpus.print_summary_stats()
new_corpus.to_csv('conversations.csv')

Number of Speakers: 8069
Number of Utterances: 30021
Number of Conversations: 4188


AttributeError: 'Corpus' object has no attribute 'to_csv'

In fact, you can **generate corpora from utterance data only**. Let's consider the simplest case scenario, where you have the primary data fields and nothing else. 

(Note that 'id' does not have to be the DataFrame index, it can just be another column in your dataframe.)

In [17]:
# constructing simple utterance dataframe, you can ignore this
simple_utt_df = utt_df[['timestamp', 'text', 'speaker', 'reply_to', 'conversation_id']]
ids = list(simple_utt_df.index)
simple_utt_df = simple_utt_df.reset_index()
simple_utt_df['id'] = ids

In [18]:
# what your basic utterance data might look like
simple_utt_df

Unnamed: 0,id,timestamp,text,speaker,reply_to,conversation_id
0,146743638.12652.12652,1185295934.0,== [WIKI_LINK: WP:COMMONNAME] ==\n,Sirex98,,146743638.12652.12652
1,146743638.12667.12652,1185277934.0,I notice that earier that moved wiki_link to ...,Sirex98,146743638.12652.12652,146743638.12652.12652
2,146842219.12874.12874,1185310317.0,"Chen was known in the poker world as ""William""...",2005,146743638.12667.12652,146743638.12652.12652
3,146860774.13072.13072,1185316241.0,I see what you saying I just read his pokersta...,Sirex98,146842696.12874.12874,146743638.12652.12652
4,143890867.11926.11926,1184144351.0,==List of slang terms for poker hands==\n,WilyD,,143890867.11926.11926
...,...,...,...,...,...,...
30016,132882061.154637.154637,1179906337.0,"Umm, please spell this out. Exactly how would ...",John Quiggin,132878058.154555.154555,132829693.152766.152766
30017,132930318.154870.154870,1179928761.0,"""conspiracy theories"", ""bizzare ideas"", and ""c...",Africangenesis,132829693.152766.152766,132829693.152766.152766
30018,132958763.155449.155449,1179937084.0,"Time and time again in the GW debate, refs tha...",Oren0,132930318.154870.154870,132829693.152766.152766
30019,133004162.155823.155685,1179949987.0,Huh? There are a dozen or so op-eds cited here...,John Quiggin,132958763.155449.155449,132829693.152766.152766


In [19]:
new_corpus = Corpus.from_pandas(simple_utt_df)

30021it [00:00, 43439.20it/s]


In [20]:
new_corpus.print_summary_stats()

Number of Speakers: 8069
Number of Utterances: 30021
Number of Conversations: 4188


We generated the same Corpus! The only difference is that because we excluded the conversations and speakers dataframes, these objects will have no metadata, whereas previously the conversations had metadata.

In [21]:
# before
corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

Unnamed: 0_level_0,meta.page_title,meta.page_id,meta.pair_id,meta.conversation_has_personal_attack,meta.verified,meta.pair_verified,meta.annotation_year,meta.split
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
146743638.12652.12652,User talk:2005,1003212,143890867.11926.11926,False,True,True,2018,train
143890867.11926.11926,User talk:2005,1003212,146743638.12652.12652,True,True,True,2018,train
127296808.516.516,User talk:Billzilla,10051321,144643838.1236.1236,False,True,True,2018,train
144643838.1236.1236,User talk:Billzilla,10051321,127296808.516.516,True,True,True,2018,train
66813686.23567.23567,Talk:Niger uranium forgeries,1005730,68000691.25417.25417,False,True,True,2018,train


In [22]:
# after
new_corpus.get_conversations_dataframe().drop(columns=['vectors']).head()

146743638.12652.12652
143890867.11926.11926
127296808.516.516
144643838.1236.1236
66813686.23567.23567


This concludes a short tutorial on how to generate ConvoKit corpora from pandas dataframes. More details on the `Corpus.from_pandas()` can be found in the documentation.

In [23]:

corpus.get_conversations_dataframe().to_csv('conversations.csv')