# Generating "Reply-to" data for reply comments

### John Burt


### Introduction:

The social media site Reddit is divided into many different communities called subreddits (subs). Each sub covers a specific topic or theme and tends to have regular users posting comments. The dataset I'm working with consists of comments to posts from several different subs.

###  Notebook purpose:

This notebook contains code to add "replyto" data columns to the main dataset, based on the parent ID column in each sample. The process wasn't a simple as I thought, and it takes a long time to complete:

- Convert ID text labels to category numbers: make ID_n cols for post_ID, parent_ID, comment_ID
- Find all unique parent IDs.
- Remove parent IDs that are actually the post ID (top level comments).
- Iterate through each parent ID:
    - Get parent comment row.
    - Find all reply comments with that parent ID.
    - Set replyto info for each reply comment using parent comment 
    
I've also included a bit of code from an analysis of computation speed of different methods I tried. Hopefully this will be helpful for future projects requiring non-trivial searchs through data matrices.

### Methods

The comment data used in this analysis was [acquired using PRAW](https://github.com/johnmburt/springboard/blob/master/capstone_1/reddit_collect_comments_v1.ipynb) from 12 subs. 8 of the subs are non-political, and 4 are political in nature. 


In [3]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import pandas as pd
pd.options.display.max_columns = 100

import numpy as np
import datetime
import time
import csv
import glob


# source data folder 
srcdir = './data_labeled/'

# the subreddits I'll be analyzing
sub2use = ['aww', 'funny', 'todayilearned','askreddit',
           'photography', 'gaming', 'videos', 'science',
           'politics', 'politicaldiscussion',             
           'conservative', 'the_Donald']

# load all labelled CSVs
dfs = []
for subname in sub2use:
    pathname = srcdir+'comment_sample_'+subname+'_labeled.csv'
#     print('reading',pathname)
    tdf = pd.read_csv(pathname)
    dfs.append(tdf)

# combine all subreddit datasets into one  
df = pd.concat(dfs).drop_duplicates()

# remove any deleted or removed comments 
df = df[(df.text!='[deleted]') & (df.text!='[removed]')]

# drop samples with NaNs
df.dropna(inplace=True)

# drop duplicates
df = df.drop_duplicates(subset='comment_ID')

# reformat parent ids to match comment ids
df.parent_ID = df.parent_ID.str.replace('t1_','')
df.parent_ID = df.parent_ID.str.replace('t3_','')

print('\nTotal comment samples read:',df.shape[0])


Total comment samples read: 3251323


### create an ID to category number mapping

I do this to help speed up the reply comment search below.

Post IDs are set to a category num of -1 to make it easier to remove them from the parent ID column.

In [5]:
from collections import OrderedDict

# Note: to get ID strings from ID numbers, invert the dict:
# invert a dict: {v:k  for k,v in my_dict.items()}

# create a dict that makes all post IDs = -1
pid = list(df['post_ID'].unique())
id_dict = OrderedDict(zip(pid,[-1]*len(pid)))

# create list of parent IDs with post IDs (ie top level comments) removed
spo = set(pid)
spa = set(df['parent_ID'].values)
parentids = list(spa.difference(spo))

# create a list containing only unique comment IDs
allcomids = np.unique(parentids + list(df['comment_ID'].values))

# create a dict containing all comment ids, with 
comiddict = OrderedDict(zip(allcomids,range(len(allcomids))))

# combine the post dict and the all comment dict
id_dict.update(comiddict)

# convert parent IDs to category values
df['parent_ID_n'] = [id_dict[x] for x in df['parent_ID']]

# convert comment IDs to category values
df['comment_ID_n'] = [id_dict[x] for x in df['comment_ID']]
                                            

### Create the new columns, fill them with null data

I will load these columns with reply-to data, then add them to the comment dataframe

In [124]:
numrows = df.shape[0]

replyto_id = np.array([object]*numrows)
replyto_score = np.array([np.nan]*numrows)
replyto_pca_score = np.array([np.nan]*numrows)
replyto_text = np.array([object]*numrows)
replyto_num_replies = np.array([np.nan]*numrows)
replyto_com_karma = np.array([np.nan]*numrows)

### Create a dataframe containing only comments that are parents to other comments. 

These are the comments I will iterate through to locate replies to them.

In [123]:
df.reset_index(inplace=True)
df.set_index('comment_ID_n', inplace=True)

par_df = df.loc[np.unique(df['parent_ID_n'][df['parent_ID_n']>=0])]
par_df = par_df.drop_duplicates(subset='parent_ID_n')

df.reset_index(inplace=True)
par_df.reset_index(inplace=True)

print(par_df.shape, par_df['parent_ID_n'].unique().shape)

# parent_ID = df['parent_ID'].values
parent_ID_n = df['parent_ID_n'].values
# comment_ID = df['comment_ID'].values
# comment_ID_n = df['comment_ID_n'].values



(649484, 28) (649484,)


### For each parent comment, find all replies to it
Note: this takes a really long time!


In [131]:
count = 0

# for par_idx, par_idn in enumerate(par_df['parent_ID_n']):
for paridx, parent in par_df.iterrows():
    idx, = np.where(parent_ID_n==parent['comment_ID_n'])
    replyto_id[idx] = parent['comment_ID']
    replyto_score[idx] = parent['score']
    replyto_pca_score[idx] = parent['pca_score'] 
    replyto_text[idx] = parent['text']
    replyto_num_replies[idx] = parent['num_replies']
    replyto_com_karma[idx] = parent['u_comment_karma']
    count += 1
    if not count % 5000:
        print(count,' ',end='')


5000  10000  15000  20000  25000  30000  35000  40000  45000  50000  55000  60000  65000  70000  75000  80000  85000  90000  95000  100000  105000  110000  115000  120000  125000  130000  135000  140000  145000  150000  155000  160000  165000  170000  175000  180000  185000  190000  195000  200000  205000  210000  215000  220000  225000  230000  235000  240000  245000  250000  255000  260000  265000  270000  275000  280000  285000  290000  295000  300000  305000  310000  315000  320000  325000  330000  335000  340000  345000  350000  355000  360000  365000  370000  375000  380000  385000  390000  395000  400000  405000  410000  415000  420000  425000  430000  435000  440000  445000  450000  455000  460000  465000  470000  475000  480000  485000  490000  495000  500000  505000  510000  515000  520000  525000  530000  535000  540000  545000  550000  555000  560000  565000  570000  575000  580000  585000  590000  595000  600000  605000  610000  615000  620000  625000  630000  635000  6400

In [136]:
print(df.shape,len(replyto_score),np.sum(~np.isnan(replyto_score)))

(3251323, 28) 3251323 1048906


### Assign the reply-to np vectors to columns in df

In [137]:
# assign the replyto np vectors to columns in df

df['replyto_id'] = replyto_id
df['replyto_score'] = replyto_score
df['replyto_pca_score'] = replyto_pca_score
df['replyto_text'] = replyto_text
df['replyto_num_replies'] = replyto_num_replies
df['replyto_com_karma'] = replyto_com_karma


### Check the results

The algorithm is a bit complicated, it's possible the output is wrong, so I'd better doublecheck.

In [189]:
# do some sanity checking to make sure this is right

# select reply comments
reply_df = df[~df['replyto_score'].isnull()]

# every comment with replyto data: parent_ID should equal replyto_id
print('# parent_ID != replyto_id:',
      (reply_df['parent_ID'] != reply_df['replyto_id']).sum())

# verify that parent_ID_n contents == replyto_id: 
numsamps = 1000 # number of samples to test
test_post = []
test_subname = []
for comidx, com in reply_df.sample(numsamps).iterrows():
    idx, = np.where(comment_ID==com['replyto_id'])  
    idx = idx[0]
    #  - same post?
    test_post.append(df['post_ID'].iloc[idx]!=com['post_ID'])
    #  - same sub?
    test_subname.append(df['sub_name'].iloc[idx]!=com['sub_name'])

print('# not same post ID:',np.sum(test_post))    
print('# not same subname:',np.sum(test_subname))   
print()
    
#  - reply text makes sense?
for comidx, com in reply_df.sample(5).iterrows():
    idx, = np.where(comment_ID==com['replyto_id'])  
    idx = idx[0]
    print('Parent index:',idx)
    print('------------------------')
    print('Parent comment:',df['text'].iloc[idx])
    print('Reply:',com['text'])
    print()
    


# parent_ID != replyto_id: 0
# not same post ID: 0
# not same subname: 0

Parent index: 527245
------------------------
Parent comment: Cops are not going to do shit for you, except make the situation worse.
Reply: They wouldn't need to since I was in the right. They also wouldnt do shit for the bar/restaurant. This would end in mine, the cops, and teh establishments time being wasted. Probably a ban from going there in the future. 

But no one is going to jail and that tab isn't being paid by OP.

For $500, I will hang out til the cops show up vs fighting a bouncer or waiter or running away on the tab. 

What would you suggest in this situation if you don't feel that the cops would be any help if it got to that point? Just pay? Run? Physically be confronted by an aggressive bouncer/owner when you try and leave without paying the disputed bill? 

IF it escalated beyond "I am not paying this amount." to "Fuck you aren't leaving til you pay and I will physically stop you if you try." the

In [187]:
# save the reply-enhanced df

if True:
    replypath = srcdir+'comment_sample_all_plus_replies.csv'
    df.to_csv(replypath,index=False)
    print('saved reply data to',replypath)

saved reply data to ./data_labeled/comment_sample_all_plus_replies.csv


### Timeit tests of different methods for searching for a value in a column

I've left this here for future reference.

#### idx = np.where(paridn==id)
- 532 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#### x = paridn[paridn == id]
- 539 ms ± 6.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#### x = df[df['parent_ID_n'] == id]['parent_ID_n']
- 819 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#### idx = (np.abs(paridn - id)).argmin()
- 2.91 s ± 152 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [36]:
# %%timeit

def find_nearest(array, value):
    array = np.asarray(array)
    return (np.abs(array - value)).argmin()

paridnlist = df['parent_ID_n'].unique()

# paridn = df['parent_ID_n'].values
paridn = np.asarray(df['parent_ID_n'].values)

idxlist = []
x = 0
for id in df['parent_ID_n'].unique():
#     print(df[df['parent_ID_n'] == id]['parent_ID_n'],end='')
#     x = df[df['parent_ID_n'] == id]['parent_ID_n']
#     x = paridn[paridn == id]
#     idx = find_nearest(paridn, id)
#     idx = (np.abs(paridn - id)).argmin()
    idx, = np.where(paridn==id)
#     print(id,idx.shape,list(idx))
    idxlist.extend(list(idx))


KeyboardInterrupt: 

### Old method, extremely slow:

Left for future reference

In [None]:
# set replyto cols only for replies to negative comments
# df_neg = df[df.score<0]
# for index, comment in df_neg.iterrows():
df_withreply = df[df['num_replies']>0]
print('# with replies =',df_withreply.shape[0])
count=0
for index, comment in df_withreply.iterrows():
#     df['replyto_id'].loc[comment['comment_ID_n']] = comment['comment_ID']
#     df['replyto_score'].loc[comment['comment_ID_n']] = comment['score']
#     df['replyto_pca_score'].loc[comment['comment_ID_n']] = comment['pca_score']
#     df['replyto_text'].loc[comment['comment_ID_n']] = comment['text']
#     break
    try:
        df.loc[comment['comment_ID_n']]['replyto_id'] = comment['comment_ID']
        df.loc[comment['comment_ID_n']]['replyto_score'] = comment['score']
        df.loc[comment['comment_ID_n']]['replyto_pca_score'] = comment['pca_score']
        df.loc[comment['comment_ID_n']]['replyto_text'] = comment['text']
    except:
        pass

#     replies = df['parent_ID']==comment['comment_ID']
#     if replies.any():
# #         print(comment['comment_ID'],comment['score'])
#         df['replyto_id'][replies] = comment['comment_ID']
#         df['replyto_score'][replies] = comment['score']
#         df['replyto_pca_score'][replies] = comment['pca_score']
#         df['replyto_text'][replies] = comment['text']
    count += 1
    if not count % 5000: print(count,' ',end='')


# with replies = 1159680
5000  