This notebook is for preprocessing the data for modeling in the next notebook.

First, I import the necessary libraries.

networkx is a graph package for Python, which allows data to easily be represented and analyzed in graph form.

In [1]:
import pandas as pd
import numpy as np
import networkx as nx

I load the posts and comments from previous notebooks, and filter them. I keep only posts within a specific upvote ratio, the hypothesis being that a higher ratio of upvotes indicates that the post is constructive, or at the very least contributing to the collective hype around a stock.

I also remove megathreads, both because they were too large to efficiently mine comments from, and because they are very spam-heavy and with no specific direction from the opening post.

In [53]:
posts = pd.read_csv('posts.csv',index_col=0)
comments = pd.read_csv('comments.csv',index_col=0)

  mask |= (ar1 == a)


In [3]:
posts = posts[posts.upvote_ratio >= 0.75]

In [None]:
posts.nlargest(50,'num_comments_true')

In [4]:
posts = posts[~posts.index.isin(['l6i12n','l5ne0q','l4lmrx','l5xpai','l7iorh','lm7n51','lq0l68','la5s8i','lcdspa','lbm3vr','lab86a',
                                'l0hhqg','lce3mf','lg0mn2','ld4yet','l74zgc','lgrxxk','lf9rdy','lcoe7l','lb7rg4',
                                'lbyf09','lhk9iv'])]

I also remove comments with no indicated author, and restrict them to ones whose parent post is still in the posts dataframe.

In [56]:
comments = comments[comments.link_id.isin(posts.index)]

In [57]:
comments = comments[~comments.author.isna()]

These are the final shapes of both dataframes.

In [58]:
comments.shape

(2192225, 9)

In [59]:
posts.shape

(289821, 15)

Here, I set the index of comments to be the unique comment id.

In [60]:
comments = comments.set_index('id')

In [61]:
comments.head()

Unnamed: 0_level_0,author,parent_id,link_id,subreddit,body,score,permalink,created_utc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
g3jg6ae,MnVikingsFan34,ikazok,ikazok,wallstreetbets,About as likely as me regaining my missing chr...,69,/r/wallstreetbets/comments/ikazok/what_are_the...,1598924000.0
g3jgdna,hedgeAgainst,ikazok,ikazok,wallstreetbets,Not happening. AMZN is the new BRK.A.,93,/r/wallstreetbets/comments/ikazok/what_are_the...,1598924000.0
g3jgg4f,RainMan214,ikb0s0,ikb0s0,wallstreetbets,Holy shit,2,/r/wallstreetbets/comments/ikb0s0/did_i_win_39...,1598924000.0
g3jgho0,hedgeAgainst,g3jg6ae,ikazok,wallstreetbets,Musk will reveal his new gene therapy next month.,2,/r/wallstreetbets/comments/ikazok/what_are_the...,1598924000.0
g3jgjio,RainMan214,ikazok,ikazok,wallstreetbets,Why not wait till he announces the split? If i...,1,/r/wallstreetbets/comments/ikazok/what_are_the...,1598924000.0


The following code accounts for comments whose parent comment is not in the dataframe, and converts the immediate parent to the parent post. This project will not involve analyzing sentiment, so this won't be problematic for analysis.

In [66]:
index = comments.index

In [9]:
def in_index(x):
    if x['parent_id'] == x['link_id']:
        return x
    if x['parent_id'] not in index:
        x['parent_id'] = x['link_id']
    return x

In [67]:
ids = comments[comments.parent_id != comments.link_id]

In [70]:
len(ids)

121378

In [69]:
ids = ids[~ids.parent_id.isin(index)].index

In [71]:
comments.loc[ids,'parent_id'] = comments.loc[ids,'link_id']

In [17]:
posts.to_csv('posts_ready.csv')
comments.to_csv('comments_ready.csv')

In [2]:
posts = pd.read_csv('posts_ready.csv',index_col=0)
comments = pd.read_csv('comments_ready.csv',index_col=0)

  mask |= (ar1 == a)


In [3]:
posts.head()

Unnamed: 0_level_0,author,subreddit,selftext,num_comments,score,title,permalink,link_flair_css_class,created_utc,stock,num_comments_true,score_true,upvote_ratio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
il81vu,JackMaverick7,wallstreetbets,,2,1,$AMC up 15% today already and labor day weeken...,/r/wallstreetbets/comments/il81vu/amc_up_15_to...,question,1599058000.0,AMC,1.0,1.0,1.0
il848o,JackMaverick7,wallstreetbets,,2,1,$AMC up 15% today already and big viewing numb...,/r/wallstreetbets/comments/il848o/amc_up_15_to...,question,1599058000.0,AMC,1.0,1.0,1.0
ilhn1t,Wiletj1,wallstreetbets,[removed],2,1,Good Expiration Date for Shorting AMC,/r/wallstreetbets/comments/ilhn1t/good_expirat...,question,1599088000.0,AMC,1.0,1.0,1.0
ipu68n,kushkiller1,wallstreetbets,[removed],2,1,AMC $7 Calls 9/18? Going up?,/r/wallstreetbets/comments/ipu68n/amc_7_calls_...,stocks,1599703000.0,AMC,1.0,1.0,1.0
iqqyc6,kushkiller1,wallstreetbets,[removed],2,1,AMC CALLS? 80% of the theaters have reopened!,/r/wallstreetbets/comments/iqqyc6/amc_calls_80...,dd,1599833000.0,AMC,1.0,1.0,1.0


Here, I create a new column in the posts dataframe that contains any other posts that the post in the row is adjacent to. Here I define adjacent as "someone has commented on this post, and then made a new post mentioning the same stock within the next 24 hours". This will be valuable for determining information cascades consisting of multiple post-comment trees.

In [None]:
count = 0
for i,row in posts.iterrows():
    count += 1
    if count < 0:
        continue
    if count%100 == 0:
        print(count)
    temp_list = ''
    if row.num_comments_true == 0:
        continue
    comments_post = comments[comments.link_id == i]
    for author in comments_post.author:
        author_posts = posts[posts.author == author]
        for j,post in author_posts.iterrows():
            flag = 0
            for stock in row.stock.split('|'):
                if stock in post.stock.split('|'):
                    flag = 1
            if flag == 1:
                if post.created_utc > row.created_utc and post.created_utc - row.created_utc < 86400:
                    temp_list = temp_list + j + '|'
    posts.loc[i,'adjacent'] = temp_list[:-1]

In [12]:
posts.to_csv('posts_temp_july.csv')

In [55]:
posts['adjacent'].value_counts()

l7mf0h|l7t5vq                                                            28
ldbb61                                                                   18
l5bqh1                                                                   18
l6bvz7                                                                   17
k2gzuu                                                                   16
                                                                         ..
la5914|la5pr4|la5uji|la76f4|la9ea6|laak5l|lan5qr|lan7a3|laofjc|labpkj     1
l2bbnw                                                                    1
isot9i|isot9i|isot9i                                                      1
k2419k                                                                    1
lb518s                                                                    1
Name: adjacent, Length: 18413, dtype: int64

In [75]:
posts.to_csv('posts_final_save.csv')
comments.to_csv('comments_final_save.csv')

Here, I add date columns to the posts and comments dataframes, and the author of the parent post and stock information for each comment.

In [72]:
posts['date'] = pd.to_datetime(posts['created_utc'],errors='coerce',unit='s').dt.strftime("%Y-%m-%d")
comments['date'] = pd.to_datetime(comments['created_utc'],errors='coerce',unit='s').dt.strftime("%Y-%m-%d")

In [73]:
comments['link'] = comments['link_id'].apply(lambda x: posts.loc[x,'author'])
comments['stock'] = comments['link_id'].apply(lambda x: posts.loc[x,'stock'])

In [74]:
posts = posts.sort_values('created_utc')
comments = comments.sort_values('created_utc')

Finally, I create my graph object. This is a directed graph, with each node being a post or a comment. There will be two types of directed edges: post -> comment or comment -> comment, indicating a direct reply, and post -> post, indicating 'adjacent' posts, as defined earlier.

In [4]:
posts_graph = nx.DiGraph()

In [5]:
for i,row in posts.iterrows():
    posts_graph.add_node(i,author=row.author,stocks=row.stock.split('|'),time=row.created_utc,date=row.date,score=row.score_true,posttype='post')

In [6]:
for i,row in comments.iterrows():
    posts_graph.add_node(i,author=row.author,stocks=row.stock.split('|'),time=row.created_utc,date=row.date,score=row.score,posttype='comment')

In [None]:
count = 0
for i,row in comments.iterrows():
    count += 1
    if count % 100 == 0:
        print(count)
    if row.link_id == row.parent_id:
        weight = posts.loc[row.link_id,'created_utc'] - row.created_utc
    else:
        weight = comments.loc[row.parent_id,'created_utc'] - row.created_utc
    posts_graph.add_edge(row.parent_id,i,weight=weight)

In [80]:
for i,row in posts.iterrows():
    if row.adjacent == np.NaN:
        continue
    if type(row.adjacent) == float:
        continue
    if len(row.adjacent) == 0:
        continue
    for adj in row.adjacent.split('|'):
        weight = posts.loc[adj,'created_utc'] - row.created_utc
        posts_graph.add_edge(i,adj,weight=weight)

After constructing the graph, I observe the node count and edge count. This is a very sparse graph, with slightly more nodes than edges.

In [81]:
len(posts_graph.nodes())

2482046

In [82]:
len(posts_graph.edges())

2282869

Finally, I save the graph as a pickle object for the analysis in the next notebook.

In [83]:
nx.write_gpickle(posts_graph,open('posts_graph_connected.pkl','wb'))