<h1>Network of related subreddits where active authors also posted (Coronavirus Anti Lockdown)</h1>

Objective: 
- Find out which author on Reddit has been actively posting about the topic of Coronavirus Anti Lockdown, the most popular submission (post) revolving around that topic and related subreddits which the group of active users are also talking about at the same time

Findings:
- Most Active user in Coronavrius Anti Lockdown Subreddit: signed7
- Most popular submission (post): 10,000 anti-lockdown protesters gather in London to claim coronavirus is ‘a hoax’
- Most commonly posted topic among the group of active users: Coronavirus, Worldnews, News

In [1]:
# https://praw.readthedocs.io/en/latest/code_overview/models/submission.html
# https://www.reddit.com/r/redditdev/comments/rhrz9f/404_response_using_literally_the_code_in_the_docs/

In [2]:
# pip install praw

In [3]:
import praw
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

In [4]:
# Set up Connection with Reddit

reddit = praw.Reddit(client_id='RyUj7x3RzmFJXAoPLzYCYw', \
                     client_secret='KbKj5ps21RQ_hZPJF1l7p1FGEDsyRA', \
                     user_agent='IS434_JoeyLau', \
                     username='joeylau2000', \
                     password='')

In [5]:
# Read CSV File in Data Frame

anti_lockdown_data = pd.read_csv('./Reddit_Data/Reddit_Coronavirus_Anti-Lockdown_100.csv')

FileNotFoundError: [Errno 2] File ../Reddit_Data/Reddit_Coronavirus_Anti-Lockdown_100.csv does not exist: '../Reddit_Data/Reddit_Coronavirus_Anti-Lockdown_100.csv'

In [None]:
# Sort Data Frame based on highest score

anti_lockdown_data_by_score = anti_lockdown_data.sort_values("score", ascending=False)

In [None]:
# Preview Data Frame

anti_lockdown_data_by_score

In [None]:
# Get title of submission (post) with highest score - the number of upvotes a comment receives

anti_lockdown_data_by_score.loc[4]['title']

In [None]:
# Find unique authors (out of 100)

anti_lockdown_data_by_score.author.nunique() 

In [None]:
# Relationship between comments and upvotes 

ax = anti_lockdown_data_by_score.plot('score', 'comms_num', kind='scatter', logx=True, logy=True, title='Scatter plot between Score and Number of Comments')
ax.set(xlabel="Score", ylabel="Number of comments")
plt.savefig("./Reddit_Output/ScatterPlot", dpi=150, bbox_inches='tight', pad_inches=0.5)

In [None]:
# Only take users who posted more than once

repeating = anti_lockdown_data_by_score[anti_lockdown_data_by_score.duplicated(['author'], keep=False)] 

In [None]:
repeating

In [None]:
len(repeating)

In [None]:
# Remove deleted users

repeating = repeating[repeating.author != 'None'] 

In [None]:
# Out of 100 posts, this is the amount of people who posted more than once 

repeating.author.nunique() 

In [None]:
# See the distrubution of authors and their posts

ax = repeating.author.value_counts().plot(kind='bar',title='Distribution of authors and their posts') 
ax.set(xlabel="Authors", ylabel="Number of posts")
plt.savefig("./Reddit_Output/Bargraph",dpi=150, bbox_inches='tight',pad_inches=0.5)

In [None]:
# Compiling a list of authors that appeared more than once on subreddit's top of all times 
# (Used for network graph and for get_user_posts function)

u_authors = list(repeating.author.unique()) 

In [None]:
def get_user_posts(author, n):
    
    redditor = reddit.redditor(author)
    user_posts_list = []
    
    for submission in redditor.submissions.top(limit = n):
        info_list = []
        info_list.append(submission.id)
        info_list.append(submission.score)
        info_list.append(str(submission.author))
        info_list.append(submission.num_comments)
        info_list.append(str(submission.subreddit))
        user_posts_list.append(info_list)
    
    a = sorted(user_posts_list, key=lambda x: x[1], reverse = True)
    user_posts_df = pd.DataFrame(a)
    return user_posts_df 

In [None]:
authors_df =  pd.DataFrame() 
authors_df = authors_df.fillna(0)
for u in u_authors: # Loops through every "influencer" user and gets 10 top posts per user
    c = get_user_posts(u, 10)
    authors_df = pd.concat([authors_df, c]) 

In [None]:
authors_df = authors_df.rename(index=str, # rename column names 
                               columns={0: "id", 1: "score", 2: "author", 3: "num_comments", 4: "subreddit"})

In [None]:
# Dataframe of other subreddits where authors posted 

authors_df.head(10) 

In [None]:
counts = authors_df['subreddit'].value_counts() 
# Only plot the subreddits that appear more than twice
ax = authors_df[authors_df['subreddit'].isin(counts[counts > 2].index)].subreddit.value_counts().plot(kind='bar',title='Distribution of other subreddits where influencers post') 
ax.set(xlabel="Subreddits", ylabel="Number of posts")
plt.savefig("./Reddit_Output/BargraphSubreddits", dpi=150, bbox_inches='tight', pad_inches=0.5)

In [None]:
# Create a dataframe for network graph 

n_df = authors_df[['author', 'subreddit']] 
n_df.head()

In [None]:
# Not a very meaningful graph

g = nx.from_pandas_edgelist(n_df, source='author', target='subreddit') 
nx.draw(g)

In [None]:
# Make list of unique subreddits to use in network graph 

subs = list(n_df.subreddit.unique()) 

In [None]:
plt.figure(figsize=(18, 18))

# Create the graph from the dataframe
g = nx.from_pandas_edgelist(n_df, source='author', target='subreddit') 

# Create a layout for nodes 
layout = nx.spring_layout(g,iterations=50,scale=2)

# Draw the parts we want, edges thin and grey
# Influencers appear small and grey
# Subreddits appear in blue and sized according to their respective number of connections.
# Labels for subreddits ONLY
# People who have more connections are highlighted in color 

# Go through every subbreddit, ask the graph how many connections it has. 
# Multiply that by 80 to get the circle size
sub_size = [g.degree(sub) * 80 for sub in subs]
nx.draw_networkx_nodes(g, 
                       layout, 
                       nodelist=subs, 
                       node_size=sub_size, # a LIST of sizes, based on g.degree
                       node_color='lightblue')

# Draw all the entities 
nx.draw_networkx_nodes(g, layout, nodelist=u_authors, node_color='#cccccc', node_size=100)

# Draw highly connected influencers 
popular_people = [person for person in u_authors if g.degree(person) > 1]
nx.draw_networkx_nodes(g, layout, nodelist=popular_people, node_color='orange', node_size=100)

nx.draw_networkx_edges(g, layout, width=1, edge_color="#cccccc")

node_labels = dict(zip(subs, subs)) #labels for subs
nx.draw_networkx_labels(g, layout, labels=node_labels)

# No axis needed
plt.axis('off')
plt.title("Network Graph of Related Subreddits")
plt.savefig("../Reddit_Output/NetworkGraph_Corona_Anti_Lockdown", bbox_inches='tight', pad_inches=0.5)
plt.show()