## Collecting subreddit data
This is the file that is actually used to collect usernames from all posts and comments on a subreddit as well as their activity in other subreddits. If you want to see the code behind the functions in this script, look at the collector module.

In [1]:
import time, math
import dill
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio 
from redditalyzer import collector
import chart_studio

# If you are using vscode to render the chart
pio.renderers.default = "vscode"

The next two code blocks should be run to collect reddit usernames that have commented on the subreddit within a given time range. Default time range is within the last 30d. Time range can be changed with the start_date parameter of the request_pushift function. Accepted parameters can be found on the pushift API documentation. You should see the printed numbers begin to increase as the script collects more usernames.

If you want to collect the usernames from the submissions as well, you have to change the endpoint param in the request_pushift functions.

In [None]:
data, complete = collector.request_pushift('ludwigahgren')
comment_usernames = collector.pull_usernames(data)

This code block should probably take less than 10 minutes for a month's worth of usernames on r/ludwigahgren.

If the numbers stop printing it's because the pushift servers aren't responding due to either rate limits or high load. I set it retry after 10 seconds but if it happens for prolonged period of time their servers are probably just down and you should stop the script.

**Do not remove any code related to rate limiting, pushift devs are generously providing a free api and worst case scenario you get blocked or something if you send too many requests.**

In [None]:
while not complete:
    try:
        comment_date = collector.last_created(data)
        data, complete = collector.request_pushift('ludwigahgren', start_date=comment_date)
        comment_usernames = collector.pull_usernames(data,name_set=usernames)
    except RuntimeError:
        time.sleep(10)
        continue
    except IndexError:
        print(f"Process complete, retrieved {len(usernames)} usernames and ended at timestamp {comment_date}")
        break
    else:
        print(len(usernames), comment_date)


Code blocks for submission usernames

In [None]:
data, complete = collector.request_pushift('ludwigahgren', endpoint='submission')
submission_usernames = collector.pull_usernames(data)

In [None]:
while not complete:
    try:
        comment_date = collector.last_created(data)
        data, complete = collector.request_pushift('ludwigahgren', start_date=comment_date, endpoint='submission')
        submission_usernames = collector.pull_usernames(data,name_set=usernames)
    except RuntimeError:
        time.sleep(10)
        continue
    except IndexError:
        print(f"Process complete, retrieved {len(usernames)} usernames and ended at timestamp {comment_date}")
        break
    else:
        print(len(usernames), comment_date)

Combining username sets and writing to a pickle file.

In [None]:
total_usernames = comment_usernames |= submission_usernames

In [None]:
with open('usernames.dat', 'wb') as picklefile:
    dill.dump(total_usernames, picklefile)

In [4]:
with open('usernames.dat', 'rb') as dillfile:
    username_set = dill.load(dillfile)

Collecting the comments and submissions from our list of names using praw...

Takes quite a while for a month's worth of users from a moderately active subreddit like r/ludwigahgren due to the large amount of requests made. Tried to use pushift api to speed it up but their servers are better suited to small amounts of bulk requests rather than large amounts of small requests like this. 

You could split up the usernames between the pushift api and praw to make it go faster but it should only take a few hours anyway if you're patient.

In [None]:
activity_data = collector.retrieve_activity(test)

Save the data so you don't have to re run a large script.

In [7]:
with open('activity-data.dat', 'wb') as picklefile:
    dill.dump(activity_data, picklefile)

In [5]:
with open('activity-data.dat', 'rb') as dillfile:
    activity_data = dill.load(dillfile)

### The Data
Finally, we get some data we can actually look at. Activity is the combined number of posts and comments made in the specified subreddit from people who also posted in r/LudwigAhgren. rludwig is the number of posts those users made in r/LudwigAhgren so you can compare how active those users are in each subreddit. Users is simply the number of users who have posted in both subreddits.

In [6]:
df = pd.DataFrame.from_dict(activity_data, orient="index")

In [7]:
df.sort_values(by=['users'],ascending=False, inplace=True)
df.head(50)

Unnamed: 0,activity,rludwig,users
AskReddit,1969,1211,555.0
LivestreamFail,2753,1278,450.0
memes,1725,634,354.0
dankmemes,689,408,208.0
teenagers,1563,388,200.0
196,1440,331,191.0
nextfuckinglevel,381,347,166.0
Cringetopia,478,220,147.0
interestingasfuck,300,299,147.0
Minecraft,303,291,141.0


## Plotting
Constructing the plot with plotly

In [8]:
hover_text = []
bubble_size = []
i = 1
# Adding hover labels and bubble_sizes based on user overlap
for index, row in df.iterrows():
    hover_text.append(f'Activity in {index}: {row["activity"]}<br>' +
    f'Activity in r/LudwigAhgren: {row["rludwig"]}<br>' +
    f'Overlapping Users: {row["users"]}')
    bubble_size.append(row["users"])
    i += 1
df['text'] = hover_text
df['size'] = bubble_size

fig = go.Figure()

subreddit_data = {}
annotations = []
for index, row in df.iloc[0:30, :].iterrows():
    subreddit_data[index] = df.loc[index, :]
    # Annotations
    show_arrow = False
    yshift=15
    startstandoff=0
    yshift_adjust = {'PublicFreakout':20, 'aww': -10, 'Minecraft': 25, 'interestingasfuck': -10, 'PewdiepieSubmissions': -10}
    if index in ['PublicFreakout', 'interestingasfuck','PewdiepieSubmissions', 'aww']:
        yshift = yshift_adjust.get(index)
    annotations.append(dict(text=index, showarrow=show_arrow, arrowhead=1, x=math.log(row['rludwig'])/math.log(10), y=math.log(row['activity'])/math.log(10),yshift=yshift))

sizeref = 2. * max(df['size'].iloc[0:30]) / (120 ** 2)
for subreddit_name, subreddit in subreddit_data.items():
    fig.add_trace(go.Scatter(
        x=[subreddit['rludwig']], y=[subreddit['activity']],
        name=subreddit_name, text=subreddit['text'],  
        marker_size=subreddit['size'] // 8, # Tune this paramater to fit the marker size you want
    ))
fig.update_traces(mode='markers', marker=dict(sizemode='area', sizeref=sizeref, line_width=2, opacity=0.5))

# for annotation in annotations:
#     fig.add_annotation(annotation)
fig.update_layout(
    title='r/LudwigAhgren Crossover with other Subreddits (Top 30 by User Overlap)',
    xaxis=dict(
        title='Activity (Posts+Comments) in r/LudwigAhgren',
        gridcolor='white',
        type='linear', # Change to log and uncomment the annotations for better visibility
        gridwidth=2,
    ),
    yaxis=dict(
        title='Activity (Posts+Comments) in other Subreddit',
        gridcolor='white',
        gridwidth=2,
        type='linear' # Change to log and uncomment the annotations for better visibility
    ),
    legend=dict(
        title=dict(text="""Subreddits by overlapping user count<br><br>Bubble size proportional to user count""", font=dict(size=18)),
        itemsizing="trace"
    ),
    width=1650,
    height=900,
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)
fig.show()

In [11]:
# Code block for publishing to chart studio, ignore.
config=dict(scrollZoom=True)
chart_studio.plotly.plot(fig, filename='ludwig_redditors', config=config)

'https://plotly.com/~brandon-info/3/'

In [12]:
# Export df to csv
df.to_csv(path_or_buf='activity.csv',index=True)