## Collecting subreddit data
This is the file that is actually used to collect usernames from all posts and comments on a subreddit as well as their activity in other subreddits. If you want to see the code behind the functions in this script, look at the collector module.

In [1]:
import time, math
import dill
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio 
from redditalyzer import collector
import chart_studio

# If you are using vscode to render the chart
pio.renderers.default = "vscode"

Version 7.2.0 of praw is outdated. Version 7.3.0 was released Thursday June 17, 2021.


Enter the name of the subreddit you want to get data from below.

In [2]:
search_in = 'atrioc'

The next two code blocks should be run to collect reddit usernames that have commented on the subreddit within a given time range. Default time range is within the last 30d. Time range can be changed with the start_date parameter of the request_pushift function. Accepted parameters can be found on the pushift API documentation. You should see the printed numbers begin to increase as the script collects more usernames.

If you want to collect the usernames from the submissions as well, you have to change the endpoint param in the request_pushift functions.

In [None]:
data, complete = collector.request_pushift(search_in, start_date='1y')
comment_usernames = collector.pull_usernames(data)

This code block should probably take less than 10 minutes for a month's worth of usernames on r/ludwigahgren.

If the numbers stop printing it's because the pushift servers aren't responding due to either rate limits or high load. I set it retry after 10 seconds but if it happens for prolonged period of time their servers are probably just down and you should stop the script.

**Do not remove any code related to rate limiting, pushift devs are generously providing a free api and worst case scenario you get blocked or something if you send too many requests.**

In [None]:
while not complete:
    try:
        comment_date = collector.last_created(data)
        data, complete = collector.request_pushift(search_in, start_date=comment_date)
        comment_usernames = collector.pull_usernames(data, name_set=comment_usernames)
    except RuntimeError:
        time.sleep(10)
        continue
    except IndexError:
        print(f"Process complete, retrieved {len(comment_usernames)} usernames and ended at timestamp {comment_date}")
        break
    else:
        print(len(comment_usernames), comment_date)


Code blocks for submission usernames

In [None]:
data, complete = collector.request_pushift(search_in, endpoint='submission')
submission_usernames = collector.pull_usernames(data)

In [None]:
while not complete:
    try:
        submission_date = collector.last_created(data)
        data, complete = collector.request_pushift(search_in, start_date=submission_date, endpoint='submission')
        submission_usernames = collector.pull_usernames(data,name_set=submission_usernames)
    except RuntimeError:
        time.sleep(10)
        continue
    except IndexError:
        print(f"Process complete, retrieved {len(submission_usernames)} usernames and ended at timestamp {submission_date}")
        break
    else:
        print(len(submission_usernames), submission_date)

Combining username sets and writing to a pickle file.

In [None]:
total_usernames = comment_usernames.union(submission_usernames)

In [None]:
with open(f'usernames-{search_in}.dat', 'wb') as picklefile:
    dill.dump(total_usernames, picklefile)

In [3]:
with open(f'usernames-{search_in}.dat', 'rb') as dillfile:
    username_set = dill.load(dillfile)

Collecting the comments and submissions from our list of names using praw...

Takes quite a while for a month's worth of users from a moderately active subreddit like r/ludwigahgren due to the large amount of requests made. Tried to use pushift api to speed it up but their servers are better suited to small amounts of bulk requests rather than large amounts of small requests like this. 

You could split up the usernames between the pushift api and praw to make it go faster but it should only take a few hours anyway if you're patient.

In [4]:
activity_data = collector.retrieve_activity(username_set, 'year')

ink_karma'
1901 conkiembock
1902 MapoDiddy
1903 JakeHutson
1904 TimInterrante
1905 kai_texans
1906 mo-inc
1907 jovili_
1908 ActuallyGood786
1909 grandma_needs_jesus
1910 yoetboi69
1911 BigwillyTwisty
BigwillyTwisty could not be found. Error: received 404 HTTP response
1912 totally_not_paul
1913 RichardFarter
1914 Slow_Pound8254
1915 letmeoutpls
1916 BryceE212
1917 Bigfoop
1918 gyaruruu
1919 Shrenk69
1920 Will_the_Thrill19
1921 klausklass
1922 9olp
1923 TrueWiggles1305
1924 Wolfgang_Amadeuss
1925 FemboyInASkirt
1926 Bre7t
1927 SHEEPBBQ
1928 I__Synergy__I
1929 EMPTY__Meat
1930 Greenguyme
1931 rohin-m
1932 TheOtodus
1933 Daxnazzle
1934 bvckspaced
1935 giggleump
1936 Hitlers_LeftTesticle
1937 Surlap
1938 bentaylorrr
1939 Fingy123
1940 BrownBagBoy
1941 Ninja_underwear420
1942 VoidMakesVids
1943 NicoNomad
1944 PollosTacos
1945 babeybluecheese
babeybluecheese could not be found. Error: received 404 HTTP response
1946 Anti_Fake_Yoda_Bot
1947 R_Wolf_48
1948 Leafsfan32
1949 riaath
1950 regalchun

Save the data so you don't have to re run a large script.

In [5]:
with open('activity-data-atrioc.dat', 'wb') as picklefile:
    dill.dump(activity_data, picklefile)

In [None]:
with open('activity-data.dat', 'rb') as dillfile:
    activity_data = dill.load(dillfile)

### The Data
Finally, we get some data we can actually look at. Activity is the combined number of posts and comments made in the specified subreddit from people who also posted in r/LudwigAhgren. rludwig is the number of posts those users made in r/LudwigAhgren so you can compare how active those users are in each subreddit. Users is simply the number of users who have posted in both subreddits.

In [6]:
df = pd.DataFrame.from_dict(activity_data, orient="index")

In [13]:
df.sort_values(by=['atrioc'],ascending=False, inplace=True)
df.head(50)

Unnamed: 0,activity,atrioc,users,text,size
LudwigAhgren,7118,6095,1270,Activity in LudwigAhgren: 7118<br>Activity in ...,1270
LivestreamFail,4089,2852,563,Activity in LivestreamFail: 4089<br>Activity i...,563
AskReddit,2420,2211,621,Activity in AskReddit: 2420<br>Activity in r/a...,621
Stanz,1505,2052,250,Activity in Stanz: 1505<br>Activity in r/atrio...,250
ConnorEatsPants,2945,1361,218,Activity in ConnorEatsPants: 2945<br>Activity ...,218
memes,2130,1317,359,Activity in memes: 2130<br>Activity in r/atrio...,359
Minecraft,569,1114,226,Activity in Minecraft: 569<br>Activity in r/at...,226
dankmemes,1017,949,266,Activity in dankmemes: 1017<br>Activity in r/a...,266
QTCinderella,395,872,123,Activity in QTCinderella: 395<br>Activity in r...,123
196,2127,870,283,Activity in 196: 2127<br>Activity in r/atrioc:...,283


## Plotting
Constructing the plot with plotly

In [25]:
hover_text = []
bubble_size = []
i = 1
# Adding hover labels and bubble_sizes based on user overlap
for index, row in df.iterrows():
    hover_text.append(f'Activity in {index}: {row["activity"]}<br>' +
    f'Activity in r/{search_in}: {row[f"{search_in}"]}<br>' +
    f'Overlapping Users: {row["users"]}')
    bubble_size.append(row["users"])
    i += 1
df['text'] = hover_text
df['size'] = bubble_size

fig = go.Figure()

subreddit_data = {}
annotations = []
for index, row in df.iloc[0:30, :].iterrows():
    subreddit_data[index] = df.loc[index, :]
    # Annotations
    show_arrow = False
    yshift=15
    startstandoff=0
    yshift_adjust = {'PublicFreakout':20, 'aww': -10, 'Minecraft': 25, 'interestingasfuck': -10, 'PewdiepieSubmissions': -10}
    if index in ['PublicFreakout', 'interestingasfuck','PewdiepieSubmissions', 'aww']:
        yshift = yshift_adjust.get(index)
    annotations.append(dict(text=index, showarrow=show_arrow, arrowhead=1, x=math.log(row[f'{search_in}'])/math.log(10), y=math.log(row['activity'])/math.log(10),yshift=yshift))

sizeref = 2. * max(df['size'].iloc[0:30]) / (120 ** 2)
for subreddit_name, subreddit in subreddit_data.items():
    fig.add_trace(go.Scatter(
        x=[subreddit[f'{search_in}']], y=[subreddit['activity']],
        name=subreddit_name, text=subreddit['text'],  
        marker_size=subreddit['size'] // 8, # Tune this paramater to fit the marker size you want
    ))
fig.update_traces(mode='markers', marker=dict(sizemode='area', sizeref=sizeref, line_width=2, opacity=0.5))

# for annotation in annotations:
#     fig.add_annotation(annotation)
fig.update_layout(
    title=f'r/{search_in} Crossover with other Subreddits (Top 30 by User Overlap)',
    xaxis=dict(
        title=f'Activity (Posts+Comments) in r/{search_in}',
        gridcolor='white',
        type='log', # Change to log and uncomment the annotations for better visibility
        gridwidth=2,
    ),
    yaxis=dict(
        title='Activity (Posts+Comments) in other Subreddit',
        gridcolor='white',
        gridwidth=2,
        type='log' # Change to log and uncomment the annotations for better visibility
    ),
    legend=dict(
        title=dict(text="""Subreddits by overlapping user count<br><br>Bubble size proportional to user count""", font=dict(size=18)),
        itemsizing="trace"
    ),
    width=1650,
    height=900,
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)
fig.show()

In [26]:
# Code block for publishing to chart studio, ignore.
config=dict(scrollZoom=True)
chart_studio.plotly.plot(fig, filename=f'{search_in}_redditors', config=config)

'https://plotly.com/~brandon-info/8/'

In [12]:
# Export df to csv
df.to_csv(path_or_buf='activity-atrioc.csv',index=True)

## Some Exploratory Blocks

In [14]:
df['activity/users'] = df['activity']/df['users']

In [15]:
df[f'{search_in}/users'] = df[f'{search_in}']/df['users']

In [16]:
df.describe()

Unnamed: 0,activity,atrioc,users,size,activity/users,atrioc/users
count,9688.0,9688.0,9688.0,9688.0,9688.0,9687.0
mean,16.055533,15.043249,4.361272,4.361272,inf,3.371591
std,155.056487,88.345271,19.969481,19.969481,,4.881781
min,1.0,0.0,0.0,0.0,1.0,0.0
25%,1.0,1.0,1.0,1.0,1.0,1.0
50%,2.0,3.0,1.0,1.0,1.5,2.0
75%,8.0,10.0,3.0,3.0,3.0,4.0
max,10870.0,6095.0,1270.0,1270.0,inf,108.0


In [17]:
most_dedicated_users = df.query("users >= 5").copy()

In [18]:
most_dedicated_users.describe()

Unnamed: 0,activity,atrioc,users,size,activity/users,atrioc/users
count,1498.0,1498.0,1498.0,1498.0,1498.0,1498.0
mean,72.708278,69.771696,20.008011,20.008011,3.213305,3.323943
std,269.364012,215.942649,47.818796,47.818796,3.205834,2.525818
min,5.0,1.0,5.0,5.0,1.0,0.166667
25%,12.0,16.0,6.0,6.0,1.5,2.0
50%,25.0,31.0,9.0,9.0,2.2,2.857143
75%,54.0,63.0,18.0,18.0,3.610294,3.875
max,7118.0,6095.0,1270.0,1270.0,40.8,49.875


In [19]:
most_dedicated_users.sort_values(by=[f'{search_in}/users','activity'], ascending=False, inplace=True)

In [20]:
most_dedicated_users.head(50)

Unnamed: 0,activity,atrioc,users,text,size,activity/users,atrioc/users
atriocprechat,9,399,8,Activity in atriocprechat: 9<br>Activity in r/...,8,1.125,49.875
clipclub,9,178,6,Activity in clipclub: 9<br>Activity in r/atrio...,6,1.5,29.666667
whodatLIVE,13,138,6,Activity in whodatLIVE: 13<br>Activity in r/at...,6,2.166667,23.0
Quack_001,40,280,15,Activity in Quack_001: 40<br>Activity in r/atr...,15,2.666667,18.666667
Consoom,11,111,6,Activity in Consoom: 11<br>Activity in r/atrio...,6,1.833333,18.5
whodat950,42,357,20,Activity in whodat950: 42<br>Activity in r/atr...,20,2.1,17.85
Trainwreckstv,39,243,14,Activity in Trainwreckstv: 39<br>Activity in r...,14,2.785714,17.357143
JIDSV,13,84,5,Activity in JIDSV: 13<br>Activity in r/atrioc:...,5,2.6,16.8
BotezLive,19,144,9,Activity in BotezLive: 19<br>Activity in r/atr...,9,2.111111,16.0
AZCardinals,72,76,5,Activity in AZCardinals: 72<br>Activity in r/a...,5,14.4,15.2


In [23]:
hover_text = []
bubble_size = []

# Adding hover labels and bubble_sizes based on user overlap
for index, row in most_dedicated_users.iterrows():
    hover_text.append(f'Activity in {index}: {row["activity"]}<br>' +
    f'Activity in r/{search_in}: {row[f"{search_in}"]}<br>' +
    f'Overlapping Users: {row["users"]}')
    bubble_size.append(row["users"])
most_dedicated_users['text'] = hover_text
most_dedicated_users['size'] = bubble_size

fig = go.Figure()

subreddit_data = {}
annotations = []
for index, row in most_dedicated_users.iloc[0:30, :].iterrows():
    subreddit_data[index] = most_dedicated_users.loc[index, :]
    # Annotations
    # show_arrow = False
    # yshift=15
    # startstandoff=0
    # yshift_adjust = {'PublicFreakout':20, 'aww': -10, 'Minecraft': 25, 'interestingasfuck': -10, 'PewdiepieSubmissions': -10}
    # if index in ['PublicFreakout', 'interestingasfuck','PewdiepieSubmissions', 'aww']:
    #     yshift = yshift_adjust.get(index)
    # annotations.append(dict(text=index, showarrow=show_arrow, arrowhead=1, x=math.log(row['rludwig'])/math.log(10), y=math.log(row['activity'])/math.log(10),yshift=yshift))

sizeref = 2. * max(most_dedicated_users['size'].iloc[0:30]) / (100 ** 2)
for subreddit_name, subreddit in subreddit_data.items():
    fig.add_trace(go.Scatter(
        x=[subreddit[f'{search_in}/users']], y=[subreddit['activity']],
        name=subreddit_name, text=subreddit['text'],  
        marker_size=subreddit['size'] * 2, # Tune this paramater to fit the marker size you want
    ))
fig.update_traces(mode='markers', marker=dict(sizemode='area', sizeref=sizeref, line_width=2, opacity=0.5))

# for annotation in annotations:
#     fig.add_annotation(annotation)
fig.update_layout(
    title=f'r/{search_in} Most Active User Groups Activity Mapping',
    xaxis=dict(
        title=f'r/{search_in} activity:Users ratio',
        gridcolor='white',
        type='linear', # Change to log and uncomment the annotations for better visibility
        gridwidth=2,
    ),
    yaxis=dict(
        title='Activity (Posts+Comments) in other Subreddit',
        gridcolor='white',
        gridwidth=2,
        type='linear' # Change to log and uncomment the annotations for better visibility
    ),
    legend=dict(
        title=dict(text="""Subreddits by overlapping user count<br><br>Bubble size proportional to user count""", font=dict(size=18)),
        itemsizing="trace"
    ),
    width=1650,
    height=900,
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
)
fig.show()