# AAAAAAAAAAAAAAAAAnalysis

This notebooks documents the scraping of submissions and comments from the /r/AAAAAAAAAAAAAAAAA subreddit, a subreddit dedicated to.. well.. AAAAAAAAAAAAAAAAA.

The notebook is split into the sections:

1. Scraping data with the pmaw API
2. Cleaning the data to remove uninteresting entries (deleted posts, automoderator comments etc..)
3. Plotting of various quantities from the scraped data

# Data Scraping

Scraping submissions and comments using the pmaw API, mostly following the guidelines here: https://pypi.org/project/pmaw/. Took ~40 minutes to scrape everything.

In [1]:
import os
import pandas as pd

from pmaw import PushshiftAPI
api = PushshiftAPI()

def scrape_submissions(subreddit, output_csv, limit=1000, before=None, after=None, overwrite_csv=False):
    
    if os.path.exists(output_csv) and not overwrite_csv:
        print("{} Already exists, not overwriting".format(output_csv))
        return
    
    if before and after:
        submissions = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after)
    else:
        submissions = api.search_submissions(subreddit=subreddit, limit=limit)

    submission_df = pd.DataFrame(submissions)
    submission_df.to_csv(output_csv, header=True, index=False, columns=list(submission_df.axes[1]))

    print("\nSuccesfully Written {} submissions to {}".format(len(submission_df), output_csv))
    
    
def scrape_comments(subreddit, output_csv, limit=1000, before=None, after=None, overwrite_csv=False):
    
    if os.path.exists(output_csv) and not overwrite_csv:
        print("{} Already exists, not overwriting".format(output_csv))
        return
    
    if before and after:
        comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
    else:
        comments = api.search_comments(subreddit=subreddit, limit=limit)

    comments_df = pd.DataFrame(comments)
    comments_df.to_csv(output_csv, header=True, index=False, columns=list(comments_df.axes[1]))

    print("\nSuccesfully Written {} comments to {}".format(len(comments_df), output_csv))

In [2]:
scrape_submissions(subreddit = 'AAAAAAAAAAAAAAAAA', output_csv = 'raw_submissions.csv', limit=None, overwrite_csv=False)
scrape_comments(subreddit = 'AAAAAAAAAAAAAAAAA', output_csv = 'raw_comments.csv', limit=None, overwrite_csv=False)

raw_submissions.csv Already exists, not overwriting
raw_comments.csv Already exists, not overwriting


In [3]:
submissions_df = pd.read_csv('raw_submissions.csv', low_memory=False)
comments_df = pd.read_csv('raw_comments.csv', low_memory=False)

# Data Cleaning

The raw data we scraped requires some amount of cleaning to remove entries that don't line up with the spirit of the subreddit and or analysis. These will be deleted posts, automoderator posts, and posts from reddit bots (which usually provide some long description of their purpose in their comments).

We will then also filter out urls, subreddit links.

### Filtering Authors

Manually looked at the top 500 posters from each dataframe, we can gauge what authors / usernames to filter out. Showing here the top 6 as a preview of what this step looks for.


These are  '[deleted]', 'AutoModerator' and a list of bots: 'VredditDownloader', 'SaveVideo', 'RepostSleuthBot', 'SaveThisVIdeo', 'sneakpeekbot', 'AutoCrosspostBot', 'CoolDownBot', 'haikusbot', 'MAGIC_EYE_BOT', 'FuckCoolDownBot2', 'FuckThisShitBot41', 'SaveVideo', 'ClickableLinkBot', 'uwutranslator', 'Screem_Bot', 'morse-bot', 'nwordcountbot, 'phonebatterylevelbot', 'SmileBot-2020'

In [4]:
N_FILTER = 10
for v in submissions_df['author'].value_counts()[:N_FILTER].index:
    print(v)
print("")
for v in comments_df['author'].value_counts()[:N_FILTER].index:
    print(v)

[deleted]
fakedimestesso
Mickthebrick1
runningawaywithyrmom
Lansofl
Total-Volume-9387
sindjaika
Catslifephils
hamakaze99
AbundanceLifeStyle

[deleted]
MAGIC_EYE_BOT
runningawaywithyrmom
Mickthebrick1
AutoModerator
VredditDownloader
fakedimestesso
RepostSleuthBot
LonelyGameBoi
billsanzer


In [5]:
# Note, ~ is inverting the .isin mask so we remove posts that match an author in the list
remove_authors = ['[deleted]', 'AutoModerator', 'VredditDownloader', 'SaveVideo', 'RepostSleuthBot', 'SaveThisVIdeo', 'sneakpeekbot', 'AutoCrosspostBot', 'CoolDownBot', 'haikusbot', 'MAGIC_EYE_BOT', 'FuckCoolDownBot2', 'FuckThisShitBot41', 'SaveVideo', 'ClickableLinkBot', 'uwutranslator', 'Screem_Bot', 'morse-bot', 'nwordcountbot', 'phonebatterylevelbot', 'SmileBot-2020']
submissions_df = submissions_df[~submissions_df['author'].isin(remove_authors)]
comments_df = comments_df[~comments_df['author'].isin(remove_authors)]

### Further Filtering

Here I just apply some filterings to tidy up the data we will be analysing (submission titles and comment bodies). Here removing URLs and subreddit/username links.

In [6]:
# Drop any NA values in the raw dataframe
submissions_df = submissions_df.dropna(subset=['title'])
comments_df = comments_df.dropna(subset=['body'])

# remove http and www URLS
# pattern = r'[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'
# After trialing multiple regex patterns, this ends up being the simplest way to remove the majority of these posts while retaining time efficiency
submissions_df = submissions_df[submissions_df['title'].str.contains("http")==False]
comments_df = comments_df[comments_df['body'].str.contains("http")==False]

# regex finds and removes subreddit /r/ links
submissions_df['title'] = submissions_df['title'].replace(r'/r/([^\s/]+)', '', regex=True)
comments_df['body'] = comments_df['body'].replace(r'/r/([^\s/]+)', '', regex=True)

# regex finds and removes username /u/ links
submissions_df['title'] = submissions_df['title'].replace(r'/u/([^\s/]+)', '', regex=True)
comments_df['body'] = comments_df['body'].replace(r'/u/([^\s/]+)', '', regex=True)

Finally there are some HTML encodings in the comments like &amp which should be decoded to avoid the countvectorizer catching amp as a word.

In [7]:
def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

submissions_df['title'] = submissions_df['title'].apply(lambda x: html_decode(x))
comments_df['body'] = comments_df['body'].apply(lambda x: html_decode(x))

# Feature Engineering