# Data Scraper

This file contains the functions used to scrape and do prelimary clean-up during the compilation process. The python requests libary was used to complete this task.

**In this file are the functions:**

- scraper()
    - This function actual scrapes the data. It takes in which subreddit's submissions to pull.  It also takes in the 'last batch' or the file that was just scraped as it uses the last row of it to determine what date to look at in order to begin its next iteration of the scraping process.  This function also specifies what parameters to use when scraping and specifies the subreddit, the number of rows to scrape, excluding of video content, and excluding anything authored by reddit's AutoModerator.
    
- compiler()
    - This function takes in which subreddit to examine and the maximum number of posts to scrape. It creates an empty list to store all scraped posts and also sets a variable to hold the most recent scrape of 100 posts.  It is built around a while loop and continues to call the scraper() function until it has met the maximum numbers of posts. It adds each batch of 100 posts to the larger list of all posts and then returns a list of all posts once the maximum as been met.  Additionally, this function uses the time.sleep() function which allows for a more 'natural' or less robotic scrape of reddit.  It was important to include this so as not to overwhelm reddit's pushshift API.
        
- compile_scrapes()
    - This function completes some very preliminary cleaning by removing postings that were deleted by a moderator, the user, or reddit. It also only keeps posts with content in the selftext column.  Lastly, it merges the bodyweightfitness and weightlifing scrapes into one dataframe which was then exported to be used in future steps.
        
        
**In total, 20,000 posts were scraped from each subreddit. After merging the data and removing posts that had been flagged as deleted, the combined total of 40,000 posts had been reduced to 26,049 (weightlifting: 12,028, bodyweightfitness: 14,021).**

In [1]:
import requests
import pandas as pd
import numpy as np
import time

import random

import datetime as dt

import math

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Gather data using the requests library

In [17]:
url = 'https://api.pushshift.io/reddit/search/submission'

# Function built to perform multiple scrapes of the apishift reddit api
# It will initiate a new scrape based on the date of the last post pulled
# from the immediately prior scrape
# HELP: https://reddit-api.readthedocs.io/en/latest/#searching-submissions
# HELP: https://www.textjuicer.com/2019/07/crawling-all-submissions-from-a-subreddit/

# This function does the scraping
def scraper(subreddit, last_batch):
    # setting up params to indicate what subreddit
    # size (max is 100)
    # excluding videos, posts created by automoderator
    params = {
    'subreddit' : subreddit,
    'size' : 100,
    'is_video': False,
    'author': '!AutoModerator'
    }
    
    # if the last_batch of scraped submissions is NOT blank / NoneType:
    if last_batch != None:
        # Find the last row of the most recent scrape to pull the 
        # UTC time in order to know where to initiate the next scrape
        if len(last_batch) > 0:
            # creating the param 'before' = to be included in request.get call below
            # this calculates the utc date of the very last post of the most recent scrape
            params['before'] = last_batch[-1]['created_utc']
        else:
            return []

    # Make the request with url and params that have been updated for each scrape    
    res = requests.get(url, params)
    # return json data which will be added to post_list in compiler() below
    return res.json()['data']

# this function stores the scraped data and controls the intervals
# at which scraper() is run
def compiler(subreddit, posts_to_scrape):

    # creating a list to store retrieved posts
    post_list = []
    
    # setting up a list to hold the scraped posts from the 
    # most recent post scrape interation
    # initially set to a NoneType object
    last_batch = None

    # settinq up while loop that will run as long as last_batch is NOT blank
    # AND while the length of the post_list is still less than the number
    # of desired posts to scrape
    while last_batch != [] and len(post_list) < posts_to_scrape:
        # calling the scraper() function and saving it as last_batch
        last_batch = scraper(subreddit, last_batch)
        # adding the most recent batch of scraped submissions to the
        # larger post_list list
        post_list = post_list + last_batch
        # setting up .sleep() to randomly pick a wait time so
        # scrape seems more 'natural'
        # understanding time.sleep(): https://www.datacamp.com/community/tutorials/python-time-sleep
        time.sleep(random.randint(3,7))
    # return the post_list from the 0th row to the max # of posts requested    
    return post_list[:posts_to_scrape]

In [18]:
# Function built to compile and merge dataframes from two subreddits
def compile_scrapes(df1, df2):
    
    # Turning scraped data into dataframes
    df1 = pd.DataFrame(df1)
    df2 = pd.DataFrame(df2)
    # combining the two scrapes into one dataframe
    df = pd.concat([df1, df2], axis=0, sort=False)

    # dropping posts that have been removed via a
    # moderator, user deleted, or by reddit
    df = df[df['removed_by_category'] != 'moderator' ]
    df = df[df['removed_by_category'] != 'deleted' ]
    df = df[df['removed_by_category'] != 'reddit' ]
    
    # keeping only posts with text in selftext
    df = df.loc[df['selftext'].str.len() > 0]

    # Exporting the compiled dataframe
    export_df = df.to_csv('../data/compiled_final2.csv', index = False) 

    return df

In [19]:
# Scraping the first subreddit of interest
# and returning a dataframe
df1 = compiler('bodyweightfitness', 150)
df1 = pd.DataFrame(df1)
df1.shape

(150, 68)

In [20]:
# Scraping the second subreddit of interest
# and returning a dataframe
df2 = compiler('weightlifting', 145)
df2 = pd.DataFrame(df2)
df2.shape

(145, 78)

In [14]:
# Calling the compile_scrapes() to complete merge and do initial deletions
final_df = compile_scrapes(df1, df2)

In [15]:
final_df.shape

(26049, 107)

In [16]:
final_df.head(2)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,author_flair_template_id,author_flair_text_color,author_flair_background_color,banned_by,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_cakeday,edited,collections,suggested_sort,thumbnail_height,thumbnail_width,distinguished,gilded,link_flair_template_id,link_flair_text,link_flair_css_class,media,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata,poll_data,event_end,event_is_live,event_start,steward_reports,removed_by,updated_utc,og_description,og_title,rte_mode,author_id,archived,author_created_utc,can_gild,category,content_categories,hidden,quarantine,removal_reason,subreddit_name_prefixed,brand_safe,previous_visits
34,[],False,Solfire,,"[{'e': 'text', 't': 'Dam Son'}]",Dam Son,richtext,t2_37jve,False,False,[],False,False,1615696693,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},m4o0zb,True,False,False,False,True,True,False,,[],dark,text,False,False,True,43,0,False,all_ads,/r/bodyweightfitness/comments/m4o0zb/sunday_sh...,False,6.0,,1615696704,1,"**HEY YOU,**\n\nHave you taken any recent pics...",True,False,True,bodyweightfitness,t5_2tf0a,2026376.0,public,self,Sunday Show Off - Because it's perfectly fine ...,0.0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6.0,,,,dark,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
70,[],False,teetee9,,[],,text,t2_16171byh,False,False,[],False,False,1615643440,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},m46noj,True,False,False,False,True,True,False,,[],dark,text,False,False,True,3,0,False,all_ads,/r/bodyweightfitness/comments/m46noj/what_woul...,False,6.0,,1615643451,1,the context that led me to ask this question i...,True,False,False,bodyweightfitness,t5_2tf0a,2025667.0,public,self,what would someone who is 5 foot tall and 114 ...,0.0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [17]:
final_df['subreddit'].value_counts(ascending=True)

weightlifting        12028
bodyweightfitness    14021
Name: subreddit, dtype: int64