<img src="https://imgur.com/3Ua9VYU.png" style="float: left; margin: 18px; height: 75px"> 

## *Data Extraction with Wrangling & Cleaning*
---

## Import Libraries
---

In [2]:
import requests
import pandas as pd
import time

## Pushshift Reddit API
---

In [568]:
#Returning dataframe with subreddit and body of the comment.
#API only accepts maximum value of 500.
#Status code of 200 means request has succeeded. 
#AutoModerator comments are removed as well as [removed] comments since they have no use in the model. #datacleaning

def reddit_comment_getter(subreddit,num_posts):
    '''
    Accumulates subreddit comments from Pushshift Reddit API
    returns a pandas dataframe of subreddit & body of comment.
    
    reddit_comment_getter(subreddit,num_posts)
    
    Parameters -
    
    subreddit: name of subreddit in string format('name_of_subreddit') 
    
    num_posts: minimum total rows of data in integer format(#), also the number of results pulled each time
    '''
    url='https://api.pushshift.io/reddit/search/comment'
    params={
        'subreddit': subreddit,
        'size': num_posts
    }
    res=requests.get(url,params)
    if res.status_code != 200:
        return f"Error {res.status_code}: {subreddit} doesn't work! Try Again"
    else:
        resframed=pd.DataFrame(res.json()['data'])
        df1=resframed.query('author != "AutoModerator" & `body` != "[removed]"')[['subreddit','body','created_utc']]

    while len(df1)<num_posts:
        oldest=df1[['created_utc']].iloc[-1]
        params={
        'subreddit': subreddit,
        'size': num_posts-len(df1),
        'before': oldest
    }
        res=requests.get(url,params)
        if res.status_code != 200:
            return f"Error {res.status_code}: {subreddit} doesn't work! Try Again"
        else:
            resframed2=pd.DataFrame(res.json()['data'])
            df2=resframed2.query('author != "AutoModerator" & `body` != "[removed]"')[['subreddit','body','created_utc']]
            df1=df1.append(df2)
        time.sleep(3) #prevents 429 error! prevents overloading the server with huge amounts requests per second! 
    return (df1.reset_index())[['subreddit','body']]


#.query syntax knowledge obtained from:
    #https://stackoverflow.com/questions/26535563/querying-for-nan-and-other-names-in-pandas/26535881#26535881
    #https://stackoverflow.com/questions/40045545/pandas-query-string-where-column-name-contains-special-characters
    #https://stackoverflow.com/questions/13611065/efficient-way-to-apply-multiple-filters-to-pandas-dataframe-or-series

## Retrieving Subreddit Comments
---

In [569]:
bodybuilding=reddit_comment_getter('bodybuilding',1000)
powerlifting=reddit_comment_getter('powerlifting',1000)

## Concatenating the Two Subreddits
---

In [577]:
pd.concat([bodybuilding,powerlifting],ignore_index=True).to_csv('./dataset/bbandpl.csv',escapechar='\r',index=False)

In [4]:
#checking csv
pd.read_csv('../dataset/bbandpl.csv')

Unnamed: 0,subreddit,body
0,bodybuilding,Lmao so classic is basically open now
1,bodybuilding,No one had a coach up until late 90s. Seems li...
2,bodybuilding,Let’s all agree with enough and some. 1st plac...
3,bodybuilding,58 now 59 at Christmas
4,bodybuilding,Where is the cheapest iso100 fruity pebbles? I...
...,...,...
1995,powerlifting,Eddie without a suit hasnt ever pulled over 46...
1996,powerlifting,I used it for a brief amount of time. I think ...
1997,powerlifting,"Go for it, especially if these sources are pee..."
1998,powerlifting,Is anyone using JuggernautAI for powerlifting?...
