<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

## Part 1: Webscraping and Data Gathering

## Problem Statement

Using [Pushshift's](https://github.com/pushshift/api) API, posts from two subreddits, namely r/depression and r/BipolarReddit, will be collected. I will then use NLP to train a classifier to determine which subreddit a given post came from.

### Contents:
- [Posts extraction](#Posts-extraction)
- [Export posts to CSV](#Export-posts-to-CSV)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import requests
import time

## Posts extraction

In [2]:
# Function to request data from reddit API

def get_posts(subreddit):
    # Initializing our URL, params and empty lists to capture posts
    url = 'https://api.pushshift.io/reddit/search/submission'
    posts_ff=[]
    params = {
        'subreddit': subreddit, # the subreddit to scrape from
        'size': 100, # to retrieve 100 posts at one time from Reddit (Hard Limited at 100)
        'before': None, # to continue from where it has previously stopped
    }

    # Below is a for loop to execute the the command 10 times to get ~ 1000 posts
    for i in range(10):
        res = requests.get(url, params) # Request data from Reddit
        if res.status_code != 200:   #Print error if request fails
            print('Status error', res.status_code)
            break
        else:
            print(f'Request iteration: {i+1}\n Status code: {res.status_code}') # Print iteration count and corresponding status code
            data = res.json() # Store results in data variable
            posts = data['data'] # Store results into list
        if len(posts) > 0: # Condition to check if our results are returning nothing
            newbefore = posts[-1]['created_utc']
            params['before'] = newbefore # Replace the value of 'created_utc' of last post in current request iteration as the new 'before' value in our parameters
            posts_ff.extend(posts) # Add the 100 results of the current request loop into the posts list
            print('Number of posts scraped: ', len(posts_ff))
        else:
            print('Request did not fetch any results')
        time.sleep(3) #seconds to sleep
    print('**End of scrape**')        
    return posts_ff

In [3]:
sub_depress = get_posts('depression')

Request iteration: 1
 Status code: 200
Number of posts scraped:  100
Request iteration: 2
 Status code: 200
Number of posts scraped:  200
Request iteration: 3
 Status code: 200
Number of posts scraped:  300
Request iteration: 4
 Status code: 200
Number of posts scraped:  400
Request iteration: 5
 Status code: 200
Number of posts scraped:  500
Request iteration: 6
 Status code: 200
Number of posts scraped:  600
Request iteration: 7
 Status code: 200
Number of posts scraped:  700
Request iteration: 8
 Status code: 200
Number of posts scraped:  800
Request iteration: 9
 Status code: 200
Number of posts scraped:  900
Request iteration: 10
 Status code: 200
Number of posts scraped:  1000
**End of scrape**


In [4]:
sub_bipolar = get_posts('bipolarreddit')

Request iteration: 1
 Status code: 200
Number of posts scraped:  100
Request iteration: 2
 Status code: 200
Number of posts scraped:  200
Request iteration: 3
 Status code: 200
Number of posts scraped:  300
Request iteration: 4
 Status code: 200
Number of posts scraped:  400
Request iteration: 5
 Status code: 200
Number of posts scraped:  500
Request iteration: 6
 Status code: 200
Number of posts scraped:  600
Request iteration: 7
 Status code: 200
Number of posts scraped:  700
Request iteration: 8
 Status code: 200
Number of posts scraped:  800
Request iteration: 9
 Status code: 200
Number of posts scraped:  900
Request iteration: 10
 Status code: 200
Number of posts scraped:  1000
**End of scrape**


In [5]:
# Check class
print(type(sub_depress))
print(type(sub_bipolar))

<class 'list'>
<class 'list'>


In [6]:
# Check 1st entry of subreddit depress
sub_depress[0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_richtext', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'removed_by_category', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'suggested_sort', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags', 'upvote_ratio', 'url'

In [7]:
# Check 1st entry of subreddit bipolar
sub_bipolar[0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags', 'upvote_ratio', '

In [8]:
# Convert each subreddit post to data frame
df_depress = pd.DataFrame(sub_depress)
df_bipolar = pd.DataFrame(sub_bipolar)

In [9]:
# Display number of rows, columns for dataframes
print(f'Depression DF: {df_depress.shape}')
print(f'Bipolar DF: {df_bipolar.shape}')

Depression DF: (1000, 64)
Bipolar DF: (1000, 71)


In [10]:
df_depress.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,author_cakeday,post_hint,preview
0,[],False,_Glokta,,[],,text,t2_7dk5nrzz,False,False,...,A very cringy rap/poem I wrote up while I was ...,0,[],1.0,https://www.reddit.com/r/depression/comments/w...,no_ads,0.0,,,
1,[],False,below298,,[],,text,t2_g3sc7al9,False,False,...,Crying in the dark,0,[],1.0,https://www.reddit.com/r/depression/comments/w...,no_ads,0.0,,,
2,[],False,Fistandantis,,[],,text,t2_4acc43w4,False,False,...,After a two years of feeling like shit i am lo...,0,[],1.0,https://www.reddit.com/r/depression/comments/w...,no_ads,0.0,,,


In [11]:
df_bipolar.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_background_color,author_flair_text_color,removed_by_category,author_flair_template_id,post_hint,preview,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_cakeday
0,[],False,bejbisvejeok1,,[],,text,t2_q7ov1ijy,False,False,...,,,,,,,,,,
1,[],False,Defiant-Order1997,,[],,text,t2_a6h5ybov,False,False,...,,,,,,,,,,
2,[],False,Crazy-Leadership1834,,[],,text,t2_fhqb0p92,False,False,...,,,,,,,,,,


## Export posts to CSV

In [12]:
df_depress.to_csv('../data/depression_posts.csv',index=False)
df_bipolar.to_csv('../data/bipolar_posts.csv',index=False)