# Web APIs & Classification

# Reddit API and Classification

For project 3, your goal is two-fold:
1. Using Reddit's API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

#### About the API

Reddit's API is fairly straightforward. For example, if I want the posts from [`/r/boardgames`](https://www.reddit.com/r/boardgames), all I have to do is add `.json` to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk




---

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

**Pro Tip 2:** Reddit will give you 25 posts **per request**. To get enough data, you'll need to hit Reddit's API **repeatedly** (most likely in a `for` loop). _Be sure to use the `time.sleep()` function at the end of your loop to allow for a break in between requests. **THIS IS CRUCIAL**_

**Pro tip 3:** The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

**Pro tip 4:** At the end of each loop, be sure to save the results from your scrape as a `csv`: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.



---

### Necessary Deliverables / Submission

- Code and executive summary must be in a clearly commented Jupyter Notebook.
- You must submit your slide deck.
- Materials must be submitted by **9:30 AM on Friday, Oct 2 2020**.



For Project 3 the evaluation categories are as follows:<br>
**The Data Science Process**
- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

**Organization and Professionalism**
- Organization
- Visualizations
- Python Syntax and Control Flow
- Presentation

---

### Why we choose this project for you?
This project covers three of the biggest concepts we cover in the class: Classification Modeling, Natural Language Processing and Data Wrangling/Acquisition.

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL.  There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

Part 2 of the project focuses on **Natural Language Processing** and converting standard text data (like Titles and Comments) into a format that allows us to analyze it and use it in modeling.

Part 3 of the project focuses on **Classification Modeling**.  Given that project 2 was a regression focused problem, we needed to give you a classification focused problem to practice the various models, means of assessment and preprocessing associated with classification.   


### The Data Science Process

### Organization and Professionalism

**Project Organization**
- Are modules imported correctly (using appropriate aliases)?
- Are data imported/saved using relative paths?
- Does the README provide a good executive summary of the project?
- Is markdown formatting used appropriately to structure notebooks?
- Are there an appropriate amount of comments to support the code?
- Are files & directories organized correctly?
- Are there unnecessary files included?
- Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- Are sufficient visualizations provided?
- Do plots accurately demonstrate valid relationships?
- Are plots labeled properly?
- Are plots interpreted appropriately?
- Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- Is care taken to write human readable code?
- Is the code syntactically correct (no runtime errors)?
- Does the code generate desired results (logically correct)?
- Does the code follows general best practices and style guidelines?
- Are Pandas functions used appropriately?
- Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- Is the problem statement clearly presented?
- Does a strong narrative run through the presentation building toward a final conclusion?
- Are the conclusions/recommendations clearly stated?
- Is the level of technicality appropriate for the intended audience?
- Is the student substantially over or under time?
- Does the student appropriately pace their presentation?
- Does the student deliver their message with clarity and volume?
- Are appropriate visualizations generated for the intended audience?
- Are visualizations necessary and useful for supporting conclusions/explaining findings?



## Executive Summary

## Problem Statement

**Problem Statement**
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection**
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?



## Data Collection & Wrangling

We will be using Reddit's JSON API to collect posts (i.e. threads) from the two subreddits:<br>
- r/Android
- r/apple

At the end of this section, we will have dataframes each containing posts from the two subreddits.

### Library Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

from bs4 import BeautifulSoup
import requests
import time
import random

### Get requests
In order to access the subreddits, we will begin with using requests library to send HTTP requests, which will return a response object with all response data. 

In [2]:
#defining url to access
url_android = 'https://www.reddit.com/r/Android.json'
url_apple = 'https://www.reddit.com/r/apple.json'

In [3]:
#sending get requests to the defined urls
res_android = requests.get(url_android, headers={'User-agent': 'Pony Inc 1.0'})
res_apple = requests.get(url_apple, headers={'User-agent': 'Tony Inc 2.0'})

In [4]:
#checking status_code for android subreddit
#the HTTP 200 OK success status response code indicates that the request has succeeded
res_android.status_code

200

In [5]:
#checking status_code for apple subreddit
#the HTTP 200 OK success status response code indicates that the request has succeeded
res_apple.status_code

200

#### Lets begin with the JSON object from android subreddit.
The JSON object is written like a dictionary object with key value pairs.

In [6]:
dict_android = res_android.json()
dict_android

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Android',
     'selftext': 'Note 1. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\nThis weekly Sunday thread is for you to let off some steam and speak out about whatever complaint you might have about:  \n\n* Your device.  \n\n* Your carrier.  \n\n* Your device\'s manufacturer.  \n\n* An app  \n\n* Any other company\n\n***  \n\n**Rules**  \n\n1) Please do not target any individuals or try to name/shame any individual. If you hate Google/Samsung/HTC etc. for one thing that is fine, but do not be rude to an individual app developer.\n\n2) If you have a suggestion to solve another user\'s issue, please leave a comment but be sure it\'s constructive! We do not want any flame-wars.  \n\n3) Be respectful of other\'s opinions. Even if you 

#### Exploring the JSON object

In [7]:
dict_android.keys()

dict_keys(['kind', 'data'])

In [8]:
dict_android['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

#### Subreddit posts "location"

In [9]:
#the subreddit posts are nested in the key: children
dict_android['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'Android',
  'selftext': 'Note 1. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\nThis weekly Sunday thread is for you to let off some steam and speak out about whatever complaint you might have about:  \n\n* Your device.  \n\n* Your carrier.  \n\n* Your device\'s manufacturer.  \n\n* An app  \n\n* Any other company\n\n***  \n\n**Rules**  \n\n1) Please do not target any individuals or try to name/shame any individual. If you hate Google/Samsung/HTC etc. for one thing that is fine, but do not be rude to an individual app developer.\n\n2) If you have a suggestion to solve another user\'s issue, please leave a comment but be sure it\'s constructive! We do not want any flame-wars.  \n\n3) Be respectful of other\'s opinions. Even if you feel that somebody is "wrong" you don\'t have to go out of your way to prove them w

In [10]:
#key to access the next posts
dict_android['data']['after']

't3_izv8c1'

In [11]:
dict_android['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [12]:
dict_android['data']['children'][0]['kind']

't3'

In [13]:
dict_android['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort',

In [14]:
#Subreddit of post
dict_android['data']['children'][0]['data']['subreddit']

'Android'

In [15]:
#Title of first post in the subreddit
dict_android['data']['children'][0]['data']['title']

'Sunday Rant/Rage (Sep 27 2020) - Your weekly complaint thread!'

In [16]:
##Title of second post in the subreddit
dict_android['data']['children'][1]['data']['title']

'Google Maps is getting dedicated car mode UI'

In [17]:
#First post content
dict_android['data']['children'][0]['data']['selftext']

'Note 1. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\nThis weekly Sunday thread is for you to let off some steam and speak out about whatever complaint you might have about:  \n\n* Your device.  \n\n* Your carrier.  \n\n* Your device\'s manufacturer.  \n\n* An app  \n\n* Any other company\n\n***  \n\n**Rules**  \n\n1) Please do not target any individuals or try to name/shame any individual. If you hate Google/Samsung/HTC etc. for one thing that is fine, but do not be rude to an individual app developer.\n\n2) If you have a suggestion to solve another user\'s issue, please leave a comment but be sure it\'s constructive! We do not want any flame-wars.  \n\n3) Be respectful of other\'s opinions. Even if you feel that somebody is "wrong" you don\'t have to go out of your way to prove them wrong. Disagree politely, and move on.'

In [18]:
posts_android = [p['data'] for p in dict_android['data']['children']]

In [19]:
pd.DataFrame(posts_android)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,url_overridden_by_dest,preview,link_flair_template_id
0,,Android,"Note 1. Join our IRC, and Telegram chat-rooms!...",t2_6l4z3,False,,0,False,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...,[],...,https://www.reddit.com/r/Android/comments/j0pd...,2268400,1601205000.0,0,,False,,,,
1,,Android,,t2_q4p0j,False,,0,False,Google Maps is getting dedicated car mode UI,[],...,https://www.androidpolice.com/2020/09/27/googl...,2268400,1601245000.0,0,,False,link,https://www.androidpolice.com/2020/09/27/googl...,{'images': [{'source': {'url': 'https://extern...,
2,,Android,,t2_31mkizvx,False,,0,False,Tasker lets you intercept Samsung S Pen gestur...,[],...,https://www.xda-developers.com/customize-samsu...,2268400,1601216000.0,0,,False,link,https://www.xda-developers.com/customize-samsu...,{'images': [{'source': {'url': 'https://extern...,
3,,Android,,t2_gernm,False,,0,False,22% off nearly everything in European Google S...,[],...,https://store.google.com/,2268400,1601202000.0,0,,False,link,https://store.google.com/,{'images': [{'source': {'url': 'https://extern...,
4,,Android,,t2_cc9vk,False,,0,False,The new Galaxy S20 FE: $100 off at Amazon and ...,[],...,https://www.cnet.com/news/galaxy-s20-fe-preord...,2268400,1601226000.0,1,,False,link,https://www.cnet.com/news/galaxy-s20-fe-preord...,{'images': [{'source': {'url': 'https://extern...,
5,,Android,,t2_g1dl4zr,False,,0,False,U.S. antitrust investigation of Google is comi...,[],...,https://fortune.com/2020/09/27/google-antitrus...,2268400,1601258000.0,0,,False,link,https://fortune.com/2020/09/27/google-antitrus...,{'images': [{'source': {'url': 'https://extern...,
6,,Android,If your device is running Android 6.0+ then yo...,t2_15vsl7,False,,0,False,[Pro Tip] Enable Nearby Share on all your andr...,[],...,https://www.reddit.com/r/Android/comments/j0p1...,2268400,1601203000.0,0,,False,self,,{'images': [{'source': {'url': 'https://extern...,
7,,Android,,t2_jy4qk,False,,0,False,New budget Lenovo P11 tablet leaks with Snapdr...,[],...,https://www.techniknews.net/news/lenovo-p11-al...,2268400,1601225000.0,0,,False,link,https://www.techniknews.net/news/lenovo-p11-al...,{'images': [{'source': {'url': 'https://extern...,
8,,Android,,t2_1xqjsw6h,False,,0,False,"Android 11 got rid of the 4GB limit on videos,...",[],...,https://www.androidpolice.com/2020/09/26/andro...,2268400,1601133000.0,0,,False,link,https://www.androidpolice.com/2020/09/26/andro...,{'images': [{'source': {'url': 'https://extern...,8770017c-413c-11e3-8132-12313b0ae6f4
9,,Android,,t2_1zm8phdi,False,,0,False,Suface Duo vs LG V60: More different than you'...,[],...,https://youtube.com/watch?v=9yGq24i8gA4,2268400,1601191000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,rich:video,https://youtube.com/watch?v=9yGq24i8gA4,{'images': [{'source': {'url': 'https://extern...,


In [20]:
url_android + '?after=' + dict_android['data']['after']

'https://www.reddit.com/r/Android.json?after=t3_izv8c1'

#### Obtaining posts

Each request gets us 25 posts, and we are able to obtain the maximum amount of posts in approximately 30 - 35 requests. We have tried to increase our number of requests beyond 35 but found out that we will be requesting the same posts (meaning duplicate information) given that the same url will be repeated after the 30th ~ 35th request when we print url to track request (requesting info from the same url). 

<br>


In [21]:
%%time
posts_apple = []
after_apple = None
request_num = 0
for a in range(45):
    if after_apple == None:
        current_url = url_apple
    else:
        current_url = url_apple + '?after=' + after_apple
    print(current_url)

    res_apple = requests.get(current_url, headers={'User-agent':'Tony Inc 2.0'})
    if res_apple.status_code != 200:
        print("Status error", res_apple.status_code)
        break
        
    current_dict_apple = res_apple.json()
    current_posts_apple = [p['data'] for p in current_dict_apple['data']['children']]
    posts_apple.extend(current_posts_apple)
    after_apple = current_dict_apple['data']['after']
    request_num += 1
    if request_num % 10 == 0:
        print(f'Request num: {request_num}')
        
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)


https://www.reddit.com/r/apple.json
6
https://www.reddit.com/r/apple.json?after=t3_j06wzm
2
https://www.reddit.com/r/apple.json?after=t3_j02p2w
4
https://www.reddit.com/r/apple.json?after=t3_j05ws0
5
https://www.reddit.com/r/apple.json?after=t3_j0670r
2
https://www.reddit.com/r/apple.json?after=t3_iyu4vk
4
https://www.reddit.com/r/apple.json?after=t3_iybsyi
5
https://www.reddit.com/r/apple.json?after=t3_izcdcw
2
https://www.reddit.com/r/apple.json?after=t3_iysjo5
3
https://www.reddit.com/r/apple.json?after=t3_ix32rn
Request num: 10
4
https://www.reddit.com/r/apple.json?after=t3_iwswjl
6
https://www.reddit.com/r/apple.json?after=t3_ivx31l
4
https://www.reddit.com/r/apple.json?after=t3_ivl1pj
3
https://www.reddit.com/r/apple.json?after=t3_iwa65i
5
https://www.reddit.com/r/apple.json?after=t3_ivxlf0
4
https://www.reddit.com/r/apple.json?after=t3_iuiasz
6
https://www.reddit.com/r/apple.json?after=t3_iuxujh
2
https://www.reddit.com/r/apple.json?after=t3_itphm8
3
https://www.reddit.com/r/app

Below, we have defined a function which will enable us to run a "request loop" to obtain posts from the subreddit given by url. Based on our observation as noted above, we will set number of requests at 30.

In [22]:
#Function defined to obtain posts from subreddits
def obtain_posts(url, file_path):
    '''
    Docstring:
    Get requests 40 times from subreddit url using reddit api and export subreddit posts 
    into a comma-separated values (csv) file.
    
    Parameters
    ----------
    url : str
        string containing the URL to which the get request is sent
    file_path: str, path object
        destination file path for saved csv output containing subreddit posts
    
    '''
    posts = []
    after = None
    #we will be obtaining requests 40 times
    for n in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        #print url to track request
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)
    pd.DataFrame(posts).to_csv(file_path, index = False)

In [23]:
%%time
# Lets obtain posts from android subreddit and save output into android_posts.csv
obtain_posts(url_android, '../datasets/android_posts.csv')

https://www.reddit.com/r/Android.json
2
https://www.reddit.com/r/Android.json?after=t3_izv8c1
3
https://www.reddit.com/r/Android.json?after=t3_iyfqq6
4
https://www.reddit.com/r/Android.json?after=t3_ixol4k
5
https://www.reddit.com/r/Android.json?after=t3_ix1lec
6
https://www.reddit.com/r/Android.json?after=t3_iw37u9
3
https://www.reddit.com/r/Android.json?after=t3_iv3jbo
3
https://www.reddit.com/r/Android.json?after=t3_iu03iu
2
https://www.reddit.com/r/Android.json?after=t3_it0irx
2
https://www.reddit.com/r/Android.json?after=t3_isk6l6
5
https://www.reddit.com/r/Android.json?after=t3_ir67te
2
https://www.reddit.com/r/Android.json?after=t3_ipsxt9
2
https://www.reddit.com/r/Android.json?after=t3_ipmt61
5
https://www.reddit.com/r/Android.json?after=t3_ioyuhj
4
https://www.reddit.com/r/Android.json?after=t3_inzthu
5
https://www.reddit.com/r/Android.json?after=t3_imvkqf
3
https://www.reddit.com/r/Android.json?after=t3_im3v3e
3
https://www.reddit.com/r/Android.json?after=t3_il7yie
6
https://

In [24]:
%%time
# Lets obtain posts from apple subreddit
obtain_posts(url_apple, '../datasets/apple_posts.csv')

https://www.reddit.com/r/apple.json
4
https://www.reddit.com/r/apple.json?after=t3_j06wzm
3
https://www.reddit.com/r/apple.json?after=t3_j02p2w
6
https://www.reddit.com/r/apple.json?after=t3_j05ws0
4
https://www.reddit.com/r/apple.json?after=t3_j0670r
4
https://www.reddit.com/r/apple.json?after=t3_iyu4vk
5
https://www.reddit.com/r/apple.json?after=t3_iybsyi
3
https://www.reddit.com/r/apple.json?after=t3_izcdcw
3
https://www.reddit.com/r/apple.json?after=t3_iysjo5
2
https://www.reddit.com/r/apple.json?after=t3_ix32rn
6
https://www.reddit.com/r/apple.json?after=t3_iwswjl
4
https://www.reddit.com/r/apple.json?after=t3_ivx31l
3
https://www.reddit.com/r/apple.json?after=t3_ivl1pj
4
https://www.reddit.com/r/apple.json?after=t3_iwa65i
3
https://www.reddit.com/r/apple.json?after=t3_ivxlf0
6
https://www.reddit.com/r/apple.json?after=t3_iuiasz
4
https://www.reddit.com/r/apple.json?after=t3_iuxujh
5
https://www.reddit.com/r/apple.json?after=t3_itphm8
3
https://www.reddit.com/r/apple.json?after=t3

### Data collected
We have managed to collect posts/threads in the Android subreddit and apple subreddit.

In [25]:
#check android posts
df_android = pd.read_csv('../datasets/android_posts.csv')
df_android.shape

(982, 109)

In [28]:
df_android['title'].nunique()

729

In [26]:
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,"Note 1. Join our IRC, and Telegram chat-rooms!...",,,,
author_fullname,t2_6l4z3,t2_q4p0j,t2_31mkizvx,t2_gernm,t2_cc9vk
saved,False,False,False,False,False
...,...,...,...,...,...
post_hint,,link,link,link,link
url_overridden_by_dest,,https://www.androidpolice.com/2020/09/27/googl...,https://www.xda-developers.com/customize-samsu...,https://store.google.com/,https://www.cnet.com/news/galaxy-s20-fe-preord...
preview,,{'images': [{'source': {'url': 'https://extern...,{'images': [{'source': {'url': 'https://extern...,{'images': [{'source': {'url': 'https://extern...,{'images': [{'source': {'url': 'https://extern...
link_flair_template_id,,,,,


In [27]:
#check apple posts
df_apple = pd.read_csv('../datasets/apple_posts.csv')
df_apple.shape

(986, 108)

**Data Collection**
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?



**Next:** [Data Cleaning and Exploratory Data Analysis](./02_data_cleaning_and_eda.ipynb)

## References