# Project 3: Web APIs & Classification

## Notebook 1: Data Collection

## Problem Statement

It is election year for the United States of America, the Democratic nominee (Joe Biden) has open up to shortlisting companies capable of running analysis on social media platforms to inform his campaign of the public sentiments towards himself as well as his party [Source](https://www.nytimes.com/news-event/2020-election). His competition, President Trump has been utilising social media platforms; specifically twitter and has a higher presence in the digital space. On President Trump's own Facebook page, he has more than 29 million followers compared to Joe Biden's 2 million followers. Twitter tells a similar story; about 80 million for Trump, 5.4 million for Biden [Source](https://www.npr.org/2020/05/21/859932268/trump-and-biden-wage-an-uneven-virtual-campaign). 

As technology becomes intertwined with our daily living, it becomes inevitable that these platforms becomes more prominent in it influence for elections [Source](https://journalism.uoregon.edu/news/six-ways-media-influences-elections). Hence, it has become increasing important to understand the social media landscape to further improve one's chance when election day arrives. The campaign liaison has specified there are two categorise of competitions for companies to apply for tender. In the first category the model must be capable of classifying whether a block of text is generally democrat leaning or republican leaning. In the second category the model must be able to do sentiment analysis to understand if the block of text is positive or negative. Our team decided to focus on the first category.

We choose the reddit as our platform of choice and from there we would like to uncover the answers to the following problems:  

- **Between the two similar subreddits r/democrats and r/Republican are able to differentiate the post using Natural Language Processing models?**  
- **Which models is then likely to work best?**

Success is evaluated based on answering the problem statements and producing a model that has the highest classification score base upon metrics like accuracy, precision, recall and f1. 

## Executive Summary

This project sets out to propose a classification modeling approach that would most accurately enable the us to predict which subreddit a post belongs to, using only the title and post data with subreddit names removed. This will assist in our application for the competition tender with the democratic nominee campaign. To illustrate this process and build a proposal using the selected platform Reddit, we have selected to build, evaluate, and compare classification models using Natural Language Processing (NLP) tools.  

Our chosen subreddits to compare are:

 - **r/democrats**
 - **r/Republican**

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

### Contents:
- [Function to pull post using Reddit API](#Function-to-pull-post-using-Reddit-API)

In [None]:
# library imports
import requests
import time
import pandas as pd
from random import randint

### Function to pull post using Reddit API

Reddit is able to rank post via top and further filter to include all posts since the subreddit inception I decided to scrap the top 1000 post (limit of reddit architecture [sources](https://www.reddit.com/r/redditdev/comments/30a7ap/does_reddit_api_limit_total_listings_returned_to/) with the rationale that top posts are likely to have more content be it post length or comments.

In [None]:
# create header parameter to prevent error 429
headers = {'User-agent': 'dsi15-p3-mike'}

In [None]:
# scrap function for subreddit using the provided tutorial as reference
def subreddit_scrap(subreddit, after, no_times):
    default_url = 'https://old.reddit.com/'
    subreddit = subreddit
    ranking = '/top.json?t=all'

    scrap_url = default_url + subreddit + ranking

    after = after
    posts = []

    for i in range(no_times):
        if after == None:
            params = {}
        else: 
            params = {'after': after}
        
        res = requests.get(scrap_url, params = params, headers = headers)
    
        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
            unique_post = len(set([p['data']['name'] for p in posts]))
            print('{} unique posts: '.format(subreddit), str(unique_post))
            print('{} after: '.format(subreddit), after)
        else:
            print(res.status_code)
            break

        time.sleep(randint(2, 6))
    return posts

In [None]:
# there is a limit of 1000 posts due to the architeture of reddit
dems_posts = subreddit_scrap('r/democrats', None, 40)

In [None]:
dems_posts_df = pd.DataFrame(dems_posts)

In [None]:
dems_posts_df.to_csv('../datasets/dems_top1000_posts.csv', index = False)

In [None]:
dems_posts_df.shape

In [None]:
reps_posts = subreddit_scrap('r/Republican', None, 40)

In [None]:
reps_posts_df = pd.DataFrame(reps_posts)

In [None]:
reps_posts_df.shape

In [None]:
reps_posts_df.to_csv('../datasets/reps_top1000_posts.csv', index = False)

## Continue to Notebook 02