<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Classifying Reddit Posts from Trump and Biden Subreddits using NLP Classification

---
# Part 1: Web scrapping
---

### Overview of this notebook

In this part, we will scrape data from reddit and clean up the data a little, to obtain a usable csv file for part 2.\
In particular, we will replace all the '[removed]' strings with the empty string '' and drop all the '[deleted]' entries.\
Now, if the entries in 'selftext' were the empty string, they are not going to be very useful for predicting anything. We decided to concatenate the 'titles' and 'selftext' features by noting that some titles are informative/self-explanatory enough.\
We decided on a character limit of 80 based on [this](https://awc-hse.medium.com/a-title-that-works-characteristics-and-tips-7fe33c5aef67) article.\
That would conclude this notebook.

### Contents:
- [Background](#Background)
- [Scrapping](#Scrapping)

## Background

The United States is preparing for an upcoming presidential election, which will determine who will be the next leader of the country. The current Trump administration, led by President Trump, is in the process of hiring new employees. The selection process is based on interviews and projects, which are used to evaluate the candidates' skills and abilities.

As part of the hiring process, we have been tasked with completing a project that involves using natural language processing (NLP) classification to determine if a given post is from the Trump or Biden subreddit. This project will require us to analyze and categorize large amounts of text data from both subreddits, using NLP techniques to identify patterns and features that distinguish between the two.

### Data source

We will be scrapping data from the [r/trump/](https://www.reddit.com/r/trump/) and [r/JoeBiden/](https://www.reddit.com/r/JoeBiden/) subreddits.

### Problem Statement

The goal of this project is to use NLP classification techniques to accurately identify whether a given post is from the Trump or Biden subreddit. This task is important because it will help to shed light on the political preferences and attitudes of the users of these subreddits, which may be indicative of broader trends in American politics. Additionally, the project will help to evaluate our skills in NLP and machine learning.

For our project, we will only be focusing on the metric accuracy. This is because the goals for project is simply to determine whether the post is classified correctly.

## Scrapping

In [1]:
import requests
import time
import random
import pandas as pd

In [2]:
def delay_request(maximum = 10):
    delay_time = random.randint(1, maximum)
    time.sleep(delay_time)

In [3]:
def scraping_worker(
    url = 'https://api.pushshift.io/reddit/search/submission',
    subs = ['trump', 'joebiden'], 
    size = 100,
    iterations = 18,
): #sad face
    
    ls = []
    for sub in subs: #subs is an argument.
        params = {'subreddit': sub, 'size': size}
        for i in range(iterations):
            
            #while loop will try to connect until success within current iteration.
            connected = False
            while connected == False:
                res = requests.get(url, params)
                if res.status_code != 200:
                    print(f'Failed to connect, retrying..., status code: {res.status_code}')
                    delay_request(3)
                else:
                    connected = True
            
            #will run these once connected, then move on to next iteration!
            data = res.json()
            posts = data['data'] #note that post is not changed, only added to other variables.
            ls += posts #concat to ls.
            params['before'] = posts[-1]['created_utc'] #update params.
            delay_request()
            print(f'For subreddit: {sub}, iteration: {i+1} completed!')
  
    return pd.DataFrame(ls)[['subreddit', 'selftext', 'title']]

In [4]:
%%time
df = scraping_worker()

For subreddit: trump, iteration: 1 completed!
For subreddit: trump, iteration: 2 completed!
For subreddit: trump, iteration: 3 completed!
For subreddit: trump, iteration: 4 completed!
For subreddit: trump, iteration: 5 completed!
For subreddit: trump, iteration: 6 completed!
For subreddit: trump, iteration: 7 completed!
For subreddit: trump, iteration: 8 completed!
For subreddit: trump, iteration: 9 completed!
For subreddit: trump, iteration: 10 completed!
For subreddit: trump, iteration: 11 completed!
For subreddit: trump, iteration: 12 completed!
For subreddit: trump, iteration: 13 completed!
For subreddit: trump, iteration: 14 completed!
For subreddit: trump, iteration: 15 completed!
For subreddit: trump, iteration: 16 completed!
For subreddit: trump, iteration: 17 completed!
For subreddit: trump, iteration: 18 completed!
For subreddit: joebiden, iteration: 1 completed!
For subreddit: joebiden, iteration: 2 completed!
For subreddit: joebiden, iteration: 3 completed!
For subreddit: j

In [5]:
df.head()

Unnamed: 0,subreddit,selftext,title
0,trump,,EPIC! Project Omega Sends Letter to Georgia’s ...
1,trump,,BREAKING: House Oversight Committee Shows Bide...
2,trump,[removed],'The United States of America' is 24 letters. ...
3,trump,[removed],'The United States of America' is 24 letters. ...
4,trump,,"Peacefully, Queen gena de' Medici Etherton Doc..."


Next cell shows we have `''`, `'[removed]'` and `'[deleted]'` in `selftext`. We will be replacing the `'[removed]'` strings with `''` and dropping the `'[deleted]'` rows.

In [6]:
df['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [7]:
df['selftext'] = df['selftext'].replace('[removed]', '')

In [8]:
df = df.drop(df[df['selftext'] == '[deleted]'].index).drop_duplicates()

Next we prepare the two text columns for merging, by identifying the columns relevant to us.\
We then merge the two columns.

In [9]:
df = df[(df['selftext'] != '') | (df['title'].map(lambda x: len(x)) > 60)]

In [10]:
df['combined'] = df['selftext'] + ' ' + df['title']

In [11]:
df['subreddit'].value_counts(normalize=True)

trump       0.558313
JoeBiden    0.441687
Name: subreddit, dtype: float64

In [12]:
df = df.reset_index(drop=True)

In [13]:
df.head()

Unnamed: 0,subreddit,selftext,title,combined
0,trump,,EPIC! Project Omega Sends Letter to Georgia’s ...,EPIC! Project Omega Sends Letter to Georgia’s...
1,trump,,BREAKING: House Oversight Committee Shows Bide...,BREAKING: House Oversight Committee Shows Bid...
2,trump,,'The United States of America' is 24 letters. ...,'The United States of America' is 24 letters....
3,trump,,'The United States of America' is 24 letters. ...,'The United States of America' is 24 letters....
4,trump,,"Peacefully, Queen gena de' Medici Etherton Doc...","Peacefully, Queen gena de' Medici Etherton Do..."


In [14]:
df.isnull().sum()

subreddit    0
selftext     0
title        0
combined     0
dtype: int64

In [15]:
df.shape

(2015, 4)

In [16]:
# df.to_csv('../data/trump-biden-reddit-scape.csv', index = False)