# Classification of Reddit Posts via Natural Language Processing

The objective of this project is to correctly classify posts between different subreddits based purely on the text data. For the purposes of this project, we have picked 3 subreddit pairings which the ascending in difficulty based on the content and nature of the posts:
1. ```r/datascience``` vs ```r/machinelearning```.
2. ```r/news``` vs ```r/theonion```.
3. ```r/theonion``` vs ```r/nottheonion```.

The idea is to start by prototyping on an easy problem and then progressively ratchet up the (perceived) difficulty to motivate the development of an increasingly complex and robust model.

<br>

---

# Data Collection and Scraping

We collect posts off reddit using the ```pushshift.io``` API. We package the data-scaping steps and oft-used parameters into helper functions in the custom ```bifrost``` library (this is a reference to the Rainbow Bridge in Norse mythology).

In [1]:
import numpy as np
import pandas as pd
import requests

import bifrost

<br> 

Our first step will be to scrape necessary data from the subreddits in question. The first question me must answer is:
     
<h2 align="center">How much data should we scrape?</h2>
     
One might think the answer is simple: "As much as we can get!" but this is not a satisfactory answer. We could easily scrape 1,000,000 reddit posts from each of the 5 subreddits and this would give us an abundant wealth of data. However, vectorization and training on 2,000,000 text-based examples would be woefully slow and may even be impossible given hardware limitations (imagine trying to store a $2,000,000 \times 100,000$ matrix in memory; it would probably eat up every last byte of RAM we have and still not be enough!).

We must let level-heads prevail and choose a more reasonable number. Let us decide (a bit arbitrarily) on 10,000 posts to train on for each subreddit, with an additional 5,000 posts held-out as a final test set to give our model a final grade. If it turns out this number is too small, we can always go back and scrape more data.

We proceed by first scraping the 10,000 training posts from each subreddit and logging the UTC creation times. Then after building our models, we will scrape the additional 5,000 posts with UTC creation times before the last training example (this separation of data scraping is a safe-guard against accidentally training on the test set).

We provide the following code as a markdown cell to avoid having to wait for the web-scraping process each time we reboot this notebook, but the reader is free to try running it themselves in a code cell

In [2]:
ds_params = {'subreddit':'datascience',
             'fields':('title', 'selftext','created_utc'),
             'size':100}

```python

datascience = pd.DataFrame(
                        bifrost.mass_collect_reddit( # scrapes subreddit for content, 0.8s time-delay built-in to avoid accidental DDoS
                                        content = 'submission',
                                        params=ds_params, 
                                        iters = 100,  # 1 iteration collects 100 posts.
                                        verbose = True  # demonstrate process with console printout, will be turned off in future calls
                                )
                          )
```

<br> 

And just like that, ```bifrost``` has scraped 10,000 reddit posts from ```r/datascience```.

<br>


In [3]:
# load previously scraped results 
# this avoids having to re-scrape each time we reboot the notebook :)
datascience = pd.read_csv('../data/datascience10k.csv')

```python
# save our data to a csv so we don't have to re-scrape every single time
datascience.to_csv('../data/datascience10k.csv')
```

In [4]:
# check shape
datascience.shape

(10000, 4)

In [5]:
datascience.columns

Index(['Unnamed: 0', 'created_utc', 'selftext', 'title'], dtype='object')

In [6]:
datascience.drop(columns='Unnamed: 0',inplace=True)

<br> 

Let's check for null values


In [7]:
datascience.isnull().sum()

created_utc       0
selftext       1726
title             0
dtype: int64

<br>

We have approximately 1700 empty bodies. Based on the author's reddit experience, these are most likely either:
1. Questions: the question is in the title so body is empty.
2. Content Sharing: the actual content is not text-based, e.g. an embedded youtube video for NLP.
3. Removed: the post contents were removed by the moderators.

Let's take a quick look.


In [8]:
datascience[ datascience['selftext'].isnull() ].head()

Unnamed: 0,created_utc,selftext,title
20,1648041965,,What are your favorite reference books?
26,1648032739,,Data scientist returning to work from maternit...
28,1648029230,,Logical process order of a sql query and why k...
30,1648024571,,Hypothesis Testing &amp; Anova Model Topics
34,1648018824,,Could you give me idea for entry level project...


<br>

Let's also now scrape the data for the other 4 subreddits.

In [9]:
ml_params = {'subreddit':'machinelearning','fields':('title', 'selftext','created_utc'),'size':100}

news_params = {'subreddit':'news','fields':('title', 'selftext','created_utc'),'size':100}

onion_params = {'subreddit':'theonion','fields':('title', 'selftext','created_utc'),'size':100}

notonion_params = {'subreddit':'nottheonion','fields':('title', 'selftext','created_utc'),'size':100}

```python
ml = pd.DataFrame(
                        bifrost.mass_collect_reddit(
                                        content = 'submission',
                                        params= ml_params, 
                                        iters = 100,  
                                        verbose = False  
                                )
                          )

ml.to_csv('../data/machinelearning10k.csv')
```

<br> 

```python
news = pd.DataFrame(
                        bifrost.mass_collect_reddit( 
                                        content = 'submission',
                                        params= news_params, 
                                        iters = 100,  
                                        verbose = False  
                                )
                          )

news.to_csv('../data/news10k.csv')
```

<br>

```python
onion = pd.DataFrame(
                        bifrost.mass_collect_reddit( 
                                        content = 'submission',
                                        params= onion_params, 
                                        iters = 100, 
                                        verbose = False  
                          )

onion.to_csv('../data/onion10k.csv')
```

```python
notonion = pd.DataFrame(
                        bifrost.mass_collect_reddit( 
                                        content = 'submission',
                                        params= notonion_params, 
                                        iters = 100, 
                                        verbose = False  
                                )
                          )

notonion.to_csv('../data/notonion10k.csv')
```

In [10]:
ml = pd.read_csv('../data/machinelearning10k.csv')
news = pd.read_csv('../data/news10k.csv')
onion = pd.read_csv('../data/onion10k.csv')
notonion = pd.read_csv('../data/notonion10k.csv')


<br>

--- 
# 100,000 Post Titles

For future usage and general data hoarding purposes, we'll scrape the post titles of 100,000 reddits posts in each of the 5 subreddits mentioned above.

In [11]:
ds_epoch = datascience['created_utc'][9999]
ml_epoch = ml['created_utc'][9999]
news_epoch = news['created_utc'][9999]
onion_epoch = onion['created_utc'][9984]
notonion_epoch = notonion['created_utc'][9999]

In [12]:
ds_titles_params =  {'subreddit':'datascience','fields':('title','created_utc'),'size':100, 'before':ds_epoch}

ml_titles_params = {'subreddit':'machinelearning','fields':('title','created_utc'),'size':100, 'before':ml_epoch}

news_titles_params = {'subreddit':'news','fields':('title','created_utc'),'size':100, 'before':news_epoch}

onion_titles_params = {'subreddit':'theonion','fields':('title','created_utc'),'size':100, 'before':onion_epoch}

notonion_titles_params = {'subreddit':'nottheonion','fields':('title', 'created_utc'),'size':100, 'before':notonion_epoch}

In [13]:
ds_iter2 = pd.DataFrame(
                        bifrost.mass_collect_reddit( 
                                        content = 'submission',
                                        params=ds_titles_params, 
                                        iters = 4000,  
                                        verbose = True, 
                                        delay=True
                                )
                          )

pd.concat([datascience[['created_utc','title']], ds_iter2]).to_csv('../data/datascience15k.csv')

JSONDecodeError: [Errno Expecting value] <html>
<head><title>429 Too Many Requests</title></head>
<body bgcolor="white">
<center><h1>429 Too Many Requests</h1></center>
<hr><center>nginx/1.14.0</center>
</body>
</html>
: 0

In [None]:
ml = pd.DataFrame(
                        bifrost.mass_collect_reddit( 
                                        content = 'submission',
                                        params= ml_titles_params, 
                                        iters = 4000,  
                                        verbose = False,
                                        delay=True
                                )
                          )

ml.to_csv('../data/machinelearningtitles100k.csv')

In [None]:
news = pd.DataFrame(
                        bifrost.mass_collect_reddit(
                                        content = 'submission',
                                        params= news_titles_params, 
                                        iters = 4000, 
                                        verbose = False,
                                        delay=True
                                )
                          )

news.to_csv('../data/newstitles100k.csv')


In [None]:

onion = pd.DataFrame(
                        bifrost.mass_collect_reddit(
                                        content = 'submission',
                                        params= onion_titles_params, 
                                        iters = 4000,  
                                        verbose = False,
                                        delay=True
                                )
                          )

onion.to_csv('../data/oniontitles100k.csv')

In [None]:


notonion = pd.DataFrame(
                        bifrost.mass_collect_reddit(
                                        content = 'submission',
                                        params= notonion_titles_params, 
                                        iters = 4000,  
                                        verbose = False,
                                        delay=True
                                )
                          )

notonion.to_csv('../data/notoniontitles100k.csv')