# Project 3: Reddit API and NLP

## Problem Statement

A major cyber attack has temporarily rendered the Reddit website dysfunctional. This has also forced Reddit to explore  alternative mechanisms to perform its essential data classification functions. As a data scientist at reddit, I am tasked to build an alternative machine learning algorithm for accurately classifying the user posts into distinct subreddits or discussion topics depending on the content of the posts. For the purpose of a trial run in ascertaining the validity of the above algorithm, only posts from two subreddits are being trained and tested for the classification model.

## Executive Summary

This project focuses on data wrangling, natural language processing, and classification modelling for the random posts pulled from the two subreddits on the Reddit website. Reddit is a discussion website which allows users to post content and comments relating to a wide variety of fields ranging from day-to-day activities such as cooking to highly specialized domains such as Programming, Science, etc. I am a consultant with Reddit who has been tasked with developing an alternative machine learning algorithm which accurately classifies user-comments into specific categories depending on the content of each post. The user-content that is provided to me is related to Excel and Legal Advice subreddit. For the purpose of the project, I scrapped data (about 1,000 random user-posts for each subreddit) from the Reddit website. The data was later cleaned for punctuations as well as English stopwords such as is, are, and, among others for the purpose of building the model. Since the computer programs only accept an input in the form of digits, the entire textual dataset, comprising of title and posts, is converted into numbers using a function known as count vectorizer. The vectorizer provides a number corresponding to each word in the post and title according to its frequency, that is, the number of times a word appears in the title and the post. For simplication purposes, title and posts are combined together into a separate column called as "combined". Once the process of vectorization is complete, different models such as Logistic Regression, Naives Bayes and Decision Tree are fitted to the dataset. The models are further evaluated using the accuracy score. The comparison of all the three models indicates that Naives Bayes Model performs the best on the testing dataset. This model also showed a minor misclassification of the posts between the two subreddits. Only 1 post from Excel subreddit was misclassified as Legal Advice and vice-versa. Further research can be undertaken by employing more classification models such as Random Forest, Support Vector Machines, etc. and more observations can be added to see if the accuracy score improves.         


### Contents:
- [Data Dictionary](#Data-Dictionary)
- [Data Import](#Data-Import)
- [Data Cleaning and Exploratory Data Analysis] 
- [Preprocessing (Tokenizing and Lemmatizing]
- [Modelling and Evaluation]
- [Conclusion and Recommendations]

## Data Dictionary

|Variable|Type|Description|
|---|----|----|
|title|object|title of the user-created posts|
|selftext|object|user-created posts|	
|subreddit|object|user-created boards for posting related to excel and legal advice|

## Data Import

### Import the libraries

In [1]:
# Imports
import requests
import pandas as pd
import time
import random

### Subreddit 1 - Legaladvice

In [2]:
#url of first subreddit - legaladvice 
url = 'https://www.reddit.com/r/legaladvice.json'

In [3]:
res = requests.get(url)

In [4]:
#checking the status of the website
res.status_code

429

In [5]:
res = requests.get(url, headers={'User-agent': 'Preety'})

In [6]:
res.status_code

200

In [7]:
#json format of the scrapped data is similar to a dictionary (just for reference)
reddit_dict = res.json()

In [8]:
#print(reddit_dict)

In [9]:
#scrapping the data from legaladvice subreddit into json(dictionary) format
posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Preety'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/legaladvice.json
53
https://www.reddit.com/r/legaladvice.json?after=t3_eu0jg8
17
https://www.reddit.com/r/legaladvice.json?after=t3_etxceo
32
https://www.reddit.com/r/legaladvice.json?after=t3_eu1zsr
17
https://www.reddit.com/r/legaladvice.json?after=t3_etyfz7
21
https://www.reddit.com/r/legaladvice.json?after=t3_eu0hdn
18
https://www.reddit.com/r/legaladvice.json?after=t3_etzr78
25
https://www.reddit.com/r/legaladvice.json?after=t3_etyhtq
20
https://www.reddit.com/r/legaladvice.json?after=t3_etxot6
6
https://www.reddit.com/r/legaladvice.json?after=t3_etwzhs
29
https://www.reddit.com/r/legaladvice.json?after=t3_etvrna
60
https://www.reddit.com/r/legaladvice.json?after=t3_etv7r8
53
https://www.reddit.com/r/legaladvice.json?after=t3_ettziu
57
https://www.reddit.com/r/legaladvice.json?after=t3_ett24z
48
https://www.reddit.com/r/legaladvice.json?after=t3_etv1zn
42
https://www.reddit.com/r/legaladvice.json?after=t3_ethoxr
59
https://www.reddit.com/r/legaladvice.json

In [10]:
#saving the posts collected in json format (dictionary) into a csv and dataframe 
posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Preety'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv('legaladvice.csv')
        current_df = pd.DataFrame()
        
    else:
        pd.DataFrame(posts).to_csv('legaladvice.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/legaladvice.json
4
https://www.reddit.com/r/legaladvice.json?after=t3_ets76p
4
https://www.reddit.com/r/legaladvice.json?after=t3_etxceo
6
https://www.reddit.com/r/legaladvice.json?after=t3_eu2c1t
3
https://www.reddit.com/r/legaladvice.json?after=t3_etwwb9
2
https://www.reddit.com/r/legaladvice.json?after=t3_etw3f7
4
https://www.reddit.com/r/legaladvice.json?after=t3_eu020d
5
https://www.reddit.com/r/legaladvice.json?after=t3_etyu3z
4
https://www.reddit.com/r/legaladvice.json?after=t3_etxz27
4
https://www.reddit.com/r/legaladvice.json?after=t3_etx9dy
6
https://www.reddit.com/r/legaladvice.json?after=t3_etvyfq
2
https://www.reddit.com/r/legaladvice.json?after=t3_etvh9b
2
https://www.reddit.com/r/legaladvice.json?after=t3_etub99
4
https://www.reddit.com/r/legaladvice.json?after=t3_etctrx
6
https://www.reddit.com/r/legaladvice.json?after=t3_ets8fo
6
https://www.reddit.com/r/legaladvice.json?after=t3_etrcol
3
https://www.reddit.com/r/legaladvice.json?after=t3_ets8b

In [11]:
#checking the dimension of the scrapped posts from legaladvice subreddit
len(posts)

1241

In [17]:
#putting the scrapped posts into a dataframe, limiting to only 3 columns to be used for classification purposes
legaladvice_df = pd.DataFrame(posts, columns = ['subreddit', 'title', 'selftext']).to_csv('legaladvice.csv', index = False)

In [18]:
#saving the dataframe as a csv file
legaladvice_df = pd.read_csv('legaladvice.csv')

In [19]:
#checking the legaladvice dataframe  
legaladvice_df.head()

Unnamed: 0,subreddit,title,selftext
0,legaladvice,2019 Taxes - IRS Free File program now open,"Hey folks, “tax season” is upon us, I wanted t..."
1,legaladvice,My sister is wanting my kid after I die,I have always been straight forward and let ev...
2,legaladvice,Code bootcamp preys on minorities and tricks t...,TLDR: I got tricked into signing an income sha...
3,legaladvice,"Victoria, Australia. Ex employer uses software...",I worked at a business that among many other t...
4,legaladvice,[MA] I was just named as the defendant in a la...,This morning I woke up to papers saying that I...


In [20]:
legaladvice_df.shape

(1241, 3)

### Subreddit 2 - Excel

In [44]:
#url of second subreddit - excel
url_2 = 'https://www.reddit.com/r/excel.json'

In [45]:
res = requests.get(url)

In [46]:
#checking the status of the website
res.status_code

429

In [47]:
res = requests.get(url_2, headers={'User-agent': 'Preety'})

In [48]:
res.status_code

200

In [49]:
#scrapping the data from excel subreddit into json(dictionary) format 
posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url_2
    else:
        current_url = url_2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Preety'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/excel.json
2
https://www.reddit.com/r/excel.json?after=t3_ett5n7
11
https://www.reddit.com/r/excel.json?after=t3_etmegs
39
https://www.reddit.com/r/excel.json?after=t3_ethkc8
2
https://www.reddit.com/r/excel.json?after=t3_etbjkp
34
https://www.reddit.com/r/excel.json?after=t3_etcfxz
37
https://www.reddit.com/r/excel.json?after=t3_et4vuy
56
https://www.reddit.com/r/excel.json?after=t3_eszios
7
https://www.reddit.com/r/excel.json?after=t3_esuhkp
31
https://www.reddit.com/r/excel.json?after=t3_esvifi
58
https://www.reddit.com/r/excel.json?after=t3_esl03l
40
https://www.reddit.com/r/excel.json?after=t3_espg81
32
https://www.reddit.com/r/excel.json?after=t3_eshkwa
32
https://www.reddit.com/r/excel.json?after=t3_esf1du
31
https://www.reddit.com/r/excel.json?after=t3_esdqn9
26
https://www.reddit.com/r/excel.json?after=t3_esccdo
2
https://www.reddit.com/r/excel.json?after=t3_es26th
40
https://www.reddit.com/r/excel.json?after=t3_es0h5u
57
https://www.reddit.com/r/excel

In [57]:
#saving the posts collected in json format (dictionary) into a csv and dataframe 
posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url_2
    else:
        current_url = url_2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Preety'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    if a > 0:
        prev_posts = pd.read_csv('excel.csv')
        current_df = pd.DataFrame()
        
    else:
        pd.DataFrame(posts).to_csv('excel.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/excel.json
4
https://www.reddit.com/r/excel.json?after=t3_etr93n
6
https://www.reddit.com/r/excel.json?after=t3_etm9w2
6
https://www.reddit.com/r/excel.json?after=t3_etc97f
4
https://www.reddit.com/r/excel.json?after=t3_etbg3s
6
https://www.reddit.com/r/excel.json?after=t3_etcdpe
2
https://www.reddit.com/r/excel.json?after=t3_esxml7
5
https://www.reddit.com/r/excel.json?after=t3_eszbgm
4
https://www.reddit.com/r/excel.json?after=t3_esv9hv
2
https://www.reddit.com/r/excel.json?after=t3_essk5v
2
https://www.reddit.com/r/excel.json?after=t3_esqnyt
3
https://www.reddit.com/r/excel.json?after=t3_esg5yx
2
https://www.reddit.com/r/excel.json?after=t3_esfpdv
5
https://www.reddit.com/r/excel.json?after=t3_esi5ib
3
https://www.reddit.com/r/excel.json?after=t3_esdng4
2
https://www.reddit.com/r/excel.json?after=t3_esc55d
5
https://www.reddit.com/r/excel.json?after=t3_es561d
5
https://www.reddit.com/r/excel.json?after=t3_es1u0u
4
https://www.reddit.com/r/excel.json?after=t3

In [58]:
#checking the total number of posts/rows for excel subreddit 
len(posts)

1247

In [59]:
#putting the scraped data from excel subreddit in a dataframe; keeping only 3 columns, namely - subreddit, title, selftext for classification purposes
excel_df = pd.DataFrame(posts, columns = ['subreddit', 'title', 'selftext']).to_csv('excel.csv', index = False)

In [62]:
#saving the dataframe into a csv file 
excel_df = pd.read_csv('excel.csv')

In [63]:
excel_df.head()

Unnamed: 0,subreddit,title,selftext
0,excel,"Dynamic Arrays released along with the FILTER,...",Dynamic arrays has released to all Office 365 ...
1,excel,Determining the vertex/peak of an irregular cu...,"Hi All,\n\nDoes anyone know how to find the ma..."
2,excel,Enter data on a field/pop-up then fills on a t...,I want to input data on a cell or a pop up fro...
3,excel,Is there a way to extract data from a Google s...,"Like, can i have a range in Excel reference a ..."
4,excel,Trying to calculate the minimum of (p&amp;l) d...,need to calculate profit divided by sales in s...


In [76]:
#checking the dimensions of excel dataframe - 1247 rows and 3 columns 
excel_df.shape

(1247, 3)

In [77]:
#checking null values in the three columns of excel dataframe 
excel_df.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64

In [79]:
#checking the duplicated rows in legaladvice dataframe 
legaladvice_df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1236     True
1237     True
1238     True
1239     True
1240     True
Length: 1241, dtype: bool

In [82]:
#dropping duplicates from legaladvice dataframe
legaladvice_df.drop_duplicates(inplace=True)

In [83]:
#checking the number of rows and columns after dropping duplicates from legaladvice dataframe
legaladvice_df.shape

(994, 3)

In [84]:
#dropping duplicates from excel dataframe
excel_df.drop_duplicates(inplace=True)

In [101]:
#checking the number of rows and columns after dropping duplicates from excel dataframe
excel_df.shape

(821, 3)

In [104]:
#concatenating the legaladvice and excel dataframes into a final dataframe 
final = pd.concat([legaladvice_df, excel_df])

In [105]:
final.head()

Unnamed: 0,subreddit,title,selftext
0,legaladvice,2019 Taxes - IRS Free File program now open,"Hey folks, “tax season” is upon us, I wanted t..."
1,legaladvice,My sister is wanting my kid after I die,I have always been straight forward and let ev...
2,legaladvice,Code bootcamp preys on minorities and tricks t...,TLDR: I got tricked into signing an income sha...
3,legaladvice,"Victoria, Australia. Ex employer uses software...",I worked at a business that among many other t...
4,legaladvice,[MA] I was just named as the defendant in a la...,This morning I woke up to papers saying that I...


In [106]:
final.tail()

Unnamed: 0,subreddit,title,selftext
816,excel,How do I code a sum function for a coded surve...,"\n\nHey guys, I don't know a lot about excel ..."
817,excel,copy &amp; paste values using a VB code / Macro,"Hi guys,\n\ni'm looking for a VB code that can..."
818,excel,How to Increase all Values in a Column by a Fi...,I have a column of a few sequential hundred va...
819,excel,Excel displaying extra black window in task view,When I open excel it open a single windows but...
820,excel,I want to look for a specific word in a table ...,I am tasked with displaying the daily change i...


In [107]:
#checking the dimensions of the final dataframe (this has both excel and legaladvice subreddit)
final.shape

(1815, 3)

In [108]:
#saving the final concatenated dataframe to csv file
final.to_csv('final.csv', index=False)