
# Project 3: Web APIs & Classification

## Problem Statement
Select a classification model which best identifies posts from the subreddit Learn Programming and Learn Machine Learning from the various models selected for testing. 

The project is split into 2 notebooks:

 - Notebook 1: Gathering and Cleaning of Data
 - Notebook 2: Exploratory Data Analysis, Modelling and Evaluation


## Executive Summary
This project requires the application of Natural Language Processing Models ('NLP') to correctly classify the post contents to 2 similar subreddits based on the words used most and related to respective subreddit. This will allow for more accurate search results of the related posts based on the keywords entered by users and as part of the requirements, Naive Bayes classifier is used as well as Logistic Regression and Random Forest for evaluation. 

The steps taken include:-
 - scraping data from the respective subreddits selected 
 - selecting the columns required for NLP classification
 - cleaning of data using BeautifulSoup, NTLK's Stopwords, Regex

After the cleaning of the data, the total posts works out to be 1657 posts of which 58.72% are posts from Learn Programming and 41.28% are from Learn Machine Learning. This works out to be a baseline accuracy of 0.5869. The ratio of the posts are not completely equal but not substantial enough to affect the accuracy scores. Based on the exploratory data analysis performed on the selftext columns, there are common words which frequently appear on both subreddits therefore using stopwords of these common words in order to increase the accuracy of the models. Stemming is then selected when compared to lemmatization and original text based on the accuracy results from running Logistic Regression across all 3. While stemming may not always reduce the word form to actual forms but since the focus on the model is to arrive at accuracy, selections will be based primarily on the accuracy figures. 

6 models were built for the purpose of evaluation:-
 1) Count Vectorizer with Naive Bayesian's MultinomialNB
 2) Tfidf Vectorizer with Naive Bayesian's MultinomialNB
 3) Tfidf Vectorizer with Logistic Regression
 4) Count Vectorizer with Logistic Regression
 5) Count Vectorizer with Random Forest
 6) Tfidf Vectorizer with Random Forest
 
All 6 models have higher cross-validated mean train scores when compared to their respective test scores which could indicate overfitting and the test scores were also in the approximately 0.9 range. This could have been a result of the 2 subreddits being vastly unrelated. However, at the same time, one could probably think that programming is part and parcel of machine learning. In terms of scorings, Model 1 and 4 are very close in terms of the test scores and results from classification reports. Both the models gave an accuracy of 0.93 and the f1 scores for both models are the same. However, Model 1 gave an equal precision score of 0.93 for both subreddits as compared to 0.91 and 0.94 for Model 4. In this case where there is no preference of class 0 over class 1 or vice versa, the equal precision score of Model 1 is preferred. Both the AUC scores for Model 1 and 4 were also very close at 0.973 and 0.978 respectively which also suggest both models are high in their probability of predictions and since the cross validation mean train score for Model 1 is slightly lower than that of Model 4 which means less overfitting, Model 1 on the overall is the best performing.

For the next iteration, the classifier model should be applied on unseen data to validate the scoring and perhaps apply lemmatization and add other stop words to lower the overfitting results. The title can also be included together with the selftext since the question could be in the title itself. The accuracy model can also then be further validated using other subreddits.
 

### Contents
- [Import Libraries](#Import-Libraries)
- [Data Gathering](#Data-Gathering)
- [Data Cleaning](#Data-Cleaning)





## Import Libraries

In [1]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import re

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Gathering

The function below is used to loop and scrape posts from reddit.com for 50 times to draw approximately 1200 unique posts to account for any null values and duplicate posts which may be dropped thereafter. 

A time.sleep of 1 second and user agent is assigned to prevent reddit from blocking. 

### Scrap Data from Subreddit: Learn Programming

In [2]:
#posts = []
#after = None
#url = 'https://www.reddit.com/r/learnprogramming.json'

#for a in range(50):
#    if after == None:
#        current_url = url
#    else:
#        current_url = url + '?after=' + after
#    print(current_url)
#    res = requests.get(current_url, headers={'User-agent': 'scruffy 1.0'})
    
#    if res.status_code != 200:
#        print('Status error', res.status_code)
#        break
     
#    current_dict = res.json()
#    current_posts = [p['data'] for p in current_dict['data']['children']]
#    posts.extend(current_posts)
#    after = current_dict['data']['after']
    
#    if a > 50:
#        prev_posts = pd.read_csv('./data/learnprogramming.csv')
#        current_df = pd.DataFrame()
#    else:
#        pd.DataFrame(posts).to_csv('./data/learnprogramming.csv', index = False)
    
    
#    time.sleep(1)


In [3]:
#Check length of post
#len(posts)

### Scrap Data from Subreddit: Learn Machine Learning

In [4]:
#posts = []
#after = None
#url = 'https://www.reddit.com/r/learnmachinelearning.json'

#for a in range(50):
#    if after == None:
#        current_url = url
#    else:
#        current_url = url + '?after=' + after
#    print(current_url)
#    res = requests.get(current_url, headers={'User-agent': 'scruffy 1.0'})
    
#    if res.status_code != 200:
#        print('Status error', res.status_code)
#        break
     
#    current_dict = res.json()
#    current_posts = [p['data'] for p in current_dict['data']['children']]
#    posts.extend(current_posts)
#    after = current_dict['data']['after']
    
#    if a > 50:
#        prev_posts = pd.read_csv('./data/learnmachinelearning.csv')
#        current_df = pd.DataFrame()
#    else:
#        pd.DataFrame(posts).to_csv('./data/learnmachinelearning.csv', index = False)
       
#    time.sleep(1)

In [5]:
#len(posts)

### Import Files

In [6]:
#Read in csv files saved
learnprogramming = pd.read_csv('./data/learnprogramming.csv')
learnmachinelearning = pd.read_csv('./data/learnmachinelearning.csv')

pd.set_option ('display.max_columns', 500)
pd.set_option('display.max_rows', 1300)

In [23]:
# Check shape of both datasets for sufficient line items
print('Shape of learnprogramming :',learnprogramming.shape)
print('Shape of learnmachinelearning:',learnmachinelearning.shape)

Shape of learnprogramming : (1252, 103)
Shape of learnmachinelearning: (1239, 110)


## Data Cleaning

### Combine both dataframes

In [8]:
#Combine both dataframes for easier manipulation
combined_df = pd.concat([learnprogramming, learnmachinelearning], ignore_index=True, sort = False)

In [24]:
#Cross check if the dataframes were combined 
combined_df.shape

(1661, 4)

### Select Features List

Based on the Reddit site, the key columns required would be the 
- Title which gives the title of the thread
- Selftext which shows the contents of the thread
- Subreddit which shows which subreddit the post is from

The column for author has been retained in case it comes in handy as the 2 subreddits are quite closely related and the author could possibly be common to the threads.


In [10]:
#Return the dataframe with only the selected features from the features list
features_list = ['author', 'title', 'selftext', 'subreddit']
combined_df = combined_df[features_list]

In [11]:
combined_df.head()

Unnamed: 0,author,title,selftext,subreddit
0,michael0x2a,New? READ ME FIRST!,# Welcome to /r/learnprogramming!\n\n## Quick ...,learnprogramming
1,AutoModerator,What have you been working on recently? [May 0...,What have you been working on recently? Feel f...,learnprogramming
2,Dealoite,Does anyone else REGRET becoming a web developer?,I'm a web developer and I hate it.\n\nI enjoye...,learnprogramming
3,OpenSourcere42069,Programming portfolio?,"Hello, I'm looking for some advice on making a...",learnprogramming
4,Scorlibpl,Need advice about data science,"Hey guys, so currently im a IT diploma student...",learnprogramming


In [12]:
# Check combined dataframe
combined_df.shape

(2491, 4)

### Drop duplicate titles

In [13]:
#Remove duplicates in title column
combined_df.drop_duplicates(subset = ['title', 'selftext'], keep = 'first', inplace = True)

In [14]:
#Check for number of line items
combined_df.shape

(1953, 4)

### Check for null items

In [25]:
#Check for null items
combined_df.isnull().sum()

author       0
title        0
selftext     0
subreddit    0
dtype: int64

In [16]:
#drop NAN rows
combined_df.dropna(inplace = True)

### Encode Subreddit

This is to encode the subreddit columns to identify which of the subreddits it belongs to. Learn Programming is mapped as 1 (positive) and Learn Machine Learning as 0 (negative).

In [17]:
combined_df['subreddit'].replace({'learnprogramming': 1, 'learnmachinelearning': 0}, inplace = True)

### Text Cleaning

Using the function below to loop through each row in the dataset to perform the following for column specified:-
   - remove html codes
   - remove html links
   - remove punctuations
   - convert them to lower case and splitting into individual words
   - remove stopwords
   - join the text back to the dataframe

In [18]:
def clean_contents(df, column_text):
    # This removes html codes like <br>
    removed_html = BeautifulSoup(column_text).get_text() 
    
    # 2. This remove http followed by anything before a space, tab or newline links should not have any space or tab or newline between them
    removed_links = re.sub("http.\S+", " ", removed_html)
                                                              
    # 3. This removes non-letters
    removed_punctuation = re.sub("[^a-zA-Z]", " ", removed_links)
    
                                                                  
        
    # 4. Convert to lower case, split into individual words.
    lower_words = removed_punctuation.lower().split()
    
    # 4. This converts nltk's stopwords into a set
    stops = set(stopwords.words('english'))
    
    # 5. Compare our remaining words with stop words and only keep words not in the stop words
    meaningful_words = [words for words in lower_words if words not in stops]
    
    # 6. Join the words back into one string separated by space and replace the joint text back to the dataframe
    cleansed_text = " ".join(meaningful_words)
    df.replace(column_text, cleansed_text, inplace = True)

In [19]:
for selftext in combined_df['selftext']:
    clean_contents(combined_df, selftext)

  markup
  markup


In [20]:
for title in combined_df['title']:
    clean_contents(combined_df, title)

In [21]:
# Check Number of observations for each subreddit
combined_df['subreddit'].value_counts(normalize = True)

1    0.586996
0    0.413004
Name: subreddit, dtype: float64

In [22]:
#save a copy of cleaned dataset
combined_df.to_csv('data/cleandata.csv', index = False)