# Project FitNut - Classification Modelling via Natural Language Processing

<img src = " ../image/cover.png" alt = "cover"/>

([*Source*](https://surgesr.com/wp-content/uploads/2021/09/surge-new-logo.png))


# Contents  

## Part 1
### [Overview](#overview)
### [Problem Statement](#problemstatement)
### [Method(ology)](#methodology)
- [Import Libraries](#importlibraries)
- [Model Framework](#modelframework)

### [Data Collection](#datacollection)
- [Fitness](#fitcollection)
- [Nutrition](#nutcollection)

## Part 2 (see Notebook 2)  
---


<div id="overview"></div>

### Overview

The prevalence of Covid-19 has accelerated the digital transformation journey of traditional brick-and-mortar businesses on a global scale. SURGE - an elite private gym concept under the Core Collective group specializing in curated fitness/wellness programmes - is exploring a new business unit that focuesses on a tailored dual fitness-and-nutrition concept as membership rates for their physical gym at 79 Anson Road have dwindled. Before deep-diving in, SURGE plans to study latest trends and grasp ground sentiments on fitness/nutrition through analyzing two relevant subreddit threads: [*r/bodyweightfitness*](https://www.reddit.com/r/bodyweightfitness/) and [*r/EatCheapAndHealthy*](https://www.reddit.com/r/EatCheapAndHealthy/). 


<div id="problemstatement"></div>

### Problem Statement

A blanket approach was adopted in downloading 2,000 threads from the *bodyweightfitness* and *EatCheapAndHealthy* subreddits (i.e. posts not distinguished by respective subreddits). However, the fitness and nutrition portfolios are handled by two different teams in SURGE. As the hired Data Science consultant, develop a classification model to determine which of the abovementioned subreddits a given thread originates from. 


<div id="methodology"></div>

### Method(ology)

The relevant libraries are systematically imported and underlying rationale for the production model discussed. 

<div id="importlibraries"></div>

##### Import Libraries

In [1]:
# General Modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Collectiong/Scraping Modules
import requests

#NLP Modules
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Modelling Modules
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

%matplotlib inline


<div id="modelframework"></div>

##### Model Framework

The goal is to develop a binary classification model. To ensure holistic decision-making in production modelling, we consider a varied set of possible models, each with their respective mechanisms:   

    a) Logistic Regression - Classifies an observation based on a relatively simple discriminative algorithm by means of establishing a probabilistic boundary between binary classes (above P(x) = class 1, below P(x) = class 2).    

    b) KNeighborsClassifier (KNN)- Classifies an observation based on the properties of "nearest neighbours" typically measured by euclidean/manhattan distance across vector space.  

    c) Multinomial Naive Bayes - Classification technique based on (a series of) conditional probabilities, whereby inherent features of an observation would contribute to its eventual calculated class. Unlike KNN, Naive Bayes assumes that observations are independent of one another. Multinomial Naive Bayes is selected over Gaussian and Bernoulli as the dataset involves discrete data which is not normally distributed and the predictor variable (i.e. subreddit title + selftext) is not binary in nature.  

    d) Random Forest - Ensemble classification algorithm comprising multiple decision trees which predicts classes based on the "wisdom of crowds"; the assemblage of models (trees) operating as a community (crowd) will inform/condition all individual constituent models (tree).  

Similarly, in evaluating model performance, we consider various scoring metrices to ensure the production model is robust:  

    i) Accuracy - Perhaps the first scoring indicator most will review since it is arguably the most intuitive, derived by [(TP + TN) / (TP + FP + TN + FN)], gives the ratio of correct predictions to total predictions. However, accuracy in itself is not adequate as it does not inform on model effectiveness/precision, not to mention that accuracy can be artificially modified simply by adjusting the threshold to achieve a biased outcome.  

    ii) F1-score - Defined as the harmonic mean of precision and recall, follows the formula [2 * (Precision * Recall) / (Precision + Recall)], where Precision is [TP / (TP + FP)] and Recall (i.e. sensitivity) is [TP / (TP + FN)]. Unlike classification problems where false negatives are unacceptable (i.e. prioritize sensitivity such as in the case of identifying potential terrorists or chronic/fatal diseases) or false positives are intolerable (i.e. prioritize specificity such as in the case of an individual being declared positive receiving punishment), in the context of this project, there is no positive/negative in this sense as both binary classes are merely stating subreddit origin of a given thread. Hence, we strive for a balanced proxy instead by examining the F1-score as a scoring metric, which not only gives weight to the percentage of true positives over the total positives in the data but serves as an indicator for confidence of predicted positives as well. The F1-score seeks to optimize both precision and recall simultaneously.  

    iii) ROC AUC - Bringing it all together, we analyze the ROC AUC score as well, which equates to the area under the ROC probability curve for the True Positive Rate against the False Positive Rate. By doing so, we ascertain if the extent of false positives and false negatives are effectively minimized, in which case returning a score of close to 1. Conversely, for a (theoretically) completely flawed model, the ROC AUC score returns a 0.5, which depicts the worst case scenario where the model has no discrimination capacity to distinguish between both binary classes.   

For all scoring metrices, we strive for a benchmark of 90% (i.e. >= 0.9). The intention of setting this benchmark is to attain an equilibrium between reducing classification errors which would result in time/effort required to manually reroute subreddit posts, and ensuring that the production model is not overfitted.  


*TP = True Positives (Post predicted as belonging to bodyweightfitness subreddit and indeed belonging to bodyweightfitness subreddit)*  
*TN = True Negatives (Post predicted as belonging to EatCheapAndHealthy subreddit and inteed belonging to EatCheapAndHealthy subreddit)*   
*FP = False Positives (Post predicted as belonging to bodyweightfitness subreddit but actually under EatCheapAndHealthy subreddit)*  
*FN = False Negatives (Post predicted as belonging to EatCheapAndHealthy subreddit but actually under bodyweightfitness subreddit)*  


<div id="datacollection"></div>

### Data Collection

The models will be trained using ~2,000 posts from the *bodyweightfitness* and *EatCheapAndHealthy* subreddits. While this figure may seem excessive given that 1,000 posts should suffice in developing the model, the larger sample size is to provide buffer for invalid observations (e.g. duplicate posts, non-text posts). Additionally, a larger sample size will improve validity of results by virtue of a more encompassing train/test dataset.  

<div id="fitcollection"></div>

##### Fitness Data


In [5]:
fit_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=bodyweightfitness' # URL for fitness subreddit


In [6]:
# Adapted from SOF (https://www.reddit.com/r/pushshift/comments/bfc2m1/capping_at_1000_posts/)
n = 0 # number of posts, initialized at zero
last = '' # Used to cut off scrapping
fit_posts = []  # Empty list to append fitness posts subsequently

while n < 1000: # As the current Pushshift API limit is 100, we will have to create a loop to obtain 1,000 posts
    req_fit = requests.get('{}&before={}'.format(fit_url,last)) # Scrap the subject URL
    fit_data = req_fit.json() # Fit scrapped data using json
    for post in fit_data['data']:
        fit_posts.append(post) # Add scrapped post to consolidated list
        n += 1 # Increase post counter by 1
    last = int(post['created_utc']) # Set the cut off using time frequency for number of posts collected 
        

In [7]:
fit_df = pd.DataFrame(fit_posts) # Convert scrapped data into pd dataframe format


In [8]:
fit_df = fit_df[['title','selftext','subreddit']] # Extract only relevant rows


In [10]:
fit_df.reset_index(drop = True, inplace = True)


In [11]:
fit_df.info() # Generally no issues, with 1,000 posts scrapped and in the correct str dtype
                # Address missing values during data cleaning


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      1000 non-null   object
 1   selftext   994 non-null    object
 2   subreddit  1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


In [12]:
fit_df.head(10) # To address [removed] selftext during data cleaning


Unnamed: 0,title,selftext,subreddit
0,Which door frame pull-up bar?,Hi I've never trained pull-ups before and now ...,bodyweightfitness
1,Having trouble gaining weight,[removed],bodyweightfitness
2,Pull-up negatives,[removed],bodyweightfitness
3,How long to see Results?,[removed],bodyweightfitness
4,I don't have weights,[removed],bodyweightfitness
5,Upper back strengthening,[removed],bodyweightfitness
6,Question about calisthenics skills periodization,"First of all,I'm sorry for my bad english\nSo,...",bodyweightfitness
7,Wrist and sternum pain are killing my gains,[removed],bodyweightfitness
8,Help me buy a assistance band.,[removed],bodyweightfitness
9,A good tip I picked up for high-frequency trai...,"Hey everyone,\n\nI know there's a lot of inter...",bodyweightfitness


In [13]:
fit_df.duplicated().sum() # Noticeable number of duplicate posts which will likely be deleted during cleaning


25

<div id="nutcollection"></div>

##### Nutrition Data

In [14]:
nut_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=EatCheapAndHealthy' # URL for nutrition subreddit


In [15]:
n = 0
nut_posts = []

while n < 1000:
    req_nut = requests.get('{}&before={}'.format(nut_url,last))
    nut_data = req_nut.json()
    for post in nut_data['data']:
        nut_posts.append(post)
        n += 1
    last = int(post['created_utc'])


In [16]:
nut_df = pd.DataFrame(nut_posts)


In [17]:
nut_df = nut_df[['title','selftext','subreddit']]


In [18]:
nut_df.reset_index(drop = True, inplace = True)


In [19]:
nut_df.info() # Generally no issues, with 1,000 posts scrapped and in the correct str dtype


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      1000 non-null   object
 1   selftext   1000 non-null   object
 2   subreddit  1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


In [20]:
nut_df.head(10) # To address [removed] and possibly null-value selftext during data cleaning


Unnamed: 0,title,selftext,subreddit
0,The Smoothie Diet: 21 Day Rapid Weight Loss Pr...,This smoothie diet 21 day rapid weight loss pr...,EatCheapAndHealthy
1,Full Guide About Tosh Trek,,EatCheapAndHealthy
2,Colorado alternative to Aldi,"I moved to CO a few years ago, and I still hav...",EatCheapAndHealthy
3,HOW TO LOSE WEIGHT TIPS AND TRICKS,[removed],EatCheapAndHealthy
4,This weeks theme ingredient is... Pumpkin! Wha...,Our next key ingredient is **Pumpkin!** Let us...,EatCheapAndHealthy
5,Healthy meals advice,"Hi everyone,\n\nI'm looking for meal ideas. He...",EatCheapAndHealthy
6,Fruit Salad.,,EatCheapAndHealthy
7,Accidentally got grapefruit instead of oranges...,[removed],EatCheapAndHealthy
8,Buy organic A2 Cow Ghee Online in Delhi,,EatCheapAndHealthy
9,"I’ve inherited many,many meatballs. What can I...",[removed],EatCheapAndHealthy


In [21]:
nut_df.duplicated().sum() # Noticeable number of duplicate posts which will likely be deleted during cleaning


25

Having successfully collected the desired number of posts from both subreddits, we combine both into a single dataframe for further processing: 

In [22]:
fitnut_df = pd.concat([fit_df, nut_df])


In [28]:
fitnut_df.shape # Correct total number of subreddit posts (2,000) and columns (3)


(2000, 3)

In [29]:
fitnut_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      2000 non-null   object
 1   selftext   1994 non-null   object
 2   subreddit  2000 non-null   object
dtypes: object(3)
memory usage: 62.5+ KB


In [31]:
# Export combined scrapped dataframe
##fitnut_df.to_csv('../data/fitnut_scrapped.csv', index = False)


This concludes the current notebook (Part 1). In the following notebook (Part 2), we will perform data cleaning and EDA/visualizations, before moving to the formal modelling stage.  