# Project 3 - Book 2: Web APIs & Classification

## Problem Statement

Nutrino is a leading provider of nutrition related data services and analytics. As part of the data science team, we have been tasked to generate business insights curated from popular social media platforms. The company will be able to use that information to better understand customers & markets, enhance decision-making, and ultimately increase profitability.

To do so, we will first be scrapping data from reddit and using classification models such as Logistic Regression and Naive Bayes to uncover patterns within 2 popular diets, Keto and Vegan. In developing this proof of concept, we want to be able to classify all available text data into business ready data, stengthening our core analytics product. 

We also hope to reveal previously unrecognised sub-trends that pertains to attitudes, lifestyles and buying behavior. With a better understanding of the population and their eating patterns, we will be able to advise our clients oh how they can launch their targeted marketing campaigns and improve the success of their products and programs.


## Executive Summary

As the data science team in Nutrino, we have been tasked to build a classifier to improve core product of the company, which is to provide nutrition related data services and analytics. We are also tasked to identify patterns on 2 currently trending diets, keto and vegan. 

Our classifier was successful in predicting at an above 90% accuracy score. We also identified patterns in the motivations and preferences of the 2 groups of subredditors, which will help determine the kind of customer engagement with teach group. 



## Notebooks:
- [Data Scrapping and Cleaning](./book1_data_scrapping_cleaning.ipynb)
- [EDA](./book2_eda.ipynb)
- [Modeling and Recommendations](./book3_preprocesing_modeling_recommendations.ipynb)


## Contents:
- [Import Libraries](#Import-Libraries)
- [Import Data](#Import-Data)
- [Baseline Accuracy](#Baseline-Accuracy)
- [Model Prep](#Model-Prep)
- [Model Selection](#Model-Selection)
- [Modeling Test Data](#Modeling-Test-Data)
- [Classification Metrics](#Classification-Metrics)
- [Recommendations](#Recommendations)
- [Conclusion](#Conclusion)

### Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

%matplotlib inline

### Import data

In [2]:
df = pd.read_csv('../datasets/data_clean.csv')

In [3]:
df.shape

(1647, 8)

In [4]:
df.head()

Unnamed: 0,text,vegan_label,subred_name,author,upvotes,num_comments,post_id,word_count
0,regan russell animal rights activist killed st...,1,r/vegan,nekkototoro,5371,583,hca93z,19
1,vegan hacktivists looking developers ui design...,1,r/vegan,veganactivismbot,70,0,f3svif,184
2,last words fellow vegan elijah mcclain murdere...,1,r/vegan,VenmoMeFiveBucks,1430,147,hf6eej,13
3,30 pounds vegan since may 3 never felt better,1,r/vegan,CoyoteaParty,1195,64,hf55ez,9
4,mink fur farms shut netherlands due covid 19 o...,1,r/vegan,PlantPoweredAdam,2515,84,hezfns,9


In [5]:
df.isnull().sum()
#no null values

text            0
vegan_label     0
subred_name     0
author          0
upvotes         0
num_comments    0
post_id         0
word_count      0
dtype: int64

### Baseline Accuracy

In [6]:
#let's calculate a baseline score so that we know if 
#our model is outperforming our null model

df['vegan_label'].value_counts(normalize=True)

1    0.584699
0    0.415301
Name: vegan_label, dtype: float64

#### Interpretation of baseline accuracy
The baseline accuracy means that if we predict 1 for all posts, we would be right at least 57% of the time. 

Now let's prepare for modelling!

### Model Prep

#### Model Prep: Create feature matrix (`X`) and target vector (`y`)

In [7]:
X=df['text']
y=df['vegan_label']

#### Model Prep: Train/test split

In [8]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,stratify=y)
#here we stratify due to the unbalanced classes in the dataset

#### Model Prep: Steps for Pipeline

In [9]:
#here we prepare the steps for pipeline
cvec_lr = [('cvec', CountVectorizer()),('lr', LogisticRegression())]
tvec_lr = [('tvec',TfidfVectorizer()),('lr', LogisticRegression())]
cvec_nb = [('cvec', CountVectorizer()),('nb', MultinomialNB())]
tvec_nb = [('tvec',TfidfVectorizer()),('nb', MultinomialNB())]

### Model Selection

#### Choice of Model

In order to select a model, we will be running 2 vectorizers and 2 models. 
The vectorizers are namely Count Vectorizer and TFIDF Vectorizer.
Count Vectoriser counts the number of times a word appears. 
TFIDF goes a step further to apply a penalty for words that appear in  multiple posts. 

The models we chose are Logistic Regression and Naive Bayes (MultiNomial). 

Logistic Regression estimates the relationship between our features and the target variable by estimating probabilities using a sigmoid function.
Naive bayes classifier is based on the bayes theorem which is in turn baed on conditional probabilities. It predicts probailities for each class and the class with the highest probability is ocnsidered the most likely class. 

We chose log reg because:
- Y is binary
- easy to interpret

We chose NB because:
- performs well with text classification
- specifically the Multinomial because the columns of X are all integer counts


### Model 1: CountVectorizer and Logistic Regression

#### Model Prep: Pipe_params

In [10]:
pipe_1_params = {
    'cvec__max_features': [1000, 1500, 2000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.8, 0.85, 0.9],
    'cvec__ngram_range': [(1,1), (1,2)],
    'lr__penalty': ['l1', 'l2'],
    'lr__solver' : ['liblinear'],
    'lr__C':[1,0.1,0.01]
}

#### Model Fit and Score: Train Data

In [11]:
%%time

#instantiate pipeline
pipe_1 = Pipeline(cvec_lr)

gs_1 = GridSearchCV(pipe_1,param_grid=pipe_1_params,cv=5)
gs_1.fit(X_train,y_train)

print(gs_1.best_score_)
print(gs_1.best_params_)

0.958997722095672
{'cvec__max_df': 0.8, 'cvec__max_features': 1500, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'lr__C': 0.1, 'lr__penalty': 'l2', 'lr__solver': 'liblinear'}
CPU times: user 4min 39s, sys: 4.03 s, total: 4min 43s
Wall time: 4min 48s


### Model 2: Tfidf Vectorizer and Logistic Regression

#### Model Prep: Pipe_params

In [12]:
pipe_2_params = {
    'tvec__max_features': [2500, 3000, 3500],
    'tvec__max_df': [0.3, 0.5, 0.7],
    'tvec__sublinear_tf': [True, False],
    'tvec__ngram_range': [(1,1), (1,2)],
    'lr__penalty': ['l1', 'l2'],
    'lr__solver' : ['liblinear'],
    'lr__C':[1,0.1,0.01]
}

#### Model Fit and Score: Train Data

In [13]:
%%time
#instantiate pipeline
pipe_2 = Pipeline(tvec_lr)

gs_2 = GridSearchCV(pipe_2,param_grid=pipe_2_params,cv=5)
gs_2.fit(X_train,y_train)

print(gs_2.best_score_)
print(gs_2.best_params_)

0.9582384206529992
{'lr__C': 1, 'lr__penalty': 'l2', 'lr__solver': 'liblinear', 'tvec__max_df': 0.5, 'tvec__max_features': 2500, 'tvec__ngram_range': (1, 1), 'tvec__sublinear_tf': False}
CPU times: user 4min 39s, sys: 4.29 s, total: 4min 44s
Wall time: 4min 48s


### Model 3: CountVectorizer and Multinomial Naive Bayes

#### Model Prep: Pipe_params

In [14]:
pipe_3_params = {
    'cvec__max_features': [1000, 1500, 2000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.75, 0.8, 0.85],
    'cvec__ngram_range': [(1,1), (1,2)],
    'nb__alpha' : [1.0, 1.1, 1.2],
    'nb__fit_prior' : [True, False]
}


#### Model Fit and Score: Train Data

In [15]:
%%time

#instantiate pipeline
pipe_3 = Pipeline(cvec_nb)

gs_3 = GridSearchCV(pipe_3,param_grid=pipe_3_params,cv=5)
gs_3.fit(X_train,y_train)

print(gs_3.best_score_)
print(gs_3.best_params_)

0.9483675018982536
{'cvec__max_df': 0.75, 'cvec__max_features': 2000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 1), 'nb__alpha': 1.0, 'nb__fit_prior': True}
CPU times: user 4min 31s, sys: 4.19 s, total: 4min 35s
Wall time: 4min 39s


### Model 4: Tfidf Vectorizer and Multinomial Naive Bayes

#### Model Prep: Pipe_params

In [16]:
pipe_4_params = {
    'tvec__max_features': [2500, 3000, 3500],
    'tvec__max_df': [0.3, 0.5, 0.7],
    'tvec__sublinear_tf': [True, False],
    'tvec__ngram_range': [(1,1), (1,2)],
    'nb__alpha' : [1.0, 1.1, 1.2],
    'nb__fit_prior' : [True, False]
}

#### Model Fit and Score: Train Data

In [17]:
%%time
#instantiate pipeline
pipe_4 = Pipeline(tvec_nb)

gs_4 = GridSearchCV(pipe_4,param_grid=pipe_4_params,cv=5)
gs_4.fit(X_train,y_train)

print(gs_4.best_score_)
print(gs_4.best_params_)

0.9453302961275627
{'nb__alpha': 1.2, 'nb__fit_prior': True, 'tvec__max_df': 0.5, 'tvec__max_features': 2500, 'tvec__ngram_range': (1, 2), 'tvec__sublinear_tf': False}
CPU times: user 4min 16s, sys: 3.33 s, total: 4min 20s
Wall time: 4min 22s


#### Scores recap:

In [18]:
print(f'CVec and LR Score: {gs_1.best_score_}')
print(f'TFIDFVec and LR Score: {gs_2.best_score_}')
print(f'CVec and NB Score: {gs_3.best_score_}')
print(f'TFIDFVec and NB Score: {gs_4.best_score_}')

CVec and LR Score: 0.958997722095672
TFIDFVec and LR Score: 0.9582384206529992
CVec and NB Score: 0.9483675018982536
TFIDFVec and NB Score: 0.9453302961275627


##### Observation on train scores:
Overall, all our scores are pretty similar and way above our baseline score of 0.571. The default score used by gridsearch is the accuracy score. Our accuracy scores range from 0.945 to 0.959. This means that our model is doing a pretty great job of classifying text from the 2 subreddits. 

##### Model Selection:
Based on the cross validation scores from all 4 models, we will proceed with the Count Vectorizer and Logistic Regression combination, which gave us the highest train accuracy score of 0.959. 

### Modelling Test Data

In [34]:
gs_1.score(X_test,y_test)

0.9636363636363636

##### Observation of test scores
Our test score exceeded our train score with 0.964, with some variance from the train score. 


#### Interpretation of coefficients

In [21]:
wd_features = gs_1.best_estimator_.named_steps['cvec'].get_feature_names()
wd_coef = gs_1.best_estimator_.named_steps['lr'].coef_[0]

coef_df = pd.DataFrame(zip(wd_features,wd_coef),columns=['features','coef']).sort_values('coef',ascending=False)

print(coef_df.head(2))
print(coef_df.tail(2))

     features      coef
1404    vegan  1.326415
97     animal  0.433730
    features      coef
242    carbs -0.766565
721     keto -1.775069


A positive coefficient points towards the vegan class while a negative coefficient points to the keto class. 

Besides the obvious words of vegan and keto, we can interpret the coefficient as follows:
- animal -> As the word count for animal increases by 1, our classifier is e^0.434 = 1.54 times more likely to classify it as a vegan post
- carbs ->  As the word count for carbs increases by 1, our classifier is e^0.766 = 2.15 times more likely to classify it as a vegan post

#### Generate Predictions

In [36]:
preds = gs_1.predict(X_test)

### Classification Metrics

#### Confusion Matrix

In [37]:
tn, fp, fn, tp = confusion_matrix(y_test,preds).ravel()
print('True Negative:' ,tn)
print('False Positive (Type I):' ,fp)
print('False Negative (Type II):' ,fn)
print('True Positive:' ,tp)

True Negative: 125
False Positive (Type I): 12
False Negative (Type II): 0
True Positive: 193


In [38]:
#Precision

prec = tp/(tp+fp)

print(f'Precision: {prec}')

Precision: 0.9414634146341463


#### Precision
Precision measures primarily on type I errors or false positives. This means that the model classifies a post as a vegan post when it is actually a keto post.

Our precision score is 0.94 and this means that our model rarely mislabels a keto post as a vegan post. 

The shortcoming of this is that it cannot measure false negatives, classifying keto posts as vegan posts

In [39]:
#Recall/Sensitivity

sens = tp/(tp+fn)

print(f'Sensitivity: {sens}')

Sensitivity: 1.0


#### Recall/Sensitivity
Recall focuses on type II errors. A type II error means that the model classifies a post as a keto post when it is actually a vegan post.

Our precision score is 1.0 and this means that our model never mislabels a vegan post as a keto post. 

Vice versa, the shortcoming of this metric is that it cannot measure false positives. 

In [40]:
#let's take a look at ROC AUC Score yet another metric
roc_auc_score(y_test, preds)

0.9562043795620438

#### ROC AUC Score
Our high ROC AUC score of 0.96 (close to 1) confirms that we have a good separation between our vegan and keto classes. 

In [41]:
print(classification_report(y_test,preds))

             precision    recall  f1-score   support

          0       1.00      0.91      0.95       137
          1       0.94      1.00      0.97       193

avg / total       0.97      0.96      0.96       330



#### F1 Score
The F1 score shows the balance between precision and recall. It is calculated by 2(precision*recall)/(precision+recall).
Our high F1 score of 0.96 shows that we have a good balance of both of the above. 

Ultimately, we measure our success on the F1 score which generally works well with unbalanced classes, which we have at 58/42. 
Also, we dont particularly value identifying either false positive or false negatives more, Accuracy and F1 scores should be sufficient to measure our success. 

### Recommendations

Our reccomendations are: 
- further develop the classifier to continuously improve our core product of generating data analytics  

With this proof of concept, we can invest further resources to develop the model to eventually classify all available data, to contribute to our core product of nutrition data related services.  

- vegan insights: focus on educating the target market

With vegan subredditors, we noticed that the users were highly discerning. Afterall, vegans have made a life changing commitment to the lifestlye to avoid meat and animal products. They are concerned with making the world a better place. In order to rally these users, frequently post about the science behind veganism, share insights on how they can contribute to reduce animal cruelty and keep them updated on the change that veganism is bringing.

- keto insights: focus on building a strong community

With keto subredditors, a common goal/motivation is weight loss. Build a community with these people, share progress, recipes and tips. Build a platform when people can freely share their thoughts and concerns. 

### Conclusion
There were 2 keys questions that we set out to answer in our problem statement:
1. Can we build a classifier, with at least 90% accuracy, to sort text data from Keto and Vegan subreddits?
2. Can we draw insights from these text data for targeted marketing?

After data cleaning, feature engineering and EDA, we ran our text data through a couple of vectorizer-model combinations. We tuned the hyperparameters to generate the best scores. From there, we scored them on the training data, using the accuracy scores. We chose the model with the highest accuracy score and scored our optimal model with the test data. Finally, our classifier achieved a test score of 0.964, which means that it correctly classified text data 96.4% of the time.

We also drew insights from the EDA and profiled our target market. We realised differences in motivations and preferences between the 2 groups. Our recommendations are tailored to each group with a focus on educating for the vegans and building a strong community for the keto crowd. 

With these, we will be able to provide better insights to our clients to help with their marketing campaigns. 

#### Next steps

To further improve our model, I believe that we could have included texts from the comment section which were not availabble via the API. We could perhaps explore using PRAW to scrap comments. 

We could also run sentiment analysis on the data to see which posts and comments were favoured by different users of the subreddits. 