## Binary Classification Modeling - r/biology & r/biochemistry Predicting with Boosting

In [42]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import recall_score, f1_score, precision_score
from sklearn.naive_bayes import MultinomialNB

#lemmatizer
import nltk
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
# Import stemmer.
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [43]:
#read in the submissions csv file
#get rid of unnamed:0 column wtih index_col
submissions = pd.read_csv('datasets/cleaned-submission.csv',index_col=0)
warnings.filterwarnings("ignore", category=UserWarning)

#remove all words that apply to the target variable -- biology,bio,biochem,biochemistry
submissions['selftext'].replace('biology','',regex=True,inplace=True)
submissions['selftext'].replace('biochemistry','',regex=True,inplace=True)
submissions['selftext'].replace('chemistry','',regex=True,inplace=True)
submissions['selftext'].replace('biochem','',regex=True,inplace=True)
submissions['selftext'].replace('bio','',regex=True,inplace=True)
submissions['selftext'].replace('chem','',regex=True,inplace=True)
#drop the null rows
submissions.dropna(how='any',axis=0,inplace=True)

## Functions and Classes

In [44]:
#step over collection of tokens and try to lemmatize each of them
#to use in countvectorizer we pass the new class as the tokenizer
class LemmaTokenizer:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
    def __call__(self,doc):
        return [self.lemmatizer.lemmatize(t) for t in word_tokenize(doc)]

## Stopwords

In [45]:
#longer list of stop words
#taken from the open source work found here: https://gist.github.com/sebleier/554280
#txt found here: https://gist.githubusercontent.com/ZohebAbai/513218c3468130eacff6481f424e4e64/raw/b70776f341a148293ff277afa0d0302c8c38f7e2/gist_stopwords.txt

stop_word = pd.read_csv('datasets/stopwords.csv',index_col=0)
stop_word = list(stop_word['stopwords'])

#remove punctuation from the stop words, as it has already been done in cleaning the text
stop_word = [word.replace("'",'') for word in stop_word]

## Boosting modeling with Selftext

In [46]:
#setting up X and y values for modeling
X = submissions['selftext']
y = np.where(submissions['subreddit']=='Biochemistry',1,0)

In [47]:
#train/test split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,random_state = 18)

## Null Model

In [48]:
#to get the baseline accuracy of the model
#based on the most frequent value in the training data
#biochemistry = 1, biology = 0

biochem_num = y_train.sum()
biology_num = len(y_train)-biochem_num

if biology_num < biochem_num:
    baseline_accur = round(biochem_num/len(y_train),4)
    print(f'The most frequent class is r/biochemistry. The accuracy of the null model is {baseline_accur}.')
    
else:
    baseline_accuracy = round((biology_num)/len(y_train),4)
    print(f'The most frequent class is r/biology. The accuracy of the null model is {baseline_accuracy}.')

The most frequent class is r/biochemistry. The accuracy of the null model is 0.5686.


#### Baseline/Null Model Explained:

The baseline model allows us to find a 'starting point' to compare the performance of future models to. In binary classification, a customary baseline/null model is one that will guess the most frequently occuring class in the testing set. The code above is a way to print out what the baseline accuracy of the null model actually is. With the random state of the train/test split set, this accuracy should not change with future restarts of the kernel. The most frequent class is that of r/biochemistry. Therefore, there is a 56.81% accuracy if you were to guess r/biochemsitry for every observation within the data set. 

Let's see if we can't improve on this score with some NLP modeling!

## AdaBoosting with CountVectorizer

In [8]:
#create adaboost model in pipeline
cvect = CountVectorizer(stop_words=stop_word)
pipe_adacv = make_pipeline(cvect,StandardScaler(with_mean=False),
                           AdaBoostClassifier(base_estimator=MultinomialNB(alpha=250)))

#make parameter grid
param_adacv = {
    'countvectorizer__max_features':[2_500,3_500],
    'countvectorizer__ngram_range':[(1,1),(1,2)],
    'adaboostclassifier__n_estimators': [1_500,2_500],
    'adaboostclassifier__learning_rate':[0.1,0.25]
}

In [9]:
#make gridsearch instance
grid_adacv = GridSearchCV(pipe_adacv,param_grid = param_adacv,n_jobs=-1)


grid_adacv.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('countvectorizer',
                                        CountVectorizer(stop_words=['0o', '0s',
                                                                    '3a', '3b',
                                                                    '3d', '6b',
                                                                    '6o', 'a',
                                                                    'a1', 'a2',
                                                                    'a3', 'a4',
                                                                    'ab',
                                                                    'able',
                                                                    'about',
                                                                    'above',
                                                                    'abst',
                                                                    'ac',
        

In [10]:
#the accuracy on the training data
grid_adacv.score(X_train,y_train)

0.8178493050475494

In [11]:
#score on accuracy and recall
print(f'The accuracy of the AdaBoost model is {round(grid_adacv.score(X_test,y_test),4)}.')
print(f'The recall of the AdaBoost model is {round(recall_score(y_test,grid_adacv.predict(X_test)),4)}.')
print(f'The f1 score of the AdaBoost model is {round(f1_score(y_test,grid_adacv.predict(X_test)),4)}.')
print(f'The precision score of the AdaBoost model is {round(precision_score(y_test,grid_adacv.predict(X_test)),4)}.')

The accuracy of the AdaBoost model is 0.7279.
The recall of the AdaBoost model is 0.8707.
The f1 score of the AdaBoost model is 0.7845.
The precision score of the AdaBoost model is 0.7138.


In [12]:
#get results as a dataframe
pd.DataFrame(grid_adacv.cv_results_).sort_values(by='rank_test_score').head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_adaboostclassifier__learning_rate,param_adaboostclassifier__n_estimators,param_countvectorizer__max_features,param_countvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
6,12.931333,0.116155,1.547007,0.033536,0.1,2500,3500,"(1, 1)","{'adaboostclassifier__learning_rate': 0.1, 'ad...",0.734918,0.730043,0.735366,0.745732,0.737195,0.736651,0.00512,1
7,14.026517,0.087846,1.635907,0.0186,0.1,2500,3500,"(1, 2)","{'adaboostclassifier__learning_rate': 0.1, 'ad...",0.736137,0.727605,0.735366,0.745732,0.734146,0.735797,0.005811,2
11,9.136817,0.070082,1.038945,0.008278,0.25,1500,3500,"(1, 2)","{'adaboostclassifier__learning_rate': 0.25, 'a...",0.734918,0.730652,0.729878,0.742073,0.735976,0.734699,0.004374,3
10,8.005079,0.097236,0.986915,0.005667,0.25,1500,3500,"(1, 1)","{'adaboostclassifier__learning_rate': 0.25, 'a...",0.729433,0.728215,0.731707,0.747561,0.729878,0.733359,0.007189,4
2,7.876399,0.038841,0.971102,0.005645,0.1,1500,3500,"(1, 1)","{'adaboostclassifier__learning_rate': 0.1, 'ad...",0.729433,0.73248,0.731098,0.739634,0.734146,0.733358,0.003501,5


#### Model Interpretation

Performing AdaBoosting with the best CountVectorizer learned so far, Multinomial Naive Bayes. AdaBoost algorithm will take in the weak learner and will iteratively reweight the errors of the subsequent models. This model performs closely to the AdaBoost with TfidfVectorization, but overall it does perform about 1% worse in both accuracy and recall after fitting with this training data.

## AdaBoosting with TfidfVectorizer

In [13]:
#create adaboost model in pipeline
tfidf_vect = TfidfVectorizer(stop_words=stop_word)
pipe_ada_tfidf = make_pipeline(tfidf_vect,StandardScaler(with_mean=False),
                               AdaBoostClassifier(base_estimator=MultinomialNB(alpha=630)))

#make parameter grid
param_ada_tfidf = {
    'tfidfvectorizer__max_features':[2_700,2_800,2_900],
    'tfidfvectorizer__ngram_range':[(1,1),(1,2)],
    'adaboostclassifier__n_estimators': [150,500],
    'adaboostclassifier__learning_rate':[0.9,0.95]
}

In [14]:
#make gridsearch instance
grid_ada_tfidf = GridSearchCV(pipe_ada_tfidf,param_grid = param_ada_tfidf,n_jobs=-1)


grid_ada_tfidf.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('tfidfvectorizer',
                                        TfidfVectorizer(stop_words=['0o', '0s',
                                                                    '3a', '3b',
                                                                    '3d', '6b',
                                                                    '6o', 'a',
                                                                    'a1', 'a2',
                                                                    'a3', 'a4',
                                                                    'ab',
                                                                    'able',
                                                                    'about',
                                                                    'above',
                                                                    'abst',
                                                                    'ac',
        

In [15]:
#score on the training data
grid_ada_tfidf.score(X_train,y_train)

0.8269934162399415

In [16]:
#score on accuracy and recall
print(f'The accuracy of the AdaBoost model is {round(grid_ada_tfidf.score(X_test,y_test),4)}.')
print(f'The recall of the AdaBoost model is {round(recall_score(y_test,grid_ada_tfidf.predict(X_test)),4)}.')
print(f'The f1 score of the AdaBoost model is {round(f1_score(y_test,grid_ada_tfidf.predict(X_test)),4)}.')
print(f'The precision score of the AdaBoost model is {round(precision_score(y_test,grid_ada_tfidf.predict(X_test)),4)}.')

The accuracy of the AdaBoost model is 0.7436.
The recall of the AdaBoost model is 0.872.
The f1 score of the AdaBoost model is 0.7946.
The precision score of the AdaBoost model is 0.7298.


In [17]:
#casting the results as a dataframe
pd.DataFrame(grid_ada_tfidf.cv_results_).sort_values(by='rank_test_score').head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_adaboostclassifier__learning_rate,param_adaboostclassifier__n_estimators,param_tfidfvectorizer__max_features,param_tfidfvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
22,2.93834,0.043447,0.358331,0.037832,0.95,500,2900,"(1, 1)","{'adaboostclassifier__learning_rate': 0.95, 'a...",0.741012,0.750762,0.746951,0.753659,0.743902,0.747257,0.004552,1
10,2.900059,0.024293,0.389555,0.009692,0.9,500,2900,"(1, 1)","{'adaboostclassifier__learning_rate': 0.9, 'ad...",0.74223,0.750762,0.745732,0.756098,0.740854,0.747135,0.005638,2
9,3.956627,0.038928,0.442407,0.006728,0.9,500,2800,"(1, 2)","{'adaboostclassifier__learning_rate': 0.9, 'ad...",0.74223,0.756246,0.742683,0.74939,0.738415,0.745793,0.006309,3


#### Model Interpretation:

AdaBoosting the best model discovered for the evaluation metrics accuracy and recall actually boosted those scores by approximately 1% each. The AdaBoost model is the best model learned yet. It has the highest accuracy and recall pairing out of all of the models. This fits in with the two evaluation metrics I wanted to balance for my data science question. Therefore, this is the model I will chose as the final production model for the NLP classification. This model also shows slight overfitting to the training data, which may skew coefficient values and cause a lack of generalization.

## GradientBoost with CountVectorization

In [18]:
#create adaboost model in pipeline
pipe_gb_cv = make_pipeline(cvect,StandardScaler(with_mean=False),GradientBoostingClassifier())

#make parameter grid
param_gb_cv = {
    'countvectorizer__max_features':[3_750,3_500,4_000],
    'countvectorizer__ngram_range':[(1,1),(1,2)],
    'gradientboostingclassifier__n_estimators': [250,325],
    'gradientboostingclassifier__learning_rate':[0.25,0.3],
    'gradientboostingclassifier__min_samples_split': [5,6],
    'gradientboostingclassifier__max_depth':[6,7,5]
}

In [19]:
#instantiate paramgrid
grid_gb_cv = GridSearchCV(pipe_gb_cv,param_gb_cv,n_jobs=-1)

#train on the training data
grid_gb_cv.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('countvectorizer',
                                        CountVectorizer(stop_words=['0o', '0s',
                                                                    '3a', '3b',
                                                                    '3d', '6b',
                                                                    '6o', 'a',
                                                                    'a1', 'a2',
                                                                    'a3', 'a4',
                                                                    'ab',
                                                                    'able',
                                                                    'about',
                                                                    'above',
                                                                    'abst',
                                                                    'ac',
        

In [20]:
#score on the training data
grid_gb_cv.score(X_train,y_train)

0.9642770056083882

In [21]:
#score on accuracy and recall
print(f'The accuracy of the Gradient Boost model is {round(grid_gb_cv.score(X_test,y_test),4)}.')
print(f'The recall of the Gradient Boost model is {round(recall_score(y_test,grid_gb_cv.predict(X_test)),4)}.')
print(f'The f1 score of the Gradient Boost model is {round(f1_score(y_test,grid_gb_cv.predict(X_test)),4)}.')
print(f'The precision score of the Gradient Boost model is {round(precision_score(y_test,grid_gb_cv.predict(X_test)),4)}.')

The accuracy of the Gradient Boost model is 0.7516.
The recall of the Gradient Boost model is 0.8026.
The f1 score of the Gradient Boost model is 0.7861.
The precision score of the Gradient Boost model is 0.7704.


In [22]:
#the best paramets of the gradient boost
grid_gb_cv.best_params_

{'countvectorizer__max_features': 3750,
 'countvectorizer__ngram_range': (1, 2),
 'gradientboostingclassifier__learning_rate': 0.3,
 'gradientboostingclassifier__max_depth': 7,
 'gradientboostingclassifier__min_samples_split': 5,
 'gradientboostingclassifier__n_estimators': 325}

In [23]:
#get results of the grid cross validations as a dataframe
pd.DataFrame(grid_gb_cv.cv_results_).sort_values(by='rank_test_score').head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_countvectorizer__max_features,param_countvectorizer__ngram_range,param_gradientboostingclassifier__learning_rate,param_gradientboostingclassifier__max_depth,param_gradientboostingclassifier__min_samples_split,param_gradientboostingclassifier__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
41,11.467868,0.051109,0.168353,0.001471,3750,"(1, 2)",0.3,7,5,325,"{'countvectorizer__max_features': 3750, 'count...",0.74284,0.744059,0.742073,0.745732,0.74878,0.744697,0.002387,1
55,10.16548,0.032112,0.120109,0.003798,3500,"(1, 1)",0.25,7,6,325,"{'countvectorizer__max_features': 3500, 'count...",0.738574,0.737355,0.745732,0.747561,0.75,0.743844,0.005003,2
27,9.962712,0.071769,0.164252,0.001687,3750,"(1, 2)",0.25,6,6,325,"{'countvectorizer__max_features': 3750, 'count...",0.738574,0.746496,0.740854,0.75122,0.737805,0.74299,0.005118,3
19,10.309314,0.046702,0.11961,0.00261,3750,"(1, 1)",0.3,7,6,325,"{'countvectorizer__max_features': 3750, 'count...",0.731261,0.744668,0.741463,0.753049,0.743902,0.742869,0.006995,4
75,9.725709,0.059658,0.163148,0.0021,3500,"(1, 2)",0.25,6,6,325,"{'countvectorizer__max_features': 3500, 'count...",0.741012,0.741621,0.743293,0.742073,0.744512,0.742502,0.001253,5


#### Model Interpretation

In the GradientBoost model with CountVectorization there was a decrease in score from the AdaBoost model with CountVectorization of about 10% of the recall score. This means that the Gradient Boost model wasn't predicting true positive as well and was classifying more r/biochemistry posts r/biology posts. This Gradient Boost model also shows high overfitting, therefore it is not trustworthy in terms of coefficient values and a lack of generazability.

### Gradient Boost with TfidfVectorization

In [24]:
#create adaboost model in pipeline
tfidf_vect = TfidfVectorizer(stop_words=stop_word)
pipe_gb_tfidf = make_pipeline(tfidf_vect,StandardScaler(with_mean=False),
                               GradientBoostingClassifier())

#make parameter grid
param_gb_tfidf = {
    'tfidfvectorizer__max_features':[500,1_000,2_000],
    'tfidfvectorizer__ngram_range':[(1,1),(1,2)],
    'gradientboostingclassifier__n_estimators': [150,300],
    'gradientboostingclassifier__learning_rate':[0.25,0.5,1],
    'gradientboostingclassifier__min_samples_split': [3,5],
    'gradientboostingclassifier__max_depth':[5,7]
}

In [25]:
#instantiate paramgrid
grid_gb_tfidf = GridSearchCV(pipe_gb_tfidf,param_gb_tfidf,n_jobs=-1)

#train on the training data
grid_gb_tfidf.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('tfidfvectorizer',
                                        TfidfVectorizer(stop_words=['0o', '0s',
                                                                    '3a', '3b',
                                                                    '3d', '6b',
                                                                    '6o', 'a',
                                                                    'a1', 'a2',
                                                                    'a3', 'a4',
                                                                    'ab',
                                                                    'able',
                                                                    'about',
                                                                    'above',
                                                                    'abst',
                                                                    'ac',
        

In [26]:
#score on the training data
grid_gb_tfidf.score(X_train,y_train)

0.96525237746891

In [27]:
#score on accuracy and recall
print(f'The accuracy of the Gradient Boost model is {round(grid_gb_tfidf.score(X_test,y_test),4)}.')
print(f'The recall of the Gradient Boost model is {round(recall_score(y_test,grid_gb_tfidf.predict(X_test)),4)}.')
print(f'The f1 score of the Gradient Boost model is {round(f1_score(y_test,grid_gb_tfidf.predict(X_test)),4)}.')
print(f'The precision score of the Gradient Boost model is {round(precision_score(y_test,grid_gb_tfidf.predict(X_test)),4)}.')

The accuracy of the Gradient Boost model is 0.7352.
The recall of the Gradient Boost model is 0.7987.
The f1 score of the Gradient Boost model is 0.7743.
The precision score of the Gradient Boost model is 0.7514.


In [28]:
#best parameters for the gridsearch model
grid_gb_tfidf.best_params_

{'gradientboostingclassifier__learning_rate': 0.25,
 'gradientboostingclassifier__max_depth': 5,
 'gradientboostingclassifier__min_samples_split': 5,
 'gradientboostingclassifier__n_estimators': 300,
 'tfidfvectorizer__max_features': 2000,
 'tfidfvectorizer__ngram_range': (1, 2)}

In [29]:
#casting the grid results as a dataframe
pd.DataFrame(grid_gb_tfidf.cv_results_).sort_values(by='rank_test_score').head(5)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gradientboostingclassifier__learning_rate,param_gradientboostingclassifier__max_depth,param_gradientboostingclassifier__min_samples_split,param_gradientboostingclassifier__n_estimators,param_tfidfvectorizer__max_features,param_tfidfvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
23,14.405022,0.04816,0.15334,0.004449,0.25,5,5,300,2000,"(1, 2)",{'gradientboostingclassifier__learning_rate': ...,0.719074,0.727605,0.736585,0.734756,0.734756,0.730555,0.006514,1
10,13.619877,0.074639,0.107298,0.001835,0.25,5,3,300,2000,"(1, 1)",{'gradientboostingclassifier__learning_rate': ...,0.714199,0.725168,0.739634,0.731707,0.737805,0.729703,0.009265,2
46,18.384431,0.04774,0.111301,0.001722,0.25,7,5,300,2000,"(1, 1)",{'gradientboostingclassifier__learning_rate': ...,0.729433,0.725777,0.721951,0.730488,0.733537,0.728237,0.004004,3
47,19.534681,0.066273,0.158344,0.005918,0.25,7,5,300,2000,"(1, 2)",{'gradientboostingclassifier__learning_rate': ...,0.716636,0.731871,0.732927,0.728049,0.731098,0.728116,0.005965,4
34,18.664363,0.038445,0.112302,0.002928,0.25,7,3,300,2000,"(1, 1)",{'gradientboostingclassifier__learning_rate': ...,0.716636,0.72273,0.732927,0.728049,0.736585,0.727385,0.007113,5


#### Model Interpretation:

Similar to the Gradient Boosting model with CountVectorization, the Gradient Boosting model with TfidfVectorization shows high overfitting to the training data. There is also the 10% drop in recall score from the AdaBoost model with TfidfVectorization present in this model. As we are trying to optimize recall score, this model is not what we want going forward.

## AdaBoost with Combined Text, CountVectorization, Naive Bayes

#### Pre-Modeling Cleaning

In [49]:
#read in the combined text csv
combined = pd.read_csv('datasets/combined_title_self.csv',index_col=0)

In [50]:
#shows that there are null values in the datafile
combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22014 entries, 0 to 22013
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  22014 non-null  object
 1   text       21943 non-null  object
dtypes: object(2)
memory usage: 516.0+ KB


In [51]:
#drop the null values from the dataframe
combined.dropna(inplace=True)

In [52]:
#remove all words that apply to the target variable -- biology,bio,biochem,biochemistry
combined['text'].replace('biology','',regex=True,inplace=True)
combined['text'].replace('biochemistry','',regex=True,inplace=True)
combined['text'].replace('chemistry','',regex=True,inplace=True)
combined['text'].replace('biochem','',regex=True,inplace=True)
combined['text'].replace('bio','',regex=True,inplace=True)
combined['text'].replace('chem','',regex=True,inplace=True)

In [53]:
#setting up initial X and y values
#1 corresponds to biochemistry, 0 corresponds to biology
#need to make the str unicode with astype, emoji characters in the strings
#help with the unicode error found below:
#https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document
X_combo = combined['text']
y = np.where(combined['subreddit']=='Biochemistry',1,0)

In [54]:
#train/test split of the data
X_train,X_test,y_train,y_test = train_test_split(X_combo,y,stratify=y,random_state=18)

### Null Model

In [55]:
#to get the baseline accuracy of the model
#based on the most frequent value in the training data
#biochemistry = 1, biology = 0

biochem_num = y_train.sum()
biology_num = len(y_train)-biochem_num

if biology_num < biochem_num:
    baseline_accur = round(biochem_num/len(y_train),4)
    print(f'The most frequent class is r/biochemistry. The accuracy of the null model is {baseline_accur}.')
    
else:
    baseline_accuracy = round((biology_num)/len(y_train),4)
    print(f'The most frequent class is r/biology. The accuracy of the null model is {baseline_accuracy}.')

The most frequent class is r/biochemistry. The accuracy of the null model is 0.5673.


#### Null Model Explained:

The baseline model allows us to find a 'starting point' to compare the performance of future models to. In binary classification, a customary baseline/null model is one that will guess the most frequently occuring class in the testing set. 

### Model One. Combined Text

In [36]:
#create adaboost model in pipeline
cvect = CountVectorizer(stop_words=stop_word)
pipe_adacv_title = make_pipeline(cvect,StandardScaler(with_mean=False),
                           AdaBoostClassifier(base_estimator=MultinomialNB(alpha=500)))

#make parameter grid
param_adacv_titl = {
    'countvectorizer__max_features':[6_000,6_250,6_500],
    'countvectorizer__ngram_range':[(1,1),(1,2)],
    'adaboostclassifier__n_estimators': [350,650,850],
    'adaboostclassifier__learning_rate':[.75,1,1.25,1.5]
}

In [37]:
#make gridsearch instance
grid_ada_title = GridSearchCV(pipe_adacv_title,param_grid = param_adacv_titl,n_jobs=-1)


grid_ada_title.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('countvectorizer',
                                        CountVectorizer(stop_words=['0o', '0s',
                                                                    '3a', '3b',
                                                                    '3d', '6b',
                                                                    '6o', 'a',
                                                                    'a1', 'a2',
                                                                    'a3', 'a4',
                                                                    'ab',
                                                                    'able',
                                                                    'about',
                                                                    'above',
                                                                    'abst',
                                                                    'ac',
        

In [38]:
#best parameters
grid_ada_title.best_params_

{'adaboostclassifier__learning_rate': 1.25,
 'adaboostclassifier__n_estimators': 850,
 'countvectorizer__max_features': 6500,
 'countvectorizer__ngram_range': (1, 1)}

In [39]:
#scoring on the training data
grid_ada_title.score(X_train,y_train)

0.8263352980494623

In [40]:
#score on different classification metrics
print(f'The accuracy of the Naive Bayes model with combined text is {round(grid_ada_title.score(X_test,y_test),4)}.')
print(f'The recall of the Naive Bayes model with combined text is {round(recall_score(y_test,grid_ada_title.predict(X_test)),4)}.')
print(f'The f1 score of the Naive Bayes model with combined text is {round(f1_score(y_test,grid_ada_title.predict(X_test)),4)}.')
print(f'The precision score of the Naive Bayes model with combined text is {round(precision_score(y_test,grid_ada_title.predict(X_test)),4)}.')

The accuracy of the Naive Bayes model with combined text is 0.7366.
The recall of the Naive Bayes model with combined text is 0.8705.
The f1 score of the Naive Bayes model with combined text is 0.7895.
The precision score of the Naive Bayes model with combined text is 0.7222.


In [41]:
#dataframe of grid results
pd.DataFrame(grid_ada_title.cv_results_).sort_values(by='rank_test_score').head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_adaboostclassifier__learning_rate,param_adaboostclassifier__n_estimators,param_countvectorizer__max_features,param_countvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
52,7.552899,0.0457,0.931866,0.031609,1.25,850,6500,"(1, 1)","{'adaboostclassifier__learning_rate': 1.25, 'a...",0.722965,0.735419,0.743239,0.738377,0.723184,0.732637,0.008198,1
70,7.443069,0.127319,0.799943,0.092289,1.5,850,6500,"(1, 1)","{'adaboostclassifier__learning_rate': 1.5, 'ad...",0.722661,0.735419,0.739897,0.741416,0.722881,0.732455,0.008149,2
34,7.5808,0.076546,0.896024,0.019712,1.0,850,6500,"(1, 1)","{'adaboostclassifier__learning_rate': 1, 'adab...",0.723876,0.736027,0.742328,0.734123,0.723488,0.731968,0.007291,3


### Model Interpretation

The AdaBoost model for combined text with CountVectorizatizer performed similarly to the MultinomialNB model with combined text, with an accuracy of 74.3% and recall of 86.6%. The major difference is that the AdaBoost shows more overfitting (higher variance), than the MultinomialNB model. Due to this fact, I have higher confidence in the MultinomialNB model and its coefficients.

## Boosting Models Takeaway

The best model obtained from the project came from this notebook - AdaBoosting with TfidfVectorization. This model obtains about 76% accuracy and 87% recall score, the highest scoring pair of the evaluation metrics I set out to optimize. AdaBoost with TfidfVectorization is hereby chosen as the final production model! Go Boosting!