# CP1 Baseline notebook

### By Logan Larson

In [5]:
### import packages ###

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


import warnings
warnings.filterwarnings("ignore")


### load data ###

df = pd.read_csv('clean_schefter_tweets')
del df['Unnamed: 0']


In [19]:
# define vectorizer
def make_xy(df, vectorizer=None):   
    if vectorizer is None:
        vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(df.Tweet.values.astype('U')) # convert object type to unicode
    X = X.tocsc()  # some versions of sklearn return COO format
    y = (df.Class == 1).values.astype(np.int)
    return X, y


# define cross-validation score
def cv_score(clf, X, y, scorefunc):
    result = 0.
    nfold = 5
    for train, test in KFold(nfold).split(X): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit the classifier, passed is as clf.
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average


# define log-likelihood score function
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    irrelevant = y == 0
    relevant = ~irrelevant
    return prob[irrelevant, 0].sum() + prob[relevant, 1].sum()

In [20]:
# vectorize before train/test split
X, y = make_xy(df)

# split dataset into a training and test set
xtrain, xtest, ytrain, ytest = train_test_split(X, y, random_state=1)

## Show train/test classification for both models

#### 1. Bernoulli Naive Bayes model

In [21]:
BNB = BernoulliNB().fit(xtrain, ytrain)
BNB_train_pred = BNB.predict(xtrain)
BNB_test_pred = BNB.predict(xtest)

print('\n Bernoulli Naive Bayes baseline classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, BNB_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, BNB_test_pred, digits = 3, labels=[0,1]))


 Bernoulli Naive Bayes baseline classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.963     0.892     0.926     19185
           1      0.705     0.882     0.783      5595

    accuracy                          0.890     24780
   macro avg      0.834     0.887     0.855     24780
weighted avg      0.905     0.890     0.894     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.952     0.889     0.919      6391
           1      0.690     0.848     0.761      1870

    accuracy                          0.879      8261
   macro avg      0.821     0.868     0.840      8261
weighted avg      0.893     0.879     0.884      8261



#### 2. Multinomial Naive Bayes model

In [22]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB()
MNB.fit(xtrain, ytrain)

MNBtrain_pred = MNB.predict(xtrain)
MNBtest_pred = MNB.predict(xtest)

print('\n Multinomial Naive Bayes baseline classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, MNBtrain_pred, labels=[0,1], digits=3))
print('\n Test Classification Report: \n', classification_report(ytest, MNBtest_pred, labels=[0,1], digits=3))


 Multinomial Naive Bayes baseline classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.965     0.884     0.923     19185
           1      0.692     0.890     0.778      5595

    accuracy                          0.886     24780
   macro avg      0.828     0.887     0.851     24780
weighted avg      0.903     0.886     0.890     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.956     0.878     0.915      6391
           1      0.674     0.861     0.756      1870

    accuracy                          0.874      8261
   macro avg      0.815     0.870     0.836      8261
weighted avg      0.892     0.874     0.879      8261



#### 3. Logistic Regression model

In [23]:
LR = LogisticRegression().fit(xtrain, ytrain)
LR_train_pred = LR.predict(xtrain)
LR_test_pred = LR.predict(xtest)

print('\n Logistic Regression baseline classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, LR_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, LR_test_pred, digits = 3, labels=[0,1]))


 Logistic Regression baseline classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.949     0.972     0.961     19185
           1      0.896     0.822     0.857      5595

    accuracy                          0.938     24780
   macro avg      0.922     0.897     0.909     24780
weighted avg      0.937     0.938     0.937     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.929     0.958     0.943      6391
           1      0.839     0.749     0.791      1870

    accuracy                          0.911      8261
   macro avg      0.884     0.853     0.867      8261
weighted avg      0.908     0.911     0.909      8261



## Determine if there is overfitting

If we define overfitting as a discrepancy of 10 points between test and training scores for any of four metrics (accuracy, precision, recall and F1 score) -- for both positive (1) and negative (0) classes -- I have little evidence of overfitting in any of my three models. I therefore have little incentive to perform regularization on any of them. However, there are still improvements that can be made, particularly in terms of both recall (7.3 points) and precision (5.7) for my logistic regression model, so I'll proceed with regularization out of curiosity.

## Determine optimal hyperparameters for regularization

#### 1. Bernoulli Naive Bayes (alpha)

In [28]:
# define the grid of parameters to search over
alphas = [.01, .1, 1, 5, 10, 50, 100]
min_df = 0.001

# find the best value for alpha
best_alpha = None
_, itest = train_test_split(range(df.shape[0]))
mask = np.zeros(df.shape[0], dtype=np.bool)
mask[itest] = True
maxscore = -np.inf
for alpha in alphas:        
    vectorizer = CountVectorizer(min_df = min_df)       
    Xthis, ythis = make_xy(df, vectorizer)
    Xtrainthis = Xthis[mask]
    ytrainthis = ythis[mask]
    
    clf = BernoulliNB(alpha=alpha)
    
    cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
    
    if cvscore > maxscore:
        maxscore = cvscore
        BNB_best_alpha = alpha
        
print('Bernoulli Naive Bayes optimal alpha: {}'.format(BNB_best_alpha))

Bernoulli Naive Bayes optimal alpha: 5


#### 2. Multinomial Naive Bayes (alpha)

In [29]:
# define the grid of parameters to search over
alphas = [.01, .1, 1, 5, 10, 50, 100]
min_df = 0.001

# find the best value for alpha
best_alpha = None
_, itest = train_test_split(range(df.shape[0]))
mask = np.zeros(df.shape[0], dtype=np.bool)
mask[itest] = True
maxscore = -np.inf
for alpha in alphas:        
    vectorizer = CountVectorizer(min_df = min_df)       
    Xthis, ythis = make_xy(df, vectorizer)
    Xtrainthis = Xthis[mask]
    ytrainthis = ythis[mask]
    
    clf = MultinomialNB(alpha=alpha)
    
    cvscore = cv_score(clf, Xtrainthis, ytrainthis, log_likelihood)
    
    if cvscore > maxscore:
        maxscore = cvscore
        MNB_best_alpha = alpha
        
print('Multinomial Naive Bayes optimal alpha: {}'.format(MNB_best_alpha))

Multinomial Naive Bayes optimal alpha: 10


#### 3. Logistic Regression (c)

In [33]:
# hypertuning C parameter
LR2 = LogisticRegression()
parameters = {"C": [0.0001, 0.001, 0.1, 1, 2, 3, 4, 5, 10, 100, 1000, 10000]}
fitmodel = GridSearchCV(LR2, param_grid=parameters, cv=10, scoring="accuracy").fit(xtrain,ytrain)
fitmodel.best_params_, fitmodel.best_score_, fitmodel.cv_results_

print('Optimal C value:', fitmodel.best_params_['C'])

Optimal C value: 1


## Build regularized models

#### 1. Bernoulli Naive Bayes model

In [34]:
BNB_tuned = BernoulliNB(alpha=BNB_best_alpha).fit(xtrain, ytrain)

BNB_tuned_train_pred = BNB_tuned.predict(xtrain)
BNB_tuned_test_pred = BNB_tuned.predict(xtest)

print('\n Tuned Bernoulli Naive Bayes classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, BNB_tuned_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, BNB_tuned_test_pred, digits = 3, labels=[0,1]))


 Tuned Bernoulli Naive Bayes classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.947     0.909     0.928     19185
           1      0.726     0.825     0.772      5595

    accuracy                          0.890     24780
   macro avg      0.836     0.867     0.850     24780
weighted avg      0.897     0.890     0.893     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.941     0.903     0.922      6391
           1      0.709     0.807     0.755      1870

    accuracy                          0.881      8261
   macro avg      0.825     0.855     0.838      8261
weighted avg      0.889     0.881     0.884      8261



*** Compared to non-regularized model:
- Less overfitting in terms of overall accuracy, f1 score and recall
- Slightly more overfitting in terms of precision
- Overall accuracy improved
- Improved precision on positive class for both test and training data
- Diminished recall on positive class for both test and training data
- Improved f1 on test data, diminished f1 on training data

#### 2. Multinomial Naive Bayes model

In [35]:
MNB_tuned = MultinomialNB(alpha=MNB_best_alpha).fit(xtrain, ytrain)

MNB_tuned_train_pred = MNB_tuned.predict(xtrain)
MNB_tuned_test_pred = MNB_tuned.predict(xtest)

print('\n Tuned Multinomial Naive Bayes classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, MNB_tuned_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, MNB_tuned_test_pred, digits = 3, labels=[0,1]))


 Tuned Multinomial Naive Bayes classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.947     0.909     0.928     19185
           1      0.726     0.825     0.772      5595

    accuracy                          0.890     24780
   macro avg      0.836     0.867     0.850     24780
weighted avg      0.897     0.890     0.893     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.941     0.903     0.922      6391
           1      0.709     0.807     0.755      1870

    accuracy                          0.881      8261
   macro avg      0.825     0.855     0.838      8261
weighted avg      0.889     0.881     0.884      8261



****Compared to non-regularized model:
- Less overfitting in terms of precision, recall, f1 score and overall accuracy
- Overall accuracy improved
- Precision improved for positive class but diminished for negative class
- Recall improved for positive class but diminished for negative class

#### 3a. Logistic regression

In [37]:
LR_tuned = LogisticRegression(C=fitmodel.best_params_['C']).fit(xtrain, ytrain)
LR_tuned_train_pred = LR_tuned.predict(xtrain)
LR_tuned_test_pred = LR_tuned.predict(xtest)

print('\n Tuned Logistic Regression classifier: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, LR_tuned_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, LR_tuned_test_pred, digits = 3, labels=[0,1]))


 Tuned Logistic Regression classifier: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.949     0.972     0.961     19185
           1      0.896     0.822     0.857      5595

    accuracy                          0.938     24780
   macro avg      0.922     0.897     0.909     24780
weighted avg      0.937     0.938     0.937     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.929     0.958     0.943      6391
           1      0.839     0.749     0.791      1870

    accuracy                          0.911      8261
   macro avg      0.884     0.853     0.867      8261
weighted avg      0.908     0.911     0.909      8261



****Compared to non-regularized model"

Since the optimal alpha turned out to be the same as the default setting (1), there is no difference between regularized and non-regularized models. Let's instead look at a ridge and lasso classifier.

#### 3b. Logistic regression with L1 penalty

In [19]:
# lasso

lasso = LogisticRegression(C=1, penalty='l1').fit(xtrain,ytrain)

lasso_train_pred = lasso.predict(xtrain)
lasso_test_pred = lasso.predict(xtest)

print('\n Logistic Regression classifier with L1 penalty: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, lasso_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, lasso_test_pred, digits = 3, labels=[0,1]))


 Logistic Regression classifier with L1 penalty: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0       0.94      0.97      0.96     19185
           1       0.88      0.80      0.84      5595

    accuracy                           0.93     24780
   macro avg       0.91      0.89      0.90     24780
weighted avg       0.93      0.93      0.93     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94      6391
           1       0.84      0.75      0.79      1870

    accuracy                           0.91      8261
   macro avg       0.89      0.85      0.87      8261
weighted avg       0.91      0.91      0.91      8261



#### Compared to non-regularized model

Performs the same as the baseline model on test data, but performs worse on all metrics on training data.

#### 3c. Logistic regression with L2 penalty

In [38]:
# ridge

ridge = LogisticRegression(C=1, penalty='l2').fit(xtrain,ytrain)

ridge_train_pred = ridge.predict(xtrain)
ridge_test_pred = ridge.predict(xtest)

print('\n Logistic Regression classifier with L2 penalty: \n \n')
print('\n Training Classification Report: \n', classification_report(ytrain, ridge_train_pred, digits = 3, labels=[0,1]))
print('\n Test Classification Report: \n', classification_report(ytest, ridge_test_pred, digits = 3, labels=[0,1]))


 Logistic Regression classifier with L2 penalty: 
 


 Training Classification Report: 
               precision    recall  f1-score   support

           0      0.949     0.972     0.961     19185
           1      0.896     0.822     0.857      5595

    accuracy                          0.938     24780
   macro avg      0.922     0.897     0.909     24780
weighted avg      0.937     0.938     0.937     24780


 Test Classification Report: 
               precision    recall  f1-score   support

           0      0.929     0.958     0.943      6391
           1      0.839     0.749     0.791      1870

    accuracy                          0.911      8261
   macro avg      0.884     0.853     0.867      8261
weighted avg      0.908     0.911     0.909      8261



#### Compared to non-regularized model:

Performs the same.

## Which metric aligns with the business problem?

Model evaluation depends on the business problem and whether - in this scenario - we should care most about overall accuracy, precision, recall, or a combination of precision and recall, such as the f1 score.

To come to a decision, let's start from the beginning. We defined a positive case (labeled 1) as a newsworthy tweet  while a negative case (labeled 0) as an irrelevant tweet. It follows that a false positive can be considered an irrelevant tweet that was incorrectly classified as newsworthy. In contrast, a false negative could be considered a newsworthy tweet that was incorrectly classified as irrelevant. 

Since I'd rather have an irrelevant tweet classified as newsworthy (false positive) than have a newsworthy tweet classified as irrelevant (false negative) - and risk users missing potentially important information - I primarily want to minimize the number of false negatives. Thus, I should pick the model that performs best in terms of recall.

The primary alternative would be to prioritize precision, and that is less preferable because it wouldn't be the worst thing in the world for an irrelevant tweet to be considered newsworthy since I suspect that potential users of my model could simply ignore any such errors. However, I would lose trust quickly if I failed to recognize the news that matters. Put this way, it doesn't make sense to prioritize either precision or any kind of F score if it means a trade-off in terms of recall.

In prioritizing recall, I initially prefer the non-regularized Multinomial Naive Bayes model due to its high recall on positive cases within the test set.