# Supervised Machine Learning - Balance of viewpoints classifier
   
In this notebook, the manual content analysis data is used to train and evaluate a classifier that assesses the balance of viewpoints of an article.   
The process includes feature selection, and the evaluation and comparison of different types of classifiers on different types of text representations.

## Load packages

In [1]:
#import relevant packages
import pandas as pd
from pandas import read_excel
import re
import numpy as np
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit, GridSearchCV, train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from pprint import pprint
import joblib

## Read data

In [2]:
#read in the manually coded data
df = read_excel("mca_cleaned.xlsx")

In [3]:
#remove unreliable coders
df = df[df.CID != 4]
df = df.drop(df[(df.CID != 3) & (df.AID < 100000)].index)
len(df)

301

## Splitting the data into test and training set

In [4]:
#inspect the category distribution to see how balanced it is
df["BOV"].value_counts()

0    212
1     89
Name: BOV, dtype: int64

In [5]:
#create training and testing dataset 
x_train, x_test, y_train, y_test = train_test_split(df["Article"], df.BOV, test_size=0.2, random_state=1)

# Feature engineering
   
Four different types of text representations for the classifier training were used, namely count vectors, TF-IDF vectors with unigrams, and TF-IDF vectors with bigrams and TF-IDF vectors with both uni- and bigrams. All classifiers were trained and tested on all features.   
First, the labels, which remain in a binary format, are renamed. After that, the different vector types are created. 

In [6]:
labels_train = y_train
labels_test = y_test

### Count Vectors

In [7]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer="word", 
                             token_pattern=r"\w{1,10}", 
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)

#features for training
features_train_count = count_vect.fit_transform(x_train).toarray()

#features for testing
features_test_count = count_vect.fit_transform(x_test).toarray()

#inspect the shape
print(features_train_count.shape)
print(features_test_count.shape)

(240, 200)
(61, 200)


### TF-IDF Vectors
In the following, Term Frequency-Inverse Document Frequency is applied in order to represent the text data as a vector that can be used as numerical input for a SML algorithm. 
Three different vectors are created:
    - A vector with unigrams only
    - A vector with bigrams only
    - A vector with unigrams & bigrams
    
In addition to that, several parameters were specified:
    - Terms that appear in less than 10 documents are ignored -> min_df
    - All other terms are included -> max_df
    - In total, up to 200 features can be extracted per text -> max_features

In [8]:
# unigrams
tfidf_vect_ug = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (1,1),
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)

# bigrams
tfidf_vect_bg = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (2,2),
                             min_df = 10, 
                             max_df = 1., 
                             max_features=3) #otherwise there is an error

# unigrams and bigrams
tfidf_vect_ubg = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (1,2),
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)


#features for testing
features_train_tfidf_ug = tfidf_vect_ug.fit_transform(x_train).toarray()
features_train_tfidf_bg = tfidf_vect_bg.fit_transform(x_train).toarray()
features_train_tfidf_ubg = tfidf_vect_ubg.fit_transform(x_train).toarray()

#features (=y_train)
features_test_tfidf_ug = tfidf_vect_ug.fit_transform(x_test).toarray()
features_test_tfidf_bg = tfidf_vect_bg.fit_transform(x_test).toarray()
features_test_tfidf_ubg = tfidf_vect_ubg.fit_transform(x_test).toarray()

#Explore the shape of the features
print(features_train_tfidf_ubg.shape)
print(features_test_tfidf_ubg.shape)

(240, 200)
(61, 200)


## Model training & evaluation
In the following, different classifiers are trained and evaluated through their precision, recall and accuracy.

In [9]:
def train_model(classifier, features_train, labels_train, features_test):
    # fit the training dataset on the classifier
    classifier.fit(features_train, labels_train)
    # predict the labels on validation dataset
    predictions = classifier.predict(features_test)
    #calculate the accuracy of the predictions
    accuracy = classification_report(labels_test, predictions)
    print(accuracy)

### Stochastic Gradient Descent Classifier
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [10]:
#defining the classifier
sgdc = SGDClassifier(loss="hinge", max_iter=200, random_state=8) 
#training and evaluating the classifier
train_model(sgdc, features_train_count, labels_train, features_test_count)
train_model(sgdc, features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(sgdc, features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(sgdc, features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.82      0.68      0.75        41
           1       0.52      0.70      0.60        20

    accuracy                           0.69        61
   macro avg       0.67      0.69      0.67        61
weighted avg       0.72      0.69      0.70        61

              precision    recall  f1-score   support

           0       0.68      0.41      0.52        41
           1       0.33      0.60      0.43        20

    accuracy                           0.48        61
   macro avg       0.51      0.51      0.47        61
weighted avg       0.57      0.48      0.49        61

              precision    recall  f1-score   support

           0       1.00      0.17      0.29        41
           1       0.37      1.00      0.54        20

    accuracy                           0.44        61
   macro avg       0.69      0.59      0.42        61
weighted avg       0.79      0.44      0.37        61

              preci

#### Hyperparameter tuning & cross validation
Second, a grid search is performed in order to establish and cross-validate the best model parameters

In [11]:
def cross_validate(classifier, features_train, labels_train, cv):
    scores = cross_val_score(classifier, features_train, labels_train, cv=cv, scoring ="f1")
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

clf = SGDClassifier(loss="hinge", max_iter=200, random_state=8)
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, method="sigmoid", cv=10)  
grid_params = {"base_estimator__alpha": [0.0001, 0.001, 0.01, 0.1]}  
grid_search = GridSearchCV(estimator=calibrated_clf, param_grid=grid_params, cv=10)
grid_search.fit(features_train_tfidf_bg, labels_train)
print("Count vectors:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_count, labels_train, 10))
grid_search.fit(features_train_count, labels_train)
print("TF-IDF vectors with unigrams:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ug, labels_train, 10))
grid_search.fit(features_train_tfidf_ug, labels_train)
print("TF-IDF vectors with bigrams:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ug, labels_train, 10))
grid_search.fit(features_train_tfidf_bg, labels_train)
print("TF-IDF vectors with uni- and bigrams:",grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ubg, labels_train, 10))

Count vectors: {'base_estimator__alpha': 0.0001}
Accuracy: 0.35 (+/- 0.39)
None
TF-IDF vectors with unigrams: {'base_estimator__alpha': 0.01}
Accuracy: 0.34 (+/- 0.31)
None
TF-IDF vectors with bigrams: {'base_estimator__alpha': 0.001}
Accuracy: 0.34 (+/- 0.31)
None
TF-IDF vectors with uni- and bigrams: {'base_estimator__alpha': 0.0001}
Accuracy: 0.36 (+/- 0.35)
None


In [12]:
#incorporating the grid-search results for training and evaluating the classifiers for different vectors
train_model(SGDClassifier(loss="hinge", alpha = .0001, max_iter=200, random_state=8), features_train_count, labels_train, features_test_count)
train_model(SGDClassifier(loss="hinge", alpha = .01, max_iter=200, random_state=8), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(SGDClassifier(loss="hinge", alpha = .001, max_iter=200, random_state=8), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(SGDClassifier(loss="hinge", alpha = .0001, max_iter=200, random_state=8), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.82      0.68      0.75        41
           1       0.52      0.70      0.60        20

    accuracy                           0.69        61
   macro avg       0.67      0.69      0.67        61
weighted avg       0.72      0.69      0.70        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              preci

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Save the best classification results for later comparison

In [13]:
clf = SGDClassifier(loss="hinge", alpha = .0001, max_iter=200, random_state=8)
clf.fit(features_train_count, labels_train)
sgdc_final = clf.predict(features_test_count)

### Naive Bayes Classifier
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [14]:
# Naive Bayes on Count Vectors
print("Gaussian NB")
train_model(GaussianNB(), features_train_count, labels_train, features_test_count)
train_model(GaussianNB(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(GaussianNB(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(GaussianNB(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

print("")
print("")

print("Multinomial NB")
train_model(MultinomialNB(), features_train_count, labels_train, features_test_count)
train_model(MultinomialNB(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(MultinomialNB(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(MultinomialNB(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

Gaussian NB
              precision    recall  f1-score   support

           0       0.69      0.88      0.77        41
           1       0.44      0.20      0.28        20

    accuracy                           0.66        61
   macro avg       0.57      0.54      0.53        61
weighted avg       0.61      0.66      0.61        61

              precision    recall  f1-score   support

           0       0.68      0.98      0.80        41
           1       0.50      0.05      0.09        20

    accuracy                           0.67        61
   macro avg       0.59      0.51      0.45        61
weighted avg       0.62      0.67      0.57        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

       

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Hyperparameter tuning
Second, a grid search is performed in order to establish and cross-validate the best model parameters. 

In [15]:
clf = MultinomialNB()
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, method="sigmoid", cv=10)  
grid_params = {"base_estimator__alpha": [0.0001, 0.001, 0.01, 0.1]}  
grid_search = GridSearchCV(estimator=calibrated_clf, param_grid=grid_params, cv=10)
grid_search.fit(features_train_tfidf_bg, labels_train)
print("Count vectors:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_count, labels_train, 10))
grid_search.fit(features_train_count, labels_train)
print("TF-IDF vectors with unigrams:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ug, labels_train, 10))
grid_search.fit(features_train_tfidf_ug, labels_train)
print("TF-IDF vectors with bigrams:", grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ug, labels_train, 10))
grid_search.fit(features_train_tfidf_bg, labels_train)
print("TF-IDF vectors with uni- and bigrams:",grid_search.best_params_)
print(cross_validate(sgdc, features_train_tfidf_ubg, labels_train, 10))

Count vectors: {'base_estimator__alpha': 0.0001}
Accuracy: 0.35 (+/- 0.39)
None
TF-IDF vectors with unigrams: {'base_estimator__alpha': 0.0001}
Accuracy: 0.34 (+/- 0.31)
None
TF-IDF vectors with bigrams: {'base_estimator__alpha': 0.1}
Accuracy: 0.34 (+/- 0.31)
None
TF-IDF vectors with uni- and bigrams: {'base_estimator__alpha': 0.0001}
Accuracy: 0.36 (+/- 0.35)
None


In [16]:
#incorporating the grid-search results for training and evaluating the classifiers for different vectors
train_model(MultinomialNB(alpha = .0001), features_train_count, labels_train, features_test_count)
train_model(MultinomialNB(alpha = .0001), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(MultinomialNB(alpha = .1,), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(MultinomialNB(alpha = .0001), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.84      0.39      0.53        41
           1       0.40      0.85      0.55        20

    accuracy                           0.54        61
   macro avg       0.62      0.62      0.54        61
weighted avg       0.70      0.54      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              preci

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [17]:
#compare results with the best Gaussian classifier
train_model(GaussianNB(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.74      0.76      0.75        41
           1       0.47      0.45      0.46        20

    accuracy                           0.66        61
   macro avg       0.61      0.60      0.60        61
weighted avg       0.65      0.66      0.65        61



### Save best classifier for later comparison

In [18]:
clf = GaussianNB()
clf.fit(features_train_tfidf_ubg, labels_train)
nbc_final = clf.predict(features_test_tfidf_ubg)

## Support Vector Machines
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [19]:
train_model(svm.SVC(random_state=8), features_train_count, labels_train, features_test_count)
train_model(svm.SVC(random_state=8), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(svm.SVC(random_state=8), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(svm.SVC(random_state=8), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              preci

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Hyperparameter tuning
Second, a grid search is performed in order to establish and cross-validate the best model parameters. 

In [20]:
# C
C = [.0001, .001, .01]

# gamma
gamma = [.0001, .001, .01, .1, 1, 10, 100]

# degree
degree = [1, 2, 3, 4, 5]

# kernel
kernel = ['linear', 'rbf', 'poly']

# probability
probability = [True]

# Create the random grid
random_grid = {'C': C,
              'kernel': kernel,
              'gamma': gamma,
              'degree': degree,
              'probability': probability
             }

In [21]:
# First create the base model to tune
svc = svm.SVC(random_state=8)

# Definition of the random search
random_search = RandomizedSearchCV(estimator=svc,
                                   param_distributions=random_grid,
                                   n_iter=50,
                                   scoring='f1',
                                   cv=10, 
                                   verbose=1, 
                                   random_state=8)

In [22]:
random_search.fit(features_train_count, labels_train)

print("The best hyperparameters from Random Search for count vectors are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The best hyperparameters from Random Search for count vectors are:
{'probability': True, 'kernel': 'linear', 'gamma': 0.0001, 'degree': 3, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.4554390054390055


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   30.3s finished


In [23]:
random_search.fit(features_train_tfidf_ug, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with unigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 50 candidates, totalling 500 fits
The best hyperparameters from Random Search for tfidf vectors with unigrams are:
{'probability': True, 'kernel': 'poly', 'gamma': 10, 'degree': 4, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.3822216672216672


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   25.5s finished


In [None]:
random_search.fit(features_train_tfidf_bg, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with bigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


In [24]:
random_search.fit(features_train_tfidf_ubg, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with uni and bigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 50 candidates, totalling 500 fits
The best hyperparameters from Random Search for tfidf vectors with uni and bigrams are:
{'probability': True, 'kernel': 'poly', 'gamma': 10, 'degree': 4, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.45404761904761914


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:   25.3s finished


In [25]:
#incorporate grid_search results
train_model(svm.SVC(random_state=8, probability= True, kernel= 'linear', gamma= 0.0001, degree= 3, C= 0.01), features_train_count, labels_train, features_test_count)
train_model(svm.SVC(random_state=8, probability= True, kernel= 'poly', gamma= 10, degree= 4, C= 0.01), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(svm.SVC(random_state=8, probability= True, kernel= 'poly', gamma= 10, degree= 4, C= 0.01), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(svm.SVC(random_state=8, probability= True, kernel= 'poly', gamma= 10, degree= 4, C= 0.01), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.76      0.71      0.73        41
           1       0.48      0.55      0.51        20

    accuracy                           0.66        61
   macro avg       0.62      0.63      0.62        61
weighted avg       0.67      0.66      0.66        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              precision    recall  f1-score   support

           0       0.67      1.00      0.80        41
           1       0.00      0.00      0.00        20

    accuracy                           0.67        61
   macro avg       0.34      0.50      0.40        61
weighted avg       0.45      0.67      0.54        61

              preci

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [26]:
#compare it to the base model for the vectors that achieved the best results
base_model = svm.SVC(random_state=8, probability= True, kernel= 'linear', gamma= 0.0001, degree= 3, C= 0.01)
base_model.fit(features_train_count, labels_train)

svc_pred = base_model.predict(features_test_count)
print(classification_report(labels_test, svc_pred))

              precision    recall  f1-score   support

           0       0.76      0.71      0.73        41
           1       0.48      0.55      0.51        20

    accuracy                           0.66        61
   macro avg       0.62      0.63      0.62        61
weighted avg       0.67      0.66      0.66        61



### Save the best classification results for later comparison

In [27]:
clf = svm.SVC(random_state=8, probability= True, kernel= 'linear', gamma= 0.0001, degree= 3, C= 0.01)
clf.fit(features_train_count, labels_train)
svc_final = clf.predict(features_test_count)

## K-nearest neighbour classifier

In [28]:
train_model(KNeighborsClassifier(), features_train_count, labels_train, features_test_count)
train_model(KNeighborsClassifier(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(KNeighborsClassifier(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(KNeighborsClassifier(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.72      0.93      0.81        41
           1       0.62      0.25      0.36        20

    accuracy                           0.70        61
   macro avg       0.67      0.59      0.58        61
weighted avg       0.69      0.70      0.66        61

              precision    recall  f1-score   support

           0       0.66      0.90      0.76        41
           1       0.20      0.05      0.08        20

    accuracy                           0.62        61
   macro avg       0.43      0.48      0.42        61
weighted avg       0.51      0.62      0.54        61

              precision    recall  f1-score   support

           0       0.69      0.98      0.81        41
           1       0.67      0.10      0.17        20

    accuracy                           0.69        61
   macro avg       0.68      0.54      0.49        61
weighted avg       0.68      0.69      0.60        61

              preci

## Cross-Validation for Hyperparameter tuning

### Grid Search Cross Validation

In [30]:
# Create the parameter grid 
n_neighbors = [int(x) for x in np.linspace(start = 1, stop = 190, num = 200)]

param_grid = {"n_neighbors": n_neighbors}

# Create a base model
knnc = KNeighborsClassifier()

# Manually create the splits in CV in order to be able to fix a random_state (GridSearchCV doesn't have that argument)
cv_sets = ShuffleSplit(n_splits = 3, test_size = .2, random_state = 8)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=knnc, 
                           param_grid=param_grid,
                           scoring="f1",
                           cv=cv_sets,
                           verbose=1)

In [31]:
grid_search.fit(features_train_count, labels_train)

print("The best hyperparameters from Grid Search for count vectors are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The best hyperparameters from Grid Search for count vectors are:
{'n_neighbors': 1}

The mean accuracy of a model with these hyperparameters is:
0.2763278388278388


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    4.7s finished


In [32]:
grid_search.fit(features_train_tfidf_ug, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors are:
{'n_neighbors': 1}

The mean accuracy of a model with these hyperparameters is:
0.4308571696300982


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    4.9s finished


In [33]:
grid_search.fit(features_train_tfidf_bg, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors with ngrams are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors with ngrams are:
{'n_neighbors': 5}

The mean accuracy of a model with these hyperparameters is:
0.203968253968254


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    2.7s finished


In [34]:
grid_search.fit(features_train_tfidf_ubg, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors with ngrams are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors with ngrams are:
{'n_neighbors': 1}

The mean accuracy of a model with these hyperparameters is:
0.4041078190132695


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    4.5s finished


In [35]:
#incorporating the grid-search results for training and evaluating the classifiers for different vectors
train_model(KNeighborsClassifier(n_neighbors= 1), features_train_count, labels_train, features_test_count)
train_model(KNeighborsClassifier(n_neighbors= 1), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(KNeighborsClassifier(n_neighbors= 5), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(KNeighborsClassifier(n_neighbors= 1), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.71      0.85      0.78        41
           1       0.50      0.30      0.37        20

    accuracy                           0.67        61
   macro avg       0.61      0.58      0.58        61
weighted avg       0.64      0.67      0.65        61

              precision    recall  f1-score   support

           0       0.67      0.85      0.75        41
           1       0.33      0.15      0.21        20

    accuracy                           0.62        61
   macro avg       0.50      0.50      0.48        61
weighted avg       0.56      0.62      0.57        61

              precision    recall  f1-score   support

           0       0.69      0.98      0.81        41
           1       0.67      0.10      0.17        20

    accuracy                           0.69        61
   macro avg       0.68      0.54      0.49        61
weighted avg       0.68      0.69      0.60        61

              preci

In [36]:
#compare results to the best base model
base_model = KNeighborsClassifier()
base_model.fit(features_train_count, labels_train)

base_model_pred = base_model.predict(features_test_count)
print(classification_report(labels_test, base_model_pred))

              precision    recall  f1-score   support

           0       0.72      0.93      0.81        41
           1       0.62      0.25      0.36        20

    accuracy                           0.70        61
   macro avg       0.67      0.59      0.58        61
weighted avg       0.69      0.70      0.66        61



### Save the best classification results for later comparison

In [37]:
knn_final = base_model_pred

## Final Model comparison

In [38]:
print("classification report - Stochastic Gradiend Descent") 
print(classification_report(labels_test, sgdc_final))
print("classification report - Naive Bayes") 
print(classification_report(labels_test, nbc_final))
print("classification report - SVC") 
print(classification_report(labels_test, svc_final))
print("classification report - k nearest neighbour") 
print(classification_report(labels_test, knn_final))

classification report - Stochastic Gradiend Descent
              precision    recall  f1-score   support

           0       0.82      0.68      0.75        41
           1       0.52      0.70      0.60        20

    accuracy                           0.69        61
   macro avg       0.67      0.69      0.67        61
weighted avg       0.72      0.69      0.70        61

classification report - Naive Bayes
              precision    recall  f1-score   support

           0       0.74      0.76      0.75        41
           1       0.47      0.45      0.46        20

    accuracy                           0.66        61
   macro avg       0.61      0.60      0.60        61
weighted avg       0.65      0.66      0.65        61

classification report - SVC
              precision    recall  f1-score   support

           0       0.76      0.71      0.73        41
           1       0.48      0.55      0.51        20

    accuracy                           0.66        61
   macro avg

## Export the best classifier

In [39]:
clf = SGDClassifier(loss="hinge", alpha = .0001, max_iter=200, random_state=8)
clf.fit(features_train_count, labels_train)
svc_pred = clf.predict(features_test_count)
print(classification_report(labels_test, svc_pred))

              precision    recall  f1-score   support

           0       0.82      0.68      0.75        41
           1       0.52      0.70      0.60        20

    accuracy                           0.69        61
   macro avg       0.67      0.69      0.67        61
weighted avg       0.72      0.69      0.70        61



In [40]:
joblib.dump(clf, 'classifier_bov.pkl')

['classifier_bov.pkl']