# Supervised Machine Learning - Neutrality Classifier
   
In this notebook, the manual content analysis data is used to train and evaluate a classifier that assesses the neutrality of an article.   
The process includes feature selection, and the evaluation and comparison of different types of classifiers.

## Load packages

In [2]:
#import relevant packages
import pandas as pd
from pandas import read_excel
import re
import numpy as np
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, ShuffleSplit, GridSearchCV, train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from pprint import pprint
import joblib

## Read data

In [3]:
#read in the manually coded data
df = read_excel("mca_cleaned.xlsx")

In [4]:
len(df)

487

## Splitting the data into test and training set
I reran the following models with both the original and the cleaned text. I indicated which one led to better results for each classifier.

In [5]:
df["NEU"].value_counts()

1    283
0    204
Name: NEU, dtype: int64

In [6]:
#create training and testing dataset 
x_train, x_test, y_train, y_test = train_test_split(df["clean text"], df.NEU, test_size=0.2, random_state=1)

# Feature engineering
   
Four different types of text representations for the classifier training were used, namely count vectors, TF-IDF vectors with unigrams, and TF-IDF vectors with bigrams and TF-IDF vectors with both uni- and bigrams. All classifiers were trained and tested on all features.   
First, the labels, which remain in a binary format, are renamed. After that, the different vector types are created. 

In [7]:
labels_train = y_train
labels_test = y_test

### Count Vectors

In [8]:
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer="word", 
                             token_pattern=r"\w{1,10}", 
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)

#features for training
features_train_count = count_vect.fit_transform(x_train).toarray()

#features for testing
features_test_count = count_vect.fit_transform(x_test).toarray()

#inspect the shape
print(features_train_count.shape)
print(features_test_count.shape)

(389, 200)
(98, 200)


### TF-IDF Vectors
In the following, Term Frequency-Inverse Document Frequency is applied in order to represent the text data as a vector that can be used as numerical input for a SML algorithm. 
Three different vectors are created:
    - A vector with unigrams only
    - A vector with bigrams only
    - A vector with unigrams & bigrams
    
In addition to that, several parameters were specified:
    - Terms that appear in less than 10 documents are ignored -> min_df
    - All other terms are included -> max_df
    - In total, up to 200 features can be extracted per text -> max_features

In [9]:
# unigrams
tfidf_vect_ug = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (1,1),
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)

# bigrams
tfidf_vect_bg = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (2,2),
                             min_df = 10, 
                             max_df = 1., 
                             max_features= 6)  #because for at least one article there are only 6 features which leads to an error

# unigrams and bigrams
tfidf_vect_ubg = TfidfVectorizer(analyzer='word', 
                             token_pattern=r'\w{1,}',
                             ngram_range = (1,2),
                             min_df = 10, 
                             max_df = 1., 
                             max_features=200)


#features for testing
features_train_tfidf_ug = tfidf_vect_ug.fit_transform(x_train).toarray()
features_train_tfidf_bg = tfidf_vect_bg.fit_transform(x_train).toarray()
features_train_tfidf_ubg = tfidf_vect_ubg.fit_transform(x_train).toarray()

#features (=y_train)
features_test_tfidf_ug = tfidf_vect_ug.fit_transform(x_test).toarray()
features_test_tfidf_bg = tfidf_vect_bg.fit_transform(x_test).toarray()
features_test_tfidf_ubg = tfidf_vect_ubg.fit_transform(x_test).toarray()

#Explore the shape of the features
print(features_train_tfidf_ubg.shape)
print(features_test_tfidf_ubg.shape)

(389, 200)
(98, 200)


## Model training & evaluation
In the following, different classifiers are trained and evaluated through their precision, recall and accuracy.

In [10]:
def train_model(classifier, features_train, labels_train, features_test):
    # fit the training dataset on the classifier
    classifier.fit(features_train, labels_train)
    # predict the labels on validation dataset
    predictions = classifier.predict(features_test)
    #calculate the accuracy of the predictions
    accuracy = classification_report(labels_test, predictions)
    print(accuracy)

### Stochastic Gradient Descent Classifier
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [11]:
#defining the classifier
sgdc = SGDClassifier(loss="hinge", max_iter=200, random_state=8) 
#training and evaluating the classifier
train_model(sgdc, features_train_count, labels_train, features_test_count)
train_model(sgdc, features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(sgdc, features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(sgdc, features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.52      0.71      0.60        42
           1       0.70      0.50      0.58        56

    accuracy                           0.59        98
   macro avg       0.61      0.61      0.59        98
weighted avg       0.62      0.59      0.59        98

              precision    recall  f1-score   support

           0       0.47      0.52      0.49        42
           1       0.61      0.55      0.58        56

    accuracy                           0.54        98
   macro avg       0.54      0.54      0.54        98
weighted avg       0.55      0.54      0.54        98

              precision    recall  f1-score   support

           0       0.56      0.48      0.51        42
           1       0.65      0.71      0.68        56

    accuracy                           0.61        98
   macro avg       0.60      0.60      0.60        98
weighted avg       0.61      0.61      0.61        98

              preci

#### Cross-validation
Second, the results are cross-validated

In [12]:
def cross_validate(classifier, features_train, labels_train, cv):
    scores = cross_val_score(classifier, features_train, labels_train, cv=cv, scoring ="f1")
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
    
cross_validate(sgdc, features_train_count, labels_train, 10)
cross_validate(sgdc, features_train_tfidf_ug, labels_train, 10)
cross_validate(sgdc, features_train_tfidf_bg, labels_train, 10)
cross_validate(sgdc, features_train_tfidf_ubg, labels_train, 10)

Accuracy: 0.69 (+/- 0.08)
Accuracy: 0.70 (+/- 0.15)
Accuracy: 0.50 (+/- 0.55)
Accuracy: 0.71 (+/- 0.09)


#### Hyperparameter tuning
Third, a grid search is performed in order to establish the best model parameters

In [13]:
clf = SGDClassifier(loss='hinge', max_iter=200)
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, method='sigmoid', cv=10)

grid_params = {'base_estimator__alpha': [0.0001, 0.001, 0.01, 0.1]}  
grid_search = GridSearchCV(estimator=calibrated_clf, param_grid=grid_params, cv=10)
grid_search.fit(features_train_count, labels_train)
print(grid_search.best_params_)

{'base_estimator__alpha': 0.01}


In [14]:
sgdc_final = SGDClassifier(loss="hinge", alpha = .01, max_iter=200) 
#training and evaluating the classifier
train_model(sgdc_final, features_train_count, labels_train, features_test_count)

              precision    recall  f1-score   support

           0       0.52      0.64      0.57        42
           1       0.67      0.55      0.61        56

    accuracy                           0.59        98
   macro avg       0.60      0.60      0.59        98
weighted avg       0.61      0.59      0.59        98



### Save the best classification results for later comparison

In [15]:
clf = SGDClassifier(loss="hinge", alpha = .01, max_iter=200) 
clf.fit(features_train_count, labels_train,)
sgdc_final = clf.predict(features_test_count)

### Naive Bayes Classifier
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [16]:
# Naive Bayes on Count Vectors
print("Gaussian NB")
train_model(GaussianNB(), features_train_count, labels_train, features_test_count)
train_model(GaussianNB(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(GaussianNB(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(GaussianNB(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

print("")
print("")

print("Multinomial NB")
train_model(MultinomialNB(), features_train_count, labels_train, features_test_count)
train_model(MultinomialNB(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(MultinomialNB(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(MultinomialNB(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

Gaussian NB
              precision    recall  f1-score   support

           0       0.55      0.71      0.62        42
           1       0.72      0.55      0.63        56

    accuracy                           0.62        98
   macro avg       0.63      0.63      0.62        98
weighted avg       0.65      0.62      0.62        98

              precision    recall  f1-score   support

           0       0.45      0.48      0.47        42
           1       0.59      0.57      0.58        56

    accuracy                           0.53        98
   macro avg       0.52      0.52      0.52        98
weighted avg       0.53      0.53      0.53        98

              precision    recall  f1-score   support

           0       0.59      0.40      0.48        42
           1       0.64      0.79      0.70        56

    accuracy                           0.62        98
   macro avg       0.61      0.60      0.59        98
weighted avg       0.62      0.62      0.61        98

       

#### Cross-validation
Second, the results are cross-validated.

In [17]:
print("Gaussian NB; order: count, tfidf_ug, tfidf_bg, tfidf_ubg")
cross_validate(GaussianNB(), features_train_count, labels_train, 10)
cross_validate(GaussianNB(), features_train_tfidf_ug, labels_train, 10)
cross_validate(GaussianNB(), features_train_tfidf_bg, labels_train, 10)
cross_validate(GaussianNB(), features_train_tfidf_ubg, labels_train, 10)
print("Multinomial NB; order: count, tfidf_ug, tfidf_bg, tfidf_ubg")
cross_validate(MultinomialNB(), features_train_count, labels_train, 10)
cross_validate(MultinomialNB(), features_train_tfidf_ug, labels_train, 10)
cross_validate(MultinomialNB(), features_train_tfidf_bg, labels_train, 10)
cross_validate(MultinomialNB(), features_train_tfidf_ubg, labels_train, 10)

Gaussian NB; order: count, tfidf_ug, tfidf_bg, tfidf_ubg
Accuracy: 0.71 (+/- 0.07)
Accuracy: 0.71 (+/- 0.11)
Accuracy: 0.62 (+/- 0.18)
Accuracy: 0.70 (+/- 0.11)
Multinomial NB; order: count, tfidf_ug, tfidf_bg, tfidf_ubg
Accuracy: 0.75 (+/- 0.09)
Accuracy: 0.77 (+/- 0.06)
Accuracy: 0.72 (+/- 0.05)
Accuracy: 0.77 (+/- 0.05)


#### Hyperparameter tuning
Third, a grid search is performed in order to establish the best model parameters. 

In [18]:
clf = MultinomialNB()
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, method='sigmoid', cv=10)  

grid_params = {'base_estimator__alpha': [0.0001, 0.001, 0.01, 0.1]}  
grid_search = GridSearchCV(estimator=calibrated_clf, param_grid=grid_params, cv=10)
grid_search.fit(features_train_tfidf_ubg, labels_train)
print(grid_search.best_params_)

{'base_estimator__alpha': 0.1}


In [19]:
#training and evaluating the classifier
train_model(MultinomialNB(alpha = .01), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.48      0.29      0.36        42
           1       0.59      0.77      0.67        56

    accuracy                           0.56        98
   macro avg       0.53      0.53      0.51        98
weighted avg       0.54      0.56      0.53        98



In [20]:
train_model(GaussianNB(), features_train_count, labels_train, features_test_count)

              precision    recall  f1-score   support

           0       0.55      0.71      0.62        42
           1       0.72      0.55      0.63        56

    accuracy                           0.62        98
   macro avg       0.63      0.63      0.62        98
weighted avg       0.65      0.62      0.62        98



### Save best classifier for later comparison

In [21]:
clf = GaussianNB()
clf.fit(features_train_count, labels_train)
nbc_final = clf.predict(features_test_count)

## Support Vector Machines
First, the classifiers for different vectors are trained and evaluated on the training and test data.

In [22]:
train_model(svm.SVC(), features_train_count, labels_train, features_test_count)
train_model(svm.SVC(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(svm.SVC(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(svm.SVC(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.68      0.60      0.63        42
           1       0.72      0.79      0.75        56

    accuracy                           0.70        98
   macro avg       0.70      0.69      0.69        98
weighted avg       0.70      0.70      0.70        98

              precision    recall  f1-score   support

           0       0.63      0.29      0.39        42
           1       0.62      0.88      0.73        56

    accuracy                           0.62        98
   macro avg       0.63      0.58      0.56        98
weighted avg       0.63      0.62      0.58        98

              precision    recall  f1-score   support

           0       0.79      0.26      0.39        42
           1       0.63      0.95      0.76        56

    accuracy                           0.65        98
   macro avg       0.71      0.60      0.58        98
weighted avg       0.70      0.65      0.60        98

              preci

#### Cross-validation
Second, the results are cross-validated.

In [23]:
cross_validate(svm.SVC(), features_train_count, labels_train, 10)
cross_validate(svm.SVC(), features_train_tfidf_ug, labels_train, 10)
cross_validate(svm.SVC(), features_train_tfidf_bg, labels_train, 10)
cross_validate(svm.SVC(), features_train_tfidf_ubg, labels_train, 10)

Accuracy: 0.76 (+/- 0.13)
Accuracy: 0.78 (+/- 0.06)
Accuracy: 0.72 (+/- 0.05)
Accuracy: 0.78 (+/- 0.05)


#### Hyperparameter tuning
Third, a grid search is performed in order to establish the best model parameters. 

In [24]:
# C
C = [.0001, .001, .01]

# gamma
gamma = [.0001, .001, .01, .1, 1, 10, 100]

# degree
degree = [1, 2, 3, 4, 5]

# kernel
kernel = ['linear', 'rbf', 'poly']

# probability
probability = [True]

# Create the random grid
random_grid = {'C': C,
              'kernel': kernel,
              'gamma': gamma,
              'degree': degree,
              'probability': probability
              }

In [25]:
# First create the base model to tune
svc = svm.SVC()


# Definition of the random search
random_search = RandomizedSearchCV(estimator=svc,
                                   param_distributions=random_grid,
                                   n_iter=50,
                                   scoring='f1',
                                   cv=10, 
                                   verbose=1, 
                                   random_state=8)

In [26]:
random_search.fit(features_train_count, labels_train)

print("The best hyperparameters from Random Search for count vectors are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The best hyperparameters from Random Search for count vectors are:
{'probability': True, 'kernel': 'linear', 'gamma': 0.0001, 'degree': 3, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.7737994399068956


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  1.1min finished


In [27]:
random_search.fit(features_train_tfidf_ug, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with unigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  1.1min finished


The best hyperparameters from Random Search for tfidf vectors with unigrams are:
{'probability': True, 'kernel': 'poly', 'gamma': 10, 'degree': 4, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.7573195254232818


In [28]:
random_search.fit(features_train_tfidf_bg, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with bigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The best hyperparameters from Random Search for tfidf vectors with bigrams are:
{'probability': True, 'kernel': 'poly', 'gamma': 0.001, 'degree': 4, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.736950467124978


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    5.8s finished


In [29]:
random_search.fit(features_train_tfidf_ubg, labels_train)

print("The best hyperparameters from Random Search for tfidf vectors with uni and bigrams are:")
print(random_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(random_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 50 candidates, totalling 500 fits
The best hyperparameters from Random Search for tfidf vectors with uni and bigrams are:
{'probability': True, 'kernel': 'poly', 'gamma': 10, 'degree': 4, 'C': 0.01}

The mean accuracy of a model with these hyperparameters is:
0.7563813374183843


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  1.1min finished


In [32]:
svc = svm.SVC(probability= True, kernel= "linear", gamma= 0.0001, degree= 3, C= 0.01, )

#fit the model to the training data
svc.fit(features_train_count, labels_train)

#get predictions
svc_pred = svc.predict(features_test_count)
print(classification_report(labels_test, svc_pred))

              precision    recall  f1-score   support

           0       0.68      0.40      0.51        42
           1       0.66      0.86      0.74        56

    accuracy                           0.66        98
   macro avg       0.67      0.63      0.63        98
weighted avg       0.67      0.66      0.64        98



In [33]:
base_model = svm.SVC()
base_model.fit(features_train_count, labels_train)

svc_pred = base_model.predict(features_test_count)
print(classification_report(labels_test, svc_pred))

              precision    recall  f1-score   support

           0       0.68      0.60      0.63        42
           1       0.72      0.79      0.75        56

    accuracy                           0.70        98
   macro avg       0.70      0.69      0.69        98
weighted avg       0.70      0.70      0.70        98



### Save the best classification results for later comparison

In [34]:
svc_final = base_model.predict(features_test_count)

## K-nearest neighbour classifier

In [35]:
train_model(KNeighborsClassifier(), features_train_count, labels_train, features_test_count)
train_model(KNeighborsClassifier(), features_train_tfidf_ug, labels_train, features_test_tfidf_ug)
train_model(KNeighborsClassifier(), features_train_tfidf_bg, labels_train, features_test_tfidf_bg)
train_model(KNeighborsClassifier(), features_train_tfidf_ubg, labels_train, features_test_tfidf_ubg)

              precision    recall  f1-score   support

           0       0.70      0.17      0.27        42
           1       0.60      0.95      0.74        56

    accuracy                           0.61        98
   macro avg       0.65      0.56      0.50        98
weighted avg       0.64      0.61      0.54        98

              precision    recall  f1-score   support

           0       0.41      0.43      0.42        42
           1       0.56      0.54      0.55        56

    accuracy                           0.49        98
   macro avg       0.48      0.48      0.48        98
weighted avg       0.49      0.49      0.49        98

              precision    recall  f1-score   support

           0       0.64      0.17      0.26        42
           1       0.60      0.93      0.73        56

    accuracy                           0.60        98
   macro avg       0.62      0.55      0.50        98
weighted avg       0.61      0.60      0.53        98

              preci

## Cross-Validation for Hyperparameter tuning

### Grid Search Cross Validation

In [36]:
# Create the parameter grid 
n_neighbors = [int(x) for x in np.linspace(start = 1, stop = 300, num = 200)]

param_grid = {"n_neighbors": n_neighbors}

# Create a base model
knnc = KNeighborsClassifier()

# Manually create the splits in CV in order to be able to fix a random_state (GridSearchCV doesn't have that argument)
cv_sets = ShuffleSplit(n_splits = 3, test_size = .2, random_state = 8)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator=knnc, 
                           param_grid=param_grid,
                           scoring="f1",
                           cv=cv_sets,
                           verbose=1)

In [37]:
grid_search.fit(features_train_count, labels_train)

print("The best hyperparameters from Grid Search for count vectors are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


The best hyperparameters from Grid Search for count vectors are:
{'n_neighbors': 40}

The mean accuracy of a model with these hyperparameters is:
0.7450818224449899


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    7.9s finished


In [38]:
grid_search.fit(features_train_tfidf_ug, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors are:
{'n_neighbors': 125}

The mean accuracy of a model with these hyperparameters is:
0.7786153150034552


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    8.1s finished


In [39]:
grid_search.fit(features_train_tfidf_bg, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors with ngrams are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors with ngrams are:
{'n_neighbors': 125}

The mean accuracy of a model with these hyperparameters is:
0.7275625697433444


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    4.3s finished


In [39]:
grid_search.fit(features_train_tfidf_ubg, labels_train)

print("The best hyperparameters from Grid Search for tfidf vectors with ngrams are:")
print(grid_search.best_params_)
print("")
print("The mean accuracy of a model with these hyperparameters is:")
print(grid_search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 200 candidates, totalling 600 fits
The best hyperparameters from Grid Search for tfidf vectors with ngrams are:
{'n_neighbors': 134}

The mean accuracy of a model with these hyperparameters is:
0.7732016257181274


[Parallel(n_jobs=1)]: Done 600 out of 600 | elapsed:    8.2s finished


In [42]:
n_neighbors = [120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130]
param_grid = {'n_neighbors': n_neighbors}

knnc = KNeighborsClassifier()
cv_sets = ShuffleSplit(n_splits = 3, test_size = .2, random_state = 8)

grid_search = GridSearchCV(estimator=knnc, 
                           param_grid=param_grid,
                           scoring='f1',
                           cv=cv_sets,
                           verbose=1)

grid_search.fit(features_train_tfidf_ug, labels_train)

Fitting 3 folds for each of 11 candidates, totalling 33 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  33 out of  33 | elapsed:    0.6s finished


GridSearchCV(cv=ShuffleSplit(n_splits=3, random_state=8, test_size=0.2, train_size=None),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [120, 121, 122, 123, 124, 125, 126, 127,
                                         128, 129, 130]},
             scoring='f1', verbose=1)

In [43]:
#evaluate the best hyperparameters
clf = KNeighborsClassifier(n_neighbors= 125)
clf.fit(features_train_tfidf_ug, labels_train)
print("classification report - K nearest neighbour - 129 neighbours - tf idf vectors with unigrams") 
print(classification_report(labels_test, clf.predict(features_test_tfidf_ug)))

classification report - K nearest neighbour - 129 neighbours - tf idf vectors with unigrams
              precision    recall  f1-score   support

           0       0.52      0.31      0.39        42
           1       0.60      0.79      0.68        56

    accuracy                           0.58        98
   macro avg       0.56      0.55      0.54        98
weighted avg       0.57      0.58      0.56        98



In [44]:
base_model = KNeighborsClassifier()
base_model.fit(features_train_tfidf_ubg, labels_train)

base_model_pred = base_model.predict(features_test_tfidf_ubg)
print(classification_report(labels_test, base_model_pred))

              precision    recall  f1-score   support

           0       0.53      0.55      0.54        42
           1       0.65      0.64      0.65        56

    accuracy                           0.60        98
   macro avg       0.59      0.60      0.59        98
weighted avg       0.60      0.60      0.60        98



### Save the best classification results for later comparison

In [45]:
knn_final = base_model_pred

## Final Model comparison

In [46]:
print("classification report - Stochastic Gradiend Descent") 
print(classification_report(labels_test, sgdc_final))
print("classification report - Naive Bayes") 
print(classification_report(labels_test, nbc_final))
print("classification report - SVC") 
print(classification_report(labels_test, svc_final))
print("classification report - k nearest neighbour") 
print(classification_report(labels_test, knn_final))

classification report - Stochastic Gradiend Descent
              precision    recall  f1-score   support

           0       0.50      0.52      0.51        42
           1       0.63      0.61      0.62        56

    accuracy                           0.57        98
   macro avg       0.56      0.57      0.56        98
weighted avg       0.57      0.57      0.57        98

classification report - Naive Bayes
              precision    recall  f1-score   support

           0       0.55      0.71      0.62        42
           1       0.72      0.55      0.63        56

    accuracy                           0.62        98
   macro avg       0.63      0.63      0.62        98
weighted avg       0.65      0.62      0.62        98

classification report - SVC
              precision    recall  f1-score   support

           0       0.68      0.60      0.63        42
           1       0.72      0.79      0.75        56

    accuracy                           0.70        98
   macro avg

## Export the best classifier

In [54]:
clf = svm.SVC()
clf.fit(features_train_count, labels_train)
joblib.dump(clf, 'classifier_neu.pkl')

['classifier_neu.pkl']

In [55]:
svc_pred = clf.predict(features_test_count)
print(classification_report(labels_test, svc_pred))

              precision    recall  f1-score   support

           0       0.68      0.60      0.63        42
           1       0.72      0.79      0.75        56

    accuracy                           0.70        98
   macro avg       0.70      0.69      0.69        98
weighted avg       0.70      0.70      0.70        98

