# Performance Baselines

In this notebook we establish the performance baselines for SysFake's unique news classification problem.

We will use the results of this notebook to judge the performance of our real model.

## Imports

In [1]:
import pickle

import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.kernel_approximation import Nystroem
from sklearn.metrics.pairwise import chi2_kernel
from sklearn.preprocessing import robust_scale, normalize, minmax_scale
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.dummy import DummyClassifier

## Data Read

In [2]:
d_full = pd.read_csv('../data/d_full_nonorm.csv')

d_sub = pd.read_csv('../data/d_sub.csv')

d_test = pd.read_csv('../data/human_test_vectors.csv')

d_concat = pd.concat((d_full, d_test))

x, y = *(d_full.drop('label', axis=1),
         d_full['label']),

x, y = map(lambda j: j.to_numpy(), (x,y))

## Naive Methods

First we will use `sklearn.dummy.DummyClassifier` to make predictions based on simple rules such as choosing the most frequent class and randomly guessing. We score these naive predictions using weighted recall, the standard metric of this problem.

### Most Frequent

In [3]:
cross_val_score(DummyClassifier(strategy='most_frequent'),
                x, y,
                scoring='recall_weighted',
                cv=StratifiedKFold(n_splits=10, shuffle=True),
                verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    1.1s remaining:    2.7s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    1.1s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.1s finished


array([0.30642633, 0.30642633, 0.30642633, 0.30642633, 0.30642633,
       0.30666667, 0.30666667, 0.30666667, 0.30588235, 0.30588235])

### Random Guessing

In [4]:
cross_val_score(DummyClassifier(strategy='uniform'),
                x, y,
                scoring='recall_weighted',
                cv=StratifiedKFold(n_splits=10, shuffle=True),
                verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


array([0.14733542, 0.14106583, 0.14655172, 0.12617555, 0.13949843,
       0.14509804, 0.13490196, 0.14431373, 0.15294118, 0.14509804])

## Start-High Performance Ceiling

Now we will try two models which have very high maximal performance in order to approximate the ceiling for how well we can do on this problem.

We will do trials with random forests and a gradient boosting learner.

While we value the recall (one of the most true-positive sensitive metrics in machine learning) for our problem, we need our model to be simple enough to potentially run locally on a user's machine (in a latency/server downtime scenario). Both of the ceiling models are computationally expensive, given that they produce deep trees.

In [5]:
c2_x = chi2_kernel(x)

In [6]:
chi2_approx = Nystroem(kernel='chi2').fit_transform(x, y)

rfc = RandomForestClassifier(n_jobs=-1)
gbc = GradientBoostingClassifier()
mnb = MultinomialNB()

In [7]:
scores = cross_val_score(rfc,
                         x, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=10, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    3.4s remaining:    8.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    3.4s remaining:    1.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    4.2s finished


array([0.73902821, 0.73354232, 0.72100313, 0.72805643, 0.7468652 ,
       0.7254902 , 0.73803922, 0.73176471, 0.73647059, 0.72156863])

In [8]:
np.mean(scores)

0.7321828631138976

In [9]:
rfc.fit(x, y)

RandomForestClassifier(n_jobs=-1)

In [10]:
scores = cross_val_score(gbc,
                         x, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=5, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:   28.5s remaining:  1.1min
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:   29.3s remaining:   12.5s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   43.3s finished


array([0.72335423, 0.71003135, 0.73667712, 0.70689655, 0.72727273,
       0.71764706, 0.72862745, 0.72705882, 0.73254902, 0.71764706])

In [11]:
np.mean(scores)

0.7227761386686335

In [12]:
gbc.fit(x, y)

GradientBoostingClassifier()

In [13]:
scores = cross_val_score(rfc,
                         chi2_approx, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=5, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:   17.2s remaining:   40.3s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:   17.3s remaining:    7.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   22.1s finished


array([0.72805643, 0.71473354, 0.72257053, 0.71316614, 0.71865204,
       0.72627451, 0.72      , 0.72941176, 0.71843137, 0.72156863])

In [14]:
np.mean(scores)

0.7212864957895384

In [15]:
d_test[d_test.columns[:-1]] = normalize(minmax_scale(robust_scale(d_test[d_test.columns[:-1]])), axis=1)

x_test, y_test = *(d_test.drop('label', axis=1),
                   d_test['label']),

x_test, y_test = map(lambda j: j.to_numpy(), (x_test, y_test))

In [16]:
with open('../models/best-model-rscv-full_kernel-linear_C-7971.3078975898925_breakties-False.pickle', mode='rb') as filein:
    svc = pickle.load(filein)

with open('../models/best_model_rscv_pl-Nys-SGD_p62.pickle', mode='rb') as filein:
    ksgd = pickle.load(filein)

with open('../models/best_model_rscv_SGD_p63.pickle', mode='rb') as filein:
    sgd = pickle.load(filein)

In [17]:
svc_pred = svc.predict(x_test)
ksgd_pred = ksgd.predict(x_test)
sgd_pred = sgd.predict(x_test)
rfc_pred = rfc.predict(x_test)
gbc_pred = gbc.predict(x_test)

In [18]:
for method, predictions in dict(zip(('C-support SVM', 'Chi Squared SGD', 'Vanilla SGD', 'Random Forest', 'Gradient Boosting'), (svc_pred, ksgd_pred, sgd_pred, rfc_pred, gbc_pred))).items():
    print('-'*53)
    print(f"{method}:")
    print(classification_report(y_pred=predictions, y_true=y_test, zero_division=0))
    print(confusion_matrix(y_true=y_test, y_pred=predictions))

-----------------------------------------------------
C-support SVM:
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         6
           2       0.19      0.67      0.30         6
           3       1.00      0.20      0.33        10
           5       0.05      0.17      0.08         6
           7       0.09      0.17      0.12         6
           9       0.00      0.00      0.00        13
          11       0.00      0.00      0.00         6

    accuracy                           0.15        53
   macro avg       0.19      0.17      0.12        53
weighted avg       0.23      0.15      0.12        53

[[0 1 0 4 1 0 0]
 [0 4 0 2 0 0 0]
 [0 1 2 5 2 0 0]
 [0 2 0 1 3 0 0]
 [0 3 0 2 1 0 0]
 [0 5 0 4 4 0 0]
 [0 5 0 1 0 0 0]]
-----------------------------------------------------
Chi Squared SGD:
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         6
           2       0.14      1.00 