# Performance Baselines

In this notebook we establish the performance baselines for SysFake's unique news classification problem.

We will use the results of this notebook to judge the performance of our real model.

## Imports

In [29]:
import numpy as np
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.kernel_approximation import Nystroem
from sklearn.dummy import DummyClassifier

## Data Read

In [30]:
d_full = pd.read_csv('../data/d_full.csv')

d_sub = pd.read_csv('../data/d_sub.csv')

x, y = *(d_sub.drop('label', axis=1),
         d_sub['label']),

x, y = map(lambda j: j.to_numpy(), (x,y))

## Naive Methods

First we will use `sklearn.dummy.DummyClassifier` to make predictions based on simple rules such as choosing the most frequent class and randomly guessing. We score these naive predictions using weighted recall, the standard metric of this problem.

### Most Frequent

In [3]:
cross_val_score(DummyClassifier(strategy='most_frequent'),
                x, y,
                scoring='recall_weighted',
                cv=StratifiedKFold(n_splits=10, shuffle=True),
                verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    1.2s remaining:    3.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    1.2s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.3s finished


array([0.30642633, 0.30642633, 0.30642633, 0.30642633, 0.30666667,
       0.30666667, 0.30666667, 0.30666667, 0.30588235, 0.30588235])

### Random Guessing

In [4]:
cross_val_score(DummyClassifier(strategy='uniform'),
                x, y,
                scoring='recall_weighted',
                cv=StratifiedKFold(n_splits=10, shuffle=True),
                verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


array([0.15125392, 0.12774295, 0.14184953, 0.13479624, 0.14901961,
       0.15529412, 0.12156863, 0.12784314, 0.14509804, 0.13411765])

## Start-High Performance Ceiling

Now we will try two models which have very high maximal performance in order to approximate the ceiling for how well we can do on this problem.

We will do trials with random forests and a gradient boosting learner.

While we value the recall (one of the most true-positive sensitive metrics in machine learning) for our problem, we need our model to be simple enough to potentially run locally on a user's machine (in a latency/server downtime scenario). Both of the ceiling models are computationally expensive, given that they produce deep trees.

In [31]:
chi2_approx = Nystroem(kernel='chi2').fit_transform(x, y)

rfc = RandomForestClassifier(n_jobs=-1)
gbc = GradientBoostingClassifier()

In [6]:
scores = cross_val_score(rfc,
                         x, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=10, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    2.8s remaining:    6.7s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    2.8s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    3.7s finished


array([0.69357367, 0.69043887, 0.70611285, 0.68495298, 0.68313725,
       0.68470588, 0.67843137, 0.69098039, 0.68392157, 0.69411765])

In [7]:
np.mean(scores)

0.6890372487553015

In [8]:
scores = cross_val_score(gbc,
                         x, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=10, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:   19.2s remaining:   44.9s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:   20.0s remaining:    8.5s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   30.8s finished


array([0.70376176, 0.69435737, 0.69200627, 0.70846395, 0.68705882,
       0.6745098 , 0.69647059, 0.6972549 , 0.68470588, 0.69803922])

In [9]:
np.mean(scores)

0.6936628557379064

In [32]:
scores = cross_val_score(rfc,
                         chi2_approx, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=10, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:   18.8s remaining:   44.1s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:   19.7s remaining:    8.4s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   23.9s finished


array([0.71551724, 0.6653605 , 0.69670846, 0.67163009, 0.68      ,
       0.6972549 , 0.68705882, 0.69254902, 0.69803922, 0.67215686])

In [33]:
np.mean(scores)

0.6876275124469851

In [36]:
scores = cross_val_score(gbc,
                         chi2_approx, y,
                         scoring='recall_weighted',
                         cv=StratifiedKFold(n_splits=10, shuffle=True),
                         verbose=3, n_jobs=-1)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:  5.5min remaining: 12.9min
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:  5.6min remaining:  2.4min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  9.0min finished


array([0.69905956, 0.69043887, 0.69278997, 0.70141066, 0.69176471,
       0.71294118, 0.6972549 , 0.70117647, 0.67921569, 0.68941176])

In [37]:
np.mean(scores)

0.6955463765443483