# Course 4 - Project - Part 6: Nonlinear classifiers

<a name="top-6"></a>
This notebook is concerned with *Part 6: Nonlinear classifiers*.

**Contents:**
* [Step 0: Loading data](#step-6.0)
* [Step 1: Try with a random Forests](#step-6.1)
* [Step 2: Try with SVMs](#step-6.2)

## Step 0: Loading data<a name="step-6.0"></a> ([top](#top-6))
---

We begin with some imports.

In [1]:
# Standard library.
import pathlib
import typing as T

# 3rd party.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Project.
import utils

We load the training set with the extracted high-level features.

In [2]:
separator = ''.center(80, '-')

path_train = pathlib.Path.cwd() / 'data' / 'swissroads-features-train.npz'
data_train = utils.load(path_train)
print(separator)
print(f'Dataset: train\n{utils.info(data_train)}')

path_valid = pathlib.Path.cwd() / 'data' / 'swissroads-features-valid.npz'
data_valid = utils.load(path_valid)
print(separator)
print(f'Dataset: valid\n{utils.info(data_valid)}')

path_test = pathlib.Path.cwd() / 'data' / 'swissroads-features-test.npz'
data_test = utils.load(path_test)
print(separator)
print(f'Dataset: test\n{utils.info(data_test)}')

--------------------------------------------------------------------------------
Dataset: train
data: shape=(280, 224, 224, 3), dtype=float32
label_idxs: shape=(280,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(280,), dtype=<U19
features: shape=(280, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: valid
data: shape=(139, 224, 224, 3), dtype=float32
label_idxs: shape=(139,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(139,), dtype=<U19
features: shape=(139, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: test
data: shape=(50, 224, 224, 3), dtype=float32
label_idxs: shape=(50,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(50,), dtype=<U19
features: shape=(50, 1280), dtype=float32


In [3]:
label_strs = data_train['label_strs']  # Same for all data sets.
assert (
    np.all(data_train['label_strs'] == data_valid['label_strs']) and
    np.all(data_train['label_strs'] == data_test['label_strs'])
)

X_train = data_train['data']
y_train = data_train['label_idxs']
F_train = data_train['features']
N_train = data_train['names']

X_valid = data_valid['data']
y_valid = data_valid['label_idxs']
F_valid = data_valid['features']
N_valid = data_train['names']

X_test = data_test['data']
y_test = data_test['label_idxs']
F_test = data_test['features']
N_test = data_test['names']

We will fix the seed for the PRNG in order to make computations deterministic.

In [4]:
RANDOM_SEED = 0

## Step 1: Try with random Forests<a name="step-6.1"></a> ([top](#top-6))
---

In [5]:
# 3rd party.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

We want to use a random forest classifier.

In this part it makes sense to merge the training and the validation sets for cross-validation (since we would not make use of the validation set otherwise).

In [6]:
X_train_large = np.concatenate([X_train, X_valid])
y_train_large = np.concatenate([y_train, y_valid])
F_train_large = np.concatenate([F_train, F_valid])
N_train_large = np.concatenate([N_train, N_valid])

**Note:** We have imbalanced classes. Some of the models seem to have some parameters to deal with imbalanced classes, so we will also try them.

We create a random forest classifier.

In [7]:
# Create the estimator.
rf = RandomForestClassifier(random_state=RANDOM_SEED)  # Use defaults.

We perform a cross-validated grid search.

In [8]:
# Setup the cross-validated grid search.
grid = {
    'n_estimators': [1, 5, 10, 100, 200],
    'max_depth': list(range(1, 10 + 1)) + [None],  # 1, 2, ..., 10, None
    'class_weight': [None, 'balanced']
}

# With 10 splits the model can be trained on 90% of the data points.
cv = StratifiedKFold(n_splits=10, random_state=RANDOM_SEED)
rf_gscv = GridSearchCV(rf, grid, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

In [9]:
# Fit/evaluate the estimator.
rf_gscv.fit(F_train_large, y_train_large);

We convert the results to a data frame.

In [10]:
df_results = (pd
    .DataFrame(data={
        'n_estimators': rf_gscv.cv_results_['param_n_estimators'],
        'max_depth': rf_gscv.cv_results_['param_max_depth'],
        'class_weight': rf_gscv.cv_results_['param_class_weight'],
        'mean_train_score': rf_gscv.cv_results_['mean_train_score'],
        'mean_test_score': rf_gscv.cv_results_['mean_test_score'],
        'std_test_score': rf_gscv.cv_results_['std_test_score'],
        'params': rf_gscv.cv_results_['params']
    })
    .sort_values(by='mean_test_score', ascending=False)
)

In [11]:
df_results.loc[:, 'n_estimators':'std_test_score'].head()

Unnamed: 0,n_estimators,max_depth,class_weight,mean_train_score,mean_test_score,std_test_score
73,100,4,balanced,0.990723,0.91276,0.036302
79,200,5,balanced,0.996561,0.909762,0.040955
83,100,6,balanced,0.999473,0.909217,0.037823
74,200,4,balanced,0.990192,0.90739,0.025099
84,200,6,balanced,1.0,0.907183,0.037198


We compute the accuracy of the best model on the test set.

In [12]:
accuracy_test = rf_gscv.best_estimator_.score(F_test, y_test)
print(f'test accuracy: {accuracy_test * 100:.1f} %')

test accuracy: 96.0 %


In [13]:
rf_gscv.best_params_

{'class_weight': 'balanced', 'max_depth': 4, 'n_estimators': 100}

In [14]:
# Persist the result.
desc = ', '.join([f'{key}={rf_gscv.best_params_[key]}' for key in ['n_estimators', 'max_depth', 'class_weight']])
utils.persist_result('random forest', 'part-06-a', desc, accuracy_test)

**Q: Does increasing the number of trees help?**

A: According to the results of the grid search, it did help up to a certain point (where the model probably starts to overfit).

## Step 2: Try with SVMs<a name="step-6.2"></a> ([top](#top-6))
---

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

We want to use an SVC classifier.

In [16]:
# Create the estimator.
svm_pipe = Pipeline([
    ('svm', LinearSVC(random_state=RANDOM_SEED)),  # Cannot be 'None' with GridSearchCV.
])

### Linear kernel

We perform a cross-validated grid search.

In [17]:
Cs = np.logspace(-4, 4, num=2 * 8 + 1)  # C defaults to 1.0.

# Setup the cross-validated grid search.
grid = [
    # LinearSVC (minize: squared hinge loss, strategy: one-vs-rest)
    {
        'svm': [LinearSVC(random_state=RANDOM_SEED)],
        'svm__C': Cs,
        'svm__class_weight':[None, 'balanced']
    },
    # SVC (kernel: linear, minimize: hinge loss, strategy: one-vs-one)
    {
        'svm': [SVC(random_state=RANDOM_SEED)],
        'svm__kernel': ['linear'],
        'svm__C': Cs,
        'svm__class_weight':[None, 'balanced']
    }
]

# With 10 splits the model can be trained on 90% of the data points.
cv = StratifiedKFold(n_splits=10, random_state=RANDOM_SEED)
svm_gscv = GridSearchCV(svm_pipe, grid, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

In [18]:
# Fit/evaluate the estimator.
svm_gscv.fit(F_train_large, y_train_large);

We convert the results to a data frame.

In [19]:
df_results = (pd
    .DataFrame(data={
        'svm_class': np.where(svm_gscv.cv_results_['param_svm__kernel'].mask, 'LinearSVC', 'SVC'),
        'kernel': svm_gscv.cv_results_['param_svm__kernel'],
        'C': svm_gscv.cv_results_['param_svm__C'],
        'class_weight': svm_gscv.cv_results_['param_svm__class_weight'],
        'mean_train_score': svm_gscv.cv_results_['mean_train_score'],
        'mean_test_score': svm_gscv.cv_results_['mean_test_score'],
        'std_test_score': svm_gscv.cv_results_['std_test_score'],
        'params': svm_gscv.cv_results_['params']
    })
    .sort_values(by='mean_test_score', ascending=False)
)

In [20]:
df_results.loc[:, 'svm_class':'std_test_score'].head()

Unnamed: 0,svm_class,kernel,C,class_weight,mean_train_score,mean_test_score,std_test_score
38,SVC,linear,0.001,,0.956515,0.92122,0.029588
43,SVC,linear,0.01,balanced,0.998145,0.920914,0.038537
39,SVC,linear,0.001,balanced,0.971624,0.918787,0.027708
40,SVC,linear,0.00316228,,0.986737,0.91872,0.026391
42,SVC,linear,0.01,,0.995227,0.918221,0.044201


In [21]:
accuracy_test = svm_gscv.best_estimator_.score(F_test, y_test)
print(f'test accuracy: {accuracy_test * 100:.1f} %')

test accuracy: 96.0 %


In [22]:
# Persist the result.
desc = ', '.join([f'{key}={df_results.iloc[0][key]}' for key in ['svm_class', 'kernel', 'C', 'class_weight']])
utils.persist_result('svm linear', 'part-06-b', desc, accuracy_test)

### RBF kernel

We perform a cross-validated grid search.

In [23]:
gammas = [0.01, 0.1, 1.0, 10.0, 'scale']

# Setup the cross-validated grid search.
grid = [
    # SVC (kernel: RBF, minimize: hinge loss, strategy: one-vs-one)
    {
        'svm': [SVC(random_state=RANDOM_SEED)],
        'svm__kernel': ['rbf'],
        'svm__C': Cs,
        'svm__gamma': gammas,
        'svm__class_weight':[None, 'balanced']
    }
]

# With 10 splits the model can be trained on 90% of the data points.
cv = StratifiedKFold(n_splits=10, random_state=RANDOM_SEED)
svm_gscv = GridSearchCV(svm_pipe, grid, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

In [24]:
# Fit/evaluate the estimator.
svm_gscv.fit(F_train_large, y_train_large);

We convert the results to a data frame.

In [25]:
df_results = (pd
    .DataFrame(data={
        'svm_class': np.where(svm_gscv.cv_results_['param_svm__kernel'].mask, 'LinearSVC', 'SVC'),
        'kernel': svm_gscv.cv_results_['param_svm__kernel'],
        'C': svm_gscv.cv_results_['param_svm__C'],
        'gamma': svm_gscv.cv_results_['param_svm__gamma'],
        'class_weight': svm_gscv.cv_results_['param_svm__class_weight'],
        'mean_train_score': svm_gscv.cv_results_['mean_train_score'],
        'mean_test_score': svm_gscv.cv_results_['mean_test_score'],
        'std_test_score': svm_gscv.cv_results_['std_test_score'],
        'params': svm_gscv.cv_results_['params']
    })
    .sort_values(by='mean_test_score', ascending=False)
)

In [26]:
df_results.loc[:, 'svm_class':'std_test_score'].head()

Unnamed: 0,svm_class,kernel,C,gamma,class_weight,mean_train_score,mean_test_score,std_test_score
169,SVC,rbf,10000.0,scale,balanced,1.0,0.918667,0.036464
104,SVC,rbf,10.0,scale,,1.0,0.918667,0.036464
129,SVC,rbf,100.0,scale,balanced,1.0,0.918667,0.036464
119,SVC,rbf,31.6228,scale,balanced,1.0,0.918667,0.036464
134,SVC,rbf,316.228,scale,,1.0,0.918667,0.036464


In [27]:
accuracy_test = svm_gscv.best_estimator_.score(F_test, y_test)
print(f'test accuracy: {accuracy_test * 100:.1f} %')

test accuracy: 90.0 %


In [28]:
# Persist the result.
desc = ', '.join([f'{key}={df_results.iloc[0][key]}' for key in ['svm_class', 'kernel', 'C', 'gamma', 'class_weight']])
utils.persist_result('svm rbf', 'part-06-c', desc, accuracy_test)

**Q: Does the RBF kernel perform better than the linear one?**

A: In this case it does not.