# Course 4 - Project - Part 6: Nonlinear classifiers

<a name="top-6"></a>
This notebook is concerned with *Part 6: Nonlinear classifiers*.

**Contents:**
* [Step 0: Loading data](#step-6.0)
* [Step 1: Try with a random Forests](#step-6.1)
* [Step 2: Try with SVMs](#step-6.2)

## Step 0: Loading data<a name="step-6.0"></a> ([top](#top-6))
---

We load the training set with the extracted high-level features.

In [1]:
# Standard library.
import os
import pathlib
import typing as T

# 3rd party.
import numpy as np

# Project.
import utils

In [2]:
separator = ''.center(80, '-')

path_train = pathlib.Path.cwd() / 'data' / 'swissroads-features-train.npz'
data_train = utils.load(path_train)
print(separator)
print(f'Dataset: train\n{utils.info(data_train)}')

path_valid = pathlib.Path.cwd() / 'data' / 'swissroads-features-valid.npz'
data_valid = utils.load(path_valid)
print(separator)
print(f'Dataset: valid\n{utils.info(data_valid)}')

path_test = pathlib.Path.cwd() / 'data' / 'swissroads-features-test.npz'
data_test = utils.load(path_test)
print(separator)
print(f'Dataset: test\n{utils.info(data_test)}')

--------------------------------------------------------------------------------
Dataset: train
data: shape=(280, 224, 224, 3), dtype=float32
label_idxs: shape=(280,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(280,), dtype=<U19
features: shape=(280, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: valid
data: shape=(139, 224, 224, 3), dtype=float32
label_idxs: shape=(139,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(139,), dtype=<U19
features: shape=(139, 1280), dtype=float32
--------------------------------------------------------------------------------
Dataset: test
data: shape=(50, 224, 224, 3), dtype=float32
label_idxs: shape=(50,), dtype=int64
label_strs: shape=(6,), dtype=<U10
names: shape=(50,), dtype=<U19
features: shape=(50, 1280), dtype=float32


In [3]:
label_strs = data_train['label_strs']  # Same for all data sets.
assert (
    np.all(data_train['label_strs'] == data_valid['label_strs']) and
    np.all(data_train['label_strs'] == data_test['label_strs'])
)

X_train = data_train['data']
y_train = data_train['label_idxs']
F_train = data_train['features']
N_train = data_train['names']

X_valid = data_valid['data']
y_valid = data_valid['label_idxs']
F_valid = data_valid['features']
N_valid = data_train['names']

X_test = data_test['data']
y_test = data_test['label_idxs']
F_test = data_test['features']
N_test = data_test['names']

## Step 1: Try with random Forests<a name="step-6.1"></a> ([top](#top-6))
---

In [4]:
# 3rd party.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

We want to use a random forest classifier.

In this part it makes sense to merge the training and the validation sets for cross-validation (since we would not make use of the validation set otherwise).

In [5]:
X_train_large = np.concatenate([X_train, X_valid])
y_train_large = np.concatenate([y_train, y_valid])
F_train_large = np.concatenate([F_train, F_valid])
N_train_large = np.concatenate([N_train, N_valid])

We just note that that we do not have balanced classes.

In [6]:
df_counts = (pd
 .DataFrame(data=pd.Series(data=y_train_large).value_counts(), columns=['count'])
 .set_index(label_strs)
)
df_counts['fraction'] = df_counts['count'] / df_counts['count'].sum()
df_counts.style.format('{:.2%}')

Unnamed: 0,count,fraction
bike,9900.00%,23.63%
car,9600.00%,22.91%
motorcycle,7600.00%,18.14%
other,6300.00%,15.04%
truck,4800.00%,11.46%
van,3700.00%,8.83%


We create a random forest classifier.

In [7]:
RANDOM_STATE = 0

# Create the estimator.
rf = RandomForestClassifier(random_state=RANDOM_STATE)  # Use defaults.

We perform a cross-validated grid search.

In [8]:
# Setup the cross-validated grid search.
grid = {
    'n_estimators': [1, 5, 10, 100, 200],
    'max_depth': list(range(1, 10 + 1)) + [None],  # 1, 2, ..., 10, None
    'class_weight': [None, 'balanced', 'balanced_subsample']
}

cv = StratifiedKFold(n_splits=10, random_state=RANDOM_STATE)
rf_gscv = GridSearchCV(rf, grid, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

In [9]:
# Fit/evaluate the estimator.
rf_gscv.fit(F_train_large, y_train_large);

We convert the results into a data frame.

In [10]:
# Collect results in a data frame.
df_results = (pd
    .DataFrame({
        'n_estimators': rf_gscv.cv_results_['param_n_estimators'],
        'max_depth': rf_gscv.cv_results_['param_max_depth'],
        'class_weight': rf_gscv.cv_results_['param_class_weight'],
        'mean_train_score': rf_gscv.cv_results_['mean_train_score'],
        'mean_test_score': rf_gscv.cv_results_['mean_test_score'],
        'std_test_score': rf_gscv.cv_results_['std_test_score'],
        'params': rf_gscv.cv_results_['params']
    })
    .sort_values(by='mean_test_score', ascending=False)
)

In [11]:
df_results.loc[:, 'n_estimators':'std_test_score'].head()

Unnamed: 0,n_estimators,max_depth,class_weight,mean_train_score,mean_test_score,std_test_score
73,100,4,balanced,0.990723,0.91276,0.036302
148,100,8,balanced_subsample,1.0,0.909814,0.035242
79,200,5,balanced,0.996561,0.909762,0.040955
83,100,6,balanced,0.999473,0.909217,0.037823
74,200,4,balanced,0.990192,0.90739,0.025099


**Comment:** The best result is 91.3 % accuracy on the test set.

## Step 2: Try with SVMs<a name="step-6.2"></a> ([top](#top-6))
---

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

We want to tune the regularization strength of the logistic regression classifier with cross-validated grid search.

**Note:** We have imbalanced classes (e.g. 22.63% bike vs. 8.83% van).

In [13]:
df_counts = (pd
 .DataFrame(data=pd.Series(data=y_train_large).value_counts(), columns=['count'])
 .set_index(label_strs)
)
df_counts['fraction'] = df_counts['count'] / df_counts['count'].sum()
df_counts.style.format({'fraction': '{:.2%}'})

Unnamed: 0,count,fraction
bike,99,23.63%
car,96,22.91%
motorcycle,76,18.14%
other,63,15.04%
truck,48,11.46%
van,37,8.83%


In [15]:
# Create the estimator.
svm_pipe = Pipeline([
    ('svm', LinearSVC(random_state=RANDOM_STATE)),
])

In [16]:
Cs = np.logspace(-4, 4, num=2 * 8 + 1)  # C defaults to 1.0.
gammas = [0.01, 0.1, 1.0, 10.0, 'scale']

# Setup the cross-validated grid search.
grid = [
    # LinearSVC (minize: squared hinge loss, strategy: one-vs-rest)
    {
        'svm__C': Cs,
        'svm__class_weight':[None, 'balanced']
    },
    # SVC (kernel: linear, minimize: hinge loss, strategy: one-vs-one)
    {
        'svm': [SVC(random_state=RANDOM_STATE)],
        'svm__kernel': ['linear'],
        'svm__C': Cs,
        'svm__class_weight':[None, 'balanced']
    },
    # SVC (kernel: RBF, minimize: hinge loss, strategy: one-vs-one)
    {
        'svm': [SVC(random_state=RANDOM_STATE)],
        'svm__kernel': ['rbf'],
        'svm__C': Cs,
        'svm__gamma': gammas,
        'svm__class_weight':[None, 'balanced']
    }
]

cv = StratifiedKFold(n_splits=10, random_state=RANDOM_STATE)
svm_gscv = GridSearchCV(svm_pipe, grid, n_jobs=-1, iid=False, refit=True, cv=cv, return_train_score=True)

In [17]:
# Fit/evaluate the estimator.
svm_gscv.fit(F_train_large, y_train_large);

In [18]:
# Collect results in a data frame.
df_results = (pd
    .DataFrame({
        'svm': svm_gscv.cv_results_['param_svm'],
        'kernel': svm_gscv.cv_results_['param_svm__kernel'],
        'C': svm_gscv.cv_results_['param_svm__C'],
        'gamma': svm_gscv.cv_results_['param_svm__gamma'],
        'class_weight': svm_gscv.cv_results_['param_svm__class_weight'],
        'mean_train_score': svm_gscv.cv_results_['mean_train_score'],
        'mean_test_score': svm_gscv.cv_results_['mean_test_score'],
        'std_test_score': svm_gscv.cv_results_['std_test_score'],
        'params': svm_gscv.cv_results_['params']
    })
    .sort_values(by='mean_test_score', ascending=False)
)

In [19]:
df_results.head()

Unnamed: 0,svm,kernel,C,gamma,class_weight,mean_train_score,mean_test_score,std_test_score,params
157,"SVC(C=1.0, cache_size=200, class_weight='balan...",rbf,1.0,scale,balanced,0.985679,0.926124,0.018708,"{'svm': SVC(C=1.0, cache_size=200, class_weigh..."
167,"SVC(C=1.0, cache_size=200, class_weight='balan...",rbf,3.16228,scale,balanced,0.998411,0.923265,0.040387,"{'svm': SVC(C=1.0, cache_size=200, class_weigh..."
162,"SVC(C=1.0, cache_size=200, class_weight='balan...",rbf,3.16228,scale,,0.995227,0.923265,0.041646,"{'svm': SVC(C=1.0, cache_size=200, class_weigh..."
38,"SVC(C=1.0, cache_size=200, class_weight=None, ...",linear,0.001,,,0.956515,0.92122,0.029588,"{'svm': SVC(C=1.0, cache_size=200, class_weigh..."
152,"SVC(C=1.0, cache_size=200, class_weight='balan...",rbf,1.0,scale,,0.986479,0.920992,0.033499,"{'svm': SVC(C=1.0, cache_size=200, class_weigh..."
