# Supervised models for feature vectors

As seen in the literature, the supervised models used for trajectory classification are usually SVM and KNN. In this notebook we are going to test the effectiveness of these models, and others.

Firstly, we are going to load the vectors where the trajectories are described by their characteristics.

In [5]:
import feature_vec as fv

metadata = fv.get_selected_data()
feat_vectors, clss_mask, clss = fv.get_feat_vectors(metadata)

100.00%

Now, we split the data into 70% for model training and the other 30% for validation and testing.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(feat_vectors, clss, stratify=clss, 
                                                  random_state = 0, test_size=0.30)

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, stratify=y_test, 
                                                  random_state = 0, test_size=0.15)

## K-Nearest Neighbors

In [7]:
from sklearn.neighbors import KNeighborsClassifier

Whith number of neighbors by default for kneighbors queries.

In [8]:
knn = KNeighborsClassifier(weights='distance')
knn.fit(X_train, y_train)
knn.score(X_val, y_val)

# 0.70

0.7054726368159204

The results with the weights parameter with value `distance` are better than `uniform` according to experiments.

Whith number of neighbors on 20.

In [9]:
knn = KNeighborsClassifier(weights='distance', n_neighbors=20)
knn.fit(X_train, y_train)
knn.score(X_val, y_val)

# 0.72

0.7223880597014926

As we could see, this model does not achieve much more precision. Let's try the other recommended model, the SVM.

## Support Vector Machine

In [10]:
from sklearn.svm import SVC

To work with SVM, we will standardize the values ​​of the features and with pipelines we will give these new values ​​to the model.

In [11]:
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler

Let's try a polynomial kernel of degree 3.

In [12]:
svm = make_pipeline(StandardScaler(), SVC(kernel='poly', degree=3, gamma='scale', random_state = 0))
svm.fit(X_train, y_train)
svm.score(X_val, y_val)

# 0.71

0.7114427860696517

And now with a sigmoid kernel.

In [13]:
svm = make_pipeline(StandardScaler(), SVC(kernel='sigmoid', gamma='auto', random_state = 0))
svm.fit(X_train, y_train)
svm.score(X_val, y_val)

# 0.75

0.7522388059701492

Got better :)

Finally, let's try it with an rbf kernel.

In [14]:
from sklearn.decomposition import PCA
svm = make_pipeline(StandardScaler(), SVC(kernel = 'rbf', gamma='auto', probability=True, random_state = 0))
svm.fit(X_train, y_train)
svm.score(X_val, y_val)

# 0.83

0.83681592039801

Let's evaluate the model.

In [15]:
from sklearn.metrics import accuracy_score, roc_auc_score

acc_score = accuracy_score(y_test, y_pred=svm.predict(X_test))
auc_score = roc_auc_score(y_test, svm.predict_proba(X_test)[:], multi_class='ovr')
print(f"Accuracy: {acc_score:0.4f}")
print(f"AUC: {auc_score:0.4f}")

Accuracy: 0.8371
AUC: 0.9579


It is the best we have achieved so far. This is not a bad result, but we will try other classic supervised models.

## Decision Tree Classifier

Let's use a basic decision tree.

In [16]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(criterion='entropy', random_state = 0)
dtc.fit(X_train, y_train)
dtc.score(X_val, y_val)

# 0.83

0.8308457711442786

Oh, this model looks good, what if we put steroids on it?

## Random Forest Classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(criterion='entropy', max_features='log2', bootstrap=False, random_state=0)
rfc.fit(X_train, y_train)
rfc.score(X_val, y_val)

# 0.90

0.900497512437811

Let's review the most important features.

In [18]:
importances = list(zip(fv.feat_name, rfc.feature_importances_))
print(*sorted(importances, key=lambda x: -x[1]), sep="\n")

('iqr_velocity', 0.11873743232587498)
('mean_velocity', 0.1094967549799762)
('stop_rate', 0.08403357148109494)
('std_velocity', 0.06461460640924746)
('velocity_change_rate', 0.0642636588838411)
('iqr_turning_angle', 0.05501240686834602)
('iqr_heading_change_rate', 0.045699769843295976)
('distance', 0.043231525433121094)
('median_velocity', 0.03803674932408078)
('coef_var_velocity', 0.027088260485747584)
('max_velocity', 0.026846961766626053)
('min_angle', 0.01511989004105316)
('std_turning_angle', 0.014554040032265155)
('iqr_acc_change_rate', 0.014480041253424156)
('iqr_angle', 0.013657525779463916)
('max_turning_angle', 0.012944970402356369)
('var_turning_angle', 0.012377904291021914)
('std_heading_change_rate', 0.01227608026371222)
('min_acceleration', 0.012192524419444005)
('var_angle', 0.01215148449788078)
('max_acc_change_rate', 0.012071222539080226)
('std_acc_change_rate', 0.011838734025787962)
('min_acc_change_rate', 0.011285571139051938)
('max_acceleration', 0.01060314897607270

That looks great, let's explore some combinations.

In [19]:
pipe1 = Pipeline([('pca', PCA(n_components = 15)), 
                     ('Random_Forest', 
                      RandomForestClassifier(criterion='entropy', max_features='log2', bootstrap=False, random_state=0))])

pipe1.fit(X_train, y_train)
pipe1.score(X_val, y_val)

# 0.80 :(

0.8049751243781095

In [20]:
pipe2 = Pipeline([('ste', StandardScaler()),
                     ('Random_Forest', 
                      RandomForestClassifier(criterion='entropy', max_features='log2', bootstrap=False, random_state=0))])

pipe2.fit(X_train, y_train)
pipe2.score(X_val, y_val)

# 0.89 :)

0.8955223880597015

### Looking for a good combination of hyperparameters.

Grid Search based on out-of-bag score

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid

import pandas as pd

param_grid = ParameterGrid(
                {'n_estimators': [100, 150, 200],
                 'max_features': [5, 7, 9, 12, 20],
                 'max_depth'   : [None, 3, 10, 20],
                 'criterion'   : ['gini', 'entropy']
                }
            )

results = {'params': [], 'oob_accuracy': []}

for params in param_grid:
    
    model = RandomForestClassifier(
                oob_score    = True,
                n_jobs       = -1,
                random_state = 0,
                ** params
             )
    
    model.fit(X_train, y_train)
    
    results['params'].append(params)
    results['oob_accuracy'].append(model.oob_score_)
    print(f"Model: {params} \u2713")

results = pd.DataFrame(results)
results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
results = results.sort_values('oob_accuracy', ascending=False)
results = results.drop(columns = 'params')
results.head(5)

Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'n_estimators': 100} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 5, 'n_estimators': 200} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'n_estimators': 100} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 7, 'n_estimators': 200} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 9, 'n_estimators': 100} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 9, 'n_estimators': 150} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 9, 'n_estimators': 200} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 12, 'n_estimators': 100} ✓
Model: {'criterion': 'gini', 'max_depth': None, 'max_features': 12, 'n_estimators': 150} ✓
Model: {

Unnamed: 0,oob_accuracy,criterion,max_depth,max_features,n_estimators
62,0.892029,entropy,,5,200
107,0.892029,entropy,20.0,5,200
104,0.891304,entropy,10.0,20,200
112,0.891304,entropy,20.0,9,150
67,0.891304,entropy,,9,150


Grid Search based on cross validation

In [None]:
from sklearn.model_selection import RepeatedKFold
import multiprocessing

param_grid = {'n_estimators': [150, 200],
            'max_features': [5, 7, 9, 15, 25],
            'max_depth'   : [None, 3, 10, 20, 30],
            'criterion'   : ['gini', 'entropy']
            }

grid = GridSearchCV(
        estimator  = RandomForestClassifier(random_state = 0),
        param_grid = param_grid,
        scoring    = 'accuracy',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

results = pd.DataFrame(grid.cv_results_)
results.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(4)

## Neural Networks

In [30]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

model = keras.models.Sequential()

model.add(keras.layers.Dense(51, activation = 'sigmoid'))
model.add(keras.layers.Dense(300, activation = 'relu'))
model.add(keras.layers.Dense(100, activation = 'relu'))
model.add(keras.layers.Dense(5, activation= 'softmax'))

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])
history = model.fit(np.array(X_train), np.array(y_train), epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [23]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
import multiprocessing

model1 = MLPClassifier(
                hidden_layer_sizes=(5),
                learning_rate_init=0.01,
                solver = 'lbfgs',
                max_iter = 1000,
                random_state = 123
            )

model2 = MLPClassifier(
                hidden_layer_sizes=(10),
                learning_rate_init=0.01,
                solver = 'lbfgs',
                max_iter = 1000,
                random_state = 123
            )

model3 = MLPClassifier(
                hidden_layer_sizes=(20, 20),
                learning_rate_init=0.01,
                solver = 'lbfgs',
                max_iter = 5000,
                random_state = 123
            )

model4 = MLPClassifier(
                hidden_layer_sizes=(50, 50, 50),
                learning_rate_init=0.01,
                solver = 'lbfgs',
                max_iter = 5000,
                random_state = 123
            )

model1.fit(X=X_train, y=y_train)
model2.fit(X=X_train, y=y_train)
model3.fit(X=X_train, y=y_train)
model4.fit(X=X_train, y=y_train)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MLPClassifier(hidden_layer_sizes=(50, 50, 50), learning_rate_init=0.01,
              max_iter=5000, random_state=123, solver='lbfgs')

In [24]:
print(model1.score(X_val, y_val),
      model2.score(X_val, y_val),
      model3.score(X_val, y_val),
      model4.score(X_val, y_val))

0.29850746268656714 0.21990049751243781 0.2407960199004975 0.07761194029850746
