# Evaluate *tsfresh* selected features quality

Goal of this notebook is to evaluate quality of features selected previously by **tsfresh** library on Depresjon dataset.
Several approaches will be tested:
* Random Forest
* Extra Trees
* SVMs

### Cross validation

As the dataset is quite small (55 samples) we use both cross-validation (5-fold and 10-fold) but also classical validation approach, using 20% of the dataset for evaluation and the rest 80% for training.

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier

def create_rf(n_trees):
    return RandomForestClassifier(n_estimators=n_trees)

In [2]:
ids = pd.read_csv('csv/ids.csv')
ids.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   number  55 non-null     object
 1   ill     55 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 1008.0+ bytes


In [3]:
names = ['all', 'night', 'day']
selected = {}
hourly_selected = {}

for n in names:
    selected[n] = pd.read_csv(f'csv/selected_{n}.csv')
    hourly_selected[n] = pd.read_csv(f'csv/selected_hourly_{n}.csv')

#### Random Forest classifier

With Random Forest classifier, we use 100, 500, and 100 trees classifiers. 

In [4]:
from sklearn.model_selection import cross_val_score

FOLDS = [5, 10]
TREES = [100, 500, 1000]
y = ids['ill']

def cross_validate(name: str, data: pd.DataFrame):
    for t in TREES:
        for f in FOLDS:
            rf = create_rf(t)
            print(f'[{name}] {f}-fold cross validation with Random Forest Classifier ({t} trees):')
            scores = cross_val_score(rf, data, y, cv=f)
            print(f'Results:\n\tMin: {min(scores)}\n\tMax: {max(scores)}\n\tMean: {np.mean(scores)}\n\tStd: {np.std(scores)}')

In [5]:
for name, data in selected.items():
    cross_validate(name, data)

[all] 5-fold cross validation with Random Forest Classifier (100 trees):
Results:
	Min: 0.8181818181818182
	Max: 0.9090909090909091
	Mean: 0.8545454545454545
	Std: 0.04453617714151229
[all] 10-fold cross validation with Random Forest Classifier (100 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.8333333333333334
	Std: 0.18135294011647257
[all] 5-fold cross validation with Random Forest Classifier (500 trees):
Results:
	Min: 0.7272727272727273
	Max: 0.9090909090909091
	Mean: 0.8363636363636363
	Std: 0.06803013430498073
[all] 10-fold cross validation with Random Forest Classifier (500 trees):
Results:
	Min: 0.6
	Max: 1.0
	Mean: 0.8533333333333333
	Std: 0.13840359661351131
[all] 5-fold cross validation with Random Forest Classifier (1000 trees):
Results:
	Min: 0.7272727272727273
	Max: 0.9090909090909091
	Mean: 0.8363636363636363
	Std: 0.06803013430498073
[all] 10-fold cross validation with Random Forest Classifier (1000 trees):
Results:
	Min: 0.6
	Max: 1.0
	Mean: 0.8533333333333333
	Std: 0

In [6]:
for name, data in hourly_selected.items():
    cross_validate(name, data)

[all] 5-fold cross validation with Random Forest Classifier (100 trees):
Results:
	Min: 0.5454545454545454
	Max: 0.9090909090909091
	Mean: 0.7272727272727273
	Std: 0.128564869306645
[all] 10-fold cross validation with Random Forest Classifier (100 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.78
	Std: 0.18147543451754933
[all] 5-fold cross validation with Random Forest Classifier (500 trees):
Results:
	Min: 0.7272727272727273
	Max: 0.9090909090909091
	Mean: 0.7818181818181819
	Std: 0.07272727272727271
[all] 10-fold cross validation with Random Forest Classifier (500 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.76
	Std: 0.18903262505010432
[all] 5-fold cross validation with Random Forest Classifier (1000 trees):
Results:
	Min: 0.6363636363636364
	Max: 0.9090909090909091
	Mean: 0.7636363636363637
	Std: 0.09270944570168699
[all] 10-fold cross validation with Random Forest Classifier (1000 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.78
	Std: 0.18147543451754933
[night] 5-fold cross vali

#### Random Forest results (cross validation)
* Results from cross validation vary from 40% to 100% accuracy.
* Slightly better results were achieved with features extracted from signal aggregated by hour
* Best achieved mean accuracy for Random Forest classifier trained on raw signal was **85,45%**
* Best achieved mean accuracy for Random Forest classifier trained on hourly-aggregated signal was **91%**
* The standard deviation is quite high for both feature sets (raw, hourly-aggregated)

### Test/train split validation

In [7]:
from sklearn.model_selection import train_test_split

TEST_RATIO = .2

def validate(name: str, data: pd.DataFrame):
    for t in TREES:
        X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=TEST_RATIO)
        rf = create_rf(t)
        rf.fit(X_train, y_train)
        score = rf.score(X_test, y_test)
        print(f'[{name}] validation with Random Forest Classifier ({t} trees):')
        print(f'\tTrain size: {X_train.shape}\n\tTest size: {X_test.shape}')
        print(f'\tAccuracy score: {score}')

In [8]:
for name, data in selected.items():
    validate(name, data)

[all] validation with Random Forest Classifier (100 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 0.9090909090909091
[all] validation with Random Forest Classifier (500 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 1.0
[all] validation with Random Forest Classifier (1000 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 0.9090909090909091
[night] validation with Random Forest Classifier (100 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.7272727272727273
[night] validation with Random Forest Classifier (500 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.6363636363636364
[night] validation with Random Forest Classifier (1000 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.6363636363636364
[day] validation with Random Forest Classifier (100 trees):
	Train size: (44, 2)
	Test size: (11, 2)
	Accuracy score: 0.8181818181818182
[day] validation with Random Forest 

In [9]:
for name, data in hourly_selected.items():
    validate(name, data)

[all] validation with Random Forest Classifier (100 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.9090909090909091
[all] validation with Random Forest Classifier (500 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.7272727272727273
[all] validation with Random Forest Classifier (1000 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.8181818181818182
[night] validation with Random Forest Classifier (100 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 0.8181818181818182
[night] validation with Random Forest Classifier (500 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 1.0
[night] validation with Random Forest Classifier (1000 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 0.9090909090909091
[day] validation with Random Forest Classifier (100 trees):
	Train size: (44, 8)
	Test size: (11, 8)
	Accuracy score: 0.8181818181818182
[day] validation with Random Forest 

#### Random Forest results
* Results from validation vary from 60% to 100% accuracy.

### ExtraTreesClassifier

In [10]:
from sklearn.ensemble import ExtraTreesClassifier

def create_et(n_trees):
    return ExtraTreesClassifier(n_estimators=n_trees)

def cross_validate(name: str, data: pd.DataFrame):
    for t in TREES:
        for f in FOLDS:
            clf = create_et(t)
            print(f'[{name}] {f}-fold cross validation with Extra Trees Classifier ({t} trees):')
            scores = cross_val_score(clf, data, y, cv=f)
            print(f'Results:\n\tMin: {min(scores)}\n\tMax: {max(scores)}\n\tMean: {np.mean(scores)}\n\tStd: {np.std(scores)}')
            
def validate(name: str, data: pd.DataFrame):
    for t in TREES:
        X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=TEST_RATIO)
        clf = create_et(t)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)
        print(f'[{name}] validation with Extra Trees Classifier ({t} trees):')
        print(f'\tTrain size: {X_train.shape}\n\tTest size: {X_test.shape}')
        print(f'\tAccuracy score: {score}')

In [11]:
for name, data in selected.items():
    cross_validate(name, data)

[all] 5-fold cross validation with Extra Trees Classifier (100 trees):
Results:
	Min: 0.6363636363636364
	Max: 0.9090909090909091
	Mean: 0.7454545454545454
	Std: 0.10601730717900547
[all] 10-fold cross validation with Extra Trees Classifier (100 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.8133333333333332
	Std: 0.17269111795984826
[all] 5-fold cross validation with Extra Trees Classifier (500 trees):
Results:
	Min: 0.6363636363636364
	Max: 0.9090909090909091
	Mean: 0.7636363636363636
	Std: 0.09270944570168699
[all] 10-fold cross validation with Extra Trees Classifier (500 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.8133333333333332
	Std: 0.17269111795984826
[all] 5-fold cross validation with Extra Trees Classifier (1000 trees):
Results:
	Min: 0.7272727272727273
	Max: 0.9090909090909091
	Mean: 0.7818181818181819
	Std: 0.07272727272727271
[all] 10-fold cross validation with Extra Trees Classifier (1000 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.7966666666666666
	Std: 0.19290181728

In [12]:
for name, data in hourly_selected.items():
    cross_validate(name, data)

[all] 5-fold cross validation with Extra Trees Classifier (100 trees):
Results:
	Min: 0.6363636363636364
	Max: 0.9090909090909091
	Mean: 0.7636363636363637
	Std: 0.09270944570168699
[all] 10-fold cross validation with Extra Trees Classifier (100 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.78
	Std: 0.18808981306221179
[all] 5-fold cross validation with Extra Trees Classifier (500 trees):
Results:
	Min: 0.6363636363636364
	Max: 1.0
	Mean: 0.8181818181818181
	Std: 0.1149919149152138
[all] 10-fold cross validation with Extra Trees Classifier (500 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.8133333333333332
	Std: 0.1944793619441976
[all] 5-fold cross validation with Extra Trees Classifier (1000 trees):
Results:
	Min: 0.6363636363636364
	Max: 1.0
	Mean: 0.8181818181818181
	Std: 0.1149919149152138
[all] 10-fold cross validation with Extra Trees Classifier (1000 trees):
Results:
	Min: 0.4
	Max: 1.0
	Mean: 0.7933333333333333
	Std: 0.18427033281447006
[night] 5-fold cross validation with Ext

In [13]:
for name, data in selected.items():
    validate(name, data)

[all] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 0.7272727272727273
[all] validation with Extra Trees Classifier (500 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 0.7272727272727273
[all] validation with Extra Trees Classifier (1000 trees):
	Train size: (44, 14)
	Test size: (11, 14)
	Accuracy score: 0.7272727272727273
[night] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.6363636363636364
[night] validation with Extra Trees Classifier (500 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.8181818181818182
[night] validation with Extra Trees Classifier (1000 trees):
	Train size: (44, 47)
	Test size: (11, 47)
	Accuracy score: 0.7272727272727273
[day] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 2)
	Test size: (11, 2)
	Accuracy score: 0.9090909090909091
[day] validation with Extra Trees C

In [14]:
for name, data in hourly_selected.items():
    validate(name, data)

[all] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.6363636363636364
[all] validation with Extra Trees Classifier (500 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.7272727272727273
[all] validation with Extra Trees Classifier (1000 trees):
	Train size: (44, 24)
	Test size: (11, 24)
	Accuracy score: 0.8181818181818182
[night] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 0.8181818181818182
[night] validation with Extra Trees Classifier (500 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 0.8181818181818182
[night] validation with Extra Trees Classifier (1000 trees):
	Train size: (44, 61)
	Test size: (11, 61)
	Accuracy score: 1.0
[day] validation with Extra Trees Classifier (100 trees):
	Train size: (44, 8)
	Test size: (11, 8)
	Accuracy score: 0.9090909090909091
[day] validation with Extra Trees Classifier (500 

#### Extra Trees results
* Extra Trees Classifier achieved almost identical results as Random Forest Classifier with results varying from 40% to 100% in cross-validation.
* Best achieved mean accuracy for Extra Trees classifier trained on raw signal was **85,45%**
* Best achieved mean accuracy for Extra Trees classifier trained on hourly-aggregated signal was **91%**

### SVM

We also test SVM together with GridSearchCV tuning, to find best SVC classifier parameters.

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

scores = ['precision', 'recall']
tuned_parameters = [
    {
        'kernel': ['rbf'], 
        'gamma': [1e-4, 1e-3, 1e-2, 1e-1, "scale", "auto"],
        'C': [1, 10, 20, 50, 100, 200, 500, 1000]
    },
]

def grid_search_svc(name: str, data: pd.DataFrame):
    X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=TEST_RATIO, random_state=0)
    
    for score in scores:
        print(f'[{name}] SVC tuning for {score}')

        clf = GridSearchCV(SVC(), tuned_parameters, scoring=f'{score}_macro', n_jobs=-1)
        clf.fit(X_train, y_train)

        print("Best parameters set found on development set:")
        print(clf.best_params_)
        print("Grid scores on development set:")
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))

In [16]:
for name, data in selected.items():
    grid_search_svc(name, data)

[all] SVC tuning for precision
Best parameters set found on development set:
{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
Grid scores on development set:
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.758 (+/-0.170) for {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.696 (+/-0.249) for {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 20, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) f

Best parameters set found on development set:
{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Grid scores on development set:
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.725 (+/-0.237) for {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.745 (+/-0.301) for {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 20, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 20, 'gamma': 0.001, '

              precision    recall  f1-score   support

           0       0.88      0.88      0.88         8
           1       0.67      0.67      0.67         3

    accuracy                           0.82        11
   macro avg       0.77      0.77      0.77        11
weighted avg       0.82      0.82      0.82        11



In [17]:
for name, data in hourly_selected.items():
    grid_search_svc(name, data)

[all] SVC tuning for precision
Best parameters set found on development set:
{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
Grid scores on development set:
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.736 (+/-0.228) for {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.736 (+/-0.228) for {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
0.272 (+/-0.022) for {'C': 20, 'gamma': 0.0001, 'kernel': 'rbf'}
0.272 (+/-0.022) f

Best parameters set found on development set:
{'C': 200, 'gamma': 'scale', 'kernel': 'rbf'}
Grid scores on development set:
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.705 (+/-0.185) for {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.720 (+/-0.307) for {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 10, 'gamma': 'auto', 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 20, 'gamma': 0.0001, 'kernel': 'rbf'}
0.500 (+/-0.000) for {'C': 20, 'gamma': 0.001, 

#### SVC results
* With SVCs, better results were also achieved with features extracted from aggregated signal - 91% accuracy on whole dataset.