# Experiments KuHar

This notebook will perform basic experiments on the balanced KuHar dataset with the following steps:
1. Quick load train, test and validation CSV subsets from the balanced KuHar dataset using `PandasDatasetsIO` helper
2. Subclassing the `Dataset` interface using `PandasMultiModalDataset`
3. Apply the fourier transform on the dataset
4. Train and evaluate SVM, KNN and Random Forest classification models in both time and frequency domains

The experiments will evaluate the performance of SVM, KNN and RF models on the balanced KuHar dataset in both time and frequency domains.

## Common imports and definitions

In [1]:
from pathlib import Path  # For defining dataset Paths
import sys                # For include librep package

# This must be done if librep is not installed via pip,
# as this directory (examples) is appart from librep package root
sys.path.append("..")

# Third party imports
import pandas as pd
import numpy as np

# Librep imports
from librep.utils.dataset import PandasDatasetsIO          # For quick load train, test and validation CSVs
from librep.datasets.multimodal import PandasMultiModalDataset # Wrap CSVs to librep's `Dataset` interface

2022-09-01 18:48:53.258299: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-01 18:48:53.258320: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Loading data
Change the path to use in other datasets

In [2]:
# Path for KuHar balanced view with the same activities (and labels numbers)
# It is assumed that the directory will contain (train.csv, test.csv and validation.csv)
dataset_path = Path("../data/views/KuHar/balanced_view")

Once paths is defined, we can load the CSV as pandas dataframes

In [3]:
# Kuhar dataframes
train, validation, test = PandasDatasetsIO(dataset_path).load()

Letś take a look in the train dataframes

In [4]:
train.head()

Unnamed: 0.1,Unnamed: 0,accel-x-0,accel-x-1,accel-x-2,accel-x-3,accel-x-4,accel-x-5,accel-x-6,accel-x-7,accel-x-8,...,gyro-z-299,accel-start-time,gyro-start-time,accel-end-time,gyro-end-time,activity code,length,serial,index,user
0,0,-0.056118,0.034403,0.052704,0.070734,0.020224,-0.048252,-0.033161,-0.006543,-0.001562,...,-0.005646,30.379,30.331,33.433,33.352,0,300,6,3000,1040
1,1,-0.019538,-0.016915,0.021001,0.055937,0.036128,0.004878,-0.032916,-0.044168,-0.04817,...,-0.005636,35.36,35.349,38.38,38.371,0,300,1,3300,1025
2,2,0.078851,0.067761,0.042445,-0.016207,-0.060515,-0.052389,-0.039572,-0.020855,-0.020164,...,0.000831,0.006,0.009,2.995,2.997,0,300,1,0,1010
3,3,-0.06795,0.00145,0.095617,0.070418,-0.008559,-0.001449,-0.013325,-0.036775,-0.043285,...,0.001721,3.045,3.034,6.067,6.057,0,300,1,300,1058
4,4,-0.03076,-0.005518,0.005185,0.029851,0.029403,0.007791,0.007751,-0.005227,-0.019164,...,0.011505,0.001,0.001,2.957,2.956,0,300,1,0,1015


## Creating a Librep dataset from pandas dataframes

Change the features to use in other datasets

In [5]:
# Kuhar features to select
features = [
    "accel-x",
    "accel-y",
    "accel-z",
    "gyro-x",
    "gyro-y",
    "gyro-z"
]

# Creating the datasets

# Train
train_dataset = PandasMultiModalDataset(
    train,
    feature_prefixes=features,
    label_columns="activity code",
    as_array=True
)

# Validation
validation_dataset = PandasMultiModalDataset(
    validation,
    feature_prefixes=features,
    label_columns="activity code",
    as_array=True
)

# Test
test_dataset = PandasMultiModalDataset(
    test,
    feature_prefixes=features,
    label_columns="activity code",
    as_array=True
)

## Inspect sample

In [6]:
# Lets print the first sample of kh_train dataset.
# Is a tuple, with an vector of 1800 elements as first element and the label as second
x = train_dataset[0]
print(x)

(array([-0.05611801,  0.03440285,  0.05270386, ..., -0.00777642,
       -0.00671116, -0.0056459 ]), 0)


In [7]:
# Inspecting sample
print(f"The sample 0: {x[0]}")
print(f"Shape of sample 0: {x[0].shape}")
print(f"The label of sample 0: {x[1]}")

The sample 0: [-0.05611801  0.03440285  0.05270386 ... -0.00777642 -0.00671116
 -0.0056459 ]
Shape of sample 0: (1800,)
The label of sample 0: 0


## Fourier Transform

In [8]:
from librep.datasets.multimodal import TransformMultiModalDataset
from librep.transforms.fft import FFT

In [9]:
fft_transform = FFT(centered = True)
transformer = TransformMultiModalDataset(transforms=[fft_transform], new_window_name_prefix="fft.")

### Use FFT in Kuhar

In [10]:
train_dataset_fft = transformer(train_dataset)
validation_dataset_fft = transformer(validation_dataset)
test_dataset_fft = transformer(test_dataset)

In [11]:
train_dataset[:][0]

array([[-5.61180100e-02,  3.44028470e-02,  5.27038570e-02, ...,
        -7.77642340e-03, -6.71116000e-03, -5.64589630e-03],
       [-1.95379260e-02, -1.69153210e-02,  2.10008620e-02, ...,
        -5.63590970e-03, -5.63590970e-03, -5.63590970e-03],
       [ 7.88507460e-02,  6.77614200e-02,  4.24451830e-02, ...,
         2.66335900e-03,  4.49595400e-03,  8.30763950e-04],
       ...,
       [-4.49047000e+00, -4.37737460e+00, -3.15459060e+00, ...,
         1.55295460e-01,  2.96276410e-03, -2.68679440e-01],
       [-1.35669830e+01, -1.23066845e+01, -1.06537895e+01, ...,
         7.37994550e-01,  8.41325160e-01,  9.55308400e-01],
       [-6.65145870e+00, -4.99298000e+00, -3.75397440e+00, ...,
         9.16958870e-01,  8.77544050e-01,  8.98849370e-01]])

In [12]:
train_dataset_fft[:][0]

array([[1.40451425e-01, 2.64954973e-01, 2.33269584e-01, ...,
        5.54874483e-03, 1.31394597e-02, 1.02363685e-02],
       [1.03750273e-02, 2.86165385e-01, 2.29687236e-01, ...,
        1.58202500e-02, 1.23533180e-02, 1.48931568e-02],
       [7.98449419e+00, 8.36490063e-01, 7.98251733e-01, ...,
        3.19458726e-02, 1.32013312e-02, 9.01949793e-03],
       ...,
       [1.17398675e+02, 4.41272871e+01, 1.13623326e+02, ...,
        1.74773671e-01, 2.30094237e-01, 2.01677331e-01],
       [6.21067120e+01, 1.78165592e+02, 1.04645300e+02, ...,
        6.27752450e-01, 6.57386879e-01, 5.43437751e-01],
       [1.95639551e+02, 1.46506404e+02, 6.75411949e+01, ...,
        2.49181044e-01, 2.13343332e-01, 1.39736334e-01]])

## Train and evaluate Random Forest classifier

In [22]:
from librep.utils.workflow import SimpleTrainEvalWorkflow, MultiRunWorkflow
from librep.estimators import RandomForestClassifier
from librep.metrics.report import ClassificationReport
import yaml

reporter = ClassificationReport(use_accuracy=True, use_f1_score=True, use_classification_report=False, use_confusion_matrix=False, plot_confusion_matrix=False)
experiment = SimpleTrainEvalWorkflow(estimator=RandomForestClassifier, estimator_creation_kwags ={'n_estimators':100} , do_not_instantiate=False, do_fit=True, evaluator=reporter)
multi_run_experiment = MultiRunWorkflow(workflow=experiment, num_runs=3, debug=False)

In [26]:
combined_train_dset = PandasMultiModalDataset(
    pd.concat([train, validation]),
    feature_prefixes=features,
    label_columns="activity code",
    as_array=True
)

x = combined_train_dset[0]
print(x)
print(f"The sample 0: {x[0]}")
print(f"Shape of sample 0: {x[0].shape}")
print(f"The label of sample 0: {x[1]}")
print(train_dataset)
print(validation_dataset)
print(combined_train_dset)


result = multi_run_experiment(combined_train_dset, test_dataset)
print(yaml.dump(result, sort_keys=True, indent=4))

(array([[-0.05611801,  0.03440285,  0.05270386, ..., -0.00777642,
        -0.00671116, -0.0056459 ],
       [-0.12517166, -0.07503891, -0.02219009, ..., -0.0005188 ,
        -0.00263977, -0.00370789]]), 0    0
0    0
Name: activity code, dtype: int64)
The sample 0: [[-0.05611801  0.03440285  0.05270386 ... -0.00777642 -0.00671116
  -0.0056459 ]
 [-0.12517166 -0.07503891 -0.02219009 ... -0.0005188  -0.00263977
  -0.00370789]]
Shape of sample 0: (2, 1800)
The label of sample 0: 0    0
0    0
Name: activity code, dtype: int64
PandasMultiModalDataset: samples=3168, features=1800, no. window=6
PandasMultiModalDataset: samples=234, features=1800, no. window=6
PandasMultiModalDataset: samples=3402, features=1800, no. window=6
runs:
-   end: 1662384876.1513305
    result:
    -   accuracy: 0.7106481481481481
        f1 score (macro): 0.697919052379794
        f1 score (micro): 0.710648148148148
        f1 score (weighted): 0.7233772439165023
    run id: 1
    start: 1662384866.423788
    time 

In [15]:
combined_train_dset_fft = transformer(combined_train_dset)

result = multi_run_experiment(combined_train_dset_fft, test_dataset_fft)
print(yaml.dump(result, sort_keys=True, indent=4))

runs:
-   end: 1662058171.5942078
    result:
    -   accuracy: 0.8287037037037037
        f1 score (macro): 0.8287587527804129
        f1 score (micro): 0.8287037037037037
        f1 score (weighted): 0.8286486546269943
    run id: 1
    start: 1662058165.8094182
    time taken: 5.784789562225342
-   end: 1662058177.4075127
    result:
    -   accuracy: 0.8310185185185185
        f1 score (macro): 0.8288986729825647
        f1 score (micro): 0.8310185185185185
        f1 score (weighted): 0.8331383640544722
    run id: 2
    start: 1662058171.59421
    time taken: 5.813302755355835
-   end: 1662058183.2215347
    result:
    -   accuracy: 0.8402777777777778
        f1 score (macro): 0.8396976809360595
        f1 score (micro): 0.8402777777777778
        f1 score (weighted): 0.840857874619496
    run id: 3
    start: 1662058177.4075143
    time taken: 5.814020395278931



## Train and evaluate Support Vector Machine classifier

In [16]:
#from librep.estimators import SVC
from sklearn.svm import SVC

experiment = SimpleTrainEvalWorkflow(estimator=SVC, estimator_creation_kwags ={'C':3.0, 'kernel':"rbf"} , do_not_instantiate=False, do_fit=True, evaluator=reporter)
multi_run_experiment = MultiRunWorkflow(workflow=experiment, num_runs=3, debug=False)

result = multi_run_experiment(combined_train_dset, test_dataset)
print(yaml.dump(result, sort_keys=True, indent=4))

runs:
-   end: 1662058188.820096
    result:
    -   accuracy: 0.42592592592592593
        f1 score (macro): 0.4038146377355998
        f1 score (micro): 0.42592592592592593
        f1 score (weighted): 0.44803721411625214
    run id: 1
    start: 1662058183.2274892
    time taken: 5.592606782913208
-   end: 1662058194.379537
    result:
    -   accuracy: 0.42592592592592593
        f1 score (macro): 0.4038146377355998
        f1 score (micro): 0.42592592592592593
        f1 score (weighted): 0.44803721411625214
    run id: 2
    start: 1662058188.8200982
    time taken: 5.559438943862915
-   end: 1662058200.1113172
    result:
    -   accuracy: 0.42592592592592593
        f1 score (macro): 0.4038146377355998
        f1 score (micro): 0.42592592592592593
        f1 score (weighted): 0.44803721411625214
    run id: 3
    start: 1662058194.3795393
    time taken: 5.731777906417847



In [17]:
result = multi_run_experiment(combined_train_dset_fft, test_dataset_fft)
print(yaml.dump(result, sort_keys=True, indent=4))

runs:
-   end: 1662058201.4200947
    result:
    -   accuracy: 0.7685185185185185
        f1 score (macro): 0.7510101967323339
        f1 score (micro): 0.7685185185185186
        f1 score (weighted): 0.7860268403047032
    run id: 1
    start: 1662058200.116313
    time taken: 1.3037817478179932
-   end: 1662058202.7162719
    result:
    -   accuracy: 0.7685185185185185
        f1 score (macro): 0.7510101967323339
        f1 score (micro): 0.7685185185185186
        f1 score (weighted): 0.7860268403047032
    run id: 2
    start: 1662058201.4200969
    time taken: 1.2961750030517578
-   end: 1662058204.0127006
    result:
    -   accuracy: 0.7685185185185185
        f1 score (macro): 0.7510101967323339
        f1 score (micro): 0.7685185185185186
        f1 score (weighted): 0.7860268403047032
    run id: 3
    start: 1662058202.716274
    time taken: 1.29642653465271



## Train and evaluate K Neighbors Classifier classifier

In [18]:
#from librep.estimators import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

experiment = SimpleTrainEvalWorkflow(estimator=KNeighborsClassifier, estimator_creation_kwags ={'n_neighbors' :1} , do_not_instantiate=False, do_fit=True, evaluator=reporter)
multi_run_experiment = MultiRunWorkflow(workflow=experiment, num_runs=3, debug=False)

result = multi_run_experiment(combined_train_dset, test_dataset)
print(yaml.dump(result, sort_keys=True, indent=4))

runs:
-   end: 1662058204.2287405
    result:
    -   accuracy: 0.3773148148148148
        f1 score (macro): 0.3769408236270271
        f1 score (micro): 0.3773148148148149
        f1 score (weighted): 0.3776888060026025
    run id: 1
    start: 1662058204.0184376
    time taken: 0.21030282974243164
-   end: 1662058204.301439
    result:
    -   accuracy: 0.3773148148148148
        f1 score (macro): 0.3769408236270271
        f1 score (micro): 0.3773148148148149
        f1 score (weighted): 0.3776888060026025
    run id: 2
    start: 1662058204.228744
    time taken: 0.07269501686096191
-   end: 1662058204.373066
    result:
    -   accuracy: 0.3773148148148148
        f1 score (macro): 0.3769408236270271
        f1 score (micro): 0.3773148148148149
        f1 score (weighted): 0.3776888060026025
    run id: 3
    start: 1662058204.3014412
    time taken: 0.071624755859375



In [19]:
result = multi_run_experiment(combined_train_dset_fft, test_dataset_fft)
print(yaml.dump(result, sort_keys=True, indent=4))

runs:
-   end: 1662058204.412305
    result:
    -   accuracy: 0.8194444444444444
        f1 score (macro): 0.8212660219703911
        f1 score (micro): 0.8194444444444444
        f1 score (weighted): 0.8176228669184978
    run id: 1
    start: 1662058204.3797736
    time taken: 0.0325314998626709
-   end: 1662058204.4388378
    result:
    -   accuracy: 0.8194444444444444
        f1 score (macro): 0.8212660219703911
        f1 score (micro): 0.8194444444444444
        f1 score (weighted): 0.8176228669184978
    run id: 2
    start: 1662058204.4123073
    time taken: 0.02653050422668457
-   end: 1662058204.467535
    result:
    -   accuracy: 0.8194444444444444
        f1 score (macro): 0.8212660219703911
        f1 score (micro): 0.8194444444444444
        f1 score (weighted): 0.8176228669184978
    run id: 3
    start: 1662058204.43884
    time taken: 0.028695106506347656

