# Notebook 03 - Preprocessing and Training
## Overview
1. Load data from previous notebook
2. Organize the dataframe by assigning a multi-index based on measurement session and session time
3. Create a train/test split by splitting on the first-level index: profile_id
4. Standardize both train and test data using a scaler fit to only the training data
5. Create and save the train and test X and y datasets

In [2]:
# imports
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [3]:
# get data
filepath = '..\data\measures_v2.csv'
df = pd.read_csv(filepath)

In [4]:
# review df structure
df.head()

Unnamed: 0,u_q,coolant,stator_winding,u_d,stator_tooth,motor_speed,i_d,i_q,pm,stator_yoke,ambient,torque,profile_id
0,-0.450682,18.805172,19.08667,-0.350055,18.293219,0.002866,0.004419,0.000328,24.554214,18.316547,19.850691,0.187101,17
1,-0.325737,18.818571,19.09239,-0.305803,18.294807,0.000257,0.000606,-0.000785,24.538078,18.314955,19.850672,0.245417,17
2,-0.440864,18.82877,19.08938,-0.372503,18.294094,0.002355,0.00129,0.000386,24.544693,18.326307,19.850657,0.176615,17
3,-0.327026,18.835567,19.083031,-0.316199,18.292542,0.006105,2.6e-05,0.002046,24.554018,18.330833,19.850647,0.238303,17
4,-0.47115,18.857033,19.082525,-0.332272,18.291428,0.003133,-0.064317,0.037184,24.565397,18.326662,19.850639,0.208197,17


To do a bit of organizing in the dataframe, I'm adding a new column containing the `runtime` for each test (denoted by
`profile_id`) and then using the `runtime` and `profile_id` as the multi-index of the dataframe

In [5]:
# create new column for runtime and initialize to 0
df['runtime'] = 0

In [6]:
# function for calculating runtime for a unique profile_id
def calculate_runtime(profile_id, Hz):
    subset = df['profile_id']==profile_id
    timestep = 1.0 / Hz
    df.loc[subset, 'runtime'] = np.arange(0, sum(subset)/2, timestep)

In [7]:
# get list of unique profile_id
u_profile_id = df.profile_id.unique()

# call calculate_runtime for each id in the list
for u_id in u_profile_id:
    calculate_runtime(u_id, 2)

In [8]:
# set profile_id and runtime as multi-index
df.set_index(['profile_id','runtime'], inplace=True)

To avoid test data leakage into the training dataset, I will create the train-test split before scaling. I'm seeing two methods that can be explored: (1) the `profile_id`s can be splits into a train set and a test set, or (2) the time series within each `profile_id` can be split into a train set and a test set.

As a particular `profile_id` demonstrates a new motor cycle, it makes more sense to train on train `profile_id`s to predict the target values in the test `profile_id`s (method 1) just as if the machine learning model were deployed into production and having to perform on the new motor cycles.

That being said, the different measurement sessions will be randomly sampled to create a training set of `profile_id`s and a testing set of `profile_id`s.

In [9]:
# get train vs test indices of first-level of index
train_ix, test_ix = train_test_split(df.index.levels[0])

# create train and test dataframes
train_df = df.loc[train_ix]
test_df = df.loc[test_ix]

# fit a StandardScaler() to the training dataframe
scaler = StandardScaler().fit(train_df)

# transform both training and testing dataframes using the scaler
scaled_train_df = pd.DataFrame(scaler.transform(train_df), index=train_df.index, columns=train_df.columns)
scaled_test_df = pd.DataFrame(scaler.transform(test_df), index=test_df.index, columns=test_df.columns)

# create X and y from the scaled train and test dataframes
target_features = ['stator_winding', 'stator_tooth', 'stator_yoke', 'pm', 'torque']
X_train_scaled = scaled_train_df.drop(target_features, axis=1)
X_test_scaled  = scaled_test_df.drop(target_features, axis=1)
y_train_scaled = scaled_train_df[target_features]
y_test_scaled  = scaled_test_df[target_features]

X_train = train_df.drop(target_features, axis=1)
X_test  = test_df.drop(target_features, axis=1)
y_train = train_df[target_features]
y_test  = test_df[target_features]

In [10]:
# double check it looks as expected
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
63,0.0,0.246925,28.045749,3.022945,0.024556,-2.000174,1.098522,24.808095
63,0.5,0.284046,28.004429,2.84273,21.57778,-8.141494,-23.634663,24.807355
63,1.0,2.693688,27.977071,5.815121,106.637736,-18.826727,-57.558698,24.806825
63,1.5,7.426934,27.958966,10.690675,238.624395,-24.998416,-79.059977,24.80741
63,2.0,14.126337,27.948243,15.97152,404.092397,-27.168663,-88.892394,24.814582


In [11]:
# double check it looks as expected
y_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
63,0.0,27.943351,26.384509,26.426604,27.592874,0.000318
63,0.5,27.913379,26.384509,26.471587,27.589586,-18.429788
63,1.0,27.91602,26.384509,26.462338,27.593999,-44.562251
63,1.5,27.921771,26.384509,26.45571,27.563574,-60.881597
63,2.0,27.948079,26.384509,26.435527,27.560592,-68.017779


In [12]:
# double check it looks as expected
X_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
72,0.0,-1.887356,30.721162,1.946434,0.022686,-2.001102,1.098022,23.886441
72,0.5,0.604919,30.721209,0.109622,27.444022,-8.94431,26.1845,23.885538
72,1.0,4.665253,30.721242,-6.753488,116.920677,-35.659968,80.725465,23.883657
72,1.5,9.206368,30.721266,-16.699363,251.384409,-59.658771,125.454513,23.88038
72,2.0,14.197583,30.720804,-28.559625,418.588814,-77.320911,158.080952,23.874869


In [13]:
# double check it looks as expected
y_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
72,0.0,32.113178,31.291477,30.660012,37.112483,0.000187
72,0.5,32.115623,31.296847,30.708855,37.111457,18.702919
72,1.0,32.122736,31.283346,30.743853,37.116662,63.775405
72,1.5,32.130268,31.28334,30.77705,36.924837,101.423441
72,2.0,32.117775,31.212032,30.820728,36.922961,128.924817


In [14]:
# save X_train, X_test, y_train, y_test dataframes
X_train.to_csv('..\data\X_train.csv')
X_test.to_csv('..\data\X_test.csv')
y_train.to_csv('..\data\y_train.csv')
y_test.to_csv('..\data\y_test.csv')

X_train_scaled.to_csv('..\data\X_train_scaled.csv')
X_test_scaled.to_csv('..\data\X_test_scaled.csv')
y_train_scaled.to_csv('..\data\y_train_scaled.csv')
y_test_scaled.to_csv('..\data\y_test_scaled.csv')