# Notebook 03 - Preprocessing and Training
## Overview
1. Load data from previous notebook
2. Organize the dataframe by assigning a multi-index based on measurement session and session time
3. Create a train/test split by splitting on the first-level index: profile_id
4. Standardize both train and test data using a scaler fit to only the training data
5. Create and save the train and test X and y datasets

In [146]:
# imports
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [147]:
# get data
filepath = '..\data\measures_v2.csv'
df = pd.read_csv(filepath)

In [148]:
# review df structure
df.head()

Unnamed: 0,u_q,coolant,stator_winding,u_d,stator_tooth,motor_speed,i_d,i_q,pm,stator_yoke,ambient,torque,profile_id
0,-0.450682,18.805172,19.08667,-0.350055,18.293219,0.002866,0.004419,0.000328,24.554214,18.316547,19.850691,0.187101,17
1,-0.325737,18.818571,19.09239,-0.305803,18.294807,0.000257,0.000606,-0.000785,24.538078,18.314955,19.850672,0.245417,17
2,-0.440864,18.82877,19.08938,-0.372503,18.294094,0.002355,0.00129,0.000386,24.544693,18.326307,19.850657,0.176615,17
3,-0.327026,18.835567,19.083031,-0.316199,18.292542,0.006105,2.6e-05,0.002046,24.554018,18.330833,19.850647,0.238303,17
4,-0.47115,18.857033,19.082525,-0.332272,18.291428,0.003133,-0.064317,0.037184,24.565397,18.326662,19.850639,0.208197,17


To do a bit of organizing in the dataframe, I'm adding a new column containing the `runtime` for each test (denoted by
`profile_id`) and then using the `runtime` and `profile_id` as the multi-index of the dataframe

In [149]:
# create new column for runtime and initialize to 0
df['runtime'] = 0

In [150]:
# function for calculating runtime for a unique profile_id
def calculate_runtime(profile_id, Hz):
    subset = df['profile_id']==profile_id
    timestep = 1.0 / Hz
    df.loc[subset, 'runtime'] = np.arange(0, sum(subset)/2, timestep)

In [151]:
# get list of unique profile_id
u_profile_id = df.profile_id.unique()

# call calculate_runtime for each id in the list
for u_id in u_profile_id:
    calculate_runtime(u_id, 2)

In [152]:
# set profile_id and runtime as multi-index
df.set_index(['profile_id','runtime'], inplace=True)

To avoid test data leakage into the training dataset, I will create the train-test split before scaling. I'm seeing two methods that can be explored: (1) the `profile_id`s can be splits into a train set and a test set, or (2) the time series within each `profile_id` can be split into a train set and a test set.

As a particular `profile_id` demonstrates a new motor cycle, it makes more sense to train on train `profile_id`s to predict the target values in the test `profile_id`s (method 1) just as if the machine learning model were deployed into production and having to perform on the new motor cycles.

That being said, the different measurement sessions will be randomly sampled to create a training set of `profile_id`s and a testing set of `profile_id`s.

In [153]:
# get train vs test indices of first-level of index
train_ix, test_ix = train_test_split(df.index.levels[0])

# create train and test dataframes
train_df = df.loc[train_ix]
test_df = df.loc[test_ix]

# fit a StandardScaler() to the training dataframe
scaler = StandardScaler().fit(train_df)

# transform both training and testing dataframes using the scaler
scaled_train_df = pd.DataFrame(scaler.transform(train_df), index=train_df.index, columns=train_df.columns)
scaled_test_df = pd.DataFrame(scaler.transform(test_df), index=test_df.index, columns=test_df.columns)

# create X and y from the scaled train and test dataframes
target_features = ['stator_winding', 'stator_tooth', 'stator_yoke', 'pm', 'torque']
X_train = scaled_train_df.drop(target_features, axis=1)
X_test = scaled_test_df.drop(target_features, axis=1)
y_train = scaled_train_df[target_features]
y_test = scaled_test_df[target_features]

In [154]:
# double check it looks as expected
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19,0.0,-1.249976,-0.823594,0.4061,-1.204935,1.023683,-0.346287,-1.234916
19,0.5,-1.250084,-0.822784,0.406157,-1.204934,1.023785,-0.346285,-1.219072
19,1.0,-1.247974,-0.8222,0.406982,-1.204938,1.023755,-0.346272,-1.206869
19,1.5,-1.251627,-0.821564,0.405315,-1.204937,1.023745,-0.346244,-1.20222
19,2.0,-1.249549,-0.820916,0.406178,-1.204934,1.02371,-0.346248,-1.196251


In [155]:
# double check it looks as expected
y_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
19,0.0,-1.646321,-1.695455,-1.518356,-1.819588,-0.340628
19,0.5,-1.646564,-1.6961,-1.518176,-1.819467,-0.34056
19,1.0,-1.646195,-1.695682,-1.518289,-1.819479,-0.339786
19,1.5,-1.646579,-1.696179,-1.518805,-1.819378,-0.34105
19,2.0,-1.646549,-1.696199,-1.51903,-1.819603,-0.340513


In [156]:
# double check it looks as expected
X_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
30,0.0,-1.23572,-0.943931,0.400362,-1.204921,1.03727,-0.351771,-2.410569
30,0.5,-1.237221,-0.946349,0.405558,-1.204927,1.033438,-0.35019,-2.410569
30,1.0,-1.238291,-0.947779,0.408632,-1.20493,1.030649,-0.349091,-2.410569
30,1.5,-1.239694,-0.948044,0.412156,-1.204929,1.028686,-0.348256,-2.410569
30,2.0,-1.2388,-0.947937,0.413536,-1.204941,1.027251,-0.347704,-2.410569


In [157]:
# double check it looks as expected
y_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
30,0.0,-1.634683,-1.698034,-1.528661,-1.904041,-0.351569
30,0.5,-1.634963,-1.697698,-1.528716,-1.904529,-0.35157
30,1.0,-1.635547,-1.697172,-1.5287,-1.904427,-0.351607
30,1.5,-1.635462,-1.696813,-1.528845,-1.904141,-0.351598
30,2.0,-1.635886,-1.696603,-1.528976,-1.903938,-0.349925


In [158]:
# save X_train, X_test, y_train, y_test dataframes
X_train.to_csv('..\data\X_train.csv')
X_test.to_csv('..\data\X_test.csv')
y_train.to_csv('..\data\y_train.csv')
y_test.to_csv('..\data\y_test.csv')