# Notebook 03 - Preprocessing and Training
## Overview
1. Load data from previous notebook
2. Organize the dataframe by assigning a multi-index based on measurement session and session time
3. Create a train/test split by splitting on the first-level index: profile_id
4. Standardize both train and test data using a scaler fit to only the training data
5. Create and save the train and test X and y datasets

In [1]:
# imports
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
# get data
filepath = '..\data\measures_v2.csv'
df = pd.read_csv(filepath)

In [3]:
# review df structure
df.head()

Unnamed: 0,u_q,coolant,stator_winding,u_d,stator_tooth,motor_speed,i_d,i_q,pm,stator_yoke,ambient,torque,profile_id
0,-0.450682,18.805172,19.08667,-0.350055,18.293219,0.002866,0.004419,0.000328,24.554214,18.316547,19.850691,0.187101,17
1,-0.325737,18.818571,19.09239,-0.305803,18.294807,0.000257,0.000606,-0.000785,24.538078,18.314955,19.850672,0.245417,17
2,-0.440864,18.82877,19.08938,-0.372503,18.294094,0.002355,0.00129,0.000386,24.544693,18.326307,19.850657,0.176615,17
3,-0.327026,18.835567,19.083031,-0.316199,18.292542,0.006105,2.6e-05,0.002046,24.554018,18.330833,19.850647,0.238303,17
4,-0.47115,18.857033,19.082525,-0.332272,18.291428,0.003133,-0.064317,0.037184,24.565397,18.326662,19.850639,0.208197,17


To do a bit of organizing in the dataframe, I'm adding a new column containing the `runtime` for each test (denoted by
`profile_id`) and then using the `runtime` and `profile_id` as the multi-index of the dataframe

In [4]:
# create new column for runtime and initialize to 0
df['runtime'] = 0

In [5]:
# function for calculating runtime for a unique profile_id
def calculate_runtime(profile_id, Hz):
    subset = df['profile_id']==profile_id
    timestep = 1.0 / Hz
    df.loc[subset, 'runtime'] = np.arange(0, sum(subset)/2, timestep)

In [6]:
# get list of unique profile_id
u_profile_id = df.profile_id.unique()

# call calculate_runtime for each id in the list
for u_id in u_profile_id:
    calculate_runtime(u_id, 2)

In [7]:
# set profile_id and runtime as multi-index
df.set_index(['profile_id','runtime'], inplace=True)

To avoid test data leakage into the training dataset, I will create the train-test split before scaling. I'm seeing two methods that can be explored: (1) the `profile_id`s can be splits into a train set and a test set, or (2) the time series within each `profile_id` can be split into a train set and a test set.

As a particular `profile_id` demonstrates a new motor cycle, it makes more sense to train on train `profile_id`s to predict the target values in the test `profile_id`s (method 1) just as if the machine learning model were deployed into production and having to perform on the new motor cycles.

That being said, the different measurement sessions will be randomly sampled to create a training set of `profile_id`s and a testing set of `profile_id`s.

In [8]:
# get train vs test indices of first-level of index
train_ix, test_ix = train_test_split(df.index.levels[0], random_state=23)

# create train and test dataframes
train_df = df.loc[train_ix]
test_df = df.loc[test_ix]

# fit a StandardScaler() to the training dataframe
scaler = StandardScaler().fit(train_df)

# transform both training and testing dataframes using the scaler
scaled_train_df = pd.DataFrame(scaler.transform(train_df), index=train_df.index, columns=train_df.columns)
scaled_test_df = pd.DataFrame(scaler.transform(test_df), index=test_df.index, columns=test_df.columns)

# create X and y from the scaled train and test dataframes
target_features = ['stator_winding', 'stator_tooth', 'stator_yoke', 'pm', 'torque']
X_train_scaled = scaled_train_df.drop(target_features, axis=1)
X_test_scaled  = scaled_test_df.drop(target_features, axis=1)
y_train_scaled = scaled_train_df[target_features]
y_test_scaled  = scaled_test_df[target_features]

X_train = train_df.drop(target_features, axis=1)
X_test  = test_df.drop(target_features, axis=1)
y_train = train_df[target_features]
y_test  = test_df[target_features]

In [9]:
# double check it looks as expected
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18,0.0,0.383218,18.219666,-0.168042,0.001803,-0.001667,-0.001413,23.586594
18,0.5,0.479439,18.2264,-0.255347,-0.004918,0.001103,0.00061,23.63596
18,1.0,0.376068,18.256712,-0.128713,0.001712,-0.002352,-0.000151,23.699308
18,1.5,0.448889,18.278429,-0.214218,-0.002492,-0.001673,-0.000596,23.744698
18,2.0,0.39092,18.293993,-0.171785,0.006671,0.000534,0.001429,23.777222


In [10]:
# double check it looks as expected
y_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18,0.0,19.775839,19.099569,18.989172,24.082214,5.22384
18,0.5,19.770027,19.097933,18.991302,24.090042,5.26395
18,1.0,19.763933,19.097601,18.996777,24.092897,5.19229
18,1.5,19.74799,19.093012,18.97394,24.100513,5.236675
18,2.0,19.752001,19.093916,18.950449,24.10927,5.213964


In [11]:
# double check it looks as expected
X_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,u_q,coolant,u_d,motor_speed,i_d,i_q,ambient
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19,0.0,-0.498699,19.277172,2.013678,0.000774,-1.999741,1.09183,22.221134
19,0.5,-0.503534,19.294641,2.017304,0.00187,-1.993066,1.092055,22.25308
19,1.0,-0.408966,19.307226,2.069818,-0.004355,-1.995046,1.093218,22.277687
19,1.5,-0.572685,19.320934,1.963754,-0.003258,-1.99565,1.095758,22.28706
19,2.0,-0.479527,19.334896,2.018658,0.0019,-1.997938,1.095445,22.299097


In [12]:
# double check it looks as expected
y_test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,stator_winding,stator_tooth,stator_yoke,pm,torque
profile_id,runtime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
19,0.0,19.157566,18.426081,18.486979,24.778728,1.236157
19,0.5,19.150526,18.411228,18.490557,24.781017,1.241396
19,1.0,19.1612,18.420845,18.488298,24.780787,1.300823
19,1.5,19.150074,18.409407,18.477997,24.782694,1.203749
19,2.0,19.150938,18.40893,18.473511,24.778448,1.244979


In [13]:
# save X_train, X_test, y_train, y_test dataframes
X_train.to_csv('..\data\X_train.csv')
X_test.to_csv('..\data\X_test.csv')
y_train.to_csv('..\data\y_train.csv')
y_test.to_csv('..\data\y_test.csv')

X_train_scaled.to_csv('..\data\X_train_scaled.csv')
X_test_scaled.to_csv('..\data\X_test_scaled.csv')
y_train_scaled.to_csv('..\data\y_train_scaled.csv')
y_test_scaled.to_csv('..\data\y_test_scaled.csv')