# Workshop Notebook

This jupyter notebook will make for the interactive part of this workshop

## Step 1: Inspect the data

Usually, the first thing we want to do when dealing with any new type of data, we want to inspect it first to get some intuitions for it. By visualizing the data, we can often get some ideas as to how to tackle the data and features we can extract from it.

In [None]:
print('Hello World! I am ready to learn!')


## Step 2: Explore the data

The first thing a Data Scientist wants to do when approaching a data problem, is to first have a look at the data. As we are looking at vibration signals, we should take a look at what these look like.



What has been done here, is to load a csv file containing rows of filepaths and correspendong train types. The filepaths are stored as binary blobs which can be found in data/signals. The table shown above is an excerpt of this list as it has been read into a dataframe
Let us explore a couple of the signatures we can find there. I also encourage you to look at more of them to get an even better idea of the data


In [2]:
import matplotlib.pyplot as plt
from helpers import load_binary
from helpers import plot_size
import pandas as pd
%matplotlib inline

df = pd.read_csv('data/training_labels.csv')
type_a = df.loc[df['train_type'] == 'train_a']
type_b = df.loc[df['train_type'] == 'train_b']
type_c = df.loc[df['train_type'] == 'train_c']
type_d = df.loc[df['train_type'] == 'train_d']
type_u = df.loc[df['train_type'] == 'unknown']

file_a = 'data/signals/training/' + type_a['filename'].iloc[0]
file_b = 'data/signals/training/' + type_b['filename'].iloc[0]
file_c = 'data/signals/training/' + type_c['filename'].iloc[0]
file_d = 'data/signals/training/' + type_d['filename'].iloc[0]
file_u = 'data/signals/training/' + type_u['filename'].iloc[0]

signal_a = load_binary(file_a)
signal_b = load_binary(file_b)
signal_c = load_binary(file_c)
signal_d = load_binary(file_d)
signal_u = load_binary(file_u)

plot_size(16, 8)
plt.subplot(511)
plt.title('Train A')
plt.plot(signal_a)
plt.subplot(512)
plt.title('Train B')
plt.plot(signal_b)
plt.subplot(513)
plt.title('Train C')
plt.plot(signal_c)
plt.subplot(514)
plt.title('Train D')
plt.plot(signal_d)
plt.subplot(515)
plt.title('Unknown Train')
plt.plot(signal_d)
plt.tight_layout()
plt.show()


Here we can already see a couple of things to expect from the data. The data comes in timeseries with thousands of timesteps and come with variable lengths. It's filled with impulses, likely from when wheels of the train is passing over the sensor. It is generally hard, and time-consuming for people to find good patterns to extract manually from long signals like these, so it would be nice if a Machine Learning Algorithm could learn something by itself which correlates with the train types we want to recognize

## Step 3: Simple Neural Network

Neural networks generally wants to take one type of input to produce one type of output. This means that we need to do something about the fact that the different signals have different lengths. There's a couple of intuitive ways to do this:
<img src="files/images/timeseries_feature.png">
We can either cut and pad the signal after a certain point, or we can stretch / compress the signals to make them all equally long. The shorter the signal, the easier and faster we will be able to train neural networks to find good patterns, but it also possible that there will not be enough information left in the signal to get the best results.

Another thing, is that neural networks do not understand words, which means we must change our "target variables" to numbers. What we do here is to use something called one-hot encoding which means that we have an array with as many elements as there are classes to classify, and we represent each index with one class, thus
- "train_a" -> [1, 0, 0, 0, 0]
- "train_b" -> [0, 1, 0, 0, 0]
- "train_c" -> [0, 0, 1, 0, 0]
- "train_d" -> [0, 0, 0, 1, 0]
- "unknown" -> [0, 0, 0, 0, 1]

In [4]:
from helpers import train_to_id5, load_dataset, plot_validation_history
from keras.layers import Dense, Conv1D, MaxPool1D, Flatten
from keras.preprocessing import sequence
from keras.models import Sequential
from scipy.signal import resample
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


# Change this to set how many steps long you want your time-series to be
# ---- You can modify this number to decide how long signals you want to use ---- #
input_length = 5000


# A function to extract the values we need as input and output for the model training
# -------- You can make changes in this function -------- #
def extract_features(signals, train_types):
    model_input = []
    model_target = []
    
    # Iterate over all signals and corresponding train types
    for signal, train_type in zip(signals, train_types):
                
        # Assemble the signal one data point
        # --------- Uncomment Line below to stretch / compress signal ----------- #
        # signal = resample(signal, input_length)
        input_vector = np.reshape(signal, (-1, 1))  # special case if you have only 1 time series
    
        # Convert train type to number
        target = train_to_id5(train_type)
        
        # Add to dataset to be fed to a machine learning algorithm
        model_input.append(input_vector)
        model_target.append(target)
    
    # Convert to a more digestable format and return the data, also makes also signals equally long
    model_input = sequence.pad_sequences(model_input, input_length)
    model_target = np.array(model_target)
    return model_input, model_target


# Load the data
training_x, training_y = load_dataset(dataset='training')
validate_x, validate_y = load_dataset(dataset='validate')

# Transform the data / extract features
training_x, training_y = extract_features(training_x, training_y)
validate_x, validate_y = extract_features(validate_x, validate_y)

# Build a Convolutional Neural Network
# ------- You can change the number of filters and kernel_size here --------- #
model = Sequential()
model.add(Conv1D(filters=4, kernel_size=5, padding='valid', input_shape=training_x.shape[1:]))
model.add(MaxPool1D(2))
model.add(Conv1D(filters=4, kernel_size=5, padding='valid'))
model.add(MaxPool1D(2))
model.add(Flatten())
model.add(Dense(units=5, activation='softmax'))
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])


# Fit a model to the data. Note less epochs are needed here
# ------- You can change the number of epochs and batch_size here ----------- #
logger = model.fit(training_x, training_y, epochs=25, batch_size=16, validation_data=[validate_x, validate_y])
plot_validation_history(logger)


## Step 4: It is your turn

The goal now is to get the highest possible validation accuracy by tweaking the various parameters you have available. Change the amount of timesteps you include in your model, the amount of filters in the neural network or also epochs and batch_size. Keep in mind that if it seems to be running too long, you can kill it with the stop button and then undo the changes which made it take too long.
You can also go to the "exampls.ipynb" notebook to get some ideas about what kind of approaches you can take. Some of these come in the form of exercises as well. The overall goal is to explore different ways to get higher validation score, so you do not need to do the exercises if you do not want to.

In [None]:
# Feel free to paste code into here if you want to keep the code above clean and copy-pastable
