# 1 Time series classification

Time series classification (TSC) operates on time series data, a series of values that is ordered by time. Data samples are labelled as belonging to a particular class. The TSC system is trained using this data to classify unlabelled samples. There is a wide range of TSC applications. Smartwatch data is used to classify human activities (walking, running, ascending stairs, etc.). Animal behaviour (hunting, sleeping) is monitored using accelerometers on tagged, wild animals for environmental studies. Sensors on industrial machines are used to classify time series samples as either normal or preceding a failure, informing machine maintenance schedules.

## 1.1 Our dataset
This workshop uses the SonyAIBORobotSurface1 dataset from the [Time Series Classification Repository](https://www.timeseriesclassification.com) (Dau et al, 2018). This dataset was collected by Vail and Veloso (2004), Carnegie Mellon University, from an accelerometer on a Sony AIBO robot. Their aim was to detect the surface that the robot was walking on in order to optimise its gait for that surface. The robots competed in the RoboCup League, a football game played on a carpeted field.



![The Sony AIBO Robot is a robot dog. It is pictured with a ball.](https://i1.wp.com/www.techdigest.tv/wp-content/uploads/2015/06/aibo-560.jpg "Sony AIBO Robot")


Image from Tech Digest (2015) (www.techdigest.tv)

## 1.2 Data capture and processing

The time series data provided is the robot's forward acceleration. The robot's acceleration was recorded by its onboard accelerometer. It was sampled 125 times per second (125Hz).


The data has been pre-processed - 
+ Each sample has the same number of data points: 70 (0.56s).
+ The samples are aligned - each starts at the same point in the robot's walk.
+ Each sample has been labelled as either cement or carpet. 
+ The dataset has been standardised to give it a mean of 0 and a standard deviation of 1. The original data had a positive mean, because the robot leans forwards slightly, and was in the range approximately [0, 0.4] gravities. Whereas our data is in the range approximately [-3, 3]. 
+ A balanced dataset has been created - with an equal number of class 0 and class 1 samples.


![A plot of example data samples](images/example_datasamples.png "Example data samples")

## 1.3 Dataset format
The entire dataset is provided in a single text file (.txt).

![621x71 matrix of data](images/time_series_dataset.png "Dataset")

# 2 Python modules and functions

## 2.1 Load Python modules
Import the Python modules that we will need. This is code that many other developers have made available for the general public to use ("open source software").

In [None]:
import numpy as np  # Arrays, matrices and functions on them. Required by Pandas, below
import pandas as pd # A data analysis library
from sklearn.model_selection import train_test_split # scikit-learn, machine learning tools
import matplotlib.pyplot as plt # A plotting library
import seaborn as sns # Built on matplotlib, facilitates aesthetically pleasing plots

# General settings
sns.set_style('whitegrid') # Plots will have a white grid
# Variables that will help us work with the classes
class_names = ['cement', 'carpet']
class_colors = ['darkorange', 'steelblue']

## 2.2 Functions

Some lines of code, wrapped up into functions, that we will use later on in this notebook. We will treat these functions like black boxes and not go through the detail of it in this workshop.

In [None]:
def load_data(filename):
    ''' Load the data from a file in a GitHub repo '''
    url_root = 'https://raw.githubusercontent.com/jarusgnuj/ai-ml-wksh/master/data/UCR_TSC_archive/SonyAIBORobotSurface1_IoC'
    url = url_root+'/'+filename
    robot_df = pd.read_csv(url, sep='\t', header=None) # Use Pandas to load the data into a Pandas DataFrame
    print('Loaded from', url)
    robot_data = robot_df.values # Convert from a Pandas DataFrame to a numpy array
    print('The shape of robot_data is', robot_data.shape)
    print('Number of samples of class 0 (cement)', (robot_data[:,0].astype(int) == 0).sum())
    print('Number of samples of class 1 (carpet)', (robot_data[:,0].astype(int) == 1).sum())
    print('')
    return robot_data


def plot_data_samples(data, labels, sample_numbers):
    ''' Plot the time series data relating to the input list of sample numbers '''
    # Input format - a list, e.g. [1, 7, 22, 42]
    fig, ax = plt.subplots()

    for i in sample_numbers:
        plt.plot(data[i], label=class_names[labels[i]], color=class_colors[labels[i]])
        print('sample', i, 'class', str(labels[i]), class_names[labels[i]])

    print('')
    plt.ylim([-3.5, 3.5])
    plt.title('Orange : cement (class 0)\nBlue : carpet (class 1)')
    ax.set_ylabel('Accelerometer data')
    ax.set_xlabel('Data point number')
    

def plot_single_sample(data, sample_number):
    ''' Plot the time series data relating to this sample number. '''
    fig, ax = plt.subplots()
    plt.plot(data[sample_number], color='darkred')
    txt = 'Sample '+str(sample_number)+': Cement or carpet?\nDo you recognise the data\'s pattern?'
    plt.suptitle(txt)
    ax.set_ylabel('Standardised x-axis accelerometer data')
    ax.set_xlabel('Data point number')

# 3 Load the data

In [None]:
filename = 'SonyAIBORobotSurface1_IoC_BALANCED.txt'
robot_data = load_data(filename) # This is a function that we created earlier in this notebook

# Print information about the data's shape and size
print('The robot_data is a matrix. These are the first 7 rows and 5 columns of robot_data:\n', robot_data[:7, :5], '\n')

## 3.1 Process the data
Separate out the labels vector from the time series data samples.

class 0 : cement

class 1 : carpet

In [None]:
labels = robot_data[:,0].astype(int)
data = robot_data[:,1:]
print('The shape of the data matrix is', data.shape)
print('The shape of the labels vector is', labels.shape)

# 4 Plot the data
Select the row numbers of the sample you wish to plot. In Python, and many other programming languages, the first row in a matrix is row 0.

In [None]:
plot_data_samples(data, labels, [0, 1, 2, 3]) ### CHANGE PARAMETER HERE ###

## 4.1 Exercise 1a : Explore the data
In the cell above, look for the code comment "### CHANGE PARAMETER HERE ###" and enter the data sample numbers that you wish to plot. E.g. you could enter [7, 41, 43, 44, 45, 46] to plot sample 7 and the samples from  41 to 46.

Add another cell so that you can compare two plots if you wish.

Explore the data and try to answer the following questions:
+ Compare a cement sample to a carpet sample.
  + How do they differ?
+ Select 5 different class 0 samples and plot them together.
  + Do they look similar?
+ Select 5 different class 1 samples and plot them together.
  + How do class 0 and class 1 samples differ?
+ Compare the samples at the beginning of the dataset to those near then end.
+ Do any of the data samples look odd in any way? 
  + Could that data sample be mis-labelled or mis-aligned?
  
The cell below will help you identify which data samples are of class 0 and which are of class 1.

In [None]:
print('Labels of some of the first few data samples:')
print(labels[:25])

# Print the label of one sample, sample i.
i = 8
print('\n', 'Label of data sample', i, ':', labels[i])

## End of exercise 1a

## 4.2 Discussion 1a
Do the class 0 and class 1 samples look different?

In what way?

What class would you say the sample below belongs to?

In [None]:
sample_number = round(np.random.rand()*543)
plot_single_sample(data, sample_number)

Were you right? Let's find out.

In [None]:
sample_number = 0  ### CHANGE PARAMETER HERE ###
print('sample_number:', sample_number)
print('sample_label:', labels[sample_number])
print('class_name:', class_names[labels[sample_number]])

# 5 Split the dataset into development and final test datasets
We've explored the data. The next step is to begin developing a machine learning model that can classify our data. Before we start, we'll set aside some data to use to test the model once we've finalised it.

![The dataset is split into two, unequal sets](images/final_test_dataset.png "Split into development and final test datasets")

The dataset split is stratified so that the subsets both have the same class balance as the original dataset (50:50 in this case).
![The split is stratified](images/stratify.png "Stratified split")

In [None]:
final_test_set_size = 100

# Use the train_test_split from the scikit-learn (sklearn) module
data_dev, data_finaltest, labels_dev, labels_finaltest = train_test_split(
    data, labels, test_size=final_test_set_size, random_state=21, stratify=labels)

print('The shape of data_dev is', data_dev.shape)
print('The shape of data_finaltest is', data_finaltest.shape)
print('Development data:')
print('Number of samples of class 0', (labels_dev == 0).sum())
print('Number of samples of class 1', (labels_dev == 1).sum())
print('Final test data:')
print('Number of samples of class 0', (labels_finaltest == 0).sum())
print('Number of samples of class 1', (labels_finaltest == 1).sum())

These datasets could now be saved to file and reloaded in the next notebook. Instead, we'll load a prepared dataset in the next notebook.

# 6 Discussion 1b : Your own time series classification applications
+ Can you think of some useful applications within your organisation?
+ Does your organisation generate or depend upon time series data?

# References
Dau, H. A., Bagnall, A., Kamgar, K., Yeh, C.-C. M., Zhu, Y., Gharghabi, S., Ratanamahatana, C. A. and Keogh, E. (2018) ‘The UCR Time Series Archive’, [Online]. Available at http://arxiv.org/abs/1810.07758 (Accessed 4 May 2019).

Vail, D. and Veloso, M. (2004) ‘Learning from accelerometer data on a legged robot’, *IFAC Proceedings*, vol. 37, no. 8, pp. 822–827 [Online]. Available at https://www.cs.cmu.edu/~mmv/papers/04iav-doug.pdf (Accessed 4 May 2019).