# Data analysis

In this exercise, we will look at how to load, manipulate and visualize data sets in Python. We will also look at how to calculate features on data sets.


## Pandas
A widely used Python package for data management and analysis is `pandas`. The basic data structure provided by this package is the data frame. A data frame is a table that can in principle be used in the same way as a table in a relational database, e.g. rows or columns (or both) can be selected.

In [None]:
import numpy as np
import pandas as pd
from scipy.io import arff
import matplotlib.pyplot as plt

data = arff.loadarff('S08.arff')
df = pd.DataFrame(data[0])
df

### Select rows and columns

With pandas, certain rows and columns can be easily selected from the data frame. One option is to select columns by name.


In [None]:
accx = df.loc[:,"Sensor_T8_Acceleration_X"]
accx

We can also select rows and columns by their indexes. The `:` stands here for "Select all rows". If only a single column is selected, the result is of type `Series`, otherwise the result is again a data frame. Columns can also be selected via their index:

In [None]:
acc = df.iloc[:,1:4]
acc

lines can be accessed in the same way. For example, the following expression returns the first 5 lines:

In [None]:
df.iloc[0:5,:]

Both options can also be combined, e.g. like this:

In [None]:
df.loc[0:5,["Sensor_T8_Acceleration_X", "Sensor_T8_Acceleration_Y"]]

Another practical option is to select rows or columns using Boolean expressions. The following expression, for example, returns all rows for which the value of `Sensor_T8_Acceleration_X` is less than 0.7.

In [None]:
df.loc[df.Sensor_T8_Acceleration_X < 0.7,:]

### Insert values

Inserting values into a data frame works in exactly the same way. The following expression sets all values in the `Sensor_T8_Acceleration_Y` column that are less than 0 to the value 0.

In [None]:
df.loc[df.Sensor_T8_Acceleration_Y < 0,"Sensor_T8_Acceleration_Y"] = 0
df

### Apply
In many cases, we want to apply a function to a complete row or column of data. The function `apply` allows us to do this. The following expression calculates the mean values per column:

In [None]:
df.iloc[:,1:31].apply(np.mean)

## Exercise 1

Calculate the distribution of the classes in `df`. (`collections.Counter`)

Plot the distribution of classes as a bar plot.

Plot some accelerometer axes (e.g. "Sensor_RightForeArm_Acceleration_X", "Sensor_RightForeArm_Acceleration_Y", "Sensor_RightForeArm_Acceleration_Z") as a line plot. The different axes should be displayed in different colors.

## Exercise 2

Next, we want to calculate some features on the data. For sequential data, features are typically calculated based on segments. This means that we first calculate a feature function (mean, ...) for rows 1 to n, then for n+1 to 2n, and so on. The segments can also overlap. 

Implement the function `feature`, which calculates a given statistical feature function (mean, ...) for a given window size, overlap and data set. Then calculate the mean, median and variance of the accelerometer data of the right foot with segment lengths of 128, 256 and 512. Use 50% overlap and plot the result.

