# Simple ETL for the Inertial Measurement Unit(s) sensors

An inertial measurement unit (IMU) is an electronic device that measures and 
reports a body's specific force, angular rate, and sometimes the magnetic field 
surroundings the body, using a combination of accelerometers and gyroscopes, 
sometimes also magnetometers.

**Yaw**.  
The yaw axis (*vertical axis*) has its origin at the center of gravity
and is directed towards the bottom of the aircraft, perpendicular to the wings
and to the fuselage reference line. Motion about this axis is called yaw. A
positive yawing motion moves the nose of the aircraft to the right. The rudder
is the primary control of yaw.

**Pitch**.  
The pitch axis (*transverse* or
*lateral axis*) has its origin at the center of gravity and is directed to the
right, parallel to a line drawn from wingtip to wingtip. Motion about this axis
is called pitch. A positive pitching motion raises the nose of the aircraft and
lowers the tail. The elevators are the primary control of pitch.

**Roll**.  
The
roll axis (*longitudinal axis*) has its origin at the center of gravity and is
directed forward, parallel to the fuselage reference line. Motion about this
axis is called roll. An angular displacement about this axis is called bank. A
positive rolling motion lifts the left wing and lowers the right wing. The pilot
rolls by increasing the lift on one wing and decreasing it on the other. This
changes the bank angle. The ailerons are the primary control of bank. The rudder
also has a secondary effect on bank.

Below a figure schematically depicting the
three aspects; **Roll**, **Yaw**, and **Pitch**.


![image](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Yaw_Axis_Corrected.svg/375px-Yaw_Axis_Corrected.svg.png) 


In the animal experiment the IMUs were strapped
down at three different places, 1) Torso, 2) left leg, and 3) right leg. This
has been schematically depicted in the figure below, where the orange squares
are the IMUs.

<img src="images/Picture1_.png" width="250" height="250" />

Thus,
now we have an idea of what IMUs are and how they were applied in the
experiment. The recorded data from the IMUs were uploaded by us to the datalake
stack, note that these were in a native format and need to be pre-processed
before we can visualize them and extract features.

## First import dependent libraries

In [None]:
import os
import sys
import subprocess
from pathlib import Path
from IPython.display import clear_output
import argparse
from subprocess import call
import pixiedust
from pyspark.sql.functions import input_file_name
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.window import Window

Make sure that Spark works properly.

In [None]:
spark

## Converting native files to readable format 

Data from the IMUs is in a native format and needs to be pre-processed. For 
this we now have a customized C++ script (thanks to Jeremy) which takes all 
available information and puts them in a *txt* file.

In [None]:
pathlist = Path("files").glob('**/*.mtb')
for filename in pathlist:
    subprocess.check_output(["convertmtb", filename])

Due to the nature of the C++ script, all output files are written in the same
folder as the original *mtb* file. 

For more convenience we move the output files
to a new folder (here the *accelerometer*-folder as subfolder of *work*). This
can be done by the following lines (these are linux commands), where the `!`
tells the notebook to exceute this in Linux.

In [None]:
!mkdir -p /home/jovyan/work/accelerometer
!mv /home/jovyan/files/accelerometer/*.txt /home/jovyan/work/accelerometer

## Load all sensor data

Thus, now we have the pre-processed data and can load this in a Spark dataframe.

In [None]:
channelsDFall = spark.read.csv('work/accelerometer/', inferSchema=True, header=True, sep=" ")

We can check the dataframe by the `.show()` command.

In [None]:
channelsDFall.show(2)

In [None]:
channelsDFall.printSchema()

## Link data with turkey id

As with the force plate, here we also want to link the filename to each row.

In [None]:
df = channelsDFall.withColumn("input", input_file_name())

Here we split the filename into two parts, one containing the ID the other the
IMU.

In [None]:
df = df.withColumn('filename', split(df.input,'/')[7])
df = df.withColumn('ID_IMUs', split(split(df['input'], '/')[7], ".mtb_")[0])
df = df.withColumn('IMU', split(split(df['input'], '/')[7], ".mtb_")[1].substr(1,8))

Show again to see if we did want we anticipated.

In [None]:
df.printSchema()

In [None]:
df.select('ID_IMUs', 'IMU').distinct().show()

In [None]:
sorted(df.groupBy(df.ID_IMUs).count().collect())

Unfortunately, in this case the turkey identifier is not part of the file path.
Instead, there is a separate metadata file that encodes which IMU have been
placed on each turkey.

So we are going to load that as well and see what is in
it.

In [None]:
metadataDF = spark.read.csv('files/Walking trial_IDmatch_edu.csv', header=True, sep=",")

In [None]:
metadataDF.printSchema()

In [None]:
metadataDF.show(3)

The `Wingband0` column is the turkey identifier, so lets rename it to `ID` for
simplicity.

In [None]:
metadataDF = metadataDF.withColumnRenamed('Wingband0', 'ID')

In [None]:
metadataDF.printSchema()

Now we have to combine the two dataframes based on the IMU identifier, we can do
this by the `join` command. 

<!-- This is too advanced 
For more information
about how to join your data see the figure below or
[here](http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-
apache-spark/). In our case we want to include all data, however we can suffice
with an inner join because the metadata file was manually curated. <img
src="http://kirillpavlov.com/images/join-types.png" width="500" height="500" />
-->

In [None]:
df = df.join(metadataDF, df.ID_IMUs == metadataDF.IMUfiles)

In [None]:
df.printSchema()

To check if the join was successfull we can show, the columns on which we joined
and include the turkey identifier as well.

In [None]:
df.select('PackedCounter','IMUfiles', 'ID_IMUs', 'ID').show(10)

It is possible to further automate creating a metadata file, as now it was
performed manually by checking the (log) files' start and end time. For the
Force Plate the output file has the time of stopping the recording, whereas the
IMU (and 3D-video) have the starting time of recording. This information was
combined in a matrix and thereafter manually inspected if the time-stamps were
rigth for the different files, any discrepancies were written down and filtered
out if possible.

Now that we have the IMU IDs linked to each turkey, we are going further by 
processing the data. 


## Clean data and calculate summary statistics

Within the dataframe there is already a column called
'StatusWord', which is an indication of whether the device is
functioning properly. 

In the 'StatusWord' column different values are given,
where all other values than 2 represent 'flagged' data.

First lets calculate some summary statistics.

In [None]:
df.select(df.ID,df.Roll,df.Yaw,df.Pitch).describe().show()

This could be more useful per animal id, as:

In [None]:
df.select(df.ID,df.Roll,df.Yaw,df.Pitch).groupBy('ID').mean().show()

Or even per animal and IMU, as:

In [None]:
df.select(df.ID,df.IMU, df.Roll,df.Yaw,df.Pitch).groupBy('ID', 'IMU').mean().show()

## Calculate a single feature

Ideally a feature to summarize the accelerometer data, would be to 
estimate the number of steps that the turkey has taken in the gait walk.

As a proxy to this number we will estimate how man times the roll axis 
measurement changed sign.


Lets create a dataframe with the columns we need:

- Roll is the rotation on the roll principal axis
- ID is the turkey id
- IMU is the sensor id
- StatusWord is a code about sensor working properly. The flag 2 is used for valid measurements.
- PackedCounter is a counter of the packets, i.e. an identifier for time

In [None]:
small = df.select(df.ID, df.IMU, df.StatusWord, df.PackedCounter, df.Roll)

Filter out the erroneous ones

In [None]:
small = small.filter(small.StatusWord==2)

To detect sign change on the roll value, we
will first add a new column with the value of the previous step 
(time lag function)

In [None]:
df_lag = small.withColumn('prev_Roll',
                        func.lag(small['Roll'])
                                 .over(Window.partitionBy("ID","IMU")\
                                             .orderBy("PackedCounter")))

And then, we calculate the signum of the current Roll times the Roll of the previous step
If it is -1 then the sign changed, if it is +1 then it remained the same.

In [None]:
df_lag = df_lag.withColumn("step",func.signum(df_lag.Roll*df_lag.prev_Roll))

As a final step, summarize by counting the minuses per sensor. 
As we do not know which sensor is attached on which leg, we will use an 
an estimated feature the minimum signum changes out of the three sensors.

In [None]:
rc =  df_lag.filter('step=-1')\
            .groupBy('ID','IMU').count()\
            .groupBy('ID').min()

## Store in a file

As a final step, lets show the indicator and save it in a file.

In [None]:
rc.show()

Then store in a file :)

In [None]:
rc.withColumnRenamed('min(count)','steps').write.csv("acc_features.csv", header=True, mode='overwrite')