# Data lake stack in Animal Sciences 
![image](http://breed4food.com/wp-content/uploads/2014/09/Breed4Food-logo.jpg)

This work is embedded within the Big Data project of [Breed4Food](http://breed4food.com). 
Here we will explore the possibility of using a data lake stack for storing and analyzing sensor data, using an animal experiment as use case to have improved scalability, modularity, and interoperability. The selected use case was an experiment in which the gait score of 200 turkeys was determined. This gait scoring is traditionally performed by a trained person. In this experiment different type of sensors were used to explore to what extent these sensors can  describe or mirror the gait score of a trained person.

### Data  & Sensors 
* Gait score (Visually trained person)
* Body Weight (Weighing scale)
* Force Plate (Kistler)
* Accelerometers / inertial measurement units {IMUs} (Xsense MTw awinda)
* 3D Video camera (Intel Realsense D415)

## ETL
The _**Extract, Transform, and Load (ETL) procedure**_, is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s). 
* Extract: retrieve data from a source
* Transform: Converting retrieved data according to rules and lookup tables or creating combinations of data from different sources 
* Load: save the data in a different location

The ETL procedure will become more important because we need to handle ever increasing datasets, varying data structures, as well as heterogeneous and multimodal data. In the animal experiment different data types were acquired by each sensor. For example for the **Force Plate (Kistler)** this were binary files, so called Technical Data Management Streaming (TDMS) files, and this file format was generated to help engineers and scientists to properly store the large amounts of data they generate during simulation(s) and test(s). In our data lake stack we want to be able to scale up the ETL procedure for each sensor, so when large number of animals are being investigated with a certain sensor, we can minimize the time for the ETL procedure.  

In short, we want to combine different sensors in a massive way (and extract features), as well as going from proprietary formats to FAIR data. When these data are loaded it will be possible to visualize these data and perform a linear regression (and Machine Learning).


## Force Plate
To get more familiar with the Jupyter Notebook and the scripting we first focus on a single sensor. The first tutorial can be found here >>> [Tutorial One](3_FP_single.ipynb) <<< where we will explain how to extract, transform, and load the data of a single Force Plate file. 

Now that you are somewhat familiar with the Jupyter Notebook and the scripting, we will now load in all available data of a single sensor. The second tutorial can be found here >>> [Tutorial Two](4_FP_multi.ipynb) <<<.

## Inertial Measurement Units (Accelerometers)
In line with the Force Plate data we also prepared a tutorial on the IMUs which cabn be found here >>> [Tutorial Three](5_Acc_multi.ipynb) <<<

## 3D Video
In addition to the already mentioned sensors, 3D videos were made as well. 

# Joining multiple sensor features

While the overall goal of this tutorial is about ETL, we will have a first  step with data analytics!

We will join the extracted features from the forceplate and the accelerometer sensors, and perform a linear regression to estimate the gait score! 

Goto: >>> [Tutorial Four](6_LinReg.ipynb) <<<


# Summarizing
Make new summarizing notebook with lessons learned Discussion points etc. 