# 0. Setting-up the Development Environment

## Lab steps


*   Integrate Github to your Colab: facilitate data uploading process
*   Download code and data from github to your google drive
*   Visualize the data

## Resources

* An introduction to the features provided by colab [here](https://colab.research.google.com/notebooks/intro.ipynb) (In French);
* Another introduction to the features provided by colab [here](https://colab.research.google.com/notebooks/basic_features_overview.ipynb);
* Mise à niveau in Python, Numpy, Matplotlib can be found [here](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb) (From Stanford's cs231n course);
* An introductory colab to Tensorflow 2 can be found [here](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb)

# 1. Integration between Drive-Colab-Github

The following contents are based on the following tutorials: [this tutorial](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228); [this tutorial](https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39). 


### The workflow
The workflow we will be using during the labs is a simple four-step process:
0. First, you will need to create a fresh new Google Drive dedicated solely for this Lab;
1. After connecting to the Colab runtime, you need to mount Google Drive and update your space using Github;
2. You work with the notebooks and needed files (your modules, libraries, etc.)as on editor;
3. (optional) You save your work, by synchronizing your Drive with Github using the operational notebook.


<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/104195015-6086cb80-5422-11eb-8cba-60ab7478c5ac.png" width="300px" title="https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39"/>
</p>
<p align="center">
Figure: From <a src="https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39">this tutorial</a>.
</p>


Before we discuss in detail, let’s take a look at each roles of those components (Google Drive, Google Colab, GitHub) and their interactions (Based on [this tutorial](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228)).

* **Google Colab:**  is used as shell to run bash commands and git commands.
* **Google Drive:** When we use Google Colab, our work is stored temporary in a virtual machine around 8 to 12 hours. Google Drive gives a possibility to store your training in cloud storage hosting. Google Drive provides free 15 GB storage and allows easy integration with Google Colab. We will use it as a location to store the clone GitHub repo that we work on permanently.
* **GitHub:** A code hosting platform for version control and collaboration which hosts the repository containing this notebook. In addition to this notebook, the repository contains also other useful code.



### Connecting, mounting and updating

In [None]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive'     # default for the drive
PROJ = 'MyDrive/ml-iot'       # path to your project on Drive

drive._mount(ROOT)           # we mount the drive at /content/drive

PROJECT_PATH = join(ROOT, PROJ)
!mkdir "{PROJECT_PATH}"    # in case we haven't created it already
%cd "{PROJECT_PATH}"

In [None]:
GIT_REPOSITORY = "2022-ml-iot-lab-2"
GIT_PATH = "https://github.com/institut-galilee/"+ GIT_REPOSITORY + ".git"
!git clone "{GIT_PATH}"
!cd "{GIT_REPOSITORY}"
# !rsync -aP --exclude="{GIT_REPOSITORY}"/data/ "{GIT_REPOSITORY}"/generated/

In [None]:
!ls

The above snippet mounts the Google Drive at /content/drive and creates our project's directory. It then pulls all the files from Github and copies them over to that directory. Finally, it collects everything that belongs to the Drive directory and copies it over to our local runtime.

A nice thing about this solution is that it won’t crash if executed multiple times. Whenever executed, it will only update what is new and that’s it. Also, with rsync we have the option to exclude some of the content, which may take too long to copy (...data?).

### [Optional] Save changes to GitHub

(Optional, only in the case you work on your own fork of the repository!)

In order to save your changes, please perform the following commands. These allow you to put your modifications in your fork (code repository).

In [None]:
# !git add .
# !git commit -m "Put here a message that describes your modifications. e.g. answering question 3"
# !git config --global user.email "your github email"
# !git config --global user.name "your first and last name"
# !git push origin master

# 2. Discovering the SHL Dataset

## Goals/Outline

In these first series of lab, we will take a look at the Sussex-Huawei Locomotion (SHL) dataset. We will also build our first Keras model and train it in order to recognize human activities (run, walk, bike, etc.) from sensory data like accelerometer, gyroscope, magnetometer, etc.

1. We will first work on a subset of the SHL dataset containing only examples from one of the four body locations (**Torso**, Hips, Hand, and Bag) and three of the fifteen modalities (**accelerometer, gyroscope, magnetometer**, gravitation, ambient pressure, etc.);

2. We will build a sample (échantillon de données) which will be used to train our activity recognition models.

4. We will take a look at the signals, visualize them, apply some preprocessings on them, and extract some valuable features from them;

1. We will see the dimensions and structure of the dataset that we use to feed the keras models with;

3. We will explore both raw signal and feature-based activity recognition models;

4. After adding other modalities, we will explore various sensor fusion schemes, i.e. channel, modality, and grouped fusion.

## Download a subset of the dataset

In this part, we will launch two scripts, `get_data.sh` and `extract_data.sh` which will download a subset of the SHL dataset and extract it to the right folder, respectively.
If you want to work, as a personal side project, on the original dataset (really heavy), you can check out the commented lines inside `get_data.sh`.

In [None]:
# check where are we located in the filesystem tree
!pwd
!ls

In [None]:
# if needed, go inside the cloned repository (be aware of the difference between %cd and !cd.)
%cd "{GIT_REPOSITORY}"
!pwd
!ls

In [10]:
# give execution permissions to the two scripts
!chmod +x get_data.sh
!chmod +x extract_data.sh

In [None]:
# launch the commands, the downloading may take a while !
!./get_data.sh

In [None]:
# extract the downloaded zip files using the following command. This also may take a while !
!./extract_data.sh

## Structure of the code repository


---



In the left side, you can see the file system button which allows you to display the contents of the current folder. This structure should be the same as that found in the GitHub repository.

```
├── data/            # where the initial zipped data will be stored
├── generated/       # generated from sample after basic transformations (ready for ML algorithms) 
     ├── sample/     # subset of data to use in your experiments
          ├── train/
          └── validation/
     ├── tmp/        # where downloaded zipped data will be extracted. This is where data is read from
     └── 0.1/        # used to store data in a memory-convinient manner (not to be changed)
├── dataset.py       # Python code used to load data
├── config.py        # provide needed configurations: paths ... (not to be changed)
├── sample.py        # contains the code to extract samples from the data 
├── get_data.sh      # shell script to download data 
└── extract_data.sh
```

## Structure of the Raw Dataset

We will explore here the structure of the SHL dataset as it is provided by the team responsible of collecting and maintaining it.

<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/104208754-27a22300-5431-11eb-8a5c-d3093fdf3de4.png" width="300px"/>
</p>

```
├── generated/
    └── sample
        ├── train
            ├── Torso/
                ├── Acc_x.txt
                ├── Acc_y.txt
                ├── Acc_z.txt
                ├── Gyr_x.txt
                ...
                └── Labels.txt
            ├── Hand             # Not included in the subset
            ├── Hips             # Not included in the subset
            └── Bag              # Not included in the subset
        └── validation
```

To see how the first 10 rows of e.g. `Acc_x.txt` look like, execute the following command.
Each row contains 500 data points measured by an accelerometer (the x axis exactly) while the user performs one of the eight given activities. These 500 data points are successive in time and correspond to 5 seconds of the performed activity.

In [None]:
!head -10 ./generated/tmp/sample/train/Torso/Acc_x.txt

## Exploring and visualizing the sensory signals


---



We will use Plotly in order to allow exploring the sensory signals in an interactive manner.

### What is Plotly

Plotly allows one to create interactive charts and maps with APIs in Python, R, and JavaScript. It's intuitive, highly customisable and from version 4, it integrates nicely with Pandas DataFrames in Python with the [Plotly Express](https://plotly.com/python/plotly-express/) module which was included in Plotly version 4, from being its own module.

### Plotly
Basic [installation of Plotly](https://plotly.com/python/getting-started/#installation)
```bash
pip install plotly
```
This enables Plotly usage in the Python environment. It won't automatically allow you to render Plotly figures in notebooks with `fig.show()`, but it should be possible in notebooks to render them as the following:

In [None]:
!pip install plotly

In [None]:
from IPython.display import HTML
import plotly.express as px
import pandas as pd
import plotly.graph_objects as go

df = pd.read_csv('./generated/sample/train/Torso/Gyr_x.txt', sep=' ')

fig = go.Figure()
for i in range(10):
    fig.add_trace(
        go.Scatter(
            x=[j+(500*i) for j in range(500)],
            y=df.iloc[i, :],
            mode="lines",
            line=go.scatter.Line(color="blue"),
            showlegend=False)
    )
HTML(fig.to_html())

## Loading the dataset

Now that we have built a sample and stored it in the drive, this is what we will use to train our activity recognition models.
Before that, we need to put the data into a convinient and memory-efficient data structure. The additional value of the backend implementation (which uses OS-base memory mapping) of this data structure will appear when you will work on the entire dataset.
In the following, we will just check a high-level overview of the data structure.

The data structure that will be used resemble to the that depicted in the figure below. It has three axis: (0) the elements; (1) time; and (2) the channels.

<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/105118442-b583ab00-5ace-11eb-989b-ac1eba8cb26d.png" height="300px"/>
</p>

For these labs, we provide you with a python class, `DataReader`, which loads the data in this format. Use it to load an manipulate the data.

In order to see how to manipulate tensors in TensorFlow, plase check out [this](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/tensor.ipynb) introductory colab.

In [None]:
from dataset import DataReader
import sample

# get the size of the sample
sample_idx = sample.load_index("./generated/sample/sample_idx.pickle")
sample_size = sample.size_of_index(sample_idx)
print(sample_size)

In [None]:
# when run for the first time, this may take a while!
train = DataReader(what='train', train_frames=sample_size)
valid = DataReader(what='validation')

**TODO**
Check out the shape of the returned objects and try to visualize its contents using the provided methods inside the class `Dataset` (`dataset.py`).