# 0. Setting-up the Development Environment

---


## Resources

* An introduction to the features provided by colab [here](https://colab.research.google.com/notebooks/intro.ipynb) (In French);
* Another introduction to the features provided by colab [here](https://colab.research.google.com/notebooks/basic_features_overview.ipynb);
* Mise à niveau in Python, Numpy, Matplotlib can be found [here](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb) (From Stanford's cs231n course);
* An introductory colab to Tensorflow 2 can be found [here](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb)

## Lab steps


*   Integrate Github to your Colab: facilitate data uploading process
*   Download code and data from github to your google drive



## Integration between Google Drive --- Colab --- Github

The following contents are based on the following tutorials: [this tutorial](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228); [this tutorial](https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39); and [](). 


### The workflow
The workflow we will be using during the labs is a simple three-step process:
1. First, after connecting to the Colab runtime, you need to mount Google Drive and update your space using Github.
2. You work with the notebooks and needed files (your modules, libraries, etc.)as on editor.
3. (optional) You save your work, by synchronizing your Drive with Github using the operational notebook.


<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/104195015-6086cb80-5422-11eb-8cba-60ab7478c5ac.png" width="300px" title="https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39"/>
</p>
<p align="center">
Figure: From <a src="https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39">this tutorial</a>.
</p>


Before we discuss in detail, let’s take a look at each roles of those components (Google Drive, Google Colab, GitHub) and their interactions (Based on [this tutorial](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228)).

* **Google Colab:**  is used as shell to run bash commands and git commands.
* **Google Drive:** When we use Google Colab, our work is stored temporary in a virtual machine around 8 to 12 hours. Google Drive gives a possibility to store your training in cloud storage hosting. Google Drive provides free 15 GB storage and allows easy integration with Google Colab. We will use it as a location to store the clone GitHub repo that we work on permanently.
* **GitHub:** A code hosting platform for version control and collaboration. For each lab, you have to get a personal version of the code repository: in other words, you have to _fork_ the repository ([tutorial](https://guides.github.com/activities/hello-world/)).



### Connecting, mounting and updating

In [None]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive'     # default for the drive
PROJ = 'MyDrive/ml-iot'       # path to your project on Drive

drive.mount(ROOT)           # we mount the drive at /content/drive

PROJECT_PATH = join(ROOT, PROJ)
!mkdir "{PROJECT_PATH}"    # in case we haven't created it already
%cd "{PROJECT_PATH}"

In [None]:
GIT_USERNAME = "institut-galilee"  # This is a shared repository. If you want to synchronize with your own fork, well, you need to fork the github repository and replace the given name by your username.
GIT_TOKEN = "XXX"  # This token is used only if you work with your own fork. You have to generate it and put it here. Make sure to keep it confidential: it is a very sensitive information!
GIT_REPOSITORY = "2021-ml-iot-labs"

In [None]:
# GIT_PATH = "https://"+GIT_TOKEN+"@github.com/"+GIT_USERNAME+"/"+GIT_REPOSITORY+".git"
GIT_PATH = "https://github.com/"+GIT_USERNAME+"/"+GIT_REPOSITORY+".git"
!git clone "{GIT_PATH}"
!cd "{GIT_REPOSITORY}"
!rsync -aP --exclude="{GIT_REPOSITORY}"/data/ "{GIT_REPOSITORY}"/generated/

In [None]:
!ls

The above snippet mounts the Google Drive at /content/drive and creates our project's directory. It then pulls all the files from Github and copies them over to that directory. Finally, it collects everything that belongs to the Drive directory and copies it over to our local runtime.

A nice thing about this solution is that it won’t crash if executed multiple times. Whenever executed, it will only update what is new and that’s it. Also, with rsync we have the option to exclude some of the content, which may take too long to copy (...data?).

### Save changes to GitHub (Optional, only in the case you work on your own fork of the repository!)

In order to save your changes, please perform the following commands. These allow you to put your modifications in your fork (code repository).

In [None]:
# !git add .
# !git commit -m "Put here a message that describes your modifications. e.g. answering question 3"
# !git config --global user.email "your github email"
# !git config --global user.name "your first and last name"
# !git push origin master

# 1. Discovering the SHL Dataset


---

## Goals/Outline

In these first series of lab, we will take a look at the Sussex-Huawei Locomotion (SHL) dataset. We will also build our first Keras model and train it in order to recognize human activities (run, walk, bike, etc.) from sensory data like accelerometer, gyroscope, magnetometer, etc.

1. We will first work on a subset of the SHL dataset containing only examples from one of the four body locations (**Torso**, Hips, Hand, and Bag) and three of the fifteen modalities (**accelerometer, gyroscope, magnetometer**, gravitation, ambient pressure, etc.);

2. We will build a sample (échantillon de données) which will be used to train our activity recognition models.

4. We will take a look at the signals, visualize them, apply some preprocessings on them, and extract some valuable features from them;

1. We will see the dimensions and structure of the dataset that we use to feed the keras models with;

3. We will explore both raw signal and feature-based activity recognition models;

4. After adding other modalities, we will explore various sensor fusion schemes, i.e. channel, modality, and grouped fusion.

## Download a subset of the dataset

In this part, we will launch two scripts, `get_data.sh` and `extract_data.sh` which will download a subset of the SHL dataset and extract it to the right folder, respectively.
If you want to work, as a personal side project, on the original dataset (really heavy), you can check out the commented lines inside `get_data.sh`.

In [None]:
# check where are we located in the filesystem tree
!pwd
!ls

In [None]:
# if needed, go inside the cloned repository (be aware of the difference between %cd and !cd.)
%cd "{GIT_REPOSITORY}"
!pwd
!ls

In [None]:
# give execution permissions to the two scripts
!chmod +x get_data.sh
!chmod +x extract_data.sh

In [None]:
# launch the commands, the downloading may take a while !
!./get_data.sh

In [None]:
# extract the downloaded zip files using the following command. This also may take a while !
!./extract_data.sh

## Structure of the code repository


---



In the left side, you can see the file system button which allows you to display the contents of the current folder. This structure should be the same as that found in the GitHub repository.

```
├── data/            # where the initial zipped data will be stored
├── generated/       # generated from sample after basic transformations (ready for ML algorithms) 
     ├── sample/     # subset of data to use in your experiments
          ├── train/
          └── validation/
     ├── tmp/        # where downloaded zipped data will be extracted. This is where data is read from
     └── 0.1/        # used to store data in a memory-convinient manner (not to be changed)
├── dataset.py       # Python code used to load data
├── config.py        # provide needed configurations: paths ... (not to be changed)
├── sample.py        # contains the code to extract samples from the data 
├── get_data.sh      # shell script to download data 
└── extract_data.sh
```

## Structure of the Raw Dataset

We will explore here the structure of the SHL dataset as it is provided by the team responsible of collecting and maintaining it.

<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/104208754-27a22300-5431-11eb-8a5c-d3093fdf3de4.png" width="300px"/>
</p>

```
├── generated/
     ├── tmp
        ├── train
             ├── Torso/
                   ├── Acc_x.txt
                   ├── Acc_y.txt
                   ├── Acc_z.txt
                   ├── Gyr_x.txt
                   ...
                   └── Labels.txt
             ├── Hand             # Not included in the subset
             ├── Hips             # Not included in the subset
             └── Bag              # Not included in the subset
     └── validation
```

To see how the first 10 rows of e.g. `Acc_x.txt` look like, execute the following command.
Each row contains 500 data points measured by an accelerometer (the x axis exactly) while the user performs one of the eight given activities. These 500 data points are successive in time and correspond to 5 seconds of the performed activity.

In [None]:
!head -10 ./generated/tmp/train/Torso/Acc_x.txt

## Exploring and visualizing the sensory signals


---



We will use Plotly in order to allow exploring the sensory signals in an interactive manner.

### What is Plotly

Plotly allows one to create interactive charts and maps with APIs in Python, R, and JavaScript. It's intuitive, highly customisable and from version 4, it integrates nicely with Pandas DataFrames in Python with the [Plotly Express](https://plotly.com/python/plotly-express/) module which was included in Plotly version 4, from being its own module.

### Plotly
Basic [installation of Plotly](https://plotly.com/python/getting-started/#installation)
```bash
pip install plotly
```
This enables Plotly usage in the Python environment. It won't automatically allow you to render Plotly figures in notebooks with `fig.show()`, but it should be possible in notebooks to render them as the following:

In [None]:
!pip install plotly

In [None]:
from IPython.display import HTML
import plotly.express as px
import pandas as pd
import plotly.graph_objects as go

df = pd.read_csv('./generated/tmp/train/Torso/Gyr_x.txt', sep=' ')

fig = go.Figure()
for i in range(10):
    fig.add_trace(
        go.Scatter(
            x=[j+(500*i) for j in range(500)],
            y=df.iloc[i, :],
            mode="lines",
            line=go.scatter.Line(color="blue"),
            showlegend=False)
    )
HTML(fig.to_html())

## Dealing with the Prohibitive Size of the Dataset

Here, issues may be realted to the training time in the virutal machines provided by colab and the memory constraints in the drive.

We want you to get concretely accustomed to the structure of the dataset and by the same occasion deal with its prohibitive size. For this, we will try to reduce the size of the dataset by selecting various portions which will be used as our traininig sample.

At the same time, this will be an occasion to analyze the signals, check the transitions between different activities (the frontier is not always as sharp as you think), etc. All these aspects can be considered as criteria for outliers detection. Try to reason about that.

In [5]:
import csv
import functools  # reduce()
import numpy as np

# N.B. lines that are commented are considered in the dataset subset but are provided in the original dataset.

coarselabel_map = {
   # 'null'  : 0,
   'still' : 1,
   'walk'  : 2,
   'run'   : 3,
   'bike'  : 4,
   'car'   : 5,
   'bus'   : 6,
   'train' : 7,
   'subway': 8,
}

# channels corresponding to the columns of <position>_motion.txt files
# ordered according to the SHL dataset documentation.
channels_basic = {
    # [...]
    2: 'Acc_x',
    3: 'Acc_y',
    4: 'Acc_z',
    5: 'Gyr_x',
    6: 'Gyr_y',
    7: 'Gyr_z',
    8: 'Mag_x',
    9: 'Mag_y',
    10: 'Mag_z',
    # 11: 'Ori_w',
    # 12: 'Ori_x',
    # 13: 'Ori_y',
    # 14: 'Ori_z',
    # 15: 'Gra_x',
    # 16: 'Gra_y',
    # 17: 'Gra_z',
    # 18: 'LAcc_x',
    # 19: 'LAcc_y',
    # 20: 'LAcc_z',
    # 21: 'Pressure'
    # [...]
}

**TODO: Write a function that takes as input an activity and returns an index containing the starting and ending point of each portion of data corresponding to that activity**

In [6]:
def index_of_activity(activity:str) -> list:
    """
    Example:
    ```python
    idxs = index_of_activity('run')
    print('Portions of activity %s :'.format('run'))
    for index in idxs:
        print('%d: [%d; %d]'.format(index[0][0], index[0][1]))
    ```
    """
    print('Constructing the index of {} ...'.format(activity))
    idxs=[]
    label_of_activity = coarselabel_map[activity]
    with open('./generated/tmp/train/Torso/Label.txt') as labels:
        reader = csv.reader(labels, delimiter=' ')

        start = -1
        for i, row in enumerate(reader):
            if (int(row[0]) == label_of_activity) and (start == -1):
                # we found the starting point of a portion
                start = i

            if (int(row[0]) != label_of_activity) and (start != -1):
                # we found the end of a portion
                # print('portion found [{};{}]'.format(start, i-1))
                idxs.append((start, i-1))
                start = -1

    return idxs

**TODO: Write a function that extracts the portions of data from the ...**

In [9]:
!mkdir -p generated/sample/train/Torso

In [10]:
def extract_examples(sample_index:list, channel:str) -> None:
    """
    Given a sample index in the form of a list of tuples (start, end)
    specifying the starting and ending point of a portion of examples,
    this function extract the corresponding examples to a new file (e.g.
    Acc_x.txt -> Acc_x_sample.txt)
    """
    # sort list in ascending order
    sample_index_sorted = sorted(sample_index, key=lambda tup: tup[0])

    with open('./generated/tmp/train/Torso/'+channel+'.txt', 'r') as input_:
        with open('./generated/sample/train/Torso/'+channel+'.txt', 'w') as output_:
            reader = csv.reader(input_)
            writer = csv.writer(output_)

            try:
                # assume that the indexes are sorted in ascending order
                sample_index_iter = iter(sample_index_sorted)
                idx = next(sample_index_iter)
                for i, row in enumerate(reader):
                    if i > idx[1]:
                        idx = next(sample_index_iter)

                    if idx[0]<i and i<idx[1]:
                        writer.writerow(row)
            except StopIteration:
                pass

**TODO: Build your own training sample.** Given that the sample has to be indentically independently distributed (i.i.d.) you can use a random generator to select the portions of data.

In order to get different samples for each student, please provide your student id number as a parameter to the seed initialization.

In [11]:
your_student_id_number = 1024


def size_of_index(index:list) -> int:
    size = functools.reduce(lambda acc,portion: (portion[1]-portion[0])+acc, index, 0)
    return size


def sample(indexes:dict) -> list:
    """
    Given the index associated to each activity, this function generates
    a single index by selecting portions for each activity. This function
    ensures that:
        - each activity has a sufficient number of representants;
        - the portions have a sufficient number of adjacent segments;
        - the number of segments of each activity is balanced;
    """
    np.random.seed(your_student_id_number)

    # 1. For each activity:
    #   a. filter portions according to their size, e.g. we do not want portions
    #      containing only 5 segments;
    #   b. compute the average size of the portions;
    min_portion_size = 10
    filtered_indexes = {}
    average_size = {}
    for activity in indexes:
        print('filtering the index of {} ...'.format(activity))
        filtered_indexes[activity] = list(filter(lambda portion: portion[1]-portion[0]>min_portion_size, indexes[activity]))
        average_size[activity] = size_of_index(filtered_indexes[activity])/len(filtered_indexes[activity])
        print(' - average size of {} filtered portions: {}'.format(activity, average_size[activity]))

    # 2. determine how many portions to sample for each activity;
    # 3. sample using np.choice;
    sample = []
    num_segment_per_activity = 2000
    for activity in indexes:
        num_segments = int(num_segment_per_activity/average_size[activity])
        ret = np.random.choice(len(filtered_indexes[activity]), size=num_segments, replace=False)
        print(' - selected portions for {}:{}'.format(activity,ret))
        idx = [filtered_indexes[activity][i] for i in ret]
        print(' - size (# of segments) of this index:{}'.format(size_of_index(idx)))
        for por in idx:
            sample.append(por)

    return sample

Now, test your implementation using the following code.

In [None]:
def build_sample() -> dict:
    idxs = {}
    for activity in coarselabel_map:
        idxs[activity] = index_of_activity(activity)

    sample_idx = sample(idxs)
    print(sample_idx)

    for _, channel in channels_basic.items():
        extract_examples(sample_idx, channel)

    # TODO. extract also the labels

    # return the used index
    return sample_idx


sample_idx = build_sample()

# TODO. save the index (the dictionnary) in order to keep track of its correspondance with the original dataset


Should we do the same thing with the validation data? ... however, we have to get both `train` and `validation` (and later `test`) inside the folder `sample/`.

In [None]:
!cp -r generated/tmp/validation/ generated/sample/validation

### Loading the dataset

Now that we have built a sample and stored it in the drive, this is what we will use to train our activity recognition models.
Before that, we need to put the data into a convinient and memory-efficient data structure. The additional value of the backend implementation (which uses OS-base memory mapping) of this data structure will appear when you will work on the entire dataset.
In the following, we will just check a high-level overview of the data structure.

The data structure that will be used resemble to the that depicted in the figure below. It has three axis: (0) the elements; (1) time; and (2) the channels.

<p align="center">
<img src="https://user-images.githubusercontent.com/8298445/105118442-b583ab00-5ace-11eb-989b-ac1eba8cb26d.png" height="300px"/>
</p>

For these labs, we provide you with a python class, `DataReader`, which loads the data in this format. Use it to load an manipulate the data.

In order to see how to manipulate tensors in TensorFlow, plase check out [this](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/tensor.ipynb) introductory colab.

In [None]:
from dataset import DataReader

# get the size of the constructed sample
sample_size = size_of_index(sample_idx)
print(sample_size)


# when run for the first time, this may take a while!
train = DataReader(what='train', train_frames=sample_size)
valid = DataReader(what='validation')

**TODO**
Check out the shape of the returned objects and try to visualize its contents using the provided methods.