## TODO:
Change functions names if you want to
- `stitch_datasets(path_to_labels_jsons, path_to_open_pose_jsons)`: stitch the labels (confused, unknown, not confused, no test subject) with the OpenPose data extracted for each frame (about 100 features of body joints).
  - Input:
    - Path to the labels jsons
    - Path to the OpenPose features jsons
  - Returns:
    - Dataframe where each row has a label and the OpenPose features

- `preprocess(dataframe)`: give unit variance and zero mean to our data. 
  - Some pre-processing steps will be part of the `stitch_datasets` function
    - Remove rows with no OpenPose feature data

- `split_data(dataframes)`: Sklearn has a function for that. I'm guessing TensorFlow does as well.
  - Input:
    - Dataframe of the whole dataset
  - Returns:
    - `train_data`: a training dataset
    - `test_data`: a testing dataset

- `create_model(train_data)`: this is where the fun happens. Let's build a NN!

- `test_model(test_data)`: run a cross validation test, or whatever test is appropriate and usually used for NN

## NN architecture
[The mostly complete chart of Neural Networks, explained](https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464)

We want:
- Some temporality

Arcitectures to start from:
- RNN: Recurrent Neural Network
  - Simple model with context preservation
- LSTM: Long / Short Term Memory
  - More complicated than RNN, but better for sequences (I think?)
- GRU: Gated Recurrent Unit
  - Similar to LSTM but less ressource intensive and similar performance
- AE / VAE: Auto-Encoder / Variational Auto-Encoder
  - Good at classfying patterns. Seems to be more for unsuperverised learning, which is not what we're after. I wonder if it can do logistic classification. Perhaps wer could generate stick figures of confused people. A little off topic though.
- SVM: Support Vector Machine
  - Can be used as a baseline




## Download our dataset from our Dropbox and our main dependencies
Make sure to add all newly created data to the `combined_jsons` folder in our shared dropbox. The data will be directly curled from it.

Uncomment the below cell when running in the cloud.

In [6]:
# %%capture
# import os
# import shutil
# import numpy as np
# 
# DATA_DIR = "data"
# REDOWNLOAD = True
# 
# if REDOWNLOAD:
#   shutil.rmtree(DATA_DIR)
# 
# if not os.path.exists(DATA_DIR):
#   !curl -L -o data.zip https://www.dropbox.com/sh/3ty3gszbpexan9q/AAC4F7GnYk-o0CU-HvM29sd9a?dl=0
#   !unzip data.zip -d data
#   !rm data.zip

## Preprocess

In [7]:
from glob import glob
from typing import *
import json

DATA_DIR = "data/combined_jsons"
dataset_paths = glob(f"{DATA_DIR}/*")

# Create single list object from all the JSONs
raw_sequences: List[List[List[float]]] = []
for path in dataset_paths:
  with open(path, "r") as f:
    raw_sequences.append(json.loads(f.read()))

Get centroid for every frame. If the centroid differs widely between 2 frames, it may indicate that different people were picked up by OpenPose.
Centroid code taken from [here](https://stackoverflow.com/questions/23020659/fastest-way-to-calculate-the-centroid-of-a-set-of-coordinate-tuples-in-python-wi).

Format of OpenPose output can be found [here](https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/output.md). A frame is represented by `x1,y1,c1,x2,y2,c2,...`

If a dropped frame is between 2 valid sequences, consider stitching them back together.
* Drop frames which have **no subject** in them. DONE.
* Drop frames which have **wrong** subject in them.  DONE.
* Give unit variance (and zero mean?) to all points.

In [8]:
from jupyter_tools.preprocessing import stitch_frames

parsed_sequences = stitch_frames(raw_sequences)

## Feature Extraction
Use an autoencoder to reduce the number of features.

## Model
Create 3 models
* Simple RNN
* LSTM/GRU RNN

For testing, it would be beneficial to write a function (in `jupyter_tools`) to display the frames as a video with a label indicating the prediction from the model.

In [9]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'tensorflow'

## Resources
### Educational
* [DeepMind deep learning lectures](https://www.youtube.com/playlist?list=PLqYmG7hTraZCDxZ44o4p3N5Anz3lLRVZF)
### Annotation tools
* [List of open source solutions](https://www.simonwenkel.com/2019/07/19/list-of-annotation-tools-for-machine-learning-research.html)
* [opencv/cvat](https://github.com/opencv/cvat)
* [alexandre01/UltimateLabeling](https://github.com/alexandre01/UltimateLabeling)
### Existing trained models
* [onnx/models](https://github.com/onnx/models)