# Introduction

This notebook provides an overview of the data collection process and the approach to preprocessing the kinematic data for the downloaded dance video dataset.

## Data Collection using Youtube Data API and Pytube

The video data used for exploratory data analysis was downloaded using the [Youtube Data API](https://developers.google.com/youtube/v3/docs/search/list) and [Pytube](https://pytube.io/en/latest/), which ensured that only authorized videos were collected for analysis. To increase the likelihood of finding relevant and clean videos that focused on individual dancers rather than groups, the code function used a keyword search that included the genre name and terms such as "solo choreography", "solo practice", or "dance cover". The expected video format is `mp4`, `width:360`, `height:640`, `max_length:120`, `min_views:100`.

For more details about the data collection process, please refer to the code in [/src/data/collection.py](https://github.com/kayesokua/gestures/blob/main/src/data/collection.py)

In [None]:
from src.data.collection import extract_video_from_youtube

extract_video_from_youtube(query='contemporary', max_count=5)
extract_video_from_youtube(query='ballet', max_count=5)
extract_video_from_youtube(query='folk', max_count=5)

## Pose Estimation using MediaPipe

After downloading the video data, the kinematic data will be extracted using [MediaPipe Pose Solution](https://github.com/google/mediapipe/blob/master/docs/solutions/pose.md). The chosen output format is `csv` with relative values by default. In this file, we obtain `x`,`y`,`z` coordinates and obtain the `fps` using [OpenCV](https://docs.opencv.org/4.x/). We use `NaN` to frames where a pose cannot be detected. 

The code snippet below gathers all videos in `mp4` format and extracts landmarks and screenshots. For more details about the data annotation process, please refer to the code in [/src/data/annotation.py](https://github.com/kayesokua/gestures/blob/main/src/data/annotation.py)

In [None]:
import os
from src.data.annotation import extract_landmarks_from_videos

video_path = 'data/external/test'

if os.path.exists(video_path):
    extract_landmarks_from_videos(video_path)
else:
    print("Path does not exists.")

## Handling Missing Poses and Outlier Detection

Since we are handling dance videos with different cinematography style, using linear interpolation or median might not be appropriate for handling missing kinematic data. Therefore, the proposed solution is to detect outliers instead by generating binary label using [Isolation Forest algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html).  `1` indicates normal data and `-1` indicates the outlier.

For more details about the data annotation process, please refer to the code in [/src/data/processing.py](https://github.com/kayesokua/gestures/blob/main/src/data/processing.py)

In [None]:
from src.data.processing import process_landmarks_using_isolation_forest
process_landmarks_using_isolation_forest("data/interim")

# Summary

This notebook provided an overview of the data we have for exploration

1. Videos (MP4) with category as filename: `data/external/{category_i}.mp4`
2. Kinematic data extracted using MediaPipe(CSV): `data/interim/{category_i}/landmarks_rel.csv`
3. Kinematic data with outliers information(CSV): `data/processed/{category_i}.csv`
4. Frame screenshots (PNG) saved in chronological order: `data/interim/{video_filename}/*.png`