# Action Recognition Cleaning
This script is run post raw data collection and processesing. 

#### Raw Data Collection & Processing Summary
The raw data collection process consisted of manually generating 210 raw video clips for 6 different actions (wave, kiss, middle finger, heart, salute, and idle), with each video clip being 40 frames long. Once video clips were generated, each video frame was processed using Google's mediapipe holistic model to identify and extract key landmarks on the human body. 

Each landmark contains a corresponding x, y, & z value, with a visibility value included for the general pose landmarks, and there are 543 landmarks collected for each frame. It is these landmarks which will be used as data points when training our machine learning models. Taking into account each x, y, z, & visibility (pose only) values for each landmark, there are 1662 data points generated for each frame. 

- Number of Pose landmarks = 33 (each with an x, y, z & visibility value) = 132 data points
- Number of Face Landmarks = 468 (each with an x, y, z value) = 1404 data points
- Left Hand Landmarks = 21 (each with an x, y, z value) = 63 data points
- Right Hand Landmarks = 21 (each with an x, y, z value) = 63 data points

The landmark data for each video frame is then flattened, concatenated, and saved as a numpy array. The reasons why the Mediapipe generated landmarks are used over the raw video images are:

1. Using the location of a person's body part within the camera frame substantially reduces the amount of data needed to train the model(s). If we passed through all of the pixel data in a frame to the model, it would result in 554x more data. This drastically increases model complexity, CPU usage, and runtime. The default resolution on a web camera is 640.0 by 480.0 pixels - that's a total of 307,200 pixels. With each pixel having 3 color channels, each channel representing the amount of green, blue, and red in each pixel, the total data generated per frame would be 921,600 data points - substantially more than our 1662 landmark values. 

2. Training an object recognition model is a difficult task in and of itself, and Mediapipe's holistic model provides a stable object detection algorithm which works in a variety of different environments. By utilizing Mediapipe's holistic model, we do not need to be concerned with external factors, such as background and lighting, which would otherwise impact our model's performance.  

3. By only passing through body location data, there is no other external noise in our data being passed through to our model. This provides cleaner data for our model to detect actions and gestures. 

The goal of this notebook is to convert all of the individual numpy arrays, containing the 1662 landmark data points for each frame, into a pandas dataframe for further downstream analysis and feature engineering.

In [1]:
import numpy as np
import pandas as pd
import os
import mediapipe as mp
from keypoints import Pipe
import cv2

## Load all numpy video frames

First thing is to load all the data in. We do this by looping through each action folder, and loading in the extracted landmark values from each frame in every video.

The data is stored in the below format, and the next cell of code loops through every folder, collecting the individual frame data and records its corresponding classification tag.

* Processed_Data >
    * wave_data >
        * subject1_waving_videos >
            * Liam_wave_video_sample_1 > 
                * frame0.npy
                * frame1.npy
                    * 
                    * 
                * frame39.npy
        * subject2_waving_videos >
            * 
            * 
    * heart >
     * 
     * 
    * n_action
    

In [69]:
# set path to saved data
data_path = "./Processed_Data"

# get list of all detected actions
action_list = os.listdir(data_path)

# create a mapping dict to detected action
label_map = {label: num for num, label in enumerate(action_list)}


# Instantiate Mediapipe model
model = Pipe()

# number of frames per video
frames_per_vid = model.frames_per_seq # Pipe module is not importing properly

# set up video and label lists
videos, labels = [], []

# loop through each action folder to get all video data
for action in action_list:
    
    # loop through every data file for each person who provided data
    # get the list of the folder paths for everyone who provided data  
    diff_data_subjects = os.listdir(os.path.join(data_path, action))
    
    # loop through each person's folder
    for data_subject in diff_data_subjects:
        
        # loop through each of the 30 video samples taken per action
        for vid_number in range(0, model.num_sample_videos): 
            print(action, data_subject, vid_number)
            
            # window which will store data for all 40 frames in a video. Full window will then be appended to 
            # to the videos list. each window will be 40 x 1662 in shape
            window = []
            
            # loop through each frame per video sample
            for frame_num in range(0, frames_per_vid):
                
                # set the path to each video frame
                path = os.path.join(data_path, action, data_subject, str(vid_number),
                                    data_subject+'{}.npy'.format(str(frame_num)))
                # load the numpy array to specific video frame and append to window list
                window.append(np.load(path))
                
            # after each video sample append the window list to the videos list
            videos.append(window)
            
            # append corresponding label to the labels list
            labels.append(label_map[action])

finger Dev_sitting_Data 0
finger Dev_sitting_Data 1
finger Dev_sitting_Data 2
finger Dev_sitting_Data 3
finger Dev_sitting_Data 4
finger Dev_sitting_Data 5
finger Dev_sitting_Data 6
finger Dev_sitting_Data 7
finger Dev_sitting_Data 8
finger Dev_sitting_Data 9
finger Dev_sitting_Data 10
finger Dev_sitting_Data 11
finger Dev_sitting_Data 12
finger Dev_sitting_Data 13
finger Dev_sitting_Data 14
finger Dev_sitting_Data 15
finger Dev_sitting_Data 16
finger Dev_sitting_Data 17
finger Dev_sitting_Data 18
finger Dev_sitting_Data 19
finger Dev_sitting_Data 20
finger Dev_sitting_Data 21
finger Dev_sitting_Data 22
finger Dev_sitting_Data 23
finger Dev_sitting_Data 24
finger Dev_sitting_Data 25
finger Dev_sitting_Data 26
finger Dev_sitting_Data 27
finger Dev_sitting_Data 28
finger Dev_sitting_Data 29
finger Kieran_table_Data 0
finger Kieran_table_Data 1
finger Kieran_table_Data 2
finger Kieran_table_Data 3
finger Kieran_table_Data 4
finger Kieran_table_Data 5
finger Kieran_table_Data 6
finger Kier

heart Liam_BS_Standing_Data 23
heart Liam_BS_Standing_Data 24
heart Liam_BS_Standing_Data 25
heart Liam_BS_Standing_Data 26
heart Liam_BS_Standing_Data 27
heart Liam_BS_Standing_Data 28
heart Liam_BS_Standing_Data 29
heart Liam_Kitchen_Data 0
heart Liam_Kitchen_Data 1
heart Liam_Kitchen_Data 2
heart Liam_Kitchen_Data 3
heart Liam_Kitchen_Data 4
heart Liam_Kitchen_Data 5
heart Liam_Kitchen_Data 6
heart Liam_Kitchen_Data 7
heart Liam_Kitchen_Data 8
heart Liam_Kitchen_Data 9
heart Liam_Kitchen_Data 10
heart Liam_Kitchen_Data 11
heart Liam_Kitchen_Data 12
heart Liam_Kitchen_Data 13
heart Liam_Kitchen_Data 14
heart Liam_Kitchen_Data 15
heart Liam_Kitchen_Data 16
heart Liam_Kitchen_Data 17
heart Liam_Kitchen_Data 18
heart Liam_Kitchen_Data 19
heart Liam_Kitchen_Data 20
heart Liam_Kitchen_Data 21
heart Liam_Kitchen_Data 22
heart Liam_Kitchen_Data 23
heart Liam_Kitchen_Data 24
heart Liam_Kitchen_Data 25
heart Liam_Kitchen_Data 26
heart Liam_Kitchen_Data 27
heart Liam_Kitchen_Data 28
heart Liam

idle William_desk_Data 2
idle William_desk_Data 3
idle William_desk_Data 4
idle William_desk_Data 5
idle William_desk_Data 6
idle William_desk_Data 7
idle William_desk_Data 8
idle William_desk_Data 9
idle William_desk_Data 10
idle William_desk_Data 11
idle William_desk_Data 12
idle William_desk_Data 13
idle William_desk_Data 14
idle William_desk_Data 15
idle William_desk_Data 16
idle William_desk_Data 17
idle William_desk_Data 18
idle William_desk_Data 19
idle William_desk_Data 20
idle William_desk_Data 21
idle William_desk_Data 22
idle William_desk_Data 23
idle William_desk_Data 24
idle William_desk_Data 25
idle William_desk_Data 26
idle William_desk_Data 27
idle William_desk_Data 28
idle William_desk_Data 29
kiss Dev_sitting_Data 0
kiss Dev_sitting_Data 1
kiss Dev_sitting_Data 2
kiss Dev_sitting_Data 3
kiss Dev_sitting_Data 4
kiss Dev_sitting_Data 5
kiss Dev_sitting_Data 6
kiss Dev_sitting_Data 7
kiss Dev_sitting_Data 8
kiss Dev_sitting_Data 9
kiss Dev_sitting_Data 10
kiss Dev_sittin

salute Liam_BS_Standing_Data 11
salute Liam_BS_Standing_Data 12
salute Liam_BS_Standing_Data 13
salute Liam_BS_Standing_Data 14
salute Liam_BS_Standing_Data 15
salute Liam_BS_Standing_Data 16
salute Liam_BS_Standing_Data 17
salute Liam_BS_Standing_Data 18
salute Liam_BS_Standing_Data 19
salute Liam_BS_Standing_Data 20
salute Liam_BS_Standing_Data 21
salute Liam_BS_Standing_Data 22
salute Liam_BS_Standing_Data 23
salute Liam_BS_Standing_Data 24
salute Liam_BS_Standing_Data 25
salute Liam_BS_Standing_Data 26
salute Liam_BS_Standing_Data 27
salute Liam_BS_Standing_Data 28
salute Liam_BS_Standing_Data 29
salute Liam_Kitchen_Data 0
salute Liam_Kitchen_Data 1
salute Liam_Kitchen_Data 2
salute Liam_Kitchen_Data 3
salute Liam_Kitchen_Data 4
salute Liam_Kitchen_Data 5
salute Liam_Kitchen_Data 6
salute Liam_Kitchen_Data 7
salute Liam_Kitchen_Data 8
salute Liam_Kitchen_Data 9
salute Liam_Kitchen_Data 10
salute Liam_Kitchen_Data 11
salute Liam_Kitchen_Data 12
salute Liam_Kitchen_Data 13
salute Lia

wave Liam_Standing_Data 12
wave Liam_Standing_Data 13
wave Liam_Standing_Data 14
wave Liam_Standing_Data 15
wave Liam_Standing_Data 16
wave Liam_Standing_Data 17
wave Liam_Standing_Data 18
wave Liam_Standing_Data 19
wave Liam_Standing_Data 20
wave Liam_Standing_Data 21
wave Liam_Standing_Data 22
wave Liam_Standing_Data 23
wave Liam_Standing_Data 24
wave Liam_Standing_Data 25
wave Liam_Standing_Data 26
wave Liam_Standing_Data 27
wave Liam_Standing_Data 28
wave Liam_Standing_Data 29
wave William_desk_Data 0
wave William_desk_Data 1
wave William_desk_Data 2
wave William_desk_Data 3
wave William_desk_Data 4
wave William_desk_Data 5
wave William_desk_Data 6
wave William_desk_Data 7
wave William_desk_Data 8
wave William_desk_Data 9
wave William_desk_Data 10
wave William_desk_Data 11
wave William_desk_Data 12
wave William_desk_Data 13
wave William_desk_Data 14
wave William_desk_Data 15
wave William_desk_Data 16
wave William_desk_Data 17
wave William_desk_Data 18
wave William_desk_Data 19
wave

In [120]:
# convert lists to numpy arrays
videos = np.array(videos)
labels = np.array(labels)

# check to make sure everything is loaded in properly, we are expecting to see 1260 unique video samples, 
# each 40 frames long, and with every frame consisting of 1662 data points.
print(videos.shape)
print(labels.shape)
print(videos[0][0].shape)

(1260, 40, 1662)
(1260,)
(1662,)


Data was properly loaded with 1260 observations (each observation having 40 frames and 1662 landmark co-ordinates per frame). 

The data is currently in the correct format to pass through to a neural network, but in order to further explore the data we are going to reshape the data into a 2D matrix and separate the landmarks back out into their respective body part groupings (i.e. face landmarks, left hand landmarks, right hand landmarks, and pose landmarks).

As mentioned previously, each landmark correlates to a specific point on the human body, and each landmark has a specific position in our 1662 landmark list (i.e. We know that the 3 values in the landmark list always corresponds to the x, y, & z position for the nose landmark). This makes unraveling the landmarks back into their respective groups straightforward. 

The landmark groups were concatenated in the below order:
1. pose (33 landmarks each with an x, y, z, & visibility value - always corresponds to the first 132 values)
2. face (468 landmarks each with an x, y, & z value - always corresponds to the next 1404 values)
3. left hand (21 landmarks each with an x, y, & z value - always corresponds to the following 63 values)
4. right hand (21 landmarks each with an x, y, & z value - always corresponds to the last 63 values)

In [105]:
# Create dictionary to hold landmark data for each body group
pose_values = {}
face_values = {}
lh_values = {}
rh_values = {}

# create video number counter
# There are 210 samples per class and we want each observation label to carry information relating it back to its
# respective action class, video sample number, and frame number 
vid_sample_num = 0

# loop through every video
for video_sample_number in range(videos.shape[0]):
    
    # determine which video sample the current frame belongs to
    if vid_sample_num < int(videos.shape[0]/6):
        vid_sample = vid_sample_num
        
    elif vid_sample_num < int(videos.shape[0]/6)*2:
        vid_sample = vid_sample_num - int(videos.shape[0]/6)
        
    elif vid_sample_num < int(videos.shape[0]/6)*3:
        vid_sample = vid_sample_num - int(videos.shape[0]/6)*2
        
    elif vid_sample_num < int(videos.shape[0]/6)*4:
        vid_sample = vid_sample_num - int(videos.shape[0]/6)*3
    
    elif vid_sample_num < int(videos.shape[0]/6)*5:
        vid_sample = vid_sample_num - int(videos.shape[0]/6)*4
        
    else:
        vid_sample = vid_sample_num - int(videos.shape[0]/6)*5
        
    # loop through each frame in each video sample
    for frame in range(videos.shape[1]):
        # create the row label which contains the current frames action/video_sample_number/frame_num_in_video_sample
        row_name = action_list[labels[vid_sample_num]] +"_" + str(vid_sample) + "_" + str(frame)
        
        # split out landmark data to their respective body regions
        pose_values[row_name] = videos[video_sample_number][frame][0:132]
        face_values[row_name] = videos[video_sample_number][frame][132:1536]
        lh_values[row_name] = videos[video_sample_number][frame][1536:1599]
        rh_values[row_name] = videos[video_sample_number][frame][1599:]
        
    vid_sample_num += 1

In [106]:
# ensure split happened correctly
pose_values

{'finger_0_0': array([ 5.03753543e-01,  5.48919380e-01, -7.53539562e-01,  9.99928415e-01,
         5.15265584e-01,  5.08145988e-01, -6.85853839e-01,  9.99795496e-01,
         5.26490808e-01,  5.08343518e-01, -6.85838223e-01,  9.99785542e-01,
         5.37297428e-01,  5.07956922e-01, -6.85886025e-01,  9.99756753e-01,
         4.80388641e-01,  5.05482972e-01, -7.31253982e-01,  9.99790013e-01,
         4.65064138e-01,  5.04712403e-01, -7.31387019e-01,  9.99787390e-01,
         4.49204683e-01,  5.04193366e-01, -7.31801391e-01,  9.99798357e-01,
         5.35020053e-01,  5.30291617e-01, -3.81446779e-01,  9.99620020e-01,
         4.10546452e-01,  5.26878655e-01, -5.81875682e-01,  9.99906600e-01,
         5.16344547e-01,  5.90068102e-01, -6.42787814e-01,  9.99897778e-01,
         4.77049679e-01,  5.90370297e-01, -7.01228440e-01,  9.99947250e-01,
         5.76335371e-01,  7.75411665e-01, -1.39373124e-01,  9.98877347e-01,
         2.85322607e-01,  7.61919737e-01, -5.72005928e-01,  9.99282181e-01

In [109]:
# Convert dictionaries to dataframes, setting the key value as that rows index
pose_df = pd.DataFrame.from_dict(pose_values, orient='index')
face_df = pd.DataFrame.from_dict(face_values, orient='index')
lh_df = pd.DataFrame.from_dict(lh_values, orient='index')
rh_df = pd.DataFrame.from_dict(rh_values, orient='index')


(50400, 132)
(50400, 1404)
(50400, 63)
(50400, 63)


In [138]:
# reset indexes - keeping label index as a col
pose_df.reset_index(inplace=True)
face_df.reset_index(inplace=True)
lh_df.reset_index(inplace=True)
rh_df.reset_index(inplace=True)

Unnamed: 0,index,0,1,2,3,4,5,6,7,8,...,122,123,124,125,126,127,128,129,130,131
0,finger_0_0,0.503754,0.548919,-0.753540,0.999928,0.515266,0.508146,-0.685854,0.999795,0.526491,...,0.524843,0.000192,0.564398,2.054700,0.709411,0.000075,0.446864,2.067051,0.161930,0.000284
1,finger_0_1,0.492862,0.526372,-0.782448,0.999930,0.507311,0.485868,-0.715031,0.999801,0.518058,...,0.125514,0.000176,0.563134,2.146005,0.356706,0.000138,0.440232,2.171197,-0.238574,0.000278
2,finger_0_2,0.487094,0.525269,-0.721858,0.999901,0.504958,0.484562,-0.656412,0.999745,0.516708,...,0.515209,0.000159,0.535052,2.247408,0.543423,0.000127,0.410596,2.269235,0.034689,0.000252
3,finger_0_3,0.477871,0.525034,-0.771861,0.999884,0.501081,0.484672,-0.705925,0.999717,0.514394,...,0.652762,0.000144,0.503392,2.355783,0.570779,0.000117,0.364500,2.360580,0.187702,0.000229
4,finger_0_4,0.476038,0.525307,-0.707020,0.999885,0.498934,0.484577,-0.645622,0.999721,0.512437,...,0.675888,0.000130,0.486453,2.438031,0.543588,0.000108,0.346792,2.442024,0.220940,0.000207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50395,wave_209_35,0.648843,0.520429,-0.572193,0.999963,0.664955,0.473858,-0.531525,0.999896,0.675880,...,0.464300,0.000010,0.617601,2.492623,-0.000934,0.000028,0.454778,2.484513,-0.035375,0.000019
50396,wave_209_36,0.648735,0.523198,-0.657996,0.999963,0.664864,0.475814,-0.618770,0.999895,0.675761,...,0.629114,0.000010,0.611808,2.496114,0.063932,0.000028,0.448229,2.486378,0.115098,0.000019
50397,wave_209_37,0.646986,0.525713,-0.636677,0.999963,0.663797,0.477565,-0.598282,0.999893,0.674494,...,0.630575,0.000010,0.610205,2.501393,0.056555,0.000028,0.449024,2.495972,0.118353,0.000019
50398,wave_209_38,0.645335,0.527644,-0.682062,0.999961,0.662593,0.479552,-0.640354,0.999889,0.672889,...,0.672138,0.000011,0.610691,2.507698,0.080599,0.000027,0.451315,2.505574,0.152360,0.000020


In [144]:
# create action category, video sample number and frame number columns
landmark_list = [pose_df, face_df, lh_df, rh_df]
for landmark_df in landmark_list:
    landmark_df['action'] = landmark_df['index'].apply(lambda x: x.split("_")[0])
    landmark_df['video sample number'] = landmark_df['index'].apply(lambda x: x.split("_")[1])
    landmark_df['frame number'] = landmark_df['index'].apply(lambda x: x.split("_")[2])
    

In [156]:
# sanity check
pose_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50400 entries, 0 to 50399
Columns: 136 entries, index to frame number
dtypes: float64(132), object(4)
memory usage: 52.3+ MB


In [None]:
New columns 

In [None]:
We can do a time series analysis - see if a landmarks previous position is a good indicator of where it is going next

We can look at the average locations for each landmark per action

We can look at the min and max values of selected landmarks for every position

We need to import body part graphs and pick our relative body parts - let's see if we can reduce complexity by cutting out a bunch of that junk.




In [None]:
The

![Alt text](https://camo.githubusercontent.com/b0f077393b25552492ef5dd7cd9fd13f386e8bb480fa4ed94ce42ede812066a1/68747470733a2f2f6d65646961706970652e6465762f696d616765732f6d6f62696c652f68616e645f6c616e646d61726b732e706e67)

![Alt text](https://camo.githubusercontent.com/7fbec98ddbc1dc4186852d1c29487efd7b1eb820c8b6ef34e113fcde40746be2/68747470733a2f2f6d65646961706970652e6465762f696d616765732f6d6f62696c652f706f73655f747261636b696e675f66756c6c5f626f64795f6c616e646d61726b732e706e67)

![canonical_face_model_uv_visualization.png](attachment:canonical_face_model_uv_visualization.png)

In [None]:

# instantiate mediapipe model
model = Pipe()

# set directory for raw data and directory for data post processing
root_raw = './Raw_Data'

raw_data_files = os.listdir(root_raw)
actions = ['wave', 'salute', 'kiss', 'idle', 'heart', 'finger']

videos, label = [], []

# set up holistic model
with model.mp_holistic.Holistic(min_detection_confidence=0.5, min_tracking_confidence=0.5) as holistic:
    for file in raw_data_files:
        for action in actions:
            for video_num in range(model.num_sample_videos):
                print(file, action, video_num)
                window = []
                for frame_num in range(model.frames_per_seq):
                    
                    # create the path to the raw video frame
                    frame_path = os.path.join(root_raw, file, action, str(video_num), str(frame_num) + '.npy')

                    # load the numpy frame
                    frame = np.load(frame_path)

                    # make mediapipe detections
                    frame, results = model.pose_detection(frame, holistic)

                    # draw landmarks on frame to be rendered
                    #model.draw_landmarks(frame, results)

                    #cv2.imshow('frame', frame)

                    # extract pose, face, lh, & rh landmarks
                    all_lm = model.extract_landmarks2(results)

                    window.append(all_lm)
                videos.append(window)
                label.append(action)