# Objective:

As input to the system, take the live feed from the webcam and use pose estimation to map out a small dance tutorial.

# Approach:
- We will take a pretrained **openpose estimation model** to prdict the **18 keypoints** on a human body.
- We take openpose model for tensorflow by Ildoo Kim
  - GitHub Repo Link: https://github.com/ildoonet/tf-pose-estimation
<br>**[!] Note**: Some how I found issues with this repo to work with tensorflow 2.0 and followed a modified repo of his by Gunjan Seth.<br>
GitHub Repo Link: https://github.com/gsethi2409/tf-pose-estimation
<br>Medium Blog by Gunjan Seth: https://medium.com/@gsethi2409/pose-estimation-with-tensorflow-2-0-a51162c095ba
- The keypoints of the dancer are obtained and stored in a array list.
- These keypoints are **normalized**.
- The user feed is taken and the keypoints are detected.
- The keypoints are normalized and the **cosine similarity** is found between the user keypoints and the array of dancer keypoints.
- The minimum similarity score is **compared with the threshold** and then it displays is the user steps are correct or not for the given dancer moves.

# Constraints To Look For:
1. The model should be fast for prediction. Latency should be avoided.
2. Predictions should be accurate and the steps should be close enough with  the dancer.


## Import the Necessary Libraries

In [2]:
import sys
import time
import logging

import numpy as np
import cv2
import tensorflow as tf
import pandas as pd

from tf_pose import common
from tf_pose.estimator import TfPoseEstimator
from tf_pose.networks import get_graph_path, model_wh
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer
import warnings
warnings.filterwarnings('ignore')

import os
os.environ["CUDA_VISIBLE_DEVICES"]=''



## Model and TfPose Estimator
We initialize the pretrained model with the required parameters as seen below.

In [3]:
camera = 0
resize = '432x368'     # resize images before they are processed
resize_out_ratio = 4.0 # resize heatmaps before they are post-processed
model='mobilenet_v2_large'
show_process = False
tensorrt = False       # for tensorrt process

In [4]:
w, h = model_wh(resize)
if w > 0 and h > 0:
    e = TfPoseEstimator(get_graph_path(model), target_size=(w, h), trt_bool=False)
else:
    e = TfPoseEstimator(get_graph_path(model), target_size=(432, 368), trt_bool=False)
print('********* Model Ready *************')

[2022-07-27 11:42:03,713] [TfPoseEstimator] [INFO] loading graph from c:\Users\SSAFY\Desktop\모션인식\models\graph/mobilenet_v2_large/graph_opt.pb(default size=432x368)
2022-07-27 11:42:03,713 INFO loading graph from c:\Users\SSAFY\Desktop\모션인식\models\graph/mobilenet_v2_large/graph_opt.pb(default size=432x368)


TfPoseEstimator/image
TfPoseEstimator/MobilenetV2/Conv/BatchNorm/Const
TfPoseEstimator/MobilenetV2/Conv/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv/depthwise/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv/depthwise/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv/project/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv/project/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv_1/expand/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv_1/expand/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv_1/depthwise/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv_1/depthwise/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv_1/project/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv_1/project/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv_2/expand/BatchNorm/Const
TfPoseEstimator/MobilenetV2/expanded_conv_2/expand/BatchNorm/Const_1
TfPoseEstimator/MobilenetV2/expanded_conv_2/depthwise/

# Take position from the trainer (dancer):
- We made two functions to get all the keypoints from the trainer and store them in a dataframe and in a list.
-  The function **"dance_video_processing"** is used to predict all the keypoints from the video and return all the keypoints for the video.
- The function **"get_position"** is used to take all the keypoints that are returned from the above function, preprocess them and return the dataframe and the list of keypoints.  

In [26]:
from anyio import current_time


def dance_video_processing(video_path= r'dance_video/dancer.mp4',showBG = True):

    cap = cv2.VideoCapture(video_path)
    if cap.isOpened() is False:
        print("Error opening video stream or file")

    # fps_time = 0
    prev_time = 0
    FPS =10
    keypoints_list=[]
    while True:
        ret_val, image = cap.read()
        cv2.imshow('Dancer2', image)
        current_time = time.time() - prev_time
        dim = (368, 428)
        if (ret_val is True) and (current_time > 1./FPS):
             # resize image
            image = cv2.resize(image, dim, interpolation = cv2.INTER_AREA)
            humans = e.inference(image,
                                 resize_to_default=(w > 0 and h > 0),
                                 upsample_size=4.0)
            if not showBG:
                image = np.zeros(image.shape)
            # Plotting the keypoints and lines to the image 
            image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
            npimg = np.copy(image)
            image_h, image_w = npimg.shape[:2]
            centers = {}
            
            for human in humans:
                      # draw point
                    for i in range(common.CocoPart.Background.value):
                            if i not in human.body_parts.keys():
                                    continue
                            body_part = human.body_parts[i]
                            x_axis=int(body_part.x * image_w + 0.5)
                            y_axis=int(body_part.y * image_h + 0.5) 
                            center=[x_axis,y_axis]
                            centers[i] = center
                            keypoints_list.append(centers)
            # To display fps
            cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - prev_time)), (10, 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
            # To display image
            cv2.imshow('Dancer', image)
            prev_time = time.time()

            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
            
        elif (ret_val is False):
            break
        else:
            cv2.waitKey(1)
    print(len(keypoints_list))
    cap.release()
    cv2.destroyAllWindows()
    return keypoints_list

In [26]:
def dance_video_processing(video_path= r'dance_video/dancer.mp4',showBG = True):
        cap = cv2.VideoCapture(video_path)
        if cap.isOpened() is False:
            print("Error opening video stream or file")
        fps_time = 0
        while True:
            ret_val, image = cap.read()
            dim = (368, 428)
            if ret_val:
                 # resize image
                image = cv2.resize(image, dim, interpolation = cv2.INTER_AREA)
                humans = e.inference(image,
                                     resize_to_default=(w > 0 and h > 0),
                                     upsample_size=4.0)
                if not showBG:
                    image = np.zeros(image.shape)
                # Plotting the keypoints and lines to the image 
                image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
                npimg = np.copy(image)
                image_h, image_w = npimg.shape[:2]
                centers = {}
                keypoints_list=[]
                for human in humans:
                          # draw point
                        for i in range(common.CocoPart.Background.value):
                                if i not in human.body_parts.keys():
                                        continue

                                body_part = human.body_parts[i]
                                x_axis=int(body_part.x * image_w + 0.5)
                                y_axis=int(body_part.y * image_h + 0.5) 
                                center=[x_axis,y_axis]
                                centers[i] = center
                                keypoints_list.append(centers)
                # To display fps
                cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - fps_time)), (10, 10),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
                # To display image
                cv2.imshow('Dancer', image)
                fps_time = time.time()
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
                
            else:
                break
        #print(keypoints_list)
        cap.release()
        cv2.destroyAllWindows()
        return keypoints_list

In [6]:
def get_position(video_path= r'dance_video/dancer.mp4',showBG = True):
    keypoints_list=dance_video_processing()
    df = pd.DataFrame(keypoints_list)
    df.to_csv("keypoints_list",index=False )
    #features=[0]*32
    
    #print(features)
    keyp_list=[]
    #data=pd.Dataframe()
    print(len(keypoints_list))
    # Preprocessing of the keypoints data
    for i in range(0, len(keypoints_list)):
        k=-2
        features=[0]*36
        for j in range(0,18):
            k=k+2
            try:
                if k>=36:
                    break
                #print(k)
                #print(keypoints_list[i][j])
                features[k]=keypoints_list[i][j][0]
                features[k+1]=keypoints_list[i][j][1]
            except:
                features[k]=0
                features[k+1]=0
        # print(features)
        keyp_list.append(features) # features : 한 프레임의 position 값 
    # print(keyp_list)
    # Getting all the feature column names for intialization of our dataframe.
    column_names=[]
    for i in range(36):
        column_names.append(str(i))
    data=pd.DataFrame(keyp_list,columns=column_names)
    return data,keyp_list

In [27]:
dance_video_processing()

error: OpenCV(4.6.0) D:\a\opencv-python\opencv-python\opencv\modules\highgui\src\window.cpp:967: error: (-215:Assertion failed) size.width>0 && size.height>0 in function 'cv::imshow'


: 

In [18]:
data,keyp_list=get_position()
print(keyp_list)

13
[[152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 245, 128, 249, 181, 153, 209, 0, 0, 0, 0, 204, 202, 0, 0, 0, 0, 145, 56, 157, 54, 0, 0, 177, 54], [152, 60, 182, 100, 150, 107, 141, 151, 99, 151, 215, 91, 24


**Observation:** 
- We can see how the keypoints data looks from the above example.
- Since they are 18 keypoints and each keypoint has **x-coordinate** and **y-coordinate** we have **36 columns** (18 x 2).

# Cosine Similarity:
Cosine Similarity function for our model to find the keypoints.

In [13]:
def findCosineSimilarity_1(source_representation, test_representation):
    import numpy as np
    a = np.matmul(np.transpose(source_representation), test_representation)
    b = np.sum(np.multiply(source_representation, source_representation))
    c = np.sum(np.multiply(test_representation, test_representation))
    return 1 - (a / (np.sqrt(b) * np.sqrt(c)))

# Comparing:
Comparing the user images with keypoints of the dancer. 

In [28]:
def compare_positions(trainer_video,user_video,keyp_list, dim=(420,720)):
    cap = cv2.VideoCapture(trainer_video)
    cam = cv2.VideoCapture(user_video) 
    cam.set(3, w)
    cam.set(4, h)
    # fps_time = 0 #Initializing fps to 0
    prev_time = 0
    FPS = 0.5
    while True:
        ret_val, image_1 = cam.read()
        e_d=0
        ret_val_1,image_2 = cap.read()
        current_time = time.time() - prev_time
        if ret_val_1 and ret_val and (current_time > 1./FPS):
            # resizing the images
            image_1 = cv2.resize(image_1, dim, interpolation = cv2.INTER_AREA)
            image_2 = cv2.resize(image_2, dim, interpolation = cv2.INTER_AREA)
            
            humans_2 = e.inference(image_1, resize_to_default=(w > 0 and h > 0),upsample_size=4.0 )
            dancers_1 = e.inference(image_2,resize_to_default=(w > 0 and h > 0),upsample_size=4.0)
            
            #Dancer keypoints and normalization
            transformer = Normalizer().fit(keyp_list)  
            keyp_list=transformer.transform(keyp_list)
            
            # Showing FPS
            cv2.putText(image_2, "FPS: %f" % (1.0 / (time.time() - prev_time)), (10, 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
           
            # Getting User keypoints, normalization and comparing also plotting the keypoints and lines to the image
            image_1 = TfPoseEstimator.draw_humans(image_1, humans_2, imgcopy=False) # 선 표시하는 코드
            image_2 = TfPoseEstimator.draw_humans(image_2, dancers_1, imgcopy=False) # 선 표시하는 코드
            # Displaying the dancer feed.
            cv2.imshow('Dancer Window', image_2)
            npimg = np.copy(image_1)
            image_h, image_w = npimg.shape[:2]
            centers = {}
            keypoints_list=[]
            for human in humans_2:
                          # draw point
                    for i in range(common.CocoPart.Background.value):
                                if i not in human.body_parts.keys():
                                        continue

                                body_part = human.body_parts[i]
                                x_axis=int(body_part.x * image_w + 0.5)
                                y_axis=int(body_part.y * image_h + 0.5)
                                center=[x_axis,y_axis]
                                centers[i] = center
                    k=-2
                    features=[0]*36
                    for j in range(0,18):
                        k=k+2
                        try:
                            if k>=36:
                                break
                            #print(k)
                            #print(keypoints_list[i][j])
                            features[k] = centers[j][0]
                            features[k+1] = centers[j][1]
                        except:
                            features[k]=0
                            features[k+1]=0
                    features=transformer.transform([features])
                    #print(features[0])
                    min_=100 # Intializing a value to get minimum cosine similarity score from the dancer array list with the user
                    for j in keyp_list:
                        #print(j)
                        sim_score=findCosineSimilarity_1(j,features[0])
                        print(sim_score)
                        #print(sim_score)
                        #Getting the minimum Cosine Similarity Score
                        if min_>sim_score:
                            min_=sim_score
            # Displaying the minimum cosine score
            cv2.putText(image_1, str(min_), (10, 30),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
            # If the disctance is below the threshold
            if min_<0.15:
                cv2.putText(image_1, "CORRECT STEPS", (120, 700),
                            cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            else:
                cv2.putText(image_1,  "NOT CORRECT STEPS", (80, 700),
                            cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
            cv2.putText(image_1, "FPS: %f" % (1.0 / (time.time() - prev_time)), (10, 50),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
            # Display the user feed
            flipped_image = cv2.flip(image_1, 1)
            cv2.imshow('User Window', flipped_image)

            prev_time = time.time()
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        else:
            break

    cam.release()
    cap.release()
    cv2.destroyAllWindows()

##### Note:
Since I cant dance, I'll be using a video for this :P.<br> We can replce the **user_video** attribute to **0 or 1** to turn on live camera depending on the type of camera we have.
### For a wrong positions:

In [29]:
compare_positions(r'dance_video/dancer.mp4',0,keyp_list)

0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674
0.4991436055639674


### For a correct positions:

In [None]:
compare_positions(r'dance_video/dancer.mp4',1,keyp_list)

NameError: name 'compare_positions' is not defined

# Conclusion:

- We have developed a pose estimation similarity pipeline to compare similarity between two poses from the given feed of videos or live cam.<br>
**Flaws:**
- This approach fails when the trainer is far or the user is near to the camera or vise-versa. This happens because there is a **scale variation** between the keypoints of the image.<br>
**Solution:**
- We can eleminate this problem by **croping out the image of a peron** using a CNN architecture like Yolo or anything that could detect the bounding boxes of a person.
- This image then can be fed to the openpose model to estimate keypoints for both the sources.<br>
**Scope of improvement:**
- The accuracy of the model for keypoint prediction can be increased by taking a much powerful pretrained model architecture than mobilenet.