# Pose Estimation
Pose estimation is an area of computer vision that involves detecting and tracking the position and orientation of objects, typically human body parts, in images or videos. This technique is widely used in various applications, from augmented reality and virtual reality to healthcare and sports analytics.

There are two main types of pose estimation:

- 2D Pose Estimation: This involves identifying key points on a 2D image, such as the head, shoulders, elbows, and knees. A popular tool for this is OpenPose, which can detect multiple people in real-time1.
- 3D Pose Estimation: This goes a step further by predicting the 3D coordinates of key points, providing a more detailed understanding of the object’s spatial orientation. This is particularly useful in robotics and animation

The goal of pose estimation is to detect the location and orientation of a person’s body parts, such as joints and limbs(keypoints), in an image or video. 

![image.png](image.png)

There are two main approaches to pose estimation: single-person and multi-person. Single-person pose estimation finds the pose of one person in an image. It knows where the person is and how many keypoints to look for, making it a regression problem. Multi-person pose estimation is different. It tries to solve a harder problem where the number of people and their positions in the image are unknown.

Single-person pose estimation can be further divided into two frameworks: direct regression-based and heatmap-based. Direct regression-based frameworks predict keypoints from feature map. Heatmap-based frameworks generate heatmaps of all keypoints within the image and then use additional methods to construct the final stick figure.

Multi-person pose estimation problem can usually be approached in two ways. The first one, called top-down, applies a person detector and then runs a pose estimation algorithm per every detected person. So pose estimation problem is decoupled into two subproblems, and the state-of-the-art
achievements from both areas can be utilized. The inference speed of this approach strongly depends on number of detected people inside the image.
The second one, called bottom-up, more robust to the number of people. At first all keypoints are detected in a given image, then they are grouped by human instances. Such approach usually faster than the previous, since it finds keypoints once and does not rerun pose estimation for each person.

The task is to predict a pose skeleton for every person in an image. The skeleton consists of keypoints (or joints): ankles, knees, hips, elbows, etc.
- Inference of Neural Network to provide two tensors: keypoint heatmaps and their pairwise relations (part affinity fields, pafs). 
- Grouping keypoints by person instances. It includes upsampling tensors to original image size, keypoints extraction at the heatmaps peaks and their grouping by instances.
  
# Models
- https://github.com/bmartacho/OmniPose
- https://github.com/open-mmlab/mmpose
- https://www.tensorflow.org/hub/tutorials/movenet


![image.png](image2.png)

- pip install --upgrade pip && pip install -r requirements.txt
- jupyter labextension install --no-build @jupyter-widgets/jupyterlab-manager
- jupyter labextension install --no-build jupyter-datawidgets/extension
- jupyter labextension install jupyter-threejs
- jupyter labextension list

- https://github.com/openvinotoolkit/open_model_zoo/tree/master/tools/model_tools

In [None]:
# %pip install pythreejs "openvino-dev>=2024.0.0" "opencv-python" "torch" "onnx<1.16.2" --extra-index-url https://download.pytorch.org/whl/cpu 

# pip install openvino-dev

In [1]:
import collections
import time
from pathlib import Path

import cv2
import ipywidgets as widgets
import numpy as np
from IPython.display import clear_output, display
import openvino as ov

# Fetch `notebook_utils` module
import requests

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
with open("notebook_utils.py", "w") as f:
    f.write(r.text)

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/engine3js.py",
)
with open("engine3js.py", "w") as f:
    f.write(r.text)

import notebook_utils as utils
import engine3js as engine

### Model
https://docs.openvino.ai/2022.3/omz_models_model_human_pose_estimation_3d_0001.html

The task is to predict a pose skeleton for every person in an image. The skeleton consists of keypoints (or joints): ankles, knees, hips, elbows, etc.
- Inference of Neural Network to provide two tensors: keypoint heatmaps and their pairwise relations (part affinity fields, pafs). 
- Grouping keypoints by person instances. It includes upsampling tensors to original image size, keypoints extraction at the heatmaps peaks and their grouping by instances.

The network first extracts features, then performs initial estimation of heatmaps and pafs, after that 5 refinement stages are performed. It is able to find 18 types of keypoints. Then grouping procedure searches the best pair (by affinity) for each keypoint, from the predefined list of keypoint pairs, e.g.
left elbow and left wrist, right hip and right knee, left eye and left ear, and so on, 19 pairs overall.

https://arxiv.org/pdf/1811.12004

https://arxiv.org/pdf/1712.03453


In [2]:
# directory where model will be downloaded
base_model_dir = "model"

# model name as named in Open Model Zoo
model_name = "human-pose-estimation-3d-0001"
# selected precision (FP32, FP16)
precision = "FP32"

BASE_MODEL_NAME = f"{base_model_dir}/public/{model_name}/{model_name}"
model_path = Path(BASE_MODEL_NAME).with_suffix(".pth")
onnx_path = Path(BASE_MODEL_NAME).with_suffix(".onnx")

ir_model_path = Path(f"model/public/{model_name}/{precision}/{model_name}.xml")
model_weights_path = Path(f"model/public/{model_name}/{precision}/{model_name}.bin")

if not model_path.exists():
    download_command = f"omz_downloader " f"--name {model_name} " f"--output_dir {base_model_dir}"
    ! $download_command

### Model Conversion

##### omz_converter --name human-pose-estimation-3d-0001 --precisions FP32 --download_dir model --output_dir model

In [3]:
if not onnx_path.exists():
    convert_command = (
        f"omz_converter " f"--name {model_name} " f"--precisions {precision} " f"--download_dir {base_model_dir} " f"--output_dir {base_model_dir}"
    )
    ! $convert_command

### Select inference device

In [4]:
device = utils.device_widget()

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

### Load the model

In [5]:
# initialize inference engine
core = ov.Core()
# read the network and corresponding weights from file
model = core.read_model(model=ir_model_path, weights=model_weights_path)
# load the model on the specified device
compiled_model = core.compile_model(model=model, device_name=device.value)
infer_request = compiled_model.create_infer_request()
input_tensor_name = model.inputs[0].get_any_name()

# get input and output names of nodes
input_layer = compiled_model.input(0)
output_layers = list(compiled_model.outputs)

In [6]:
input_layer, output_layers

(<ConstOutput: names[data] shape[1,3,256,448] type: f32>,
 [<ConstOutput: names[features] shape[1,57,32,56] type: f32>,
  <ConstOutput: names[heatmaps] shape[1,19,32,56] type: f32>,
  <ConstOutput: names[pafs] shape[1,38,32,56] type: f32>])

In [7]:
input_layer.any_name, [o.any_name for o in output_layers]

('data', ['features', 'heatmaps', 'pafs'])

## Draw 2D Pose Overlays
We need to define some connections between the joints in advance, so that we can draw the structure of the human body in the resulting image after obtaining the inference results. Joints are drawn as circles and limbs are drawn as lines

https://github.com/openvinotoolkit/open_model_zoo/tree/master/demos/human_pose_estimation_3d_demo/python


In [11]:
from pose_utis import draw_poses,  body_edges, body_edges_2d

In [12]:
def model_infer(scaled_img, stride,infer_request):
    """
    Run model inference on the input image

    Parameters:
        scaled_img: resized image according to the input size of the model
        stride: int, the stride of the window
    """

    # Remove excess space from the picture
    img = scaled_img[
        0 : scaled_img.shape[0] - (scaled_img.shape[0] % stride),
        0 : scaled_img.shape[1] - (scaled_img.shape[1] % stride),
    ]

    img = np.transpose(img, (2, 0, 1))[None,]
    infer_request.infer({input_tensor_name: img})
    # A set of three inference results is obtained
    results = {name: infer_request.get_tensor(name).data[:] for name in {"features", "heatmaps", "pafs"}}
    # Get the results
    results = (results["features"][0], results["heatmaps"][0], results["pafs"][0])

    return results


In [13]:
def run_pose_estimation(source=0, flip=False, use_popup=False, skip_frames=0,infer_request=None):
    """
    2D image as input, using OpenVINO as inference backend,
    get joints 3D coordinates, and draw 3D human skeleton in the scene

    :param source:      The webcam number to feed the video stream with primary webcam set to "0", or the video path.
    :param flip:        To be used by VideoPlayer function for flipping capture image.
    :param use_popup:   False for showing encoded frames over this notebook, True for creating a popup window.
    :param skip_frames: Number of frames to skip at the beginning of the video.
    """

    focal_length = -1  # default
    stride = 8
    player = None
    skeleton_set = None

    try:
        # create video player to play with target fps  video_path
        # get the frame from camera
        # You can skip first N frames to fast forward video. change 'skip_first_frames'
        player = utils.VideoPlayer(source, flip=flip, fps=30, skip_first_frames=skip_frames)
        # start capturing
        player.start()

        input_image = player.next()
        # set the window size
        resize_scale = 450 / input_image.shape[1]
        windows_width = int(input_image.shape[1] * resize_scale)
        windows_height = int(input_image.shape[0] * resize_scale)

        # use visualization library
        engine3D = engine.Engine3js(grid=True, axis=True, view_width=windows_width, view_height=windows_height)

        if use_popup:
            # display the 3D human pose in this notebook, and origin frame in popup window
            display(engine3D.renderer)
            title = "Press ESC to Exit"
            cv2.namedWindow(title, cv2.WINDOW_KEEPRATIO | cv2.WINDOW_AUTOSIZE)
        else:
            # set the 2D image box, show both human pose and image in the notebook
            imgbox = widgets.Image(format="jpg", height=windows_height, width=windows_width)
            display(widgets.HBox([engine3D.renderer, imgbox]))

        skeleton = engine.Skeleton(body_edges=body_edges)

        processing_times = collections.deque()

        while True:
            # grab the frame
            frame = player.next()
            if frame is None:
                print("Source ended")
                break

            # resize image and change dims to fit neural network input
            # (see https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/human-pose-estimation-3d-0001)
            scaled_img = cv2.resize(frame, dsize=(model.inputs[0].shape[3], model.inputs[0].shape[2]))

            if focal_length < 0:  # Focal length is unknown
                focal_length = np.float32(0.8 * scaled_img.shape[1])

            # inference start
            start_time = time.time()
            # get results
            inference_result = model_infer(scaled_img, stride,infer_request)

            # inference stop
            stop_time = time.time()
            processing_times.append(stop_time - start_time)
            # Process the point to point coordinates of the data
            poses_3d, poses_2d = engine.parse_poses(inference_result, 1, stride, focal_length, True)

            # use processing times from last 200 frames
            if len(processing_times) > 200:
                processing_times.popleft()

            processing_time = np.mean(processing_times) * 1000
            fps = 1000 / processing_time

            if len(poses_3d) > 0:
                # From here, you can rotate the 3D point positions using the function "draw_poses",
                # or you can directly make the correct mapping below to properly display the object image on the screen
                poses_3d_copy = poses_3d.copy()
                x = poses_3d_copy[:, 0::4]
                y = poses_3d_copy[:, 1::4]
                z = poses_3d_copy[:, 2::4]
                poses_3d[:, 0::4], poses_3d[:, 1::4], poses_3d[:, 2::4] = (
                    -z + np.ones(poses_3d[:, 2::4].shape) * 200,
                    -y + np.ones(poses_3d[:, 2::4].shape) * 100,
                    -x,
                )

                poses_3d = poses_3d.reshape(poses_3d.shape[0], 19, -1)[:, :, 0:3]
                people = skeleton(poses_3d=poses_3d)

                try:
                    engine3D.scene_remove(skeleton_set)
                except Exception:
                    pass

                engine3D.scene_add(people)
                skeleton_set = people

                # draw 2D
                frame = draw_poses(frame, poses_2d, scaled_img, use_popup)

            else:
                try:
                    engine3D.scene_remove(skeleton_set)
                    skeleton_set = None
                except Exception:
                    pass

            cv2.putText(
                frame,
                f"Inference time: {processing_time:.1f}ms ({fps:.1f} FPS)",
                (10, 30),
                cv2.FONT_HERSHEY_COMPLEX,
                0.7,
                (0, 0, 255),
                1,
                cv2.LINE_AA,
            )

            if use_popup:
                cv2.imshow(title, frame)
                key = cv2.waitKey(1)
                # escape = 27, use ESC to exit
                if key == 27:
                    break
            else:
                # encode numpy array to jpg
                imgbox.value = cv2.imencode(
                    ".jpg",
                    frame,
                    params=[cv2.IMWRITE_JPEG_QUALITY, 90],
                )[1].tobytes()

            engine3D.renderer.render(engine3D.scene, engine3D.cam)

    except KeyboardInterrupt:
        print("Interrupted")
    except RuntimeError as e:
        print(e)
    finally:
        clear_output()
        if player is not None:
            # stop capturing
            player.stop()
        if use_popup:
            cv2.destroyAllWindows()
        if skeleton_set:
            engine3D.scene_remove(skeleton_set)

In [20]:
USE_WEBCAM = False

cam_id = 0
video_path = "https://storage.openvinotoolkit.org/data/test_data/videos/face-demographics-walking.mp4"
video_path = "https://github.com/tensorflow/tfjs-models/raw/master/pose-detection/assets/dance_input.gif"
video_path = "file://D:/repos/openvino/pose/data/1585619-sd_960_540_30fps.mp4"
source = cam_id if USE_WEBCAM else video_path

run_pose_estimation(source=source, flip=isinstance(source, int), use_popup=False, infer_request=infer_request)