In [1]:
import numpy as np
import pandas as pd
import cv2
import mediapipe as mp
from IPython.display import clear_output
import time


# Initial Research

MediaPipe has a z-axis output, which apparently estimates the pose in a 2m3 box 
wherer the origin is the center of the hip. As usual, the point of the algo is 
to determine the _relataive_ position of the key points, not their absolute
position or distance.

Nevertheless, some code needs to be written to estimate the proportions of each
subject. The question is: Assume that the person is facing orthogonal to the 
camera when their body makes the largest "frame". This is the most accurate 2D
depiction of their body. How do you determine when the subhect is orthogonal
to the camera?

The easiest approach is to figure out when the subject has the same z-axis
readout on both shoulders. This would at least mean that the torso is close to being orthogonal to the 
camera. If the data collects readouts that are within some z-axis threshold,
this could be used to re-create the model with the target ratios.

# Some thoughts on UI
- Would be good if it had a text readout of what your body is "kinda like"
so it will give the user some assurance that the thing is working


# Mediapipe

Below is some code using MediaPipe to get the key metrics out of the pose estimation.

In [2]:
# Initialize stuff from mediapipe
mp_drawing = mp.solutions.drawing_utils
mp_pose = mp.solutions.pose


In [3]:
cap = cv2.VideoCapture(0)

with mp_pose.Pose(min_detection_confidence=0.5, min_tracking_confidence=0.5) as pose:
    while cap.isOpened():
        ret, frame = cap.read()

        # Recolor image
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False

        # Make detection
        results = pose.process(image)

        try:
            landmarks = results.pose_landmarks.landmark
            print(f"R_Sh:\n{landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value]}")
            print(f"L_Sh:\n{landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value]}")

            # Save values if torso Z values close to zero
            if np.max(np.abs([landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value].z,
                              landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value].z,
                              landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value].z,
                              landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].z])) > 0.5 and\
                np.min([landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER.value].visibility,
                        landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER.value].visibility,
                        landmarks[mp_pose.PoseLandmark.RIGHT_HIP.value].visibility,
                        landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].visibility]) > 0.9:
                print("Can See Torso")

                # TODO Write Framework here to save dataframe of correct torso values.
                # The Z value threshold in the above lines need tweaking as well.

        except:
            pass

        # Recolor image back to BGR
        image.flags.writeable = True
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

        # Show detections
        mp_drawing.draw_landmarks(image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS,
                                  mp_drawing.DrawingSpec(color=(245,117,66), thickness=2, circle_radius=2),
                                  mp_drawing.DrawingSpec(color=(245,66,230), thickness=2, circle_radius=2),)

        cv2.imshow("Mediapipe feed", image)

        if cv2.waitKey(10) & 0xFF == ord("q"):
            break

        clear_output(wait=True)
    cap.release()
    cv2.destroyAllWindows()
    cv2.waitKey(1)


R_Sh:
x: 0.30530035
y: 0.79772305
z: -0.3842694
visibility: 0.99893445

L_Sh:
x: 0.7562555
y: 0.8035553
z: -0.29845184
visibility: 0.9988245



Another method to run Mediapipe, as described here:
https://developers.google.com/mediapipe/solutions/vision/pose_landmarker/python#live-stream

Results and performance of each models (lite, full, heavy) are summarized here: https://storage.googleapis.com/mediapipe-assets/Model%20Card%20BlazePose%20GHUM%203D.pdf

**Issues with asynchronous processing:** Looks like asynchronous readouts are not suitable if we are doing a video frame overlay. Keep synchronous processing such that overlay is still present with the live feed. The main code difference is using a timestamp in ms and also calling `detect_async()` instead of `detect()`.

**From the oracle**
> If you're running things synchronously, especially for real-time or live feed scenarios with frameworks like MediaPipe, you typically process each frame one at a time in a loop. In such cases, you don't necessarily need to use a callback function. Instead, you can directly process each frame as you capture it, analyze it with the pose detection model, and immediately draw the keypoints or landmarks onto the frame before displaying it. This approach ensures minimal delay between capturing a frame, processing it, and displaying the results, making it suitable for real-time applications.

In [3]:
"""
Belwo is a playaround with asynchronous processing, which is a bit more complex
than running a simple live feed. This method is better if we do not have any
overlay on the video, but ignore this part since it is not quite critical
to run thing asynchronously. Code is kept in notebook for reference.
"""

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import time

# Load model
model_path = "/Users/homemasaki/code/projects/fit_me/models/pose_landmarker_full.task"

# Drawing utility
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

# Create the task
BaseOptions = mp.tasks.BaseOptions
PoseLandmarker = mp.tasks.vision.PoseLandmarker
PoseLandmarkerOptions = mp.tasks.vision.PoseLandmarkerOptions
VisionRunningMode = mp.tasks.vision.RunningMode

# Create a pose landmarker instance with the live stream mode:

options = PoseLandmarkerOptions(
    base_options=BaseOptions(model_asset_path=model_path),
    running_mode=VisionRunningMode.LIVE_STREAM)

# Grab frame using OpenCV
cap = cv2.VideoCapture(0)

with PoseLandmarker.create_from_options(options) as landmarker:
    while cap.isOpened():

        # Grab frame using OpenCV. Assume frame is grabbed.
        _, frame = cap.read()

        # Convert frame to MediaPipe Image object
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=frame.tobytes(),
                            width=frame.shape[1], height=frame.shape[0])
        landmarker.detect(mp_image)

        #cv2.imshow("Mediapipe feed", frame)

        if cv2.waitKey(10) & 0xFF == ord("q"):
            break

cap.release()
cv2.destroyAllWindows()
cv2.waitKey(1)


ValueError: The vision task is in live stream mode, a user-defined result callback must be provided.