# Static Model Test
This notebook tests the defined models using a gradio inteface where one can simply upload an image or video, choose a model and then get a classification of the perfromed exercise.

In [19]:
import gradio as gr
import os
import cv2
import numpy as np
import mediapipe as mp
import tensorflow as tf
from mediapipe.framework.formats import landmark_pb2

In [21]:
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"

MODEL_PATH = "..\skeleton_lstm_multiclass6_v3.h5"
# bench_press: 0, lat_machine: 1, pull_up: 2, push_up: 3, squat: 4, split_squat: 5
CLASSES      = ["bench_press" ,"lat_machine", "pull_up", "push_up", "squat", "split_squat"]

KEYPOINT_DIM = 132  # 33 landmarks with x,y,z,visibility

# ——— load trained model ———
model = tf.keras.models.load_model(MODEL_PATH)

  MODEL_PATH = "..\skeleton_lstm_multiclass6_v3.h5"


In [22]:
# ——— init Mediapipe Pose & etc. ———
mp_pose = mp.solutions.pose
mp_drawing = mp.solutions.drawing_utils
mp_styles = mp.solutions.drawing_styles

pose = mp_pose.Pose(
    static_image_mode=True,
    model_complexity=1,
    enable_segmentation=False,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

In [23]:
# ——— helper to extract keypoints array from Mediapipe results ———
def extract_keypoints_from_results(results):
    # build list in the same order as training: (lm0.x, lm0.y, lm0.z, lm0.v, lm1.x, …)
    kpts = []
    for lm in results.pose_landmarks.landmark:
        kpts.extend([lm.x, lm.y, lm.z, lm.visibility])
    return np.array(kpts, dtype=np.float32)  # shape = (132,)

In [25]:
# Create Gradio interface for flipped version
demo_flipped = gr.Interface(
    fn=process_video_flipped,
    inputs=gr.Video(),
    outputs=[
        gr.Textbox(label="Predictions (Flipped)", lines=4),
        gr.Image(label="Flipped Pose Visualization")
    ],
    title="Exercise Classification (Flipped)",
    description="Upload a video to classify the exercise using flipped pose keypoints."
)

# Launch the interface
demo_flipped.launch()

* Running on local URL:  http://127.0.0.1:7866
* To create a public link, set `share=True` in `launch()`.




In [27]:
# ORIGINAL
def process_video(video):
    # Initialize video capture
    cap = cv2.VideoCapture(video)
    
    # Get total number of frames
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    
    # Calculate start frame for middle 30 frames
    max_frames = 30
    if total_frames <= max_frames:
        start_frame = 0
    else:
        start_frame = (total_frames - max_frames) // 2
    
    # Set the starting position
    cap.set(cv2.CAP_PROP_POS_FRAMES, start_frame)
    
    sequence = []
    frame_count = 0
    
    while cap.isOpened() and frame_count < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
            
        # Convert frame to RGB for MediaPipe
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Get pose landmarks
        results = pose.process(frame_rgb)
        
        if results.pose_landmarks:
            # Extract keypoints
            keypoints = extract_keypoints_from_results(results)
            sequence.append(keypoints)
            frame_count += 1
    
    cap.release()
    
    # If we don't have enough frames, pad the sequence
    if len(sequence) < max_frames:
        # Pad with the last frame's keypoints
        last_frame = sequence[-1] if sequence else np.zeros(KEYPOINT_DIM)
        while len(sequence) < max_frames:
            sequence.append(last_frame)
    
    # Convert sequence to numpy array and reshape for model input
    sequence = np.array(sequence)
    sequence = sequence.reshape(1, max_frames, KEYPOINT_DIM)
    
    # Get model predictions
    predictions = model.predict(sequence, verbose=0)[0]
    
    # Get top 3 predictions
    top_3_idx = np.argsort(predictions)[-3:][::-1]
    top_3_classes = [CLASSES[i] for i in top_3_idx]
    top_3_confidences = [float(predictions[i]) for i in top_3_idx]
    
    # Create prediction text
    prediction_text = "Top 3 Predictions:\n"
    for i in range(3):
        prediction_text += f"{i+1}. {top_3_classes[i]}: {top_3_confidences[i]:.2%}\n"
    
    # Create a visualization of the last processed frame
    if len(sequence) > 0:
        last_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        # Draw pose landmarks
        mp_drawing.draw_landmarks(
            last_frame,
            results.pose_landmarks,
            mp_pose.POSE_CONNECTIONS,
            landmark_drawing_spec=mp_styles.get_default_pose_landmarks_style()
        )
        
        # Add prediction text to frame
        y_position = 30
        for i in range(3):
            text = f"{top_3_classes[i]}: {top_3_confidences[i]:.2%}"
            cv2.putText(last_frame, text, (10, y_position), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
            y_position += 40
    else:
        last_frame = np.zeros((480, 640, 3), dtype=np.uint8)
        cv2.putText(last_frame, "No pose detected in video", (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
    
    return prediction_text, last_frame

In [28]:
# Create Gradio interface
demo = gr.Interface(
    fn=process_video,
    inputs=gr.Video(),  # Changed from Image to Video
    outputs=[
        gr.Textbox(label="Predictions", lines=4),
        gr.Image(label="Last Frame with Pose")
    ],
    title="Exercise Classification (Video)",
    description="Upload a video to classify the exercise being performed."
)

# Launch the interface
demo.launch()

* Running on local URL:  http://127.0.0.1:7867
* To create a public link, set `share=True` in `launch()`.




## Results 16.06.2025

### Findings 
- **Point-of-View / Flipping Effect** <br> when flipping the keypoints horizontally (i.e. mirroring left <-> right), the model changes its prediction for the same original video. Strongly suggests the model is “looking” at absolute x-positions in the frame (e.g. left-side bias), rather than the relative geometry of the joints.
- **Overfitting** <br> model performance drops on flipped (but otherwise identical) inputs, it’s likely memorizing frame-specific patterns (camera offset, background context) rather than learning the invariant shape of the exercise. 
- **Class imbalance** <br> between the processed csv files; low-sample classes vulnerable to misclassification:
    - bench_press: 387 elements
    - bulgarian_squat: 452 elements
    - lat_machine: 333 elements
    - pull_up: 167 elements
    - push_up: 231 elements
    - split_squat: 665 elements
- **Misclassification Bias** <br> a lot of times for 3 class classification push up or pull up are misclassified to split_squat -> bias towards overrepresented class; common for under-represented classes to be “sucked into” an overrepresented neighbor

### Mitigation Steps
- **Data Augmentation**
    - flip/rotate all the input videos by a certain margin
    - add them twice: once normal & once vertical flipped (remember to swap left/right landmark indices (e.g. left_elbow <-> right_elbow) before training)
    - change framesize (e.g. 50 frames)
- **Expand Dataset**
    - More videos from different angles, lighting, and subjects will improve generalization.
    - one may have to abandon and replace given classes such as split_squat & bulgarian_squat as they do not exist in big public datasets
- **Mediapipe** <br> normalized coordinated at top left af image, all coordinates are between [0, 1]. one could consider some improvements:
    - **Center on a root joint** <br> Subtract the hip midpoint (or another stable landmark) from every x,y,z so that your model sees poses relative to the body, not the image
    - **Scale Normalization** <br> Divide distances by the length of the torso or shoulders width so that “big” vs. “small” people don’t confuse the network
    - **Hand-crafted Features** <br> calculate distances and angles and add as extra input to the model to give it more information