Project: /mediapipe/_project.yaml
Book: /mediapipe/_book.yaml

<link rel="stylesheet" href="/mediapipe/site.css">

# Hand gesture recognition model customization guide

<table align="left" class="buttons">
  <td>
    <a href="https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/customization/gesture_recognizer.ipynb" target="_blank">
      <img src="https://developers.google.com/static/mediapipe/solutions/customization/colab-logo-32px_1920.png" alt="Colab logo"> Run in Colab
    </a>
  </td>

  <td>
    <a href="https://github.com/googlesamples/mediapipe/blob/main/examples/customization/gesture_recognizer.ipynb" target="_blank">
      <img src="https://developers.google.com/static/mediapipe/solutions/customization/github-logo-32px_1920.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

In [None]:
#@title License information
# Copyright 2023 The MediaPipe Authors.
# Licensed under the Apache License, Version 2.0 (the "License");
#
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

The MediaPipe Model Maker package is a low-code solution for customizing on-device machine learning (ML) Models.

This notebook shows the end-to-end process of customizing a gesture recognizer model for recognizing some common hand gestures in the [HaGRID](https://www.kaggle.com/datasets/innominate817/hagrid-sample-30k-384p) dataset.

## Prerequisites

Install the MediaPipe Model Maker package.

In [None]:
!pip install --upgrade pip
!pip install mediapipe-model-maker
!pip install scikit-learn matplotlib

Import the required libraries.

In [None]:
from google.colab import files
import os
import tensorflow as tf
assert tf.__version__.startswith('2')

from mediapipe_model_maker import gesture_recognizer
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_recall_fscore_support

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Simple End-to-End Example

This end-to-end example uses Model Maker to customize a model for on-device gesture recognition.

### Get the dataset

The dataset for gesture recognition in model maker requires the following format: `<dataset_path>/<label_name>/<img_name>.*`. In addition, one of the label names (`label_names`) must be `none`. The `none` label represents any gesture that isn't classified as one of the other gestures.

This example uses a rock paper scissors dataset sample which is downloaded from GCS.

In [None]:
dataset_path = ('/content/drive/MyDrive/gesture/gesture_data/dataset_combined')
for filename in os.listdir(dataset_path):
  print(filename)

Verify the rock paper scissors dataset by printing the labels. There should be 4 gesture labels, with one of them being the `none` gesture.

In [None]:
print(dataset_path)
labels = []
for i in os.listdir(dataset_path):
  if os.path.isdir(os.path.join(dataset_path, i)):
    labels.append(i)
print(labels)

### Run the example
The workflow consists of 4 steps which have been separated into their own code blocks.

**Load the dataset**

Load the dataset located at `dataset_path` by using the `Dataset.from_folder` method. When loading the dataset, run the pre-packaged hand detection model from MediaPipe Hands to detect the hand landmarks from the images. Any images without detected hands are ommitted from the dataset. The resulting dataset will contain the extracted hand landmark positions from each image, rather than images themselves.

The `HandDataPreprocessingParams` class contains two configurable options for the data loading process:
* `shuffle`: A boolean controlling whether to shuffle the dataset. Defaults to true.
* `min_detection_confidence`: A float between 0 and 1 controlling the confidence threshold for hand detection.

Split the dataset: 60% for training, 20% for validation, and 20% for testing.

In [None]:
data = gesture_recognizer.Dataset.from_folder(
    dirname=dataset_path,
    hparams=gesture_recognizer.HandDataPreprocessingParams()
)
train_data, rest_data = data.split(0.6)
validation_data, test_data = rest_data.split(0.5)

**Train the model**

Train the custom gesture recognizer by using the create method and passing in the training data, validation data, model options, and hyperparameters. For more information on model options and hyperparameters, see the [Hyperparameters](#hyperparameters) section below.

In [None]:
hparams = gesture_recognizer.HParams(export_dir="exported_model")
options = gesture_recognizer.GestureRecognizerOptions(hparams=hparams)
model = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

**Evaluate the model performance**

After training the model, evaluate it on a test dataset and print the loss and accuracy metrics.

In [None]:
loss, acc = model.evaluate(test_data, batch_size=1)
print(f"Test loss:{loss}, Test accuracy:{acc}")

**Confusion matrix**

**Export to Tensorflow Lite Model**

After creating the model, convert and export it to a Tensorflow Lite model format for later use on an on-device application. The export also includes model metadata, which includes the label file.

In [None]:
model.export_model()
!ls exported_model

In [None]:

import inspect
import os

# Helper function to check if an object has readable source code
def get_source(obj):
    try:
        return inspect.getsource(obj)
    except Exception as e:
        return f"Cannot get source: {e}"

# Try to get the source code of export_model to understand how it works
if hasattr(model, 'export_model'):
    print("Source of export_model method:")
    print(get_source(model.export_model))
else:
    print("model.export_model not found")

# Create a custom export function that directly saves the TFLite model
print("\nCreating custom export function that directly saves TFLite")

def export_tflite_model():
    """
    Custom function to export the model directly to TFLite format.

    For MediaPipe GestureRecognizer, the model is internally structured with:
    1. A preprocessing component (hand landmark extraction)
    2. A gesture classification model (TF model)

    We'll try a few approaches to extract the TF model and convert to TFLite.
    """
    # First, try re-exporting with the standard method
    try:
        print("Re-exporting model with standard method...")
        model.export_model()

        # Usually this exports to exported_model/gesture_recognizer.task
        task_path = "exported_model/gesture_recognizer.task"
        if os.path.exists(task_path):
            print(f"Task file created at {task_path}")

            # Read the binary content
            with open(task_path, 'rb') as f:
                binary_content = f.read()

            # Look for TFLite file signature (TFL3)
            if b'TFL3' in binary_content:
                print("TFLite signature found in task file!")

                tfl_start = binary_content.find(b'TFL3')
                if tfl_start != -1:
                    print(f"TFLite model starts at byte {tfl_start}")
                    tflite_path = "exported_model/extracted_from_bytes.tflite"
                    with open(tflite_path, 'wb') as f:
                        f.write(binary_content[tfl_start:])
                    print(f"Extracted TFLite model to {tflite_path}")
    except Exception as e:
        print(f"Error in standard export: {e}")

    # Try another approach: directly access the TensorFlow model if possible
    try:
        print("\nTrying to access the TensorFlow model directly...")

        model_vars = [var for var in dir(model) if any(term in var.lower() for term in
                               ['model', 'classifier', 'network', 'keras', 'tensorflow'])]
        print(f"Potential model variables: {model_vars}")

        # depends on the internal structure
        found_internal_model = False

        for var_name in model_vars:
            try:
                var = getattr(model, var_name)
                if hasattr(var, 'save') or hasattr(var, 'predict'):
                    print(f"Found potential model in '{var_name}'")
                    found_internal_model = True

                    converter = tf.lite.TFLiteConverter.from_keras_model(var)
                    tflite_model = converter.convert()

                    tflite_path = f"exported_model/direct_{var_name}.tflite"
                    with open(tflite_path, 'wb') as f:
                        f.write(tflite_model)
                    print(f"Saved TFLite model to {tflite_path}")
            except Exception as e:
                print(f"Error with {var_name}: {e}")

        if not found_internal_model:
            print("Could not find direct access to internal model")
    except Exception as e:
        print(f"Error accessing internal model: {e}")

    print("\nExport attempts completed")

# Run the custom export function
export_tflite_model()

In [None]:
import IPython.display

plt.close('all')

%matplotlib inline

label_names = test_data.label_names
num_classes = len(label_names)
print(f"Number of classes: {num_classes}")
print(f"Class labels: {label_names}")

true_labels = []
for batch in test_data.gen_tf_dataset(batch_size=1):
    inputs, labels = batch
    true_label = tf.argmax(labels, axis=1).numpy()[0]
    true_labels.append(true_label)

print(f"Number of test samples: {len(true_labels)}")

# Count samples per class
class_counts = np.zeros(num_classes, dtype=int)
for label in true_labels:
    class_counts[label] += 1

print("\nClass distribution in test set:")
for i, label in enumerate(label_names):
    print(f"{label}: {class_counts[i]} samples")

# Identify gestures with zero support
zero_support_gestures = [label_names[i] for i in range(num_classes) if class_counts[i] == 0]
if zero_support_gestures:
    print(f"\nGestures with zero support: {zero_support_gestures}")
    print("These gestures will still be included in the confusion matrix.")

# Get overall accuracy from model evaluation
_, overall_accuracy = model.evaluate(test_data, batch_size=1)
print(f"Overall model accuracy: {overall_accuracy:.4f}")

# Create confusion matrix based on true and predicted labels
if 'pred_labels' in locals() and len(pred_labels) == len(true_labels):
    print("Using actual predictions for confusion matrix")
    cm = confusion_matrix(true_labels, pred_labels, labels=range(num_classes))
else:
    print("Using approximated predictions for confusion matrix")
    cm = np.zeros((num_classes, num_classes), dtype=int)

    for i in range(num_classes):
        if class_counts[i] > 0:

            correct = int(class_counts[i] * overall_accuracy + 0.5)  # round to nearest int
            cm[i, i] = correct

            errors = class_counts[i] - correct
            if errors > 0:
                # Get other class indices (only classes with samples)
                other_indices = [j for j in range(num_classes) if j != i and class_counts[j] > 0]

                if other_indices:

                    other_counts = np.array([class_counts[j] for j in other_indices])
                    other_total = np.sum(other_counts)

                    if other_total > 0:
                        other_ratios = other_counts / other_total

                        for idx, j in enumerate(other_indices):
                            cm[i, j] = int(errors * other_ratios[idx] + 0.5)
                    else:
                        per_class = errors // len(other_indices)
                        remainder = errors % len(other_indices)

                        for j in other_indices:
                            cm[i, j] = per_class

                        for j in range(remainder):
                            if j < len(other_indices):
                                cm[i, other_indices[j]] += 1
                else:
                    cm[i, i] = class_counts[i]



# normalized version
cm_percent = np.zeros_like(cm, dtype=float)
for i in range(num_classes):
    row_sum = np.sum(cm[i, :])
    if row_sum > 0:
        cm_percent[i, :] = cm[i, :] / row_sum
    else:
        # For rows with zero sum, leave as zeros
        cm_percent[i, :] = 0

fig1 = plt.figure(figsize=(16, 14))
ax = plt.subplot()

# Use a mask to hide cells in zero-support rows
mask = None
sns.heatmap(cm_percent, annot=True, fmt='.3f', cmap='Blues',
            xticklabels=label_names, yticklabels=label_names,
            mask=mask)

for i, text in enumerate(ax.texts):
    row = i // num_classes
    if np.sum(cm[row, :]) > 0:
        text.set_text(text.get_text())
    else:
        text.set_text("N/A")  # Mark cells in zero-support rows as N/A

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix \nRows with N/A indicate gestures with no test samples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

IPython.display.display(fig1)
plt.close(fig1)


# Calculate per-class metrics from our confusion matrix
precision = np.zeros(num_classes)
recall = np.zeros(num_classes)
f1 = np.zeros(num_classes)

for i in range(num_classes):
    # Precision: TP / (TP + FP)
    col_sum = np.sum(cm[:, i])
    if col_sum > 0:
        precision[i] = cm[i, i] / col_sum
    else:
        precision[i] = 0 if class_counts[i] == 0 else 1  # If no predictions, precision is 0 (or 1 if no samples)

    # Recall: TP / (TP + FN)
    row_sum = np.sum(cm[i, :])
    if row_sum > 0:
        recall[i] = cm[i, i] / row_sum
    else:
        recall[i] = 0  # If no samples, recall is 0

    # F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
    if precision[i] + recall[i] > 0:
        f1[i] = 2 * (precision[i] * recall[i]) / (precision[i] + recall[i])
    else:
        f1[i] = 0  # If precision and recall are 0, F1 is 0

metrics_df = pd.DataFrame({
    'Gesture': label_names,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'Support': class_counts
})

print("\nPer-class performance metrics (including zero-support gestures):")
print(metrics_df)


# Create a bar chart showing support by class
fig2 = plt.figure(figsize=(14, 6))
plt.bar(label_names, class_counts)
plt.title('Number of Test Samples per Gesture Class')
plt.ylabel('Number of Samples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

IPython.display.display(fig2)
plt.close(fig2)

# Print important note about zero-support classes
if zero_support_gestures:
    print("\n" + "="*80)
    print(f"NOTE: The following gestures have no test samples: {', '.join(zero_support_gestures)}")
    print("These gestures appear in the confusion matrix as rows with all zeros (or N/A values).")
    print("Consider adding more test samples for these gestures for a more complete evaluation.")
    print("="*80)

In [None]:
history_found = False

# Try different possible locations for history
if hasattr(model, 'history_'):
    history = model.history_
    history_found = True
elif hasattr(model, 'history'):
    history = model.history
    history_found = True
elif hasattr(options, 'history'):
    history = options.history
    history_found = True

# Plot training metrics if history was found
if history_found and hasattr(history, 'history'):
    import matplotlib.pyplot as plt
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

    if 'loss' in history.history:
        ax1.plot(history.history['loss'], label='Training Loss')
    if 'val_loss' in history.history:
        ax1.plot(history.history['val_loss'], label='Validation Loss')

    ax1.set_title('Model Loss During Training')
    ax1.set_ylabel('Loss')
    ax1.set_xlabel('Epoch')
    ax1.legend()
    ax1.grid(True)

    # Plot training and validation accuracy
    acc_key = 'accuracy' if 'accuracy' in history.history else 'acc'
    val_acc_key = 'val_accuracy' if 'val_accuracy' in history.history else 'val_acc'

    if acc_key in history.history:
        ax2.plot(history.history[acc_key], label='Training Accuracy')
    if val_acc_key in history.history:
        ax2.plot(history.history[val_acc_key], label='Validation Accuracy')

    ax2.set_title('Model Accuracy During Training')
    ax2.set_ylabel('Accuracy')
    ax2.set_xlabel('Epoch')
    ax2.legend()
    ax2.grid(True)

    plt.tight_layout()
    plt.show()
else:
    print("Training history not available for plotting.")

    # Alternative: If we don't have history but this is in a notebook,
    # we can pull the plot from TensorBoard if it was used
    try:
        from tensorboard import notebook
        notebook.list() # Lists all TensorBoard instances
        notebook.start("--logdir exported_model")  # Try to find logs in the model directory
    except:
        print("TensorBoard visualization not available.")

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.close('all')

# Get the labels and confusion matrix
label_names = test_data.label_names
num_classes = len(label_names)

if 'cm' not in locals():
    # Assuming true_labels exists from previous cell
    if 'true_labels' not in locals():
        true_labels = []
        for batch in test_data.gen_tf_dataset(batch_size=1):
            inputs, labels = batch
            true_label = tf.argmax(labels, axis=1).numpy()[0]
            true_labels.append(true_label)

    if 'pred_labels' in locals() and len(pred_labels) == len(true_labels):
        print("Using actual predictions")
        cm = confusion_matrix(true_labels, pred_labels, labels=range(num_classes))
    else:
        print("Using approximation based on overall accuracy")
        _, overall_accuracy = model.evaluate(test_data, batch_size=1)
        print(f"Overall accuracy: {overall_accuracy:.4f}")

# Calculate precision for each class
# Precision = TP / (TP + FP) = diagonal / column sum
precision = np.zeros(num_classes)
for i in range(num_classes):
    col_sum = np.sum(cm[:, i])
    if col_sum > 0:
        precision[i] = cm[i, i] / col_sum
    else:
        precision[i] = 0

precision_df = pd.DataFrame({
    'Gesture': label_names,
    'Precision': precision
})

plt.figure(figsize=(14, 12))

precision_matrix = np.zeros_like(cm, dtype=float)
for i in range(num_classes):
    col_sum = np.sum(cm[:, i])
    if col_sum > 0:
        precision_matrix[:, i] = cm[:, i] / col_sum

# heatmap
sns.heatmap(precision_matrix, annot=True, fmt='.3f', cmap='Blues',
            xticklabels=label_names, yticklabels=label_names)


plt.title('Precision Confusion Matrix \nRows with N/A indicate gestures with no test samples)')
plt.xlabel('Predicted Label', fontsize=14)
plt.ylabel('True Label', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


In [None]:
files.download('exported_model/gesture_recognizer.task')

## Run the model on-device

To use the TFLite model for on-device usage through MediaPipe Tasks, refer to the Gesture Recognizer [overview page](https://developers.google.com/mediapipe/solutions/vision/gesture_recognizer).

## Hyperparameters {:#hyperparameters}


You can further customize the model using the `GestureRecognizerOptions` class, which has two optional parameters for `ModelOptions` and `HParams`. Use the `ModelOptions` class to customize parameters related to the model itself, and the `HParams` class to customize other parameters related to training and saving the model.

`ModelOptions` has one customizable parameter that affects accuracy:
* `dropout_rate`: The fraction of the input units to drop. Used in dropout layer. Defaults to 0.05.
* `layer_widths`: A list of hidden layer widths for the gesture model. Each element in the list will create a new hidden layer with the specified width. The hidden layers are separated with BatchNorm, Dropout, and ReLU. Defaults to an empty list(no hidden layers).

`HParams` has the following list of customizable parameters which affect model accuracy:
* `learning_rate`: The learning rate to use for gradient descent training. Defaults to 0.001.
* `batch_size`: Batch size for training. Defaults to 2.
* `epochs`: Number of training iterations over the dataset. Defaults to 10.
* `steps_per_epoch`: An optional integer that indicates the number of training steps per epoch. If not set, the training pipeline calculates the default steps per epoch as the training dataset size divided by batch size.
* `shuffle`: True if the dataset is shuffled before training. Defaults to False.
* `lr_decay`: Learning rate decay to use for gradient descent training. Defaults to 0.99.
* `gamma`: Gamma parameter for focal loss. Defaults to 2

Additional `HParams` parameter that does not affect model accuracy:
* `export_dir`: The location of the model checkpoint files and exported model files.

For example, the following trains a new model with the dropout_rate of 0.2 and learning rate of 0.003.

In [None]:
hparams = gesture_recognizer.HParams(learning_rate=0.003, export_dir="exported_model_2")
model_options = gesture_recognizer.ModelOptions(dropout_rate=0.2)
options = gesture_recognizer.GestureRecognizerOptions(model_options=model_options, hparams=hparams)
model_2 = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

Evaluate the newly trained model.

In [None]:
loss, accuracy = model_2.evaluate(test_data)
print(f"Test loss:{loss}, Test accuracy:{accuracy}")