# HowTo: MediaPipe Model Training Jupyter Notebook for Custom Model HGR🖐️ with Gesture Recognition Task 

---

Author: [*jk4e*](https://github.com/jk4e)  
Date: *Sep. 2024*  
File: *mp_model_maker.ipynb*

---

This Jupyter notebook is designed for local training of a custom hand gesture recognition model. It demonstrates the complete end-to-end process of customizing a gesture recognition model to identify hand gestures using the provided example dataset. This notebook serves as an additional resource to the official Google Colab notebook and guide from the MediaPipe documentation.

### Why Train Locally Instead of Using Google Colab?

Through my own testing, I found that training on Google Colab is generally 1) unnecessary, and 2) not faster than training locally. This is because the vector embedding of hand keypoints takes 10 times or even longer than the actual model training, and embedding cannot be accelerated with a GPU. 

If you have a decent CPU with more than 4 cores and enough RAM, local training is likely the better option. Free Google Colab provides only relatively low-performance hardware (2 CPU cores), so the local approach is often faster. Moreover, you don't necessarily need an NVIDIA GPU for neural network training. The model architecture, the small number of images required for training (around 100+ per class), and the low number of epochs can easily be handled by a CPU.

In addition to this notebook, you'll find other helpful resources to assist you in getting started.

Helpful resources:
- Documentation - Offical Guide for Gesture Recognition Task: https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer
- Setup - Setup guide for Python: https://ai.google.dev/edge/mediapipe/solutions/setup_python
- Training - Hand gesture recognition model customization guide: https://ai.google.dev/edge/mediapipe/solutions/customization/gesture_recognizer
- Inference - Gesture recognition guide for Python: https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer/python


- A Guide for Beginners: https://blog.roboflow.com/what-is-mediapipe/
- GitHub: https://github.com/google-ai-edge/mediapipe
- API Refernce: https://ai.google.dev/edge/api

Google Colab / Jupyter Notebooks:
- MP HGR Walkthrough: https://github.com/google-ai-edge/mediapipe-samples/blob/main/tutorials/gesture_recognizer/WiMLS_2022_MediaPipe_Gesture_Recognizer_Walkthrough.ipynb
- Inference: https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/gesture_recognizer/python/gesture_recognizer.ipynb
- Customization: https://colab.research.google.com/github/googlesamples/mediapipe/blob/main/examples/customization/gesture_recognizer.ipynb


## Prerequisites and Installation

Before training with this notebook, you'll need to install the necessary Python packages.

1. Clone the repository to your local machine.
2. Create a virtual environment and activate, e.g., using conda.
3. Navigate to the cloned repository's root directory.
4. Install the required packages using pip:
    ```bash
    pip install -r requirements.txt
    ```

**Note:** Currently, the [MediaPipe Model Maker package](https://pypi.org/project/mediapipe-model-maker/) can only be installed on Linux. If you're using a Silicon-based macOS or Windows, you'll encounter an error. 

For Windows users, the workaround is to install WSL2 (Windows Subsystem for Linux). Once inside WSL2, you can install the package. If you prefer not to use WSL2 or set up a Linux environment, your only alternative is to use Google Colab.

## Import the required python packages

**Note:** Warnings you can typiclly ignore.

In [None]:
import os
import gc
import tensorflow as tf
assert tf.__version__.startswith("2")

from mediapipe_model_maker import gesture_recognizer

import matplotlib.pyplot as plt


**Cleaning Up with Garbage Collector**: To free up resources and improve performance, especially when running the notebook multiple times, it’s a good practice to manually clean up the RAM using Python's garbage collector. This can help avoid memory issues and ensure smoother performance.   Clean up with Garbage Collector to free resources.

**Tip**: It's also a good practice to restart the notebook kernel occasionally to completely reset the environment and clear any lingering resources.

In [None]:
gc.collect()

**INFO**: Show you the current workspace path.

In [None]:
print(os.getcwd())

## Get the Dataset

This notebook demonstrates the complete process of customizing a gesture recognition model for identifying hand gestures using the Rock Paper Scissors (RPS) dataset. For the gesture recognition model in Model Maker, the dataset must follow this directory structure: 

`<dataset_path>/<label_name>/<img_name>.*`

Additionally, one of the label names (`label_name`) must be `none`, which represents any gesture that isn't classified as one of the specific gestures in your dataset.

**TODO**: Update the file path to your dataset folder if you are training a custom model on your own dataset.

You have two options:
1. Download and use the example Rock Paper Scissors dataset.
2. Set your own custom dataset path.

**Note**: Ensure that your dataset follows the required subfolder structure and includes a `none` folder, even if it's empty. While you can train with an empty `none` folder, it must be present in the directory structure; otherwise, you will encounter an error. 

It’s also important to note that omitting the `none` label from your dataset is not recommended, as it can negatively impact your model's accuracy and ability to generalize to unseen gestures.


**Option**: Train with the example dataset (if you train on your own dataset ignore this cell)

In [None]:
import zipfile
import os

!wget https://storage.googleapis.com/mediapipe-tasks/gesture_recognizer/rps_data_sample.zip

# Define the path to the zip file and the dataset directory
zip_file_path = "rps_data_sample.zip"
dataset_path = ""

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(dataset_path)

# Verify the contents
for root, dirs, files in os.walk(dataset_path):
    for file in files:
        print(os.path.join(root, file))



**Option**: Train with your own dataset (set the path to the dataset folder)

In [5]:
dataset_path = "rps_data_sample"  # change the dataset path


**INFO**: Print dataset path and data labels to verify everthing is working and correct. For the RPS there should be four gesture labels, with one of them being the `none` gesture.

You should get the following cell output when training with the example dataset:

```
rps_data_sample
 Labels: ['scissors', 'rock', 'none', 'paper'] 
```

In [None]:
print(dataset_path)
labels = [
    i
    for i in os.listdir(dataset_path)
    if os.path.isdir(os.path.join(dataset_path, i))
]
print(f"\u001b[32m Labels: {labels} \033[0m")

## Show Examples of Each Label from the Dataset

**INFO:** This section plots a subset of images for each class in the dataset. If your `none` folder is empty, the plots for that label will be blank. Viewing these plots can help you better understand the content and quality of the images in your dataset.

The function displays a few example images for each gesture. You can easily adjust the number of examples shown per plot by modifying the `NUM_EXAMPLES = 5` parameter.

In [None]:
NUM_EXAMPLES = 5

%matplotlib inline

# Show the images.
for label in labels:
    label_dir = os.path.join(dataset_path, label)
    example_filenames = os.listdir(label_dir)[:NUM_EXAMPLES]
    fig, axs = plt.subplots(1, NUM_EXAMPLES, figsize=(10, 2))
    for i in range(NUM_EXAMPLES):
        try: 
            axs[i].imshow(plt.imread(os.path.join(label_dir, example_filenames[i])))
        except IndexError:
            break
        axs[i].get_xaxis().set_visible(False)
        axs[i].get_yaxis().set_visible(False)
    fig.suptitle(f"Showing {NUM_EXAMPLES} examples for {label}")

plt.show()

The core training workflow consists of four steps, each separated into its own code block.

## Load the Dataset

This step involves loading the dataset and performing vector embedding of the hand keypoints. If your dataset is large, this process may take some time.

**How It Works:**
The dataset is loaded from the `dataset_path` using the `Dataset.from_folder` method. When loading, the pre-packaged hand detection model from MediaPipe Hands detects hand landmarks in the images. Any images without detected hands are omitted from the dataset. The resulting dataset will contain the extracted hand landmark positions from each image, instead of the raw images themselves.

### Split the Dataset

You can adjust the dataset split as needed. The default split is 80% for training, 10% for validation, and 10% for testing (splitting the remaining data equally between validation and test sets).

```python
train_data, rest_data = data.split(0.8)
validation_data, test_data = rest_data.split(0.5)
```

### Cell Output

Once the dataset is processed, a summary is displayed in the cell output, showing the absolute sizes of the train, validation, and test datasets. Note that during preprocessing, some images may not have detectable hands or extractable keypoints. As a result, the actual data sizes may be slightly lower.

Cell output for example dataset:
```
 Number of classes: 4 
 Train data size: 378 
 Validation data size: 47 
 Test data size: 48 
 ```

### Parameters

The `HandDataPreprocessingParams` class provides two configurable options for data loading:

- **shuffle**: A boolean that controls whether the dataset should be shuffled. Defaults to `True`.
- **min_detection_confidence**: A float (0 to 1) that sets the confidence threshold for hand detection. Defaults to `0.7`. A higher value means the detection model must be more confident in identifying a hand, which may improve dataset quality but reduce the number of usable images, as less confident detections are discarded.

**Note:** This processing step can take considerable time, so be patient!

In [None]:

data = gesture_recognizer.Dataset.from_folder(
    dirname=dataset_path, hparams=gesture_recognizer.HandDataPreprocessingParams()
)

train_data, rest_data = data.split(0.8)
validation_data, test_data = rest_data.split(0.5)

num_of_classes = len(labels)
num_of_train_data = len(train_data)
num_of_validation_data = len(validation_data)
num_of_test_data = len(test_data)

print(f"\033[35m Number of classes: {num_of_classes} \033[0m")
print(f"\033[35m Train data size: {num_of_train_data} \033[0m")
print(f"\033[35m Validation data size: {num_of_validation_data} \033[0m")
print(f"\033[35m Test data size: {num_of_test_data} \033[0m")

## Train a model with the given parameters (standard or custom)

Train the custom gesture recognizer by using the create method and passing in the training data, validation data, model options, and hyperparameters. For more information on model options and hyperparameters, see the Hyperparameters section below or take a look in the [Docs](https://ai.google.dev/edge/mediapipe/solutions/customization/gesture_recognizer)/[API Reference](https://ai.google.dev/edge/api/mediapipe/python/mediapipe_model_maker/gesture_recognizer/HParams).

### Standard Model

Train the model using the default values.


In [None]:
# -- Standard setup of model maker --

hparams = gesture_recognizer.HParams(export_dir="exported_model")
options = gesture_recognizer.GestureRecognizerOptions(hparams=hparams)
model = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)


## Evaluation of the Model Performance

After training the model, evaluate its performance on the test dataset by printing the loss and accuracy metrics. The goal is to achieve a low loss value and a high accuracy score.

**Note**: Be cautious when interpreting high test accuracy values. If your model achieves an accuracy of `1.0`, this doesn't necessarily indicate that you’ve built a perfect model. It’s more likely that your model is overfitting, especially if your dataset is small and lacks variation. The true goal is to build a model with good generalization.

To ensure generalization, your dataset should represent a wide range of examples. For instance, it shouldn't only contain images of your own hand, but also those of other people’s hands (depending on the task and goal of your project).



In [None]:
loss, acc = model.evaluate(test_data, batch_size=1)
print(f"Test loss:{loss}, Test accuracy:{acc}")

## Export the trained model as a Tensorflow Lite Model

After creating the model, convert and export it to a Tensorflow Lite model format for later use on an on-device application. The export also includes model metadata, which includes the label file.

In [None]:
model.export_model()
!ls exported_model

## Run the Model On-Device

To deploy the TFLite model for on-device usage through MediaPipe Tasks, refer to the Gesture Recognizer [overview page](https://developers.google.com/mediapipe/solutions/vision/gesture_recognizer).

You're not limited to using Python; the model can also be integrated into Android, iOS, or Web applications.

# 🎮 Playground for Hyperparameter Tuning

----

You can further customize the model using the `GestureRecognizerOptions` class, which has two optional parameters for `ModelOptions` and `HParams`. Use the `ModelOptions` class to customize parameters related to the model itself, and the `HParams` class to customize other parameters related to training and saving the model.

`ModelOptions` has one customizable parameter that affects accuracy:
* `dropout_rate`: The fraction of the input units to drop. Used in dropout layer. Defaults to 0.05.
* `layer_widths`: A list of hidden layer widths for the gesture model. Each element in the list will create a new hidden layer with the specified width. The hidden layers are separated with BatchNorm, Dropout, and ReLU. Defaults to an empty list(no hidden layers).

`HParams` has the following list of customizable parameters which affect model accuracy:
* `learning_rate`: The learning rate to use for gradient descent training. Defaults to 0.001.
* `batch_size`: Batch size for training. Defaults to 2.
* `epochs`: Number of training iterations over the dataset. Defaults to 10.
* `steps_per_epoch`: An optional integer that indicates the number of training steps per epoch. If not set, the training pipeline calculates the default steps per epoch as the training dataset size divided by batch size.
* `shuffle`: True if the dataset is shuffled before training. Defaults to False.
* `lr_decay`: Learning rate decay to use for gradient descent training. Defaults to 0.99.
* `gamma`: Gamma parameter for focal loss. Defaults to 2

Additional `HParams` parameter that does not affect model accuracy:
* `export_dir`: The location of the model checkpoint files and exported model files.

### Custom Model (Just an example, see above Adjustable parameters)

In [None]:
hparams = gesture_recognizer.HParams(learning_rate=0.003, export_dir="exported_model_2")
model_options = gesture_recognizer.ModelOptions(dropout_rate=0.2)
options = gesture_recognizer.GestureRecognizerOptions(model_options=model_options, hparams=hparams)
model_2 = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

In [None]:
loss, accuracy = model_2.evaluate(test_data)
print(f"Test loss:{loss}, Test accuracy:{accuracy}")

In [None]:
model_2.export_model()
!ls exported_model