### Update you test data path here

In [1]:
test_data = './data/val'
test_labels = './data/val.csv'

### Model Performance Summary Table


| **Index** | **Model**                     | **Key Hyperparameters**                                                                 | **Results**                             | **Conclusion & Explanation**                                                                                                                                                                                                                                                                                                       |
|-----------|--------------------------------|-----------------------------------------------------------------------------------------|------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1         | Conv2D + GRU                  | Conv2D (16, 32, 64 filters), GRU (64 units), Dense (5 units), Dropout (40%)             | Accuracy: Train 98.79%, Val 94.0%        | The model learns spatial and temporal features effectively. Conv2D layers capture spatial features, while GRU layers handle temporal dependencies. Regularization with dropout minimizes overfitting, enabling strong generalization. This model performs robustly due to the balance of spatial and temporal learning.                   |
| 2         | Conv2D + LSTM                 | Conv2D (16, 32, 64 filters), LSTM (32 units), Dense (5 units), Dropout (50%)            | Accuracy: Train 92.01%, Val 91.0%        | Conv2D layers extract spatial patterns effectively, while the LSTM layer captures temporal dependencies. However, its slightly lower accuracy compared to Conv2D+GRU indicates GRUs might better capture the temporal nuances for gesture recognition tasks. Dropout helps reduce overfitting.                                           |
| 3         | Conv3D Without Pretraining     | 3 Conv3D layers (filters: 32, 64, 128), GlobalAvgPooling, Dense (128 units, dropout 50%) | Accuracy: Train 93.82%, Val 89.0%        | The Conv3D model effectively learns spatial and temporal features. Its consistent performance shows that raw spatial-temporal features are well captured without needing a pretrained base. The dropout layer effectively regularizes training. Further optimization could focus on data augmentation.                              |
| 4         | GRU With MediaPipe Keypoints   | GRU (64 units), Flatten, Dense (5 units), Dropout (50%)                                  | Accuracy: Train 95.5%, Val 99.0%         | MediaPipe keypoints simplify the input space, leading to a lightweight model with only 25k parameters. The GRU efficiently models the temporal dependencies in hand gestures, producing excellent generalization. This approach is computationally efficient, ideal for real-time applications, and robust due to keypoint-based input. |
| 5         | MobileNetV2 + GRU (Pretrained) | Pretrained MobileNetV2, GRU (32 units), Dense (5 units), Dropout (50%)                  | Accuracy: Train 92.01%, Val 91.0%        | The pretrained MobileNetV2 effectively extracts spatial features, while the GRU layer models temporal dependencies. However, freezing the pretrained base limits the model’s ability to adapt to specific gesture tasks. Fine-tuning the base layers could further improve performance.                                                  |
| 6         | MobileNetV3Small + GRU         | Pretrained MobileNetV3Small, GRU (64 units), Dense (5 units), Dropout (50%)             | Accuracy: Train 40%, Val 45%             | The frozen MobileNetV3Small base limits the model’s performance, resulting in underfitting. Temporal modeling via GRU is insufficient to compensate for the lack of fine-tuning. This model requires significant improvements through fine-tuning, data augmentation, or more robust temporal modeling.                                   |


# Best in the category

| **Category**        | **Model**                     | **Key Hyperparameters**                                                                 | **Results**                             | **Why This Model Stands Out**                                                                                                                                                                                                                                                              |
|----------------------|--------------------------------|-----------------------------------------------------------------------------------------|------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Non-Pretrained**   | Conv3D                        | 3 Conv3D layers (filters: 32, 64, 128), GlobalAvgPooling, Dense (128 units, dropout 50%) | Accuracy: Train 93.82%, Val 89.0%        | Conv3D achieves consistent performance with stable loss and accuracy trends. Its ability to learn spatial and temporal features simultaneously, without relying on pretrained weights, makes it a robust option. The stability suggests potential for improvement with additional training.           |
| **Pretrained**       | GRU with MediaPipe Keypoints  | GRU (64 units), Flatten, Dense (5 units), Dropout (50%), MediaPipe Hand Keypoints       | Accuracy: Train 95.5%, Val 99.0%         | Leveraging MediaPipe’s pretrained keypoint extractor reduces input complexity, allowing the lightweight GRU-based model to achieve exceptional accuracy and efficiency. Its simplicity and computational efficiency make it ideal for real-time applications while maintaining robust generalization. |


### Download and Save `utilities.py`

The following script checks for the existence of `utilities.py`. If not found, it downloads the file from https://github.com/mohiteamit/upGrad-Gesture-Recognition


In [2]:
# Download utilities.py
import os
import requests

file_name = "utilities.py"
url = "https://raw.githubusercontent.com/mohiteamit/upGrad-Gesture-Recognition/refs/heads/main/utilities.py"

# Check if the file exists
if not os.path.exists(file_name):
    print(f"{file_name} not found. Downloading...")
    try:
        response = requests.get(url)
        if response.ok:
            with open(file_name, "wb") as file:
                file.write(response.content)
            print(f"{file_name} downloaded successfully.")
        else:
            print(f"Failed to download {file_name}. HTTP Status Code: {response.status_code}")
            exit(1)
    except Exception as e:
        print(f"Error downloading {file_name}: {e}")
        exit(1)

### Download and Verify Models

The script downloads models from a list of URLs into a specified directory, ensuring file integrity through verification. If a file is missing or corrupted, it is re-downloaded.

- **Output Directory**: `models_to_evaluate`
- **Model URLs**: Pre-defined list

In [3]:
import os
import requests

# List of model URLs
model_urls = [
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/Conv2D+GRU.keras",
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/Conv2D+LSTM.keras",
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/Conv3D-32-64-128.keras",
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/pretrained-MobileNetV2+GRU.keras",
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/pretrained-MobileNetV3Small+GRU.keras",
    "https://github.com/mohiteamit/upGrad-Gesture-Recognition/raw/refs/heads/main/best-models/pretrained-mediapipe+gru.keras",
]

# Directory to save models
output_dir = "models_to_evaluate"
os.makedirs(output_dir, exist_ok=True)

# Function to verify file integrity
def verify_file(file_path, url):
    with open(file_path, 'rb') as f:
        local_content = f.read()
    response = requests.get(url)
    return response.ok and local_content == response.content

# Download models
for url in model_urls:
    filename = os.path.join(output_dir, os.path.basename(url))
    try:
        if not os.path.exists(filename) or not verify_file(filename, url):
            response = requests.get(url)
            if response.ok:
                with open(filename, 'wb') as f:
                    f.write(response.content)
            else:
                print(f"Failed to download: {url}")
    except Exception as e:
        print(f"Error processing {url}: {e}")

print("Models downloaded.")

Models downloaded.


### Import Necessary Modules

- **GestureDataGenerator**: Custom data generator from `utilities.py`.
- **TensorFlow**: Framework for deep learning.
- **load_model**: Used to load pre-trained models.

In [4]:
from utilities import GestureDataGenerator
import tensorflow as tf
from tensorflow.keras.models import load_model

### Evaluate `Conv2D+GRU` Model with TensorFlow 2.10.x

The script checks for TensorFlow version compatibility and evaluates the `Conv2D+GRU` model using the `GestureDataGenerator`.

- **TensorFlow Version**: `2.10.x`.
- **Image Size**: `(120, 120)`
- **Model Path**: `models_to_evaluate/Conv2D+GRU.keras`

In [10]:
if tf.__version__.startswith("2.10"):
    image_size = (120, 120)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size
    )

    Conv2D_GRU = load_model('models_to_evaluate/Conv2D+GRU.keras')                   # Best image size 120x120
    Conv2D_GRU.summary()
    evaluation_results = Conv2D_GRU.evaluate(test_generator)
    for metric, value in zip(Conv2D_GRU.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.10.x")

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: False
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 time_distributed_70 (TimeDi  (None, 30, None, None, 1  448      
 stributed)                  6)                                  
                                                                 
 time_distributed_71 (TimeDi  (None, 30, None, None, 1  64       
 stributed)                  6)                                  
                                                                 
 time_distributed_72 (TimeDi  (None, 30, None, None, 1  0        
 stributed)                  6)                                  
                                 

### Evaluate `Conv2D+LSTM` Model

The script checks for TensorFlow version compatibility and evaluates the `Conv2D+LSTM` model using the `GestureDataGenerator`.

- **TensorFlow Version**: `2.10.x`.
- **Image Size**: `(120, 120)`
- **Model Path**: `models_to_evaluate/Conv2D+LSTM.keras`


In [None]:
if tf.__version__.startswith("2.10"):
    image_size = (120, 120)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size
    )

    Conv2D_LSTM = load_model('models_to_evaluate/Conv2D+LSTM.keras')
    Conv2D_LSTM.summary()
    evaluation_results = Conv2D_LSTM.evaluate(test_generator)
    for metric, value in zip(Conv2D_LSTM.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.10.x")

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: False
Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 time_distributed_50 (TimeDi  (None, 30, None, None, 1  448      
 stributed)                  6)                                  
                                                                 
 time_distributed_51 (TimeDi  (None, 30, None, None, 1  64       
 stributed)                  6)                                  
                                                                 
 time_distributed_52 (TimeDi  (None, 30, None, None, 1  0        
 stributed)                  6)                                  
                                                                 
 time_distributed_53 (TimeDi  (None, 30, None, None, 3  4640     
 stributed)                  2)                                  
                         

### Evaluate `Conv3D` Model - THE BEST MODEL WITHOUT PRE-TRAIN

`The best performing model without transfer learning`

The script checks for TensorFlow version compatibility and evaluates the `Conv3D` model using the `GestureDataGenerator`.

- **TensorFlow Version**: `2.10.x`.
- **Image Size**: `(200, 200)`
- **Model Path**: `models_to_evaluate/Conv3D-32-64-128.keras`

**Note:** model has shown potential to perform even better with more training data and additional epochs. Conv2D+GRU and Conv2D+LSTM show better scores (94% and 91% respectively) but are less predictable in thier loss. Conv3D however is stable and will perform equally well on unseen data

In [None]:
if tf.__version__.startswith("2.10"):
    image_size = (200, 200)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size
    )
    Conv3D_32_64_128 = load_model('models_to_evaluate/Conv3D-32-64-128.keras') 
    Conv3D_32_64_128.summary()
    evaluation_results = Conv3D_32_64_128.evaluate(test_generator)
    for metric, value in zip(Conv3D_32_64_128.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.10.x")

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: False
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv3d_3 (Conv3D)           (None, 30, 200, 200, 32)  2624      
                                                                 
 max_pooling3d_2 (MaxPooling  (None, 15, 100, 100, 32)  0        
 3D)                                                             
                                                                 
 conv3d_4 (Conv3D)           (None, 15, 100, 100, 64)  55360     
                                                                 
 max_pooling3d_3 (MaxPooling  (None, 8, 50, 50, 64)    0         
 3D)                                                             
                                                                 
 conv3d_5 (Conv3D)           (None, 8, 50, 50, 128)    221312    
                         

### Evaluate `MobileNetV2+GRU` Model

The following code evaluates the `MobileNetV2+GRU` model using TensorFlow 2.18.x on test data.

- **TensorFlow Version**: `2.18.x`.
- **Image Size**: `(224, 224)`
- **Use MediaPipe**: `False`
- **Model Path**: `models_to_evaluate/pretrained-MobileNetV2+GRU.keras`

In [7]:
if tf.__version__.startswith("2.18"):
    image_size = (224, 224)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size,
        use_mediapipe=False
    )

    MobileNetV2_GRU = load_model('models_to_evaluate/pretrained-MobileNetV2+GRU.keras')
    MobileNetV2_GRU.summary()
    evaluation_results = MobileNetV2_GRU.evaluate(test_generator)
    for metric, value in zip(MobileNetV2_GRU.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.18.x")

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: False


[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 2s/step - accuracy: 0.8174 - loss: 0.8290
loss: 0.8503
compile_metrics: 0.8200


### Evaluate `MobileNetV3Small+GRU` Model

The following evaluates the `MobileNetV3Small+GRU` model using TensorFlow 2.18.x on test data.

- **TensorFlow Version**: `2.18`.
- **Image Size**: `(224, 224)`
- **Model Path**: `models_to_evaluate/pretrained-MobileNetV3Small+GRU.keras`

In [8]:
if tf.__version__.startswith("2.18"):
    image_size = (224, 224)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size,
    )

    MobileNetV3Small_GRU = load_model('models_to_evaluate/pretrained-MobileNetV3Small+GRU.keras')
    MobileNetV3Small_GRU.summary()
    evaluation_results = MobileNetV3Small_GRU.evaluate(test_generator)
    for metric, value in zip(MobileNetV3Small_GRU.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.18.x")        

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: False


[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 847ms/step - accuracy: 0.3608 - loss: 1.4221
loss: 1.4710
compile_metrics: 0.3600


### Evaluate `Mediapipe+GRU` Model - THE BEST MODEL WITH PRE-TRAIN

`The best performing model with transfer learning`

The following code evaluates the `Mediapipe+GRU` model using TensorFlow 2.18.x on test data. Model usages mediapipe hand as part of data processing step to compact images into (21, 3) array representing 21 key points of the hand.

- **TensorFlow Version**: Must start with `2.18`.
- **Image Size**: `(256, 256)`
- **Use MediaPipe**: `True`
- **Model Path**: `models_to_evaluate/pretrained-mediapipe+gru.keras`

**Note:** depending on how this model is deployed in practice performance of the model will differ. However this model will always out-done any other model by only focusing on hands and ignoring all other noise. Model is also CPU centric and does not require GPU centric hardware for predicating single hand gesture at a time.

In [9]:
if tf.__version__.startswith("2.18"):
    image_size = (256, 256)

    test_generator = GestureDataGenerator(
        data_path=test_data,
        labels_csv=test_labels,
        image_size=image_size,
        use_mediapipe=True
    )

    mediapipe_GRU = load_model('models_to_evaluate/pretrained-mediapipe+gru.keras')
    mediapipe_GRU.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
    mediapipe_GRU.summary()
    evaluation_results = mediapipe_GRU.evaluate(test_generator)
    for metric, value in zip(mediapipe_GRU.metrics_names, evaluation_results):
        print(f"{metric}: {value:.4f}")
else:
    raise ValueError("This model requires TensorFlow 2.18.x")

13 batches created, each of size 8, with 100 sequences of 30 images each. Use MediaPipe: True


[1m13/13[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 4s/step - accuracy: 0.8749 - loss: 0.4942
loss: 0.3874
compile_metrics: 0.9300
