# Deep Learning for Computer Vision: Final Project

## Computer Science: COMS W 4995 05

### Proposal: Due November 5, 2024
### Presentations: Due December 3 and 5, 2024
### Final Report: Due December 9, 2024

### Project Overview

The final project is one of the most important and, hopefully, exciting components of the course. You will have the opportunity to develop a deep learning system of your own choosing. 
You are free to select whatever framework (Pytorch, Tensorflow, etc.) you like, but you need create a report on your project in a Jupyter notebook. You are also free build on publically available models and code, but your report must clearly give attribution for the work of others and must clearly delineate your contributions. Also, half of the class will present their project during the last 2 days. All of the class will prepare videos of their presentation and submit these when the final report is due. 

### Project Proposal

The project description should include the title of the project, participants, a description of the objectives of the project, and a plan for how the project will be completed. The description of the objectives should include modest predictions of the success of the project. The plan for completion should include a description of the training data and how it will be obtained, a discussion of what deep learning framework will be used and why, and a rough description of the planned network architecture.

You are permitted to work together on a project in groups of two or three, but group size must not exceed three participants.  For group projects there must be a clearly delineated division of labor: you should state in the project description and project report who was responsible for which portion of the project. Each student must hand in a separate report. (Students will not necessarily get the same grade for the same project.)

You should mention whether you are simply re-implementing what others have done before but applying to new data or whether you are attempting to do something new to the best of your knowledge. Creative and original projects will be judged more kindly than those that are rehashing something in the existing literature. And projects that include a component in which data is acquired/curated into training and validation sets will be veiwed more favorably than those that simply download an existing data set such as ImageNet.

As this is a computer vision course it is expected that your data will be visual, but exceptions might be made if the student is enthusiastic and persuasive enough. The most straightforward project would be to build a system that classifies images into categories. A more difficult project might be to build a system that detects and localizes a type of object within an image. A still more complicated project might involve joining a ConvNet/Vision Transformer with an LLM Transformer for a problem (like image captioning) that requires vision and language. But again, creative and original projects will be judged more kindly.  

It is important to scope your project so that you get some working results. Project reports that say "I tried this and this but nothing seemed to work..." are discouraged. Above all, you should demonstrate end-to-end fluency in the basics of deep learning. 

I cannot wait to see the results. Good luck!

<h1><strong>Project: RealGestureX

participant: Luigi Liu (ll3840)

<h1>Introduction

I plan to develop an Integrated Static and Dynamic Hand Gesture Recognition System capable of real-time recognition of both static and dynamic hand gestures. By utilizing advanced hand tracking technologies and deep learning models, this project aims to create an intuitive interface for human-computer interaction, with applications in smart home controls, accessibility tools, and interactive gaming.

<h3>Objectives

<strong>Primary Objective</strong>:

To design and implement a gesture recognition system that accurately identifies a set of predefined static and dynamic hand gestures using PyTorch.

<strong>Secondary Objectives</strong>:

Develop a custom dataset encompassing both static and dynamic gestures to train and validate the model.<br>
Integrate hand tracking using MediaPipe Hands to extract meaningful hand landmarks.<br>
Optimize the model architecture to ensure real-time performance with high accuracy.<br>
Deploy the system in a user-friendly interface that maps recognized gestures to specific commands.<br>


<h3>Project Plan for Completion

Gesture Selection: The system will recognize the following gestures:

Static Gestures:

Pointing,
Open Palm,
Thumb and Index Finger Touching,
Thumb and Middle Finger Touching,
Fist,

Dynamic Gestures:

Swipe Up,
Swipe Down,
Swipe Left,
Swipe Right,

Custom Dataset Creation: Due to the specificity of the required gestures, a custom dataset will be developed to ensure comprehensive coverage and high-quality samples. Each gesture sequence will be labeled accurately, with separate directories for training, validation, and testing sets to facilitate unbiased model evaluation.

Tools and Frameworks:
Camera Setup: Utilize high-resolution webcams to capture clear images and videos of hand gestures.
Hand Tracking: Employ MediaPipe Hands to extract 21 3D hand landmarks per frame, providing detailed spatial information.

Data Augmentation:

Spatial Augmentation: Apply rotations, scaling, and translations to hand landmarks to simulate different hand orientations and positions.
Temporal Augmentation: Vary the speed of gesture execution to capture different movement dynamics.
Noise Injection: Introduce slight perturbations to landmark positions to enhance model robustness against real-world variances.


<h3>Network Architecture Design

Hybrid Model Architecture: CNN + LSTM

Given the need to handle both static and dynamic gestures, a hybrid architecture combining Convolutional Neural Networks (CNNs) for spatial feature extraction and Long Short-Term Memory networks (LSTMs) for temporal sequence modeling is ideal.

Architecture Overview:

CNN Layers:
Purpose: Extract spatial features from each frame or hand landmark data.
Structure: A series of convolutional, batch normalization, activation (ReLU), and pooling layers to progressively capture complex features.
LSTM Layers:
Purpose: Capture temporal dependencies and motion patterns across the sequence of frames.
Structure: One or more LSTM layers (potentially bidirectional) to process the sequence data effectively.
Fully Connected Layers:
Purpose: Classify the extracted features into predefined gesture categories.
Structure: Dense layers with dropout for regularization, culminating in a softmax layer for multi-class classification.

<h3>Innovation and Originality

While gesture recognition systems utilizing CNN + LSTM architectures have been explored in existing research (e.g., the IEEE paper “Hand Gesture Recognition Using CNN and LSTM”), this project distinguishes itself through the following innovative and original approaches:

Integration of Static and Dynamic Gesture Recognition:

Unique Approach: Unlike many existing systems that focus solely on either static or dynamic gestures, this project simultaneously handles both within a single real-time framework. This integration allows for a more versatile interaction model, catering to a broader range of user inputs.
Real-Time Performance: Emphasizing real-time processing ensures that both static and dynamic gestures are recognized and responded to instantaneously, enhancing user experience and system responsiveness.
Custom Dataset Creation:

Tailored Data: Instead of relying on existing datasets, the project involves creating a bespoke dataset that precisely matches the project's specific gesture requirements. This ensures higher accuracy and relevance in the recognition tasks.
Data Diversity: By incorporating variations in participants, lighting conditions, and gesture execution styles, the dataset enhances the model's robustness and generalizability.
Optimized Hybrid Architecture:

Adaptive Modeling: The CNN + LSTM architecture is meticulously designed to effectively capture both spatial features from individual frames and temporal dynamics across sequences. This dual capability is crucial for accurately distinguishing between similar gestures that may vary in motion.
Performance Enhancements: Implementing techniques like bidirectional LSTMs and attention mechanisms (if extended) can further refine the model's ability to focus on critical movement patterns, improving overall recognition accuracy.
System Integration and Deployment:

User-Friendly Interface: Beyond model development, the project emphasizes the integration of the recognition system into practical applications, mapping gestures to real-world commands. This end-to-end approach ensures that the system is not only theoretically sound but also practically applicable.
Scalability: The system is designed with scalability in mind, allowing for easy addition of new gestures and functionalities without significant overhauls to the existing infrastructure.
Comprehensive Evaluation and Testing:

Real-World Testing: Conducting extensive user testing across different scenarios ensures that the system performs reliably under varied conditions, addressing potential real-world challenges that purely academic models might overlook.
Performance Metrics: Utilizing a range of evaluation metrics, including confusion matrices and classification reports, provides a holistic understanding of the model's strengths and areas for improvement.

<h3>Expected Outcomes and Success Metrics


Accuracy: Achieve an overall gesture recognition accuracy of 85% or higher across both static and dynamic gestures.
Real-Time Performance: Ensure the system processes gestures with a latency of less than 200 milliseconds, facilitating smooth user interactions.<br>
Robustness: Demonstrate the system's ability to accurately recognize gestures under varying lighting conditions, hand orientations, and across different users.<br>