# Linkit Challenge SS 2023 Hand Gesture Detection - Team X

## 1. Setup


First, you usually create a virtual environment where you can keep track of all package versions.
Create the environment with the following command in Terminal (Mac)/ CMD (Windows)

python3 -m venv venv


Then activate it with
- in CMD: venv\Scripts\activate
- in Terminal: . venv/bin/activate

In [None]:
# only run once to setup your environment
first_run = False

if first_run:
    # install reqiurements.txt
    !pip3 install -r requirements.txt
    # clone the git repo of ultralytics/yolov5
    !git clone https://github.com/ultralytics/yolov5.git
    # install dependencies for yolo
    !pip3 install -r yolov5/requirements.txt
    # or from website
    #!pip3 install -qr https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt

#### 1.1. Import Packages

In [13]:
import pytorch_lightning as pl # PyTorch Lightning for easier training and evaluation of models
import torch # PyTorch
import cv2  # OpenCV for image processing
import matplotlib.pyplot as plt # for plotting
import matplotlib.patches as patches  # for plotting bounding boxes
%matplotlib inline
import uuid   # Unique identifier
import os # File system operations
import time # Time operations
import subprocess # running shell commands
import pandas as pd


## 2. Annotation

## 2.1 Capture Images

Select the number of how many images you want to take per class and choose the path were these images will be stored

In [11]:
labels = ['rock', 'paper', 'scissor']
number_imgs = 1
IMAGES_PATH = os.path.join('datasets', 'gestures', 'images')
LABELS_PATH = os.path.join('datasets', 'gestures', 'labels')

Create the directory (works for all operation systems)

In [None]:
def create_dir(path):
    if not os.path.exists(path):
        !mkdir {path}


In [None]:
for structure in [IMAGES_PATH,LABELS_PATH]:
    create_dir(structure)
    # create subfolder
    for folder in ['test-dev','train','val']:
        create_dir(os.path.join(structure, folder))

Capture the images! You can adjust the time between the different frames to have more time of switching between your hand gestures.

In [None]:
for label in labels:
    cap = cv2.VideoCapture(0)
    print('Collecting images for {}'.format(label))
    time.sleep(7) #time between gestures
    for imgnum in range(number_imgs):
        print('Collecting image {}'.format(imgnum))
        ret, frame = cap.read()
        imgname = os.path.join(IMAGES_PATH,'train',label+'.'+'{}.jpg'.format(str(uuid.uuid1())))
        cv2.imwrite(imgname, frame)
        cv2.imshow('frame', frame)
        time.sleep(2) #time between frames

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
cap.release()
cv2.destroyAllWindows()

Go to https://www.makesense.ai and label your data there. First create the label rock, then paper and then scissor to ensure that they are labeled as 0, 1, 2.

**Export the labels in yolo format**

Understanding the bounding box:

In [61]:
def draw_bounding_box(image_path, label_path):
    # Load the image
    image = cv2.imread(image_path)
    image_height, image_width, _ = image.shape
    
    # Extract the first bounding box coordinates
    coordinates =pd.read_csv(label_path,sep=' ',header=None).iloc[0,1:]
    x_rel = coordinates.iloc[0]
    y_rel = coordinates.iloc[1]
    width_rel = coordinates.iloc[2]
    height_rel = coordinates.iloc[3]
    
    # Calculate the absolute coordinates of the bounding box
    x = int(x_rel * image_width)
    y = int(y_rel * image_height)
    width = int(width_rel * image_width)
    height = int(height_rel * image_height)
    
    # Calculate the coordinates of the bounding box
    x_min = int(x - width / 2)
    y_min = int(y - height / 2)
    x_max = int(x + width / 2)
    y_max = int(y + height / 2)
    
    # Draw the bounding box rectangle on the image
    cv2.rectangle(image, (x_min, y_min), (x_max, y_max), (0, 255, 0), 2)
    
    # Display the image with bounding box
    cv2.imshow('Image with Bounding Box', image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

def draw_bounding_box_fast_access(subfolder, img_name):
    image_path = os.path.join(IMAGES_PATH,subfolder,img_name+'.jpg')
    label_path = os.path.join(LABELS_PATH,subfolder,img_name+'.txt')


In [52]:
image_path = os.path.join(IMAGES_PATH,'train','c.jpg')
label_path = os.path.join(LABELS_PATH,'train','paper.c1e51c9a-fe04-11ed-a540-161b77bd6551.txt')

In [63]:
draw_bounding_box_fast_access('train','paper.c1e51c9a-fe04-11ed-a540-161b77bd6551')

In [64]:
cv2.destroyAllWindows()

## 3. Training

#### 3.2. Load and Test Pretrained Model
In Computer Vision, we usually use pretrained Models, to reduce the number of samples required for training. In this case, we use a pretrained YOLOv5 model, which was trained on the COCO dataset. The COCO dataset contains 80 different classes, which are not relevant for our task. For now, let's just test the model on a random image.

#### Tasks:
_**Task 3.1:**_ Can you find better models for our task?
_**Task 3.2:**_ Can you find better pretrained weights for our task?
_Note:_ You can find possible models and weights on:
1. huggingface [Model Hub](https://huggingface.co/models?pipeline_tag=object-detection&sort=downloads).
2. pytorch [Model Zoo](https://pytorch.org/docs/stable/torchvision/models.html).
3. pytorch [Model Hub](https://pytorch.org/hub/).


#### 3.3 Adapt the Model for our Task
We need to adapt the model for our task. Therefore, we need to remove the last layer of the model, and replace it with a new layer, which only contains 3 classes (one for each hand gesture).
Since we use the [yolov5 implementation of ultralytics](https://github.com/ultralytics/yolov5), we can use their provided training script to train our model.


In [None]:
class yolov5:
    
    def __init__(self, img: int=640, conf: float=0.4):
        self.img = str(img)
        self.conf = conf
        # preparing command
        self.__algo = 'yolov5'
        self.__exePre = 'python3 '+self.__algo+'/'

    
    def getLastWeights(self) -> str:
        weightsPath = os.path.join(self.__algo,'runs','train')
        if os.path.exists(os.path.join(weightsPath,'exp','weights','last.pt')):
            # look at all directories in weightsPath
            expFolders =os.listdir(weightsPath)
            # choose the highest exp number
            highestExpFolder = max([s[3:] for s in expFolders])
            # return the last.pt of the higherst exp folder
            return self.__algo+'/runs/train/exp'+highestExpFolder+'/weights/last.pt'
        else:
            # use the pretrained coco weights
            return self.__algo+'yolov5s.pt'
        
    
    def train(self, data: str, weights: str=None, batch: int=16, epochs: int=3, workers: int=0, save_period: int=1):
        if weights == None:
            weights = self.getLastWeights()
        exeStr = self.__exePre+'train.py --img '+str(img)+' --batch '+str(batch)+' --epochs '+str(epochs)+' --data '+data+' --weights '+weights+' --cache --workers '+str(workers)+' --save-period '+str(save_period)
        print(exeStr)
        subprocess.run(exeStr, shell=True)

    def liveCameraPrediction(self, weights: str=None, conf: float=None):
        if conf==None:
            conf = self.conf
        if weights==None:
            weights = self.getLastWeights()
        exeStr = self.__exePre+'detect.py --source 0 --weights '+weights+' --conf '+str(conf)
        print(exeStr)
        subprocess.run(exeStr, shell=True)

    def predict(self, dataYAML: str, weights: str=None, conf=None):
        if conf==None:
            conf = 0
        if weights==None:
            weights = self.getLastWeights()
        exeStr = self.__exePre+'val.py --weights '+weights+' --data '+dataYAML+' --img '+self.img+' --conf '+str(conf)
        print(exeStr)
        subprocess.run(exeStr, shell=True)




First we define our 3 label names, and the path to the images and labels.

This is done in the dataset.yaml file.
Here we have to define the 3 classes, and the path to the images and labels.
Make sure to use the correct mapping between classID and label
1. 0 -> "rock"
2. 1 -> "paper"
3. 2 -> "scissors"

Now we train our model.
we use the dataset.yml file to define the path to the images and labels, and the number of classes.

For different Parameter Configurations refer to:
https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data

### Monitoring
It is important to monitor the training process to ensure that the model is training properly.
To do so, we recommend [Weights and Biases](https://wandb.ai/) (or [tensorboard](https://www.tensorflow.org/tensorboard)).
Both tools keep track of the training process and automatically log the results.
#### Weights and Biases
1. Create an account on [Weights and Biases](https://wandb.ai/)
2. Install the wandb package `pip install wandb`
3. Login to your account `wandb login`
4. Run the training script with the `--project` flag `python train.py --project <project_name>`
5. Go to your [Weights and Biases](https://wandb.ai/) dashboard to view the results

For more information on the YOLOV5 integration with Weights and Biases, refer to [here](https://docs.wandb.ai/guides/integrations/yolov5)


In [None]:
## with usual commands:
# Train YOLOv5s on COCO128 for 3 epochs
#!cd yolov5 && python train.py --img 640 --batch 16 --epochs 3 --data ../dataset.yaml --weights yolov5s.pt --cache --workers 0 --save-period 1
# if you want to train on a previous trained model run e.g.
#!cd yolov5 && python train.py --img 640 --batch 16 --epochs 2 --data ../dataset.yaml --weights runs/train/exp3/weights/last.pt --cache --workers 0 --save-period 1

## with our class
# usually we need to train more epochs than 3!
model = yolov5()
model.train('dataset.yaml')

## 4. Inference
Now that we have trained our model, we can use it to detect hand gestures in images.
We can use the `detect.py` script to detect hand gestures in images.

1. The `--source 0` argument specifies that we want to use the webcam as input.
2. The `--weights path.pt` argument specifies the path to the weights of the model.
3. The `--conf 0.X` argument specifies the confidence threshold.

The confidence threshold determines the minimum confidence score for a bounding box to be considered as a detection. If the confidence score is below the threshold, the bounding box will be ignored. This is useful to filter out false positives.

In [None]:
# test your model with real time predictions of your camera 
#  (you need to navigate to the python symbol that is popping up)
model.liveCameraPrediction()

In [None]:
# predict on your test-dev data and look at the mAP
#  remember: don't set any conf value! 
#            We want the mean of many conf values, which is enabled only with conf = 0
model.predict('eval.yaml')

Send us the ipynb notebook and your best yolo-weight vector (.pt file).

We then predict your score on a test set that is unknown to you.