# Introduction

in this notebook, I will describe the pipeline of my solution to the gesture recognition problem ✌🏻. You can find more information about the task [here](https://boosters.pro/championship/machinescansee2021/overview). 

My [Github Repo](https://github.com/gorodion/GestureRecognition) 🌟

Let's go!

# Converting hand datasets

In the first part we will download hand datasets and convert it to csv format

You can find the result of the converting [here](https://drive.google.com/file/d/14zGOEpDfGEb_Chd4fbPYyAj0JoKcBrVS/view?usp=sharing)

## Import

In [1]:
from tqdm.notebook import tqdm
import cv2 as cv
import matplotlib.pyplot as plt
from pathlib import Path
import skimage.io
import pandas as pd
import scipy.io

def read(path, as_gray=False):
    return skimage.io.imread(path, as_gray=as_gray)

def show(img):
    plt.imshow(img)
    plt.grid()
    plt.axis('off')
    plt.show()
    
plt.rcParams['figure.figsize'] = (8, 8)
plt.style.use('dark_background')

In [2]:
HAND_DIR = Path('hand_detection')
HAND_DATA = HAND_DIR / 'data'

for path in (HAND_DIR, HAND_DATA):
    path.mkdir(parents=True, exist_ok=True)

## Downloading dataset

You can find first dataset [here](https://www.robots.ox.ac.uk/~vgg/data/hands/)

In [3]:
!wget -O hand_dataset.tar.gz https://www.robots.ox.ac.uk/~vgg/data/hands/downloads/hand_dataset.tar.gz
!tar xzf hand_dataset.tar.gz -C $HAND_DATA

--2021-07-27 20:20:39--  https://www.robots.ox.ac.uk/~vgg/data/hands/downloads/hand_dataset.tar.gz
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250468306 (239M) [application/x-gzip]
Saving to: ‘hand_dataset.tar.gz’


2021-07-27 20:20:47 (28.6 MB/s) - ‘hand_dataset.tar.gz’ saved [250468306/250468306]



And second one [here](https://www3.cs.stonybrook.edu/~cvl/projects/hand_det_attention/) (we only take COCO-Hand dataset)

In [4]:
!wget -O coco-hands.zip http://vision.cs.stonybrook.edu/~supreeth/COCO-Hand.zip
!unzip -q coco-hands.zip -d $HAND_DATA

--2021-07-27 20:20:51--  http://vision.cs.stonybrook.edu/~supreeth/COCO-Hand.zip
Resolving vision.cs.stonybrook.edu (vision.cs.stonybrook.edu)... 130.245.4.232
Connecting to vision.cs.stonybrook.edu (vision.cs.stonybrook.edu)|130.245.4.232|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1296190351 (1.2G) [application/zip]
Saving to: ‘coco-hands.zip’


2021-07-27 20:21:32 (30.2 MB/s) - ‘coco-hands.zip’ saved [1296190351/1296190351]



## Converting datasets

### First dataset

In [5]:
def extract_resolutions(images_path: Path):
    resols = []
    for i in tqdm(list(images_path.glob(f'*.jpg'))):
        resols.append((i.name, *read(i).shape[:2]))

    resols = pd.DataFrame(resols, columns=['id', 'height', 'width'])
    return resols

def convert2csv(images_path, annot_path, save_path):
    tqdm.write('Extracting annotations')

    annot = []
    for i in tqdm(list(annot_path.glob(f'*.mat'))):
        mat = scipy.io.loadmat(i)
        filename = i.name.replace('.mat', '.jpg')
        for i in mat['boxes'][0]:
            box = list(i[0][0])[:4]
            xmin = round(min(x[0][1] for x in box))
            xmax = round(max(x[0][1] for x in box))
            ymin = round(min(x[0][0] for x in box))
            ymax = round(max(x[0][0] for x in box))
            annot.append((filename, xmin, ymin, xmax, ymax))

    df = pd.DataFrame(annot, columns=['id', 'xmin', 'ymin', 'xmax', 'ymax'])
    
    tqdm.write('Extracting resolutions')
    resols = extract_resolutions(images_path)

    # merging
    df = pd.merge(df, resols, how='left')

    df.to_csv(save_path, index=False)
    return df

In [6]:
data_subdir = HAND_DATA / 'hand_dataset'

for part in ('training', 'validation', 'test'):
    print(part, 'dataset converting..')
    subdir = data_subdir / f'{part}_dataset/{part}_data'
    images_path = subdir / 'images'
    annot_path = subdir / 'annotations'
    save_path = subdir / 'annotations.csv'
    convert2csv(images_path, annot_path, save_path)

training dataset converting..
Extracting annotations


HBox(children=(FloatProgress(value=0.0, max=4069.0), HTML(value='')))


Extracting resolutions


HBox(children=(FloatProgress(value=0.0, max=4069.0), HTML(value='')))


validation dataset converting..
Extracting annotations


HBox(children=(FloatProgress(value=0.0, max=738.0), HTML(value='')))


Extracting resolutions


HBox(children=(FloatProgress(value=0.0, max=738.0), HTML(value='')))


test dataset converting..
Extracting annotations


HBox(children=(FloatProgress(value=0.0, max=821.0), HTML(value='')))


Extracting resolutions


HBox(children=(FloatProgress(value=0.0, max=821.0), HTML(value='')))




### Second dataset

Extract annotations

In [7]:
coco_path = HAND_DATA / 'COCO-Hand/COCO-Hand-S'
annot = pd.read_csv(coco_path / 'COCO-Hand-S_annotations.txt', header=None)
annot = annot.iloc[:, :5]
annot.columns = ['id', 'xmin', 'xmax', 'ymin', 'ymax']
annot = annot[['id', 'xmin', 'ymin', 'xmax', 'ymax']]

Extract resolutions, merge and save dataframe

In [8]:
img_path = coco_path / 'COCO-Hand-S_Images'
resols = [(i, *read(img_path / i).shape[:2]) for i in tqdm(annot.id.unique())]
resols = pd.DataFrame(resols, columns=['id', 'height', 'width'])

annot = pd.merge(annot, resols, how='left')
annot.to_csv(coco_path / 'annotations.csv', index=False)

HBox(children=(FloatProgress(value=0.0, max=4534.0), HTML(value='')))




## Location changes

### First dataset

In [9]:
data_subdir = HAND_DATA / 'hand_dataset'
for part in ('training', 'validation', 'test'):
    subdir = data_subdir / f'{part}_dataset/{part}_data'
    save_subdir = data_subdir / part

    !rm -r {subdir / 'annotations'}
    !mv {subdir} {save_subdir}
    !rm -r {subdir.parent}

!rm -r {data_subdir / 'evaluation_code'}

### Second dataset

In [10]:
!mv {img_path} {img_path.parent / 'images'} 
!mv {coco_path} {HAND_DATA / 'coco_hand'}
!rm -r {coco_path.parent}

### Directory tree

In [None]:
!sudo apt-get install tree

The resulting directory tree

In [None]:
!tree {HAND_DATA} -L 4 -I '*.jpg|*.mat|*.m|*.txt'

/content/hand_detection/data
├── coco_hand
│   ├── annotations.csv
│   └── images
└── hand_dataset
    ├── test
    │   ├── annotations.csv
    │   └── images
    ├── training
    │   ├── annotations.csv
    │   └── images
    └── validation
        ├── annotations.csv
        └── images

9 directories, 4 files


## Overview

In [13]:
def visualize(df: pd.DataFrame, images_path: Path, n=5):
    for name, vals in df.groupby('id').apply(lambda x: list(x.values)).sample(n).iteritems():
        img = read(images_path / name)
        for val in vals:
            val = val[1:]
            cv.rectangle(img, (val[0], val[1]), (val[2], val[3]), 255, 2)
        show(img)

### First dataset

In [None]:
path = HAND_DATA / 'hand_dataset/training'
train = pd.read_csv(path / 'annotations.csv')
images_path = path / 'images'
visualize(train, images_path)

### Second dataset

In [None]:
path = HAND_DATA / 'coco_hand'
coco_df = pd.read_csv(path / 'annotations.csv')
img_path = Path(path / 'images')
visualize(coco_df, img_path)

Well in this part we've converted the annotations to csv format. Move on!

# Hand detection training

In this part we will train YOLOv5 model to detect hands

<font color='red'>Please, specify <u>absolute</u> path to your project in the following cell</font>

In [12]:
%env PROJECT_DIR=/content

env: PROJECT_DIR=/content


## Installing & Import

In [13]:
!git clone https://github.com/ultralytics/yolov5  # clone repo
%pip install -qr yolov5/requirements.txt # install dependencies

Cloning into 'yolov5'...
remote: Enumerating objects: 8459, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 8459 (delta 84), reused 105 (delta 52), pack-reused 8286[K
Receiving objects: 100% (8459/8459), 9.58 MiB | 28.02 MiB/s, done.
Resolving deltas: 100% (5826/5826), done.


In [14]:
from tqdm.notebook import tqdm
import os
from IPython.display import clear_output
import cv2 as cv
import skimage.io
import matplotlib.pyplot as plt
from pathlib import Path
import random
import pickle
from functools import partial
import pandas as pd
import numpy as np
import yaml
import shutil

def read(path, as_gray=False):
    return skimage.io.imread(path, as_gray=as_gray)

def show(img):
    plt.imshow(img)
    plt.grid()
    plt.axis('off')
    plt.show()
    
plt.rcParams['figure.figsize'] = (8, 8)
plt.style.use('dark_background')

In [15]:
PROJECT_DIR = Path(os.environ['PROJECT_DIR'])
HAND_DIR = PROJECT_DIR / 'hand_detection'
HAND_DATA = HAND_DIR / 'data'
HAND_MODELS = HAND_DIR / 'models'

HAND_DATA1 = HAND_DATA / 'hand_dataset'
HAND_DATA2 = HAND_DATA / 'coco_hand'

You should have the following file location

In [None]:
# !tree {HAND_DATA} -L 4 -I '*.jpg|*.mat|*.m|*.txt'

/content/hand_detection/data
├── coco_hand
│   ├── annotations.csv
│   └── images
└── hand_dataset
    ├── test
    │   ├── annotations.csv
    │   └── images
    ├── training
    │   ├── annotations.csv
    │   └── images
    └── validation
        ├── annotations.csv
        └── images

9 directories, 4 files


If you don't have these files, you can [download](https://drive.google.com/file/d/1-Zih6R3hXILx604NUmZEn372QWG6EgyJ/view?usp=sharing) and unzip them to $HAND_DIR folder, or uncomment and run the following cell 

In [16]:
# !gdown --id 1-Zih6R3hXILx604NUmZEn372QWG6EgyJ
# !unzip -q hand_data.zip -d $HAND_DIR

Downloading...
From: https://drive.google.com/uc?id=1-Zih6R3hXILx604NUmZEn372QWG6EgyJ
To: /content/hand_data.zip
602MB [00:04, 128MB/s]


## Data preparation

Here we will convert our dataset to yolo's format

In [17]:
def convert2coco(df: pd.DataFrame, save_dir: Path):
    '''
    convert to coco format
    '''
    os.makedirs(save_dir, exist_ok=True)
    cls_idx = 0 # since we only have one class
    for name, data in tqdm(df.groupby('id')):
        with open(save_dir / (name[:-4] + '.txt'), 'w') as f:
            for _, row in data.iterrows():
                x = row.width
                y = row.height
                x_center = (row.xmin + row.xmax) / (2 * x)
                y_center = (row.ymin + row.ymax) / (2 * y)
                width = (row.xmax - row.xmin) / x
                height = (row.ymax - row.ymin) / y
                assert all(0. <= i <= 1. for i in (x_center, y_center, width, height)), f'Invalid annotation {name}'
                print(cls_idx, x_center, y_center, width, height, file=f)

First one hand dataset

In [18]:
for part in ('training', 'validation', 'test'):
    print(part, 'dataset converting..')
    subdir = HAND_DATA1 / part
    df = pd.read_csv(subdir / 'annotations.csv')
    save_dir = subdir / 'labels'
    convert2coco(df, save_dir)

training dataset converting..


HBox(children=(FloatProgress(value=0.0, max=4069.0), HTML(value='')))


validation dataset converting..


HBox(children=(FloatProgress(value=0.0, max=738.0), HTML(value='')))


test dataset converting..


HBox(children=(FloatProgress(value=0.0, max=821.0), HTML(value='')))




Another one dataset

In [19]:
coco_df = pd.read_csv(HAND_DATA2 / 'annotations.csv')
coco_save_dir = HAND_DATA2 / 'labels'

# image with invalid annotation
coco_df = coco_df[coco_df.id != '000000038031.jpg']

convert2coco(coco_df, coco_save_dir)

HBox(children=(FloatProgress(value=0.0, max=4533.0), HTML(value='')))




We create a configuration file that contains information about the dataset 

In [20]:
data_config = {
    'names': ['hand'],
    'nc': 1,
    'train': [
              f'{HAND_DATA1}/training/images',
              f'{HAND_DATA2}/images'
              ],
    'val': f'{HAND_DATA1}/validation/images'
}

In [21]:
with open('yolov5/data/hand.yaml', 'w') as f:
    yaml.dump(data_config, f, default_flow_style=False)

## Train

Here we run training the yolov5x model for only 3 epochs and it's quite enough. Model's weights will be saved in the directory $PROJECT_DIR/yolov5x_hands/

In [None]:
%cd yolov5

!python train.py --img 640 --batch 8 --epochs 3 \
    --data data/hand.yaml \
    --cfg models/yolov5x.yaml \
    --weights yolov5x.pt \
    --project $HAND_MODELS \
    --name yolov5x

# don't forget to return to $PROJECT_DIR
%cd ..

## Overview of the results

Let's look at the results on the test dataset

In [None]:
model_path = HAND_MODELS / 'yolov5x/weights/best.pt' 
source_path = HAND_DATA1 / 'test/images'
sample_dir = Path('sample')
sample_dir.mkdir(exist_ok=True)

for filename in random.choices(list(source_path.glob('*.jpg')), k=16):
    shutil.copy(filename, sample_dir / filename.name)

In [None]:
!python yolov5/detect.py \
    --weights {model_path} \
    --source {sample_dir} \
    --img 640 --conf 0.5

[34m[1mdetect: [0mweights=['/content/hand_detection/models/yolov5x/weights/best.pt'], source=sample, imgsz=640, conf_thres=0.5, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
[31m[1mrequirements:[0m /content/requirements.txt not found, check failed.
YOLOv5 🚀 v5.0-313-g6e4358f torch 1.9.0+cu102 CUDA:0 (Tesla V100-SXM2-16GB, 16160.5MB)

Fusing layers... 
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 476 layers, 87198694 parameters, 0 gradients, 217.1 GFLOPs
image 1/16 /content/sample/VOC2007_104.jpg: 640x480 1 hand, Done. (0.030s)
image 2/16 /content/sample/VOC2007_140.jpg: 448x640 1 hand, Done. (0.022s)
image 3/16 /content/sample/VOC2007_187.jpg: 480x640 1 hand, Done. (

In [None]:
_, axs = plt.subplots(4, 4, figsize=(20,20))
axs = axs.ravel()

# don't forget to change to your path with detections
detections_dir = Path('/content/runs/detect/exp')

for i, filename in enumerate(detections_dir.glob('*.jpg')):
    axs[i].imshow(read(filename))
    axs[i].axis('off')
    if i == 15: break

Not bad. Move on!

# Hand detection inference

In this part we will get predictions from trained hand detector YOLOv5 for the images from gesture dataset given in the [competition](https://boosters.pro/championship/machinescansee2021/data)

For this part you should already have a trained hand detection model. You can find trained model [here](https://drive.google.com/file/d/1-CELzTRZObz9dGD28pB0xqeKrUKTtJc5/view?usp=sharing)

In [None]:
# uncomment and run to download hand detection model
# !gdown --id 1-CELzTRZObz9dGD28pB0xqeKrUKTtJc5

<font color='red'>Please, specify paths to your project and hand detector model in the following cell</font>

In [3]:
%env PROJECT_DIR=/content
%env HAND_MODEL=best.pt

env: PROJECT_DIR=/content
env: HAND_MODEL=best.pt


## Installing & Import

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone repo
%pip install -qr yolov5/requirements.txt # install dependencies

Cloning into 'yolov5'...
remote: Enumerating objects: 8396, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (72/72), done.[K
remote: Total 8396 (delta 56), reused 76 (delta 38), pack-reused 8286[K
Receiving objects: 100% (8396/8396), 9.30 MiB | 15.95 MiB/s, done.
Resolving deltas: 100% (5796/5796), done.
[K     |████████████████████████████████| 636 kB 15.5 MB/s 
[?25h

## Hand detector inference

In [4]:
from tqdm.notebook import tqdm
import os, sys, subprocess
from IPython.display import clear_output
import cv2 as cv
import skimage.io
import matplotlib.pyplot as plt
from pathlib import Path
import random
import pickle
from functools import partial
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold

def read(path, as_gray=False):
    return skimage.io.imread(path, as_gray=as_gray)

def show(img):
    plt.imshow(img)
    plt.grid()
    plt.axis('off')
    plt.show()
    
plt.rcParams['figure.figsize'] = (8, 8)
plt.style.use('dark_background')

In [5]:
PROJECT_DIR = Path(os.environ['PROJECT_DIR'])
GESTURE_DIR = PROJECT_DIR / 'gesture_clf'
GESTURE_DATA = GESTURE_DIR / 'data'
GESTURE_DETS = GESTURE_DATA / 'detections/labels'
HAND_MODEL = os.environ['HAND_MODEL']

for path in (PROJECT_DIR, GESTURE_DIR, GESTURE_DATA):
    path.mkdir(parents=True, exist_ok=True)

## Downloading & unzipping gesture dataset

Again you can find all the data [here](https://boosters.pro/championship/machinescansee2021/data)

For our example we will download and unzip a small part of the gesture dataset

In [None]:
# for i in range(1, 9):
#   !wget -O $GESTURE_DATA/data{i}.zip https://boosters.pro/api/ch/files/pub/train_data{i}.zip  

!wget -O $GESTURE_DATA/data9.zip https://boosters.pro/api/ch/files/pub/train_data9.zip

--2021-07-25 13:16:04--  https://boosters.pro/api/ch/files/pub/train_data9.zip
Resolving boosters.pro (boosters.pro)... 91.206.14.169
Connecting to boosters.pro (boosters.pro)|91.206.14.169|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2525369522 (2.4G) [application/zip]
Saving to: ‘gesture_clf/data/data9.zip’


2021-07-25 13:19:59 (10.3 MB/s) - ‘gesture_clf/data/data9.zip’ saved [2525369522/2525369522]



In [19]:
# for i in range(1, 9):
#     !unzip -q $GESTURE_DATA/data{i}.zip -d $GESTURE_DATA/unzipped
    
!unzip -q $GESTURE_DATA/data9.zip -d $GESTURE_DATA/unzipped

Let's download csv file with gesture labels

In [None]:
!wget -O $GESTURE_DATA/train.csv https://boosters.pro/api/ch/files/pub/train.csv

--2021-07-25 14:26:33--  https://boosters.pro/api/ch/files/pub/train.csv
Resolving boosters.pro (boosters.pro)... 91.206.14.169
Connecting to boosters.pro (boosters.pro)|91.206.14.169|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23339008 (22M) [application/octet-stream]
Saving to: ‘/content/gesture_clf/data/train.csv’


2021-07-25 14:26:33 (47.7 MB/s) - ‘/content/gesture_clf/data/train.csv’ saved [23339008/23339008]



## Inference

In [None]:
IMAGES_PATH = Path(f"{GESTURE_DATA}/unzipped")

In [None]:
subprocess.check_call([
    sys.executable, 'yolov5/detect.py', 
    '--weights', HAND_MODEL, 
    '--source', str(IMAGES_PATH / '**/*.jpg'), 
    '--img', '640', 
    '--conf', '0.3', 
    '--save-txt', '--save-conf', '--nosave', '--exist-ok',
    '--project', GESTURE_DATA,
    '--name', 'detections'])

## Reading csv file with labels

In [None]:
df = pd.read_csv(f'{GESTURE_DATA}/train.csv')

Now we will leave only those frames on which the hands were detected. **Note:** you can skip cell below if you downloaded all the gesture data.

In [None]:
labels = [i.stem for i in GESTURE_DETS.glob('*.txt')]
df = df[df.frame_path.apply(lambda x: Path(x).stem).isin(labels)]

Here we split the data using GroupKFold by videos, i.e. training and validation sets will not overlap in video titles

## Data splitting

In [None]:
def group_kfold(df: pd.DataFrame, groups: pd.Series):
    gkf = GroupKFold(n_splits=5)
    train_idx, test_idx = next(gkf.split(df, df, groups))
    return df.iloc[train_idx], df.iloc[test_idx]

train_df, val_df = group_kfold(df, df.video_name)

## Postprocessing

Let's parse txt files with hand detections

In [None]:
def xywh2xyxy(x, y, w, h):
    x0, y0 = x - w / 2, y - h / 2
    x1, y1 = x + w / 2, y + h / 2
    x0, x1 = x0, x1
    y0, y1 = y0, y1
    return x0, y0, x1, y1

def expand_box(x0, y0, x1, y1, n=2.5):
    deltaX = (x1 - x0) / n
    deltaY = (y1 - y0) / n
    x0 = np.clip(x0-deltaX, 0, None)
    x1 = np.clip(x1+deltaX, None, 1)
    y0 = np.clip(y0-deltaY, 0, None)
    y1 = np.clip(y1+deltaY, None, 1)
    return x0, y0, x1, y1

In [None]:
def extract_boxes(df, labels_path, fn=None):
    output = []
    for _, row in tqdm(df.iterrows(), total=len(df)):
        label_file = labels_path / (Path(row.frame_path).stem + '.txt')
        # if there is not detected hands
        if not label_file.is_file():
            # if there is a gesture on the image, then we save it for future fine-tuning
            if row.class_name != 'no_gesture' and fn is not None:
                fn.append(row.frame_path)
            continue

        # parsing txt file with detections
        with open(label_file) as f:
            for i, line in enumerate(f.read().splitlines()):
                _, x, y, wid, hei, conf = map(float, line.split())
                x0, y0, x1, y1 = xywh2xyxy(x, y, wid, hei)
                x0, y0, x1, y1 = expand_box(x0, y0, x1, y1)
                output.append([row.frame_path, x0, y0, x1, y1, row.class_name, conf])
    output = pd.DataFrame(output, columns=['frame_path', 'xmin', 'ymin', 'xmax', 'ymax', 'class_name', 'conf'])
    return output

In [None]:
fn = [] # False Negatives samples
train_df = extract_boxes(train_df, GESTURE_DETS, fn)
val_df = extract_boxes(val_df, GESTURE_DETS)

HBox(children=(FloatProgress(value=0.0, max=4873.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1218.0), HTML(value='')))




Now we're going choose a suitable confidence threshold

In [None]:
_, axs = plt.subplots(4, 4, figsize=(20,20))
axs = axs.ravel()

for i, (frame_path, data) in enumerate(train_df.sample(frac=1).groupby('frame_path', sort=False)):
    img = read(IMAGES_PATH / data.frame_path.values[0])
    h, w = img.shape[:2]
    for _, row in data.iterrows():
            x0, x1 = (int(x*w) for x in (row.xmin, row.xmax))
            y0, y1 = (int(y*h) for y in (row.ymin, row.ymax))
            cv.rectangle(img, (x0, y0), (x1, y1), 255, 3)
            cv.putText(img, str(round(row.conf, 2)), (x0 ,y0), cv.FONT_HERSHEY_SIMPLEX, 2, (255,)*3, cv.LINE_AA)
    axs[i].imshow(img)
    axs[i].axis('off')
    if i == 15: break

We will take only upper box, assuming that there is a target gesture (that is specified in train.csv)

In [None]:
def extract_upper_boxes(df, conf, fn=None):
    idxs = []
    for frame_path, data in tqdm(df.groupby('frame_path')):
        cls_name = data.class_name.values[0]
        data = data[data.conf >= conf]
        if len(data) == 0:
            # if not found > conf
            if cls_name != 'no_gesture' and fn is not None:
                fn.append(frame_path)
        else:
            upper_box_idx = data.ymin.idxmin()
            idxs.append(upper_box_idx)
    return df.loc[idxs]

In [None]:
train_df = extract_upper_boxes(train_df, 0.3, fn)
val_df = extract_upper_boxes(val_df, 0.3)

HBox(children=(FloatProgress(value=0.0, max=4873.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1218.0), HTML(value='')))




In [None]:
train_df.to_csv(f'{GESTURE_DATA}/train_loc_cls.csv', index=False)
val_df.to_csv(f'{GESTURE_DATA}/val_loc_cls.csv', index=False)

# Gesture classifier

In this part we will train a model to classify gestures

For this part you should have csv annotation files which include bbox coordinates and class label for each frame in our gesture dataset. Here is annotation for [train](https://drive.google.com/file/d/1-8ACJsRn2r4m1YxTB4QZJBIK6jUjwON3/view?usp=sharing) set and [validation](https://drive.google.com/file/d/1-F50UxD7llYmDD-_GsLa4WMtbkvX0uAc/view?usp=sharing) set given from data9.zip (only part of [all data](https://boosters.pro/championship/machinescansee2021/data))

<font color='red'>Please, specify path to your project in the following cell</font>

In [6]:
%env PROJECT_DIR=/content

env: PROJECT_DIR=/content


## Import

In [7]:
import os
from IPython.display import clear_output
import cv2 as cv
from PIL import Image
import matplotlib.pyplot as plt
from pathlib import Path
import random
import pickle
from functools import partial
import albumentations as A
from albumentations.pytorch import ToTensor
from skimage.io import imread
from tqdm.notebook import tqdm
import pandas as pd
from sklearn.metrics import classification_report, roc_curve


import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from torchvision.transforms import functional as F
from torchvision import models
from torchsummary import summary

def read(path, as_gray=False):
    return imread(path, as_gray=as_gray)

def show(img):
    plt.imshow(img)
    plt.grid()
    plt.axis('off')
    plt.show()
    
plt.rcParams['figure.figsize'] = (8, 8)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.style.use('dark_background')

In [8]:
PROJECT_DIR = Path(os.environ['PROJECT_DIR'])
GESTURE_DIR = PROJECT_DIR / 'gesture_clf'
GESTURE_DATA = GESTURE_DIR / 'data'
IMAGES_PATH = GESTURE_DATA / 'unzipped'
TRAIN_CSV = GESTURE_DATA / 'train_loc_cls.csv'
VAL_CSV = GESTURE_DATA / 'val_loc_cls.csv'

for path in (PROJECT_DIR, GESTURE_DIR, GESTURE_DATA):
    path.mkdir(parents=True, exist_ok=True)

You should have the following file location

In [None]:
# !tree {GESTURE_DATA} -L 1

/content/gesture_clf/data
├── train_loc_cls.csv
├── unzipped
└── val_loc_cls.csv

1 directory, 2 files


## Data preparation

In [10]:
train_df = pd.read_csv(TRAIN_CSV)
val_df = pd.read_csv(VAL_CSV)

for df in train_df, val_df:
    df['frame_path'] = str(IMAGES_PATH) + '/' + df['frame_path']

In [11]:
CLASSES = ['dislike', 'like', 'mute', 'no_gesture', 'ok', 'stop', 'victory']

class GestDataset(Dataset):
    classes = CLASSES
    cls2idx = {j: i for i, j in enumerate(classes)}

    transform = A.Compose([
        A.Resize(224, 224),
        A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ToTensor()
    ])

    def __init__(self, df, transform=None, phase='train'):
        self.df = df
        self.phase = phase
        self.transform = A.Compose([transform, self.transform])

        assert phase in ('train', 'val', 'test' ), 'Phase must be `train`/`val`/`test`'

        
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = imread(row.frame_path)
        h, w = img.shape[:2]
        x0, x1 = (int(x*w) for x in (row.xmin, row.xmax))
        y0, y1 = (int(y*h) for y in (row.ymin, row.ymax))
        
        cls = row.class_name
        idx_cls = self.cls2idx[cls]
        img = img[y0:y1, x0:x1]

        if self.phase in ('train', 'val'):
            out_tensor = self.apply_transform(img)
            return out_tensor, idx_cls

        elif self.phase == 'test':
            return self.apply_transform(img)

    def __len__(self):
        return len(self.df)

    def apply_transform(self, img, mask=None):
        transformed = self.transform(image=img)
        return transformed['image']

In [12]:
BS = 32

train_transform = A.Compose([
    A.Rotate(30),                         
    A.HorizontalFlip(p=0.5),
    A.HueSaturationValue(p=0.5),
    A.Blur(p=0.1),
    A.ToGray(p=0.05),
    A.RandomBrightnessContrast(p=0.3),
])

train_ds = GestDataset(train_df, transform=train_transform, phase='train')
val_ds = GestDataset(val_df, phase='val')


train_dl = torch.utils.data.DataLoader(
    train_ds, batch_size=BS, shuffle=True, num_workers=2, drop_last=True)

val_dl = torch.utils.data.DataLoader(
    val_ds, batch_size=BS, shuffle=True, num_workers=2)

We can see what the classifier will receive at the input 

In [13]:
def tensor2img(tensor):
    return ((tensor.permute(1, 2, 0).numpy()*(0.229, 0.224, 0.225)+(0.485, 0.456, 0.406))*255).astype('uint8')

In [None]:
_, axs = plt.subplots(nrows=4, ncols=4, figsize=(16,16))
x, y = next(iter(train_dl))
for x0, y0, ax, _ in zip(x, y, axs.ravel(), range(16)):
    ax.set_title(CLASSES[y0])
    ax.axis('off')
    ax.imshow(tensor2img(x0))

## Model

In [None]:
!git clone https://github.com/rwightman/pytorch-image-models.git
!mv /content/pytorch-image-models/timm ./timm

Cloning into 'pytorch-image-models'...
remote: Enumerating objects: 7503, done.[K
remote: Counting objects: 100% (1680/1680), done.[K
remote: Compressing objects: 100% (652/652), done.[K
remote: Total 7503 (delta 1191), reused 1355 (delta 1015), pack-reused 5823[K
Receiving objects: 100% (7503/7503), 17.59 MiB | 25.16 MiB/s, done.
Resolving deltas: 100% (5460/5460), done.


We will take EfficientNetV2-M model

In [15]:
import timm

model = timm.create_model('tf_efficientnetv2_m_in21ft1k', pretrained=False, num_classes=7)
model.to(device);

## Training functions

In [None]:
from functools import partial

def one_epoch(model, loss_fn, opt, dataloader, steps, phase, device):
    epoch_loss, epoch_acc = 0., 0.
    def one_step(X, y):
        nonlocal epoch_loss, epoch_acc
        X = X.to(device)
        y = y.to(device)
        opt.zero_grad()

        y_pred = model(X)
        loss = loss_fn(y_pred, y)
        if phase == 'train':
            loss.backward()
            opt.step()

        epoch_loss += loss.item()
        epoch_acc += (y_pred.argmax(1) == y).float().mean().item()

    if phase == 'train':
        model.train()
    else:
        model.eval()
    
    for step, (X, y) in enumerate(tqdm(dataloader)):
        one_step(X, y)
        if step + 1 == steps: 
            return epoch_loss / steps, epoch_acc / steps
    return epoch_loss / len(dataloader), epoch_acc / len(dataloader)

fit_epoch = partial(one_epoch, phase='train', device=device)
eval_epoch = torch.no_grad()(partial(one_epoch, phase='val', device=device))

In [None]:
def fit(model, loss_fn, opt, num_epoch, train_dl, val_dl, scheduler=None, train_steps=None, val_steps=None, plot_period=None, save_path=None, history=None):
    tqdm.write(f'Train dataset size: {len(train_ds)}')
    tqdm.write(f'Train dataloader size: {len(train_dl)}')
    tqdm.write(f'Val dataset size: {len(val_ds)}')
    tqdm.write(f'Val dataloader size: {len(val_dl)}')

    if train_steps is not None:
        tqdm.write(f'Number of iters per train epoch: {len(train_dl) // train_steps}')
    if val_steps is not None:
        tqdm.write(f'Number of iters per val epoch: {len(val_dl) // val_steps}')

    if history is None:
        history = {stat: {'train': [], 'val': []} for stat in ['loss', 'acc']}
        max_acc = 0.
    else:
        max_acc = max(history['acc']['val'])

    try:
        for epoch in range(num_epoch):
            loss, acc = fit_epoch(model, loss_fn, opt, train_dl, train_steps)
            history['loss']['train'].append(loss)
            history['acc']['train'].append(acc)


            loss, acc = eval_epoch(model, loss_fn, opt, val_dl, val_steps)
            history['loss']['val'].append(loss)
            history['acc']['val'].append(acc)
            tqdm.write(f'val loss: {round(loss, 3)} val acc: {round(acc, 2)}')

            # scheduler
            if scheduler is not None:
                scheduler.step(history['loss']['val'][-1])

            # training visualization
            if plot_period is None or epoch % plot_period == 0:
                plot_lc(history)

            # checkpoint
            if history['acc']['val'][-1] > max_acc:
                max_acc = history['acc']['val'][-1]
                torch.save(model.state_dict(), save_path if save_path is not None else 'model.pth')
                tqdm.write(f'accuracy improved: {round(max_acc, 2) * 100}%. Model saved')

    except KeyboardInterrupt:
        print('keyboard interrupt')
    finally:
        return history

In [None]:
# plot learning curves
def plot_lc(history):
    fig, axes = plt.subplots(ncols=len(history), figsize=(8*len(history),8))
    for ax, (name, vals) in zip(axes, history.items()):
        ax.plot(vals['train'], label=f'Train {name}')
        ax.plot(vals['val'], label=f'Val {name}')
        ax.set_xlabel('Num epoch')
        ax.set_ylabel(name)
        ax.set_title(name + ' graph')
        plt.axis('on')
        ax.legend()
    plt.savefig('learning_curves.png')
    plt.close()

In [None]:
loss_fn = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, patience=5)

## Train

For our example we will train only 3 epochs

In [None]:
history = fit(model, loss_fn, opt, 3, train_dl, val_dl, scheduler=scheduler)

## Evaluation

I'll evaluate model that trained at **half** of all data. You can find this model [here](https://drive.google.com/file/d/1aY4NxgXx8hY4NI8u-0Slj_d7z8qiRH2S/view?usp=sharing)

In [16]:
model = timm.create_model('tf_efficientnetv2_m_in21ft1k', pretrained=False, num_classes=7)
state_dict = torch.load('effnetv2_m.pth', map_location=device)
model.load_state_dict(state_dict)
model.to(device)
model.eval();

### Metrics

In [None]:
@torch.no_grad()
def predict(dataloader):
    '''
    returns two lists: true labels and predicted probabilities
    '''
    return zip(
        *((y0.item(), y_pred0)
            for x, y in tqdm(dataloader)
                for y0, y_pred0 in zip(y, model(x.cuda()).softmax(1).cpu().numpy()))
    )

In [None]:
y_true, y_pred = predict(val_dl)

HBox(children=(FloatProgress(value=0.0, max=39.0), HTML(value='')))




In [None]:
preds = pd.DataFrame(y_pred, columns=CLASSES)
preds['true'] = [CLASSES[y] for y in y_true]

In [None]:
print(classification_report(preds.true, preds[CLASSES].idxmax(1)))

              precision    recall  f1-score   support

     dislike       0.89      0.78      0.83        60
        like       0.84      0.70      0.77        54
        mute       0.96      0.98      0.97        55
  no_gesture       0.80      0.89      0.84       196
          ok       0.99      0.93      0.96       377
        stop       0.90      0.91      0.90       286
     victory       0.89      0.95      0.92       190

    accuracy                           0.91      1218
   macro avg       0.90      0.88      0.88      1218
weighted avg       0.91      0.91      0.91      1218



Let's calculate and plot metric of the competition for our sample data

In [None]:
def plot_metric(preds: pd.DataFrame):
    fig, axs = plt.subplots(2, 3, figsize=(16,10))
    score = 0.
    TARGET_FPR = 0.002
    classes = ('ok', 'victory', 'like', 'dislike', 'stop', 'mute')

    for cls, ax in zip(classes, axs.ravel()):
        true = (preds['true'] == cls).astype(int).values
        pred = preds[cls].values
        fpr, tpr, thr = roc_curve(true, pred)

        if fpr[0] < TARGET_FPR:
            target_tpr = tpr[fpr < TARGET_FPR][-1]
        else:
            target_tpr = 0.0
        score += target_tpr

        ax.plot(fpr, tpr)
        ax.set_title(f'Score for `{cls}` class`: {target_tpr:.2f}')
        ax.set_xlim(0, 1)
        ax.set_ylim(0, 1)
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')

    fig.suptitle(f'Total metric: {score / 6:.2f}')

In [None]:
plot_metric(preds)

### Overview

Let's look at the predictions

In [None]:
_, axs = plt.subplots(nrows=4, ncols=4, figsize=(16,16))
x, y = next(iter(val_dl))
with torch.no_grad():
    y_pred = model(x.cuda()).argmax(1).cpu().numpy()

for x0, y0, y_pred0, ax, _ in zip(x, y, y_pred, axs.ravel(), range(16)):
    ax.set_title(CLASSES[y0], color='red' if y_pred0 != y0 else 'white')
    ax.axis('off')
    ax.imshow(tensor2img(x0))

That's all with the classification of gestures!

# End-to-end inference

In this part we will write end-to-end inference function that will combine localization of hands and its classification for the presence of the gesture (or lack thereof) 

Let's download trained models

In [4]:
!gdown --id 1-CELzTRZObz9dGD28pB0xqeKrUKTtJc5 # best.pt
!gdown --id 1aY4NxgXx8hY4NI8u-0Slj_d7z8qiRH2S # effnetv2_m.pth

Downloading...
From: https://drive.google.com/uc?id=1-CELzTRZObz9dGD28pB0xqeKrUKTtJc5
To: /content/best.pt
175MB [00:01, 132MB/s]
Downloading...
From: https://drive.google.com/uc?id=1aY4NxgXx8hY4NI8u-0Slj_d7z8qiRH2S
To: /content/effnetv2_m.pth
213MB [00:01, 123MB/s]


<font color='red'>Don't forget to specify paths to hand detector and gesture classifier</font>

In [4]:
%env HAND_MODEL=best.pt
%env GESTURE_MODEL=effnetv2_m.pth

env: HAND_MODEL=best.pt
env: GESTURE_MODEL=effnetv2_m.pth


Download additional module

In [31]:
# timm
!git clone https://github.com/rwightman/pytorch-image-models.git
!mv /content/pytorch-image-models/timm ./timm

Cloning into 'pytorch-image-models'...
remote: Enumerating objects: 7503, done.[K
remote: Counting objects: 100% (1680/1680), done.[K
remote: Compressing objects: 100% (658/658), done.[K
remote: Total 7503 (delta 1193), reused 1349 (delta 1009), pack-reused 5823[K
Receiving objects: 100% (7503/7503), 17.60 MiB | 20.25 MiB/s, done.
Resolving deltas: 100% (5462/5462), done.


Download the image for an example

In [142]:
!wget -O example.jpg https://api.time.com/wp-content/uploads/2014/07/140709-winston-churchill-eisenstedt.jpg?quality=85&w=447

--2021-07-27 18:53:17--  https://api.time.com/wp-content/uploads/2014/07/140709-winston-churchill-eisenstedt.jpg?quality=85
Resolving api.time.com (api.time.com)... 192.0.66.64, 2a04:fa87:fffd::c000:4240
Connecting to api.time.com (api.time.com)|192.0.66.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115202 (113K) [image/jpeg]
Saving to: ‘example.jpg’


2021-07-27 18:53:18 (3.46 MB/s) - ‘example.jpg’ saved [115202/115202]



Here is inference class

In [1]:
import torch
import skimage.io
import cv2 as cv
import os
import timm
import numpy as np

import albumentations as A
from albumentations.pytorch import ToTensor
from torch.utils.data import DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


class Inference:
    CLASSES = ['dislike', 'like', 'mute', 'no_gesture', 'ok', 'stop', 'victory']
    NO_GESTURE_IDX = CLASSES.index('no_gesture')

    clf_transform = A.Compose([
        A.Resize(224, 224),
        A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ToTensor()
    ])

    def __init__(self, hand_conf=0.3):
        '''
        hand_conf: confidence threshold for hand detection
        '''
        self.yolo = self.load_yolo(os.environ['HAND_MODEL'])
        self.clf = self.load_clf(os.environ['GESTURE_MODEL'])
        self.CONF = hand_conf

    @staticmethod
    def load_yolo(yolo_path):
        return torch.hub.load('ultralytics/yolov5', 'custom', path=yolo_path, force_reload=True)

    @staticmethod
    def load_clf(clf_path):
        clf = timm.create_model('tf_efficientnetv2_m_in21ft1k', pretrained=False, num_classes=7)
        state_dict = torch.load(clf_path, map_location=device)
        clf.load_state_dict(state_dict)
        clf.to(device)
        return clf
    
    @torch.no_grad()
    def clf_predict(self, dataloader):
        preds = []
        for x in dataloader:
            preds.extend(self.clf(x.cuda()).softmax(1).cpu().tolist())
        return preds

    def __call__(self, img_path):
        '''
        Predicts gesture on the image with path `img_path`
        '''
        imgs_paths = [img_path]
        results = self.yolo(imgs_paths)
        # extract and filter bboxes by confidence
        bboxes = [list(map(int, xyxy)) 
            for *xyxy, conf, cls in results.xyxy[0].cpu().numpy() 
                if conf > self.CONF]

        if not bboxes:
            return 'no gesture'

        img = skimage.io.imread(img_path)
        height, width = img.shape[:2]

        # expand bboxes
        expanded_bboxes = [self.expand_box(bbox, height, width) for bbox in bboxes]
        # get crops from the image
        crops = [img[y0:y1, x0:x1] for x0, y0, x1, y1 in expanded_bboxes]
        # transform crops to tensors
        tensors = [self.clf_transform(image=crop)['image'] for crop in crops]
        dl = DataLoader(tensors, batch_size=16)
        # get predictions from gesture classifier
        preds = self.clf_predict(dl)
        
        # filter `no gesture` class
        preds = [pred for pred in preds if np.argmax(pred) != self.NO_GESTURE_IDX]
        if not preds:
            return 'no gesture'

        # get gesture class index with the highest probability
        cls_idx = np.argmax(preds) % len(self.CLASSES)
        return self.CLASSES[cls_idx]

    @staticmethod
    def expand_box(xyxy, height, width, n=2.5):
        x0, y0, x1, y1 = xyxy
        deltaX = (x1 - x0) / n
        deltaY = (y1 - y0) / n
        x0 = np.clip(x0-deltaX, 0, None)
        x1 = np.clip(x1+deltaX, None, width)
        y0 = np.clip(y0-deltaY, 0, None)
        y1 = np.clip(y1+deltaY, None, height)
        return list(map(int, (x0, y0, x1, y1)))

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 8)
plt.style.use('dark_background')

def show(img):
    plt.imshow(img)
    plt.grid()
    plt.axis('off')
    plt.show()

def read(path):
    return skimage.io.imread(path)

In [5]:
inference = Inference()

Downloading: "https://github.com/ultralytics/yolov5/archive/master.zip" to /root/.cache/torch/hub/master.zip
YOLOv5 🚀 2021-7-27 torch 1.9.0+cu102 CUDA:0 (Tesla P100-PCIE-16GB, 16280.875MB)

Fusing layers... 
Model Summary: 476 layers, 87198694 parameters, 0 gradients
Adding AutoShape... 


In [None]:
img_path = 'example.jpg'
cls_name = inference(img_path)

plt.title(cls_name)
show(read(img_path))

Great! Thank you for viewing my solution. I hope this was useful for you. Bye!