# Sign Language detector using _mediapipe_ and _scikit-learn_. 

## Introduction 

A while back, I discovered [mediapipe](https://google.github.io/mediapipe) and I liked it. It has a very vast variation of detection algorithms for pose, hand, face, iris, etc. 

I was thinking to myself, what happens if I use this to train a custom model for sign language detection? Although it was done before, I looked at it as a cool project to do in a weekend. 

## Step 1 - Installing dependencies. 

In order to get this thing to work, you need to install these packages:

* scikit-learn 
* numpy (I guess if you install scikit, it will be installed automatically)
* opencv-python 
* mediapipe 

All are available in [pypi](https://pypi.org) and there is nothing really extra-ordinary in this list. 

## Step 2 - Importing dependencies into the code

In [1]:
import mediapipe as mp 
import cv2

## Step 3 - Initializing some needed functionalities from mediapipe

Here, I used `mp.solutions.hands`. Depending on what you're going to do, change this line. 

In [2]:
mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

## Step 4 - Just get some input from the camera

This is step isn't necessary at all. I only did it because I wanted to make sure how many landmarks I can extract from the hand gestures. So if you now how much data you're dealing with, just ignore this part of the code. 

P.S : I also copied this code snippet from an old project of mine. You can clearly see I have some parts like `landmark_list = []` left unused on this snippet. 

In [53]:
camera = cv2.VideoCapture(1)
with mp_hands.Hands(min_detection_confidence=0.5, min_tracking_confidence=0.5, max_num_hands = 1) as hands:
    while camera.isOpened():
        _, image = camera.read()
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False 
        results = hands.process(image)

        image.flags.writeable = True
        landmark_list = []
        if results.multi_hand_landmarks:
            for landmark in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(image, landmark, mp_hands.HAND_CONNECTIONS)
            
        cv2.imshow("Camera No. 1", cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        if cv2.waitKey(10) & 0xff == ord('q'):
            break

camera.release()
cv2.destroyAllWindows()

In [30]:
num_coord = len(results.multi_hand_landmarks[0].landmark)
num_coord

21

## Step 5 - Data gathering 

Here, I use the same code from step 4 to collect data from the camera. But we obviously need to preprocess the dataset file before doing anything. 

In [82]:
import csv
import numpy as np 
import os

Here I made a list, index 0 is called `class`. You now why? because we need to have a class for each sign. In my case, classes where _Thumbs up, Thumbs down, Rock on, Love_. You can change it to anything you like. 

In [73]:
landmarks = ['class']
for num in range(1, num_coord + 1):
    landmarks += [f'x{num}', f'y{num}', f'z{num}'] 

In [48]:
landmarks
len(landmarks)

85

In this line, I just tell my `coords.csv` file that what first line is. This is nothing magicall.

In [74]:
with open('coords.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerow(landmarks)
    

In [90]:
class_id = "rock on"

## Step 6 - Adding data to the file 

Here, I tweaked my camera stuff to add data to the CSV file. Above, you see `class_id` variable. In the body of the loop, we use it as the index 0 of our list. After that, we flatten what we've got from our hand landmarks set. then, we put it in our dataset. 

For more information, just see `coords.csv` file from this repository. 

In [91]:
camera = cv2.VideoCapture(1)
with mp_hands.Hands(min_detection_confidence=0.5, min_tracking_confidence=0.5, max_num_hands = 1) as hands:
    while camera.isOpened():
        _, image = camera.read()
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False 
        results = hands.process(image)

        image.flags.writeable = True
        landmark_list = []
        if results.multi_hand_landmarks:
            for landmark in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(image, landmark, mp_hands.HAND_CONNECTIONS)
        
        try:
            hand = enumerate(results.multi_hand_landmarks[0].landmark)
            hand_row = list(np.array([[landmark.x, landmark.y, landmark.z] for _, landmark in hand]).flatten())
            hand_row.insert(0, class_id)
            #print(hand_row)
            with open('coords.csv', mode='a', newline='') as f:
                csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(hand_row)
                
        except Exception as e:
            pass
            
        cv2.imshow("Camera No. 1", cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        if cv2.waitKey(10) & 0xff == ord('q'):
            break

camera.release()
cv2.destroyAllWindows()

## Step 7 - Data Preprocessing 

Here I don't really do anything magical. This is just preprocessing the data using _pandas_. We open CSV file, we find our X and Y axises and then we put them in `x_train`, `y_train`, `x_test` and `y_test` variables. 

### Axises 

* X: It is dedicated to the features we needed. Like coordinations of fingers and finger tips. 
* Y: It is dedicated to the labels we needed. 

In [92]:
import pandas as pd 
from sklearn.model_selection import train_test_split

In [93]:
df = pd.read_csv('coords.csv')

In [96]:
X = df.drop('class', axis=1)
y = df['class']

In [97]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

## Step 8 - Finding which algorithm is better 

Here, I just used different algorithms, as you can see. It was necessary to know which learning algorithm is better for this project. 

In [104]:
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

As you can see here, I just put different algorithms here in a bunch of pipelines. For some reason `LogisticRegression` didn't work for me. So if you can solve its problem, I'll appreciate a pull request. 

In [106]:
pipelines = {
    #'lr':make_pipeline(StandardScaler(), LogisticRegression()),
    'rc':make_pipeline(StandardScaler(), RidgeClassifier()),
    'rf':make_pipeline(StandardScaler(), RandomForestClassifier()),
    'gb':make_pipeline(StandardScaler(), GradientBoostingClassifier()),
}

Here, we just train our models. This is the time consuming part. Depending on the size of your dataset, it may take a few minutes up to a few hours. 

On my system (a 2019 MacBook Pro with i7 chip and 16 GB's of RAM) it took almost 5 or 6 minutes. I haven't tested it on [Colab](https://colab.research.google.com) or any other computers. But there most be no significant difference.

In [107]:
fit_models = {}
for algo, pipeline in pipelines.items():
    model = pipeline.fit(x_train, y_train)
    fit_models[algo] = model

Here, we use `sklearn.metrics` to find which model is better. In my case, `RandomForestClassifier` seemed a little bit better, so I went with that. 

In [111]:
from sklearn.metrics import accuracy_score

In [112]:
for algo, model in fit_models.items():
    yhat = model.predict(x_test)
    print(algo, accuracy_score(y_test, yhat))

rc 0.9838998211091234
rf 0.9856887298747764
gb 0.9838998211091234


## Final Step - Let the party begin!

And I don't have any other explanation for th

In [126]:
model = fit_models['rf']


In [128]:
camera = cv2.VideoCapture(1)
with mp_hands.Hands(min_detection_confidence=0.5, min_tracking_confidence=0.5, max_num_hands = 1) as hands:
    while camera.isOpened():
        _, image = camera.read()
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image.flags.writeable = False 
        results = hands.process(image)

        image.flags.writeable = True
        landmark_list = []
        if results.multi_hand_landmarks:
            for landmark in results.multi_hand_landmarks:
                mp_drawing.draw_landmarks(image, landmark, mp_hands.HAND_CONNECTIONS)
        
        try:
            hand = enumerate(results.multi_hand_landmarks[0].landmark)
            hand_row = list(np.array([[landmark.x, landmark.y, landmark.z] for _, landmark in hand]).flatten())
            x = pd.DataFrame([hand_row])
            prediction = model.predict(x)
            prediction = prediction[0]
            cv2.putText(image, prediction, (0, 50), cv2.FONT_HERSHEY_COMPLEX, 2, (0, 0, 255), 2, cv2.LINE_AA)
                
        except Exception as e:
            pass
            
        cv2.imshow("Camera No. 1", cv2.cvtColor(image, cv2.COLOR_RGB2BGR))
        if cv2.waitKey(10) & 0xff == ord('q'):
            break

camera.release()
cv2.destroyAllWindows()