# Exploratory data analysis (EDA)

## Dataset

**Hand Gesture of the Colombian sign language** dataset is taken from [kaggle](https://www.kaggle.com/evernext10/hand-gesture-of-the-colombian-sign-language). <br>
Images represent hand gestures for numbers (0-5) and vowels (A, E, I, O, U). <br>

## Files

### Filenames

Let's see how many images are for each sign.

In [None]:
from os import listdir
from os.path import isfile, join

def get_gestures_count(men = True):
    gestures = {}
    signs = ['0', '1', '2', '3', '4', '5', 'A', 'E', 'I', 'O', 'U']
    for sing in signs:
        path = '../data/men/' + sign + '/' if men else '../data/Woman/' + sign + '/'
        files = [file for file in listdir(path) if isfile(join(path, file))]
        gestures[sign] = len(files)
    return gestures

men_gestures = get_gestures_count()
women_gestures = get_gestures_count(False)

- Men <br>
Below we can observe number of images for each gesture.

In [None]:
import pandas as pd

men_gestures_df = pd.DataFrame.from_dict(men_gestures)
display(men_gestures_df)

`Observation: ` <br>
TODO

- Women <br>
Below we can observe number of images for each gesture.

In [None]:
women_gestures_df = pd.DataFrame.from_dict(women_gestures)
display(women_gestures_df)

`Observation: ` <br>
TODO

- Overall sum of images for each gesture

In [None]:
gestures_df = men_gestures_df.set_index('Gesture').add(women_gestures_df.set_index('Gesture'),
                                                       fill_value=0).reset_index()
display(gestures_df)

`Observation: ` <br>
TODO

Then I check if all files in each sign's directory has **prefix** with sign name e.g. all images representing A has 'A' letter at the beggining of filename. <br>
If it's true, it will be easier to create image dataset.

In [None]:
def check_if_filename_starts_with_gesture_name(men = True):
    filenames_start_with_gesture_name = {}
    signs = ['0', '1', '2', '3', '4', '5', 'A', 'E', 'I', 'O', 'U']
    for sing in signs:
        path = '../data/men/' + sign + '/' if men else '../data/Woman/' + sign + '/'
        files = [f for f in listdir(path) if isfile(join(path, f))]
        filenames_start_with_gesture_name[sign] = all(file.startswith(sing) for file in files)
    return filenames_start_with_gesture_name
        
men_filenames_start_with_gesture_name = check_if_filename_starts_with_gesture_name()
women_filenames_start_with_gesture_name = check_if_filename_starts_with_gesture_name(False)

- Men <br>
Below we can observe if all files under given gesture directory have filename that has gesture name at the beginning.

In [None]:
men_filenames_start_with_gesture_name_df = pd.DataFrame.from_dict(men_filenames_start_with_gesture_name)
display(men_filenames_start_with_gesture_name_df)

`Observation: ` <br>
TODO

- Women <br>
Below we can observe if all files under given gesture directory have filename that has gesture name at the beginning.

In [None]:
women_filenames_start_with_gesture_name_df = pd.DataFrame.from_dict(women_filenames_start_with_gesture_name)
display(women_filenames_start_with_gesture_name_df)

`Observation: ` <br>
TODO

### Data format
Do images have same extension?

In [None]:
def check_filenames_extensions(men = True):
    extensions = {}
    signs = ['0', '1', '2', '3', '4', '5', 'A', 'E', 'I', 'O', 'U']
    for sing in signs:
        path = '../data/men/' + sign + '/' if men else '../data/Woman/' + sign + '/'
        files = [f for f in listdir(path) if isfile(join(path, f))]
        extensions[sign] = set([file.split('.')[-1] for file in files])
    return extensions
        
men_files_extensions = check_filenames_extensions()
women_files_extensions = check_filenames_extensions(False)

- men <br>
In below dataframe we can see unique extensions for images representing gestures.

In [None]:
men_files_extensions_df = pd.DataFrame.from_dict(men_files_extensions)
display(men_files_extensions_df)

`Observation: ` <br>
TODO

- women <br>
In below dataframe we can see unique extensions for images representing gestures.

In [None]:
women_files_extensions_df = pd.DataFrame.from_dict(women_files_extensions)
display(women_files_extensions_df)

`Observation: ` <br>
TODO

### Dimensions

In [None]:
from PIL import Image

def get_dimension_statistics(men = True):
    widths = {}
    heights = {}
    signs = ['0', '1', '2', '3', '4', '5', 'A', 'E', 'I', 'O', 'U']
    for sing in signs:
        path = '../data/men/' + sign + '/' if men else '../data/Woman/' + sign + '/'
        widths_and_heights = [Image.open(path + f).size  for f in listdir(path) if isfile(join(path, f))]
        widths[sign] = [width for (width, height) in widths_and_heights]
        heights[sign] = [height for (width, height) in widths_and_heights]
    return widths, heights

widths_men, heights_men = get_dimension_statistics()
widths_women, heights_women = get_dimension_statistics(False)

- men <br>
Below are statistics for images' dimensions in *men* directory for each gesture.

In [None]:
widths_men_df = pd.DataFrame.from_dict(widths_men)
display(widths_men_df.describe())

heights_men_df = pd.DataFrame.from_dict(heights_men)
display(heights_men_df.describe())

`Observation: ` <br>
TODO

- women <br>
Below are statistics for images' dimensions in *women* directory for each gesture.

In [None]:
widths_women_df = pd.DataFrame.from_dict(widths_women)
display(widths_women_df.describe())

heights_women_df = pd.DataFrame.from_dict(heights_women)
display(heights_women_df.describe())

`Observation: ` <br>
TODO

### Data mode
What are images' data modes?

In [None]:
def get_unique_img_modes(men = True):
    unique_modes = {}
    signs = ['0', '1', '2', '3', '4', '5', 'A', 'E', 'I', 'O', 'U']
    for sing in signs:
        path = '../data/men/' + sign + '/' if men else '../data/Woman/' + sign + '/'
        modes = [Image.open(path + f).mode for f in listdir(path) if isfile(join(path, f))]
        unique_modes[sign] = set(modes)
    return unique_modes
        
men_unique_img_modes = get_unique_img_modes()
women_unique_img_modes = get_unique_img_modes(False)

- men <br>
Below data frame shows unique images' modes for each gesture.

In [None]:
men_unique_img_modes_df = pd.DataFrame.from_dict(men_unique_img_modes)
display(men_unique_img_modes_df)

`Observation: ` <br>
TODO

- women <br>
Below data frame shows unique images' modes for each gesture.

In [None]:
women_unique_img_modes_df = pd.DataFrame.from_dict(women_unique_img_modes)
display(women_unique_img_modes_df)

`Observation: ` <br>
TODO

## Train and validation data

Data is not splitted into train and validation so I will do it on my own in *TODO: fill with proper filename*.

Pictures are divided into shown by men and women. I would like to have a classifier which recognizes gesture no matter who show it so I have decided to join corresponding signs.

## Gestures to recognize
As dataset is huge I have decided to recognize half of gestures. Choosen gestures: *TODO: choose them*