Skip to content

This project is an exploratory exercise on event based cameras data. It aims at enhancing classification model performance in accurately classify pronounced words in lip videos.

Notifications You must be signed in to change notification settings

mariamargherita/WordsRecognitionByLipVideo

Repository files navigation

WordsRecognitionByLipVideo


This GitHub repository showcases a project that focuses on classifying event data obtained from an event-based sensor - Event Camera, also known as a neuromorphic sensor.

Please note that the dataset was leveraged for a Kaggle challenge but it is not available anymore due to ownership rights. However, please feel free to reach out in case you are interested in the dataset.

This project was carried out in PyCharm, thus it is optimized for it. However, this should not keep you from using your own preferred server.

Introduction

An event is a 4-tuple $(x,y,p,t)$ where

  • $(x,y)$ denotes the pixel's position associated with the event.
  • $p$ is a boolean indicating whether the change in luminosity is increasing or decreasing.
  • $t$ represents the timestamp (in $\mu s$) from the start of the recording.

Event Data are DataFrames, with each row representing an event, sorted in ascending order w.r.t. the timestamp.

Note: In our unique hardware configuration provided by the manufacturer, the range of $x$ is from $0$ to $480$, $y$ varies from $0$ to $640$, $p$ can be either $0$ (decrease of luminosity) or $1$ (increase of luminosity), and $t$ is a floating-point number.


Project Objective

The primary goal of this project is to address the following problem:

Problem: Our goal is to construct a classifier that can determine the class of a new, unseen example as accurately as possible.

The main metric that will be used to assess the performance of the models is accuracy.

Usage

To use this project, follow these steps:

  1. Clone the repository: First, clone this repository to your local machine using

    git clone https://github.com/mariamargherita/WordsRecognitionByLipVideo.git
  2. Obtain the dataset: To obtain the dataset please reach out. Unfortunately, the data are not publicly available due to ownership rights.
    The train data will have the following structure:

    local_repo/
    ├──── train10/
    │       ├── train10/
    │             ├── Addition/
    │             ├── Carnaval/
    │             ├── Decider/
    │             ├── Ecole/
    │             ├── Fillette/
    │             ├── Huitre/
    │             ├── Joyeux/
    │             ├── Musique/
    │             ├── Pyjama/
    │             └── Ruisseau/
    ├──── .venv/
    ├──── .gitignore
    ├──── .LICENSE
    ├──── ...
    └──── *.ipynb
    

Every folder within train10/train10/ holds 32 csv files, named from 0.csv to 31.csv. These files represent event data focused on the face of a speaker uttering a specific french word, which is also the name of the parent folder.

  1. Install virtual environment: This project requires a certain Conda environment. You can install it by typing the following in your terminal:

    conda env create -f lip_video_env.yml

You should now be able to run the Notebooks.


Project Outline

This repository contains the following files:

 ├──── checkpoints folder: contains the checkpoint of best model with respect to validation accuracy performance
 ├──── plots foder: contains some samples visualizations. If code is run, also accuracy vs. validation accuracy and loss vs. validation loss will be stored here.
 ├──── data_handling.py: Python file containing the data load and preprocessing steps
 ├──── lip_video_env.yml: .yml file containing the Conda environment needed to run the code (see Usage section above)
 ├──── model.py: Python file containing the CNN-LSTM model
 ├──── pipeline.py: Python file containing the project pipeline
 └──── utils.py: Python file containing useful functions to run code

Data Preprocessing

The initial phase of the project involves preprocessing the raw event data.

In the preprocessing steps we performed noise reduction which did not result in an improvement in the model performance. This is probably due to the fact that in our use case noise makes our model more general. For this reason, we decided not to reduce noise in our data. We also created mini-batches since it is a preprocessing step which helps CNN-LSTM model performance.

Note: The code for noise reduction implementation was left in the repository for reference.


Model Selection and Training

After preprocessing and exploring the data, the next step is model selection and training.

We tried different model complexities and tuned parameters and hyperparameters. We also tried different test sizes and batch sizes since we do not have much data so this could have a strong impact on how well the neural network learns to generalize. Finally, we made sure to add dropout and early stopping to limit over fitting.


Results

We get a 93% accuracy on the test set.

To achieve this, we trained the model on 90% of training data and reserved a 10% for validation data. Once we found the model with the best performance on the validation data, we trained the best model on the full training data and predicted the test data labels, getting a test accuracy of 93%.


Contributions

Here are some of the steps that could still be taken in order to potentially improve the models:

  • Add regularization to further limit over fitting
  • Add attention layers in neural network composition
  • Try to tune Adam optimizer's parameters (i.e. learning rate)

About

This project is an exploratory exercise on event based cameras data. It aims at enhancing classification model performance in accurately classify pronounced words in lip videos.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published