In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
import seaborn as sns
import os
from scipy import signal

from models import top_models
from constants import *
from read_model_performance import get_model_performance

# loads cached results and times for all models
model_performance = get_model_performance()

sns.set_palette(['#323A4C', '#C65861'])

# Executive Summary

The core challenge of this project was to develop an intuitive and user-friendly product tailored to the needs of individuals with disabilities. We addressed this by creating an eye-tracking controlled RC car, utilising machine learning and data science techniques to optimise the accuracy, latency and robustness of our product.

From a physics approach, we applied critical thinking in evaluating different electrode positions, and filtering techniques such as Fast Fourier Transform and downsampling. Furthermore, through analysis of data patterns of maxima and minima, we were able to create an automatic event interval. 

The results demonstrated an average software latency of 0.02 seconds in classification, false positive rate of 0.4%, and less than a 5% difference in accuracy between trained and untrained users.

Our eye movement detection technology holds immense potential in creating more accessible products for individuals with disabilities where hands-free technology is required.

# Introduction

In our world, people with disabilities face challenges due to their lack of autonomy, impacting their social and occupational lives. We aim to develop a fast, reliable eye-movement detection technology, combining Physics and Data Science, and initially tested via an RC car. This technology will enhance an inclusive play experience for the disabled. Our versatile product could extend to healthcare for wheelchair control and cognitive testing. We'll investigate how different factors and processes affect Electro-Oculogram Activity (EOG) signal quality, and explore Machine Learning techniques for effective eye movement classification, considering the relationship between concepts such as Nyquist’s theorem, down-sampling, and signal input variations.

# Methods

![Project Design Workflow](img/figure-01.png){#fig-01}

We developed an iterative process, in which we tested different data collection/cleaning methods, as well as different classification methods and variables. Considerations combining Physics and Data Science were made at every step in the process, including the most optimal way to collect signals, filter signals, classify events, and eventually predict events. The model optimisation process was completed for each different classifier method (see @fig-classifier-eval)

# Data Collection

The scale and reliability of data collected underpins the quality of event classification. We collected data containing sequences of labelled eye movements (left-look, right-look, and blink), so we could build a classification model which accurately predicts these movements. We saved these as `.csv` files, each associated with another file containing the "markers" (labels and times) of each eye movement. In total, we recorded 813 events. See the appendix for definition and notation of events.

## Electrode Position

Electrode position is crucial in obtaining distinguishable signals that will correspond to actions by the RC car. Initially, electrodes were placed between the eyebrows and on the upper temple/eyebrow area, forming a horizontal line. This position was adequate for classification of left and right looks, however there were concerns about the implementation of a clear blinking signal, as this position was inconsistent. 

We explored the relationship between electrode placement and the signal strength, by varying each electrode position and calculating the signal strength through the average power spectral density (PSD) as a function of distance from the left eye (see appendix). These results indicate that electrode placement should be as close as possible to the eyes. We can see the effect of this in @fig-electrode-signal, where the placement of the electrode forms a horizontal line that intersects the eyes (see @fig-electrode-placement).


In [None]:
def plot_features(df, ax):
    '''
    Plots event intervals for signals given a dataframe
    '''
    for i in df[df['event'] == 1]['T']:
        ax.axvline(i, color='g',label='Left')

    for i in df[df['event'] == 2]['T']:
        ax.axvline(i, color='g', linestyle='--')

    for i in df[df['event'] == 3]['T']:
        ax.axvline(i, color='r', label='Right')
    for i in df[df['event'] == 4]['T']:
        ax.axvline(i, color='r', linestyle='--')

    for i in df[df['event'] == 13]['T']:
        ax.axvline(i, color='orange', label='Blink')
    for i in df[df['event'] == 14]['T']:
        ax.axvline(i, color='orange', linestyle='--')
    return None


fig, (ax1, ax2) = plt.subplots(1,2)

#weak data path
weak_data = f'Test_csv/CL/FFT/101FFT'
#read in dataframe from csv
df = pd.read_csv(f'{weak_data}.csv')

#plot signal and event intervals
ax1.plot(df['T'], df['V'])
plot_features(df, ax1)
ax1.set_xlabel('time (s)')
ax1.set_title('Signal From Original Electrode Position')
ax1.set_ylabel('Voltage (a.u)')
ax1.legend()
ax1.set_ylim(-50, 1050)

#FFT data path
FFT_data = f'Test_csv/CL/FFT/1FFT'

#read in dataframe from csv
df = pd.read_csv(f'{FFT_data}.csv')

#plot signals and event intervals
ax2.plot(df['T'], df['V'])
plot_features(df, ax2)
ax2.set_xlabel('time (s)')
ax2.set_title('Signal From Optimised Electrode Position')
ax2.set_ylabel("Voltage (a.u.)")
ax2.legend()
ax2.set_ylim(-50, 1050)

plt.show()

## Automatic Markers

Originally, we labelled events using a Python script, which marked events based on a timed keypress, however there was a steep learning curve for accurately labelling events. Improving this, we implemented an automatic marker based on observations from previous data, which suggested the start of an event was approximately 0.15 seconds before maximum/minimum voltage. All events were considered to have a length of 0.5 seconds. The implementation of automatic markers greatly increased the scale and reliability of our training data.

# Data Cleaning/Transformation

The required input for the classification models is a list of aggregated features from the window being classified. The training data was created splitting signal sequences into overlapping 1-second windows, which were then cleaned, and features were extracted. This provides the classifiers better insight into the signals in each window, without having the entire sequence being input as attributes to the classifier.

## Fast Fourier Transform

A degree of noise is captured in the components of the measuring equipment, such as the wires, electrodes, and internal working of the SpikerBox. We can implement Fast Fourier Transforms to convert signals from the time domain into the frequency domain which enables observing at what frequencies compose our EOG signal. This enables the use of a ‘Gaussian blur’ which acts as a low-pass filter that blocks noisy high frequency components. This is effective since the frequency of our desired signals are less than 5Hz. This plays a major role in producing clean signals for Data science to perform machine learning.

## Rolling Windows

From each sequence in the collected data, overlapping 1-second windows, each 0.1 seconds apart, were extracted (see @fig-rolling-window). This increased the scale of the training data, as opposed to fully-separated windows, thus increasing our classification models' accuracy. This also better simulates the streaming condition, since we are performing classifications every 0.1 seconds, so we have to be able to detect eye-movements which are located in different portions of each window. This increases the product's reliability.

## Feature Set

The feature set was decided by examining the waveforms of each eye movement (see **Figure X**), and deciding how to capture event waveforms, using the fewest attributes. Signal variation between users was accounted for by including features independent of baseline values (i.e., Difference between median and maximum values). The reduced feature set (see **Figure X**) reduces the classification models' complexity and latency, when compared to alternatives such as TSFresh, where there are 770+ features, which aren't tailored towards the type of the data. Further improvements could be made by considering a larger set of features, or combining our features with TSFresh.

# Machine Learning Model

We used a Machine Learning model to classify every window collected from the SpikerBox, as one of:

1. Non-event
2. Left
3. Right
4. Blink

We considered using a simple threshold (i.e., standard deviation) for event detection, however these measures weren't robust, as optimal thresholds varied between users and environments. For example, suboptimal electrode placement increased the signal noise. To solve this, we included Non-events as a target class, and the Machine Learning model better captured the waveforms’ complexity, particularly due to the feature set explored earlier.

## Classifier Evaluation

### Workflow

We tested many combinations of models, hyperparameters, downsampling rates, and feature selection methods. To test each of these, we ran 50-times 5-fold cross validation, saving various performance metrics each iteration. The cross-validation prevented overfitting, and repetitions minimised the effect of outliers, providing a more precise measurement of metrics.

During the evaluation phase, the accuracy was initially massively inflated, because one of the features is the classification of the previous window, and the training data had 100% accurate labels for this feature. This meant that the accuracy relied on the assumption that the model always classified the previous window correctly. To overcome this, we stored the windows' original sequence and position in that sequence. Then, the training data was split by sequence, instead of window. The testing was done by classifying one sequence (in order) at a time, and using the previous prediction as the label for the corresponding feature in the next window. This meant that the testing was done in the same condition as in real product use, providing more representative accuracy measures.

The software latency of each model was calculated by simulating 100 random windows, and measuring the time for each window's filtering, feature extraction, and classification. This allows us to see how long it takes from signal input to classification output.

### Metrics

We used the proportion of correctly labelled windows as a basic metric to gauge the relative performance of different models when making early design decisions. We did this by dividing the number of correctly classified windows, by the number of total windows. Although this doesn't give a good idea of how well the product works in terms of real usage, it is strongly linked to event accuracy and false positive rate, and thus worked as a good proxy during early development.

When making larger decisions, more closely linked to the end product, we evaluated the false positive rate. This was done by dividing the number of times an "event" was detected, when no event occurred, by the total number of actual events. This measure is important in ensuring the safety of resulting products, since false positives would result in unwanted movement, which poses a safety risk.

We also used event accuracy, measuring the number of misclassified events, compared to the number of total events. This is a better metric for product accuracy than the window accuracy, since it relates directly to the feel of the product.

Our software latency metric measured the average time from signal input to classification, and allowed us to adjust the SpikerBox's buffer size according to the speed. Our goal was to get the average classification time under 0.1 seconds, since this is the minimum buffer size, and is also the speed required for technology to feel seemingly instant [@response-time].

## Model Selection

### Hyperparameters and Variables

The hyperparameters and variables for each classification method (see **Figure X** for methods) were selected by evaluating all variations of them, tested on training data with no downsampling, and using the full feature set. Then, only the models with the highest accuracy for each method were kept (see **Figure Y**). Keeping all other variables constant allowed for fair comparison, however it would've been ideal to test each of these models on various different feature selection methods and downsampling rates, since some of the models may have performed better in alternate circumstances. Unfortunately, this is not viable, since the model evaluations' computation time is too large.

### Downsampling

From physics, Nyquist’s theorem states that we can capture all important features of our signal by sampling points at a rate double the maximum frequency we want to keep. By inspecting the signals’ periodograms (see **Figure X**), we can see that major signal strength lies in regions below 5Hz. This means we can obtain a theoretical downsampling rate of approximately 1000.

To validate this prediction and find the optimal downsample rate, we evaluated the window classification accuracy, and the latency, of a fixed set of models, under various different downsampling rates. Figure X confirms that a downsampling rate of 1000 is viable, however a rate of 200 was found to provide the best tradeoff between accuracy and classification time. Evidently, communication between disciplines was essential to produce this result.

### Feature Selection

The feature selection method was decided by evaluating the latency and window accuracy of the best of each classification method using different feature selection methods (see **Figure X**), using a downsampling rate of 200. Then, the method with the best accuracy for each model was selected. Latency was ignored, because it is well below our target regardless of feature selection.

### Classifier Method

To decide on the final classifier, we evaluated the event accuracy and false positive rate of each classifier method, using their best feature selection methods. As seen in **Figure X**, the K-Nearest Neighbours and Random Forest classifiers had the best results, and we ended up choosing the Random Forest classifier, due to the lower false positive rate. This is because we felt that user safety was a higher priority than ease of use.

# Resulting Product

The RC car was designed to have bluetooth connection via an Arduino. The components are: 

- SpikerBox
- Electrode stickers
- Alligator clips
- Car body
- HC-05 bluetooth
- Arduino UNO R3 
- 2 12V motors
- L298N Motor controller
- 2 9V battery sources for Arduino and motor controller.

The design of the signals used on the car were to be intuitive, with spin left/right for directional looks and to stop spinning when eyes returned to forward position and a fast blink to move forward and stop. The blinking motion needs to be emphasised in order to not classify a normal blink as forward motion. The action of a fast blink also means that the user’s eyes are able to be fixed on the car before it starts moving forward.

## Deployment

Once the user is set up with the electrode stickers attached to the SpikerBox, a random Forest classifier is trained on the training data, with a downsampling rate of 200, and no feature selection applied. Then, once the streaming begins, users’ eye movements will be picked up as signals in the electrode stickers. These signals go to the SpikerBox, which sends 1000 signals every 0.1 seconds to a computer. There, the most recent 1 second of data is filtered, downsampled, and aggregated to the features in **Figure X**. These features are given as an input to the classifier, which outputs a classification for that window. If there is a new event, a bluetooth signal is then sent out to the RC car, which interprets the events and controls the movement using DC motors.

Some practical considerations for obtaining the cleanest signals for optimal classification performance:

- **Clean skin** - Removing dirt and oil will enhance the signal.
- **Laptop Chargers** - Laptop chargers should be disconnected to not cause any electrical interference.
- **Check battery** -  SpikerBox requires a power source above 7.5V.

## Product Performance

### Classification Accuracy

The final accuracy for classification of events (evaluated using cross-validation) was **x%**. We also tested the event accuracy by using the final product, and recording whether each event was correctly classified. This was done across multiple users, some of whom did not contribute to the training data. This accuracy was **y%**.

### False Positive Rate

We measured the false positive rate using cross-validation, and found that our classifier had only a **x%** false positive rate. We did not formally measure the false positive rate in real use of the product, however no false positives occurred during all use of the product.

### Latency

Our product has an average software latency of **x** seconds. We also measured the latency of the entire product, by recording videos of eye movements and corresponding car movements, and recording the time between them. Using this method, we found that the product's latency is **x** seconds on average, although sometimes was as large as 1 second.

### Robustness

We measured the robustness of the product by comparing the event classification accuracy of different users, under lab conditions (clean signals, no distractions). We found that users who contributed to the training data had an accuracy of **x**, whereas other users had an accuracy of **y**. 

### User Experience

After extensive product use, we have found that there is a large learning curve for controlling the car. This includes learning exactly what type of eye movement is expected by the classifiers, particularly for blinking. Furthermore, turning was difficult at times, since users couldn't look at the car while turning.

Directional movements were too fast, making it difficult to turn less than 180 degrees. Also, sometimes bluetooth output was missed by the Arduino, leading to correctly classified events being “missed”.

Accuracy slightly decreased compared to lab testing. Factors like nerves, SpikerBox battery state, distraction levels, and variable electrode placement affected classification performance. The classifier was ineffective when improperly set up but performed well under optimal conditions.

# Discussion

Our performance measures varied significantly across cross-validation, lab-condition testing, and product testing. This is because the training data was collected in a controlled environment with optimal electrode placement, battery health, and minimal distractions. These factors affect signal magnitude and noise. The relative movement between the centre and left changes in different environments, impacting the maximum signal value. As a result, our cross-validation metrics assume controlled variables, which do not reflect real user experience. Additionally, many sequences were collected in the same session, which was not accounted for in our cross-validation splits, leading to inflated performance. To address this, we will separate training data by session in the future and collect data in suboptimal conditions to capture real-world variability in the classification model.

Signal output from blinks varied between users. A “fast blink” is subjective, so new users will need to practise to get this correct. We explored the use of other alternatives such as winking however this led to even more variability in signals, since not everyone can wink easily.
The overall latency was suboptimal, due to hardware limitations. Upgrading hardware can reduce this in future. Considering SpikerBox buffering, the real latency isn’t captured. If an event occurs at the beginning of the buffer, the classifier receives this information 0.1 seconds later. Consequently, the software’s true latency is 0.07 seconds on average, since the location of an event in the buffer will be in the middle, on average.
Despite these, our system is fairly robust due to cross-validation, feature independence from individual baselines, and contributions from four trainers.

# Conclusion

## Findings

We achieved our aim in creating fast, reliable eye-movement detection technology to create ocular-enabled manoeuvrability. Our classifier had excellent performance metrics including an accuracy rate of **x**, classification latency of 0.02 seconds, and less than 5% difference in accuracy between trained and untrained users. While we did face some limitations, including variability in user interpretation of signals and some hardware constraints, these challenges provided insights for future enhancements.

## Future Work

To improve our technology, we can explore improvements such as semi-supervised machine learning algorithms, and building a more robust, diverse dataset, accounting for external factors’ influence on signal features.

This project's ability to detect eye movements to allow ocular-enabled manoeuvrability can be extended to several industries. We could explore the use of this technology for controlling wheelchairs, as well as using the eye movement detection for health and accessibility purposes, such as controlling software, or completing cognitive tests for the physically disabled.

# Appendix

## Additional Information

### Defining Eye Movement "Events"

Our classification method detects the beginning and end of directional eye movements as separate events, allowing us to measure the duration of a left or right look. This was designed to allow for more control over the RC car. We denote an event as a string: “C” for when the user is looking forward, “L” for left, “R” for right, and “B” for a fast blink. For example, “CLCB” denotes the user looking forward, then left, then back to forward, and finally blinking. In this project we fixed the left signal to produce a maximum.

### Further Electrode Placement Analysis
Testing involved moving the left temple electrode down the face, while keeping the central electrode fixed.Then the central eyebrow electrode was moved vertically up the face while the left electrode was fixed. Each data point was an average of 3 of each directional event.

**Figure X** indicates that the left electrode follows an inversely proportional relationship , but the central electrode remains roughly constant. This means that to maximise the left and right signals, we should move the electrodes closer to the eye such that the lines connecting the electrodes are horizontal and intersects the eyes. Furthermore, this should act to provide complementary signals, as the left and right eye have symmetric electrical potentials and thus left and right looks would have distinguishing signals. Figure X confirms this analysis as we attain stronger signals that make it easier to distinguish blinks.


## Additional Figures

![Rolling Window Representation](img/rolling-window.png){#fig-rolling-window}

![Electrode Placement: Blue - Original Position, Red - Optimal Position](img/electrode-placement.png){#fig-electrode-placement}

## Student Contributions

**510462200** - I created Python Script for Automatic Event Markers and data collection. I had the physicist roles of signal analyst of electrode placement, theoretical downsample rate. Responsible for all technical hardware of the project - researching, constructing, coding the Arduino car, SpikerBox. I was in charge of organising data collection procedures and collecting the majority of data. I contributed physics sections to the report and presentation and performed the live demonstration and recorded a video backup. 

**510517588** - I wrote all the code for the streaming classifier, downsampling, feature selection, and classifier evaluation, as well as code for several different classifier methods. I also wrote the code for extracting features and generating the training data. Myself and 510462200 made most of the project's design decisions. I was a major contributor to both the presentation and report, in charge of creating all of the data science-related plots, as well as structuring, editing, and writing a large portion of the data science side of the report. I designed and wrote most of the slides in the presentation, and wrote the majority of the script.

**500465031** - I helped in the initial research for finding the parts for the Arduino car and what would be required to build it, I was also a contributor for both the presentation and the report working on the Introduction, conclusion, future applications of our technology and findings of our project for the report and the same for the presentation, for the presentation i was also involved in structure, design and presenting. I also was involved throughout the project in data collection multiple times, from the initial trial and error and towards the end as well. 

**500411142** - Wrote a part of model evaluation and model selection code. Focusing on making a presentation slide and report. For the presentation slide part about improving visual experience, combining different pages and extract key points of each sentence. For the report part I did the model evaluation and selection part. I am also a part of data collection. Improve the final essay.

**500443660** - I helped with part of the model evaluation and model selection. For the presentation slide section, I added a little bit of revision. For the report, I did the project results for latency, false positive rate and robustness. I also was involved in some data collection, as a tester helped to complete the car backup work and add to the report as a product performance part.

**490021440** - I wrote a part of the model selection code. For the presentation, I wrote the scripts. And for the report, I did the part of product overview, deployment and event classification accuracy.

## Code

The code is stored on [this GitHub repository](https://github.com/jooshford/eye-movement-classifier). There are instructions for each step of the process in the README file.