# About Dataset

The dataset contains EEG signals from 11 subjects with labels of alert and drowsy. It can be opened with Matlab. We extracted the data for our own research purpose from another public dataset:

Cao, Z., et al., Multi-channel EEG recordings during a sustained-attention driving task. Scientific data, 2019. 6(1): p. 1-8.

If you find the dataset useful, please give credits to their works.

The details on how the data were extracted are described in our paper:

"Jian Cui, Zirui Lan, Yisi Liu, Ruilin Li, Fan Li, Olga Sourina, Wolfgang Müller-Wittig, A Compact and Interpretable Convolutional Neural Network for Cross-Subject Driver Drowsiness Detection from Single-Channel EEG, Methods, 2021, ISSN 1046-2023, https://doi.org/10.1016/j.ymeth.2021.04.017."

The codes of the paper above are accessible from:

https://github.com/cuijiancorbin/A-Compact-and-Interpretable-Convolutional-Neural-Network-for-Single-Channel-EEG

The data file contains 3 variables and they are EEGsample, substate and subindex.

"EEGsample" contains 2022 EEG samples of size 20x384 from 11 subjects. Each sample is a 3s EEG data with 128Hz from 30 EEG channels.
"subindex" is an array of 2022x1. It contains the subject indexes from 1-11 corresponding to each EEG sample.
"substate" is an array of 2022x1. It contains the labels of the samples. 0 corresponds to the alert state and 1 correspond to the drowsy state.

The unbalanced version of this dataset is accessible from:
https://figshare.com/articles/dataset/EEG_driver_drowsiness_dataset_unbalanced_/16586957

# Importing Libraires 

In [None]:
import scipy.io
import numpy as np
import pandas as pd
from scipy.io import loadmat
from tabulate import tabulate
import matplotlib.pyplot as plt

# Loading Dataset

In [None]:
file_path = 'EEG driver drowsiness dataset.mat'
mat_data = scipy.io.loadmat(file_path)

In [None]:
print(mat_data)

# Exploratory Data Analysis

In [None]:
# Inspecting the keys and structure of the loaded data
mat_data.keys(), {key: type(mat_data[key]) for key in mat_data.keys()}

<div style="background-color: #cce5ff; padding: 10px; border: 1px solid #0066cc;">
    <h2 style="color: #0066cc; font-weight: bold;">Assigning Variables</h2>
    
</div>


In [None]:
eeg_samples = mat_data['EEGsample']
subindex = mat_data['subindex']
substates = mat_data['substate']

### Calculate the duration of each sample in seconds

EEGsample" contains 2022 EEG samples of size 20x384 from 11 subjects. Each sample is a 3s EEG data with 128Hz from 30 EEG channels. "subindex" is an array of 2022x1. It contains the subject indexes from 1-11 corresponding to each EEG sample. "substate" is an array of 2022x1. It contains the labels of the samples. 0 corresponds to the alert state and 1 correspond to the drowsy state.

In [None]:
# Calculate the duration of each sample in seconds
sampling_rate = 128  # Hz

num_samples, num_channels, num_time_points = eeg_samples.shape
sample_duration = num_time_points / sampling_rate

### Check for missing values

In [None]:
missing_values = np.isnan(eeg_samples).sum()
if missing_values == 0:
    print("No missing values in the EEG data.")
else:
    print("Number of missing values:", missing_values)

In [None]:
print("Number of subjects:", len(np.unique(subindex)))
print("EEGsample shape:", eeg_samples.shape)
print("Number of Samples:", num_samples)
print("Number of Channels:", num_channels)
print("Number of Time Points:", num_time_points)
print("Sample Duration (seconds):", sample_duration)

unique_labels, label_counts = np.unique(substates, return_counts=True)
print("Unique labels:", unique_labels)
print("Label counts:", label_counts)

#### The initial exploration of the EEG dataset reveals the following details:

#### EEG Samples (EEGsample):

  * The dataset contains 2022 EEG samples.
  * Each EEG sample is from 30 channels.
  * Each channel has 384 data points, corresponding to a 3-second EEG recording at a sampling rate of 128Hz.

#### Subject States (substate):

  * There are two unique states: 0 representing the alert state and 1 representing the drowsy state.
  * Each state has 1011 samples, indicating a balanced dataset with respect to the two states.

#### Subject Indexes (subindex):

  * There are 11 unique subjects in the dataset (labeled 1 to 11).
  * The distribution of samples across subjects varies, ranging from a minimum of 102 samples to a maximum of 314 samples per subject.

## State-specific Analysis: 'Alert' and 'Drowsy'

In [None]:
substates_flat = substates.flatten()

In [None]:
# Calculating means and standard deviations for each channel in both states

#Alert States
mean_alert = np.mean(eeg_samples[substates_flat == 0], axis=(0, 2))
std_alert = np.std(eeg_samples[substates_flat == 0], axis=(0, 2))

#Drowsy States
mean_drowsy = np.mean(eeg_samples[substates_flat == 1], axis=(0, 2))
std_drowsy = np.std(eeg_samples[substates_flat == 1], axis=(0, 2))

In [None]:
# Creating a DataFrame for easy viewing
stats_df = pd.DataFrame({
    'Channel': range(1, 31),
    'Mean_Alert': mean_alert,
    'Std_Alert': std_alert,
    'Mean_Drowsy': mean_drowsy,
    'Std_Drowsy': std_drowsy
})

In [None]:
stats_df  

## EEG Channel Statistics Across Subjects and Timepoints

In [None]:
# Compute basic statistics for each EEG channel
mean_values = np.mean(eeg_samples, axis=(0, 2))  # Compute mean along subjects and timepoints
std_values = np.std(eeg_samples, axis=(0, 2))    # Compute standard deviation along subjects and timepoints
min_values = np.min(eeg_samples, axis=(0, 2))    # Compute minimum along subjects and timepoints
max_values = np.max(eeg_samples, axis=(0, 2))    # Compute maximum along subjects and timepoints

# Create a list of dictionaries for each channel's statistics
channel_stats = [
    {
        "Channel": channel_index + 1,
        "Mean": f"{mean_values[channel_index]:.4f}",
        "Std Dev": f"{std_values[channel_index]:.4f}",
        "Min": f"{min_values[channel_index]:.4f}",
        "Max": f"{max_values[channel_index]:.4f}"
    }
    for channel_index in range(num_channels)
]

In [None]:
# Printing the table
print("Basic Statistics for EEG Channels")
print(tabulate(channel_stats, headers="keys", tablefmt="grid"))

# Data Visualization

## Distribution of Subindex and Substate Values

In [None]:
# Get unique values and their counts for subindex and substate
unique_subindex, counts_subindex = np.unique(mat_data['subindex'], return_counts=True)
unique_substate, counts_substate = np.unique(mat_data['substate'], return_counts=True)

In [None]:
# Set up subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot histogram for subindex
axes[0].bar(unique_subindex, counts_subindex, align='center', alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Frequency of Subindex Values')
axes[0].set_xlabel('Subindex Values')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, linestyle='--', alpha=0.7)

# Add count annotations on top of the bars for subindex
for x, y in zip(unique_subindex, counts_subindex):
    axes[0].text(x, y + 0.1, str(y), ha='center', va='bottom')


    

# Plot histogram for substate
axes[1].bar(unique_substate, counts_substate, align='center', alpha=0.7, color='lightcoral', edgecolor='black')
axes[1].set_title('Frequency of Substate Values')
axes[1].set_xlabel('Substate Values')
axes[1].set_ylabel('Frequency')
axes[1].grid(True, linestyle='--', alpha=0.7)

# Add count annotations on top of the bars for substate
for x, y in zip(unique_substate, counts_substate):
    axes[1].text(x, y + 0.1, str(y), ha='center', va='bottom')




# Adjust layout for better spacing
plt.tight_layout()

# Show the plots
plt.show()

## Analysis of Substate Distribution Across Subindices

In [None]:
# Group by subindex and count occurrences of 0 and 1 in substate
subindex_values, substate_counts = np.unique(mat_data['subindex'], return_counts=True)
substate_0_counts = []
substate_1_counts = []

In [None]:
for subindex_value in subindex_values:
    substate_values_for_subindex = mat_data['substate'][mat_data['subindex'] == subindex_value]
    substate_0_counts.append(np.sum(substate_values_for_subindex == 0))
    substate_1_counts.append(np.sum(substate_values_for_subindex == 1))

# Set up the bar chart
fig, ax = plt.subplots(figsize=(10, 6))

bar_width = 0.35
bar_positions_0 = np.arange(len(subindex_values))
bar_positions_1 = bar_positions_0 + bar_width

ax.bar(bar_positions_0, substate_0_counts, width=bar_width, label='Substate 0', alpha=0.7, color='skyblue', edgecolor='black')
ax.bar(bar_positions_1, substate_1_counts, width=bar_width, label='Substate 1', alpha=0.7, color='lightcoral', edgecolor='black')

# Add count annotations on top of each bar
for x, y in zip(bar_positions_0, substate_0_counts):
    ax.text(x, y + 0.1, str(y), ha='center', va='bottom')

for x, y in zip(bar_positions_1, substate_1_counts):
    ax.text(x, y + 0.1, str(y), ha='center', va='bottom')

ax.set_xticks(bar_positions_0 + bar_width / 2)
ax.set_xticklabels(subindex_values)
ax.set_xlabel('Subindex Values')
ax.set_ylabel('Frequency')
ax.set_title('Frequency of Substate Values for Each Subindex')
ax.legend()
ax.grid(True, linestyle='--', alpha=0.7)

plt.show()

# Visualization of EEG waveforms

## EEG Waveforms Comparison : Separate Subplots for Each Channel

This code generates a set of subplots to visually compare EEG waveforms between Alert and Drowsy states across all EEG channels. Each subplot represents a specific channel, displaying the mean EEG signal over time for both Alert and Drowsy states. 

In [None]:
# Selecting all channels for visualization
all_channels = list(range(1, 31))

In [None]:
# Creating plots
fig, axes = plt.subplots(nrows=len(all_channels), ncols=2, figsize=(15, 5 * len(all_channels)))
fig.suptitle('EEG Waveforms Comparison: Alert vs Drowsy States (Individual Channel Plots)', fontsize=16)

# Plotting for each channel
for i, channel in enumerate(all_channels):
    # Alert state
    axes[i, 0].plot(eeg_samples[substates_flat == 0][:, channel - 1, :].mean(axis=0))
    axes[i, 0].set_title(f'Channel {channel} - Alert State')
    axes[i, 0].set_xlabel('Time Points')
    axes[i, 0].set_ylabel('EEG Signal')
    axes[i, 0].grid(True)  # Add grid lines

    # Drowsy state
    axes[i, 1].plot(eeg_samples[substates_flat == 1][:, channel - 1, :].mean(axis=0))
    axes[i, 1].set_title(f'Channel {channel} - Drowsy State')
    axes[i, 1].set_xlabel('Time Points')
    axes[i, 1].grid(True)  # Add grid lines

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()

## EEG Waveforms Comparison : Combined Plots for All Channels

This code generates a set of subplots to compare combined EEG waveforms for Alert and Drowsy states across all EEG channels. Each subplot represents a specific channel, displaying the mean EEG signal over time for both Alert and Drowsy states. 

In [None]:
# Creating plots
fig, axes = plt.subplots(nrows=len(all_channels), ncols=1, figsize=(10, 3 * len(all_channels)))
fig.suptitle('EEG Waveforms Comparison: Alert vs Drowsy States (Combined Channel Plots)', fontsize=16)

# Plotting for each channel
for i, channel in enumerate(all_channels):
    # Alert state
    alert_mean = eeg_samples[substates_flat == 0][:, channel - 1, :].mean(axis=0)
    axes[i].plot(alert_mean, label='Alert', alpha=0.7)

    # Drowsy state
    drowsy_mean = eeg_samples[substates_flat == 1][:, channel - 1, :].mean(axis=0)
    axes[i].plot(drowsy_mean, label='Drowsy', alpha=0.7)

    axes[i].set_title(f'Channel {channel}')
    axes[i].set_xlabel('Time Points')
    axes[i].set_ylabel('EEG Signal')
    axes[i].legend()

    # Add grid
    axes[i].grid(True)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()

## EEG Data Visualization for Multiple Subjects and Channels

This code is designed to visually represent EEG data for multiple subjects and channels. For each subject, a dedicated figure is generated, and within each figure, there are individual subplots for each EEG channel. These subplots illustrate the EEG signal's amplitude over time for the respective channel.

In [None]:
num_subjects = 11 
num_channels = 30 

In [None]:
for subject in range(num_subjects):
    
    # Create a figure for each subject
    fig, axes = plt.subplots(nrows=num_channels, ncols=1, figsize=(10, 3 * num_channels))
    fig.suptitle(f'EEG Data for Subject {subject + 1}', fontsize=16)

    for channel in range(num_channels):
        # Extract data for the current channel and subject
        channel_data = eeg_samples[subject, channel, :]

        # Plotting
        axes[channel].plot(channel_data, label=f'Channel {channel + 1}')
        axes[channel].set_title(f'Channel {channel + 1}')
        axes[channel].set_xlabel('Time Points')
        axes[channel].set_ylabel('EEG Signal')
        axes[channel].legend()
        axes[channel].grid(True)

    plt.tight_layout()
    
    plt.show()

# Dimensionality Reduction

# EEG Signal Averaging-Based Channel Reduction

<div style="background-color: #cce5ff; padding: 10px; border: 1px solid #0066cc;">
    <h2 style="color: #0066cc; font-weight: bold;">Assigning Variables</h2>
    
</div>


In [None]:
eeg_samples = mat_data['EEGsample']
subindex = mat_data['subindex']
substates = mat_data['substate']

# Data Manipulation and applying Average

In [None]:
processed_samples = []

for sample in eeg_samples:
    
    # Transpose the sample
    transposed = sample.T

    # Reduce channels (Averaging Method)
    reduced = np.mean(transposed, axis=1)

    # Flatten the sample and add to the list
    processed_samples.append(reduced.flatten())

# Convert the list to a Pandas DataFrame
Averaging = pd.DataFrame(processed_samples)

In [None]:
Averaging

In [None]:
df_substates_flat = pd.DataFrame(substates, columns=['substate'])


new_data = pd.concat([Averaging, df_substates_flat], axis=1)


print(new_data.head())

In [None]:
# Assuming 'substate' is the column containing the labels
label_0 = new_data[new_data['substate'] == 0].iloc[0]
label_1 = new_data[new_data['substate'] == 1].iloc[0]

# Plotting substate 0
plt.figure(figsize=(12, 6))
plt.subplot(2, 1, 1)  # 2 rows, 1 column, plot 1
plt.plot(label_0[:-1], label='EEG Signal - Substate 0', color='blue')
plt.title("EEG Sample Plot - Substate 0 (Alert)")
plt.xlabel("Feature Index")
plt.ylabel("EEG Signal")
plt.grid(True, linestyle='--', alpha=0.7)

# Plotting substate 1
plt.subplot(2, 1, 2)  # 2 rows, 1 column, plot 2
plt.plot(label_1[:-1], label='EEG Signal - Substate 1', color='orange')
plt.title("EEG Sample Plot - Substate 1 (Drowsy)")
plt.xlabel("Feature Index")
plt.ylabel("EEG Signal")
plt.grid(True, linestyle='--', alpha=0.7)

plt.tight_layout()  # To prevent overlapping of subplot titles and axis labels
plt.legend()
plt.show()

# Vanilla Machine Learning Models

## Importing Libraries

In [None]:
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Training and Testing Split

In [None]:
X = Averaging 
y = substates_flat 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Machine Learning

In [None]:
# Initialize the models
svm_model = SVC()
rf_model = RandomForestClassifier()
knn_model = KNeighborsClassifier()
xgb_model = XGBClassifier()

# Fit the models
svm_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Predict on the test set
svm_pred = svm_model.predict(X_test)
rf_pred = rf_model.predict(X_test)
knn_pred = knn_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)

# Evaluate the models
svm_acc = accuracy_score(y_test, svm_pred)
rf_acc = accuracy_score(y_test, rf_pred)
knn_acc = accuracy_score(y_test, knn_pred)
xgb_acc = accuracy_score(y_test, xgb_pred)

print(f"SVM Accuracy: {svm_acc}")
print(f"Random Forest Accuracy: {rf_acc}")
print(f"KNN Accuracy: {knn_acc}")
print(f"XGBoost Accuracy: {xgb_acc}")