# Exploratory Analysis - Reducing Commercial Aviation Fatalities

### References
* Violin Plots: https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-2-connect
* EEG Article and Image: https://en.wikipedia.org/wiki/10%E2%80%9320_system_(EEG)

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
import os

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go

In [None]:
py.init_notebook_mode(connected=True)
color = sns.color_palette()
print(os.listdir("../input"))

## Load data

In [None]:
%%time
dtypes = {'crew': 'int8', 'seat': 'int8'}
train_df = pd.read_csv("../input/train.csv", dtype=dtypes)

In [None]:
print(train_df.shape)
train_df.head()

We first check how many simulation each pair of crew has performed:

In [None]:
train_df.crew.value_counts()

> `event` - The state of the pilot at the given time: one of A = baseline, B = SS, C = CA, D = DA

In [None]:
to_replace = ['A', 'B', 'C', 'D']
new_values = ['Baseline', 'Startle/Surprise', 'Channelized Attention', 'Diverted Attention']

train_df.event.replace(to_replace, new_values, inplace=True)

In [None]:
train_df.event.value_counts().plot(kind='bar', rot=15)

## Analyzing non-eeg data
We will first analyze the data collected that are not directly EEG. We will observe their distribution with respect to the experiment type, then with respect to the crew team.

> * `r` - Respiration, a measure of the rise and fall of the chest. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.

In [None]:
plt.figure(figsize=(10,5))
sns.violinplot(x='r', y='event', data=train_df)
plt.ylabel('Experiment Type', fontsize=12)
plt.xlabel('Respiration Level (µV)', fontsize=12)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.violinplot(x='crew', y='r', data=train_df)
plt.xlabel('Crew ID', fontsize=12)
plt.ylabel('Respiration Level (µV)', fontsize=12)
plt.show()

> * `gsr` - Galvanic Skin Response, a measure of electrodermal activity. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.

In [None]:
plt.figure(figsize=(10,5))
sns.violinplot(x='gsr', y='event', data=train_df)
plt.ylabel('Experiment Type', fontsize=12)
plt.xlabel('Galvanic Skin Response (µV)', fontsize=12)
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.violinplot(x='crew', y='gsr', data=train_df)
plt.xlabel('Crew ID', fontsize=12)
plt.ylabel('Galvanic Skin Response (µV)', fontsize=12)
plt.show()

> * `ecg` - 3-point Electrocardiogram signal. The sensor had a resolution/bit of .012215 µV and a range of -100mV to +100mV. The data are provided in microvolts.

In [None]:
plt.figure(figsize=(10,5))
sns.violinplot(x='ecg', y='event', data=train_df)
plt.ylabel('Experiment Type', fontsize=12)
plt.xlabel('3-point Electrocardiogram signal (µV)', fontsize=12)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.violinplot(x='crew', y='ecg', data=train_df)
plt.xlabel('Crew ID', fontsize=12)
plt.ylabel('3-point Electrocardiogram signal (µV)', fontsize=12)
plt.show()

## Analyze EEG Data

A quick search shows that the eeg data is collected using the **International 10–20 system**. Wikipedia describes it as such:
> The 10–20 system or International 10–20 system is an internationally recognized method to describe and apply the location of scalp electrodes in the context of an EEG exam, polysomnograph sleep study, or voluntary lab research. This method was developed to maintain standardized testing methods ensuring that a subject's study outcomes (clinical or research) could be compiled, reproduced, and effectively analyzed and compared using the scientific method. The system is based on the relationship between the location of an electrode and the underlying area of the brain, specifically the cerebral cortex.

The follow image, retrieved the from the article, shows where each of the probes are placed:

![eeg](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/21_electrodes_of_International_10-20_system_for_EEG.svg/350px-21_electrodes_of_International_10-20_system_for_EEG.svg.png)

The nomial choices are described below:
> Each electrode placement site has a letter to identify the lobe, or area of the brain it is reading from: Pre-frontal (Fp), Frontal (F), Temporal (T), Parietal (P), Occipital (O), and Central (C). Note that there is no "central lobe"; due to their placement, and depending on the individual, the "C" electrodes can exhibit/represent EEG activity more typical of Frontal, Temporal, and some Parietal-Occipital activity, and are always utilized in polysomnography sleep studies for the purpose of determining stages of sleep.

Let's see which ones are available:

### Prefrontal and Frontal Analysis

In [None]:
areas = ['fp1', 'fp2', 'f3', 'f4', 'f7', 'f8', 'fz']
df = train_df[['eeg_' + st for st in areas]]

We notice that A1 and A2 are not in the dataset. It also contains EEG data about Poz, which is not described in the article. I believe it should be in the area between Pz, O1 and O2, but I don't have anything to back this claim.

*To be continued*