# Exploratory Data Analysis 
Using the following data set [dataset](https://physionet.org/content/eeg-power-anesthesia/1.0.0/).

This is a starting point to do further EDA.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import regex as re
sns.set()

In [None]:
df_data_54_EEGquality = pd.read_csv('../EEG data/multitaper-spectra-recorded-during-gabaergic-anesthetic-unconsciousness-1.0.0/OR/54_EEGquality.csv')
df_data_54_f = pd.read_csv('../EEG data/multitaper-spectra-recorded-during-gabaergic-anesthetic-unconsciousness-1.0.0/OR/54_f.csv')
df_data_54_l = pd.read_csv('../EEG data/multitaper-spectra-recorded-during-gabaergic-anesthetic-unconsciousness-1.0.0/OR/54_l.csv')
df_data_54_Sdb = pd.read_csv('../EEG data/multitaper-spectra-recorded-during-gabaergic-anesthetic-unconsciousness-1.0.0/OR/54_Sdb.csv')
df_data_54_t = pd.read_csv('../EEG data/multitaper-spectra-recorded-during-gabaergic-anesthetic-unconsciousness-1.0.0/OR/54_t.csv')

df_data = pd.concat([df_data_54_t, df_data_54_f, df_data_54_l, df_data_54_EEGquality], axis=1)
df_data.columns = ['time', 'frequency', 'state', 'EEG_quality']

For patient 54 the following are the lengths of the dataframes:
* EEGquality: 2639
* Frequency: 99 (100 different frequency bins)
* State: 3809
* Sdb: 99 (2640 columns)
* Time: 2639

From the above, we see:
* EEGquality can be mapped to time.
* The Sdb columns can be mapped to time. The columns seem to coinside with the frequency bins

3809/2 = 1904

Questions:
* Why is "State" not matching any other format

## Preprocessing

We see that Frequency and Sdb are missing one row. It is observed that this missing row has become the column values. Therefore, this requires preprocessing.

In [None]:
def column_offset(df, column_name):
    df_pp = df.copy()                           # copying the original frame
    extra_row = df_pp.columns                   # defining the row
    extra_row_float = []
    match = '[-]*\d+.\d+e[+|-]\d*'

    for row in extra_row:
        row_clean = re.findall(match, row)[0]
        extra_row_float.append(float(row_clean))

    df_pp.loc[-1] = extra_row_float             # adding a row
    df_pp.index = df_pp.index + 1               # shifting index
    df_pp = df_pp.sort_index()                  # sorting by index
    df_pp.columns = column_name                 # changing the column name
    return df_pp


In [None]:
df_data_54_f_pp = column_offset(df_data_54_f, ['frequency'])
df_data_54_t_pp = column_offset(df_data_54_t, ['time'])
df_data_54_Sdb_pp = column_offset(df_data_54_Sdb, [x for x in range(len(df_data_54_Sdb.columns))]).T
df_data_54_Sdb_pp.head()

In [None]:
ax = sns.lineplot(x=df_data_54_Sdb_pp.index, y=df_data_54_Sdb_pp[3])
ax.set_ylim(-10,40)
ax.set_ylabel(' [Db]')
ax.set_xlabel('Time [2 s]')

In [None]:
ax = sns.histplot(data=df_data_54_Sdb_pp[0])
ax.set_xlim(-10,50)
ax.set_xlabel('Frequency')