## Table Of Content:


### 1. [Dataset categories](#dataset)
### 2. [Pre-processing](#preprocess)
### 3. [1-d data description](#onedim)
### 4. [Questions](#qs)
* [Can one patient have multiple type of seizures?](#seizures)
* [How many patients in total?](#patients)
* [How many records are there for each seizure type?](#records)

## Dataset Categories <a class="anchor" id="dataset"></a>
    Categories of dataset:
    index                  int. index of record
    fileNo                 int. follow the file categories, same for one file
    patient                int. patient ID
    session                str. sxxx one patient could have mutiple sessions
    file                   str. txxx one session could have mutiple files
    EEGtype                str. EMU, ICU, Inpatient, Outpatient, Unknown
    EEGsubtype             str. {EMU: EMU, ICU: [NICU,RICU,NSICU,SICU,CICU,BURN,ICU,,PICU], Inpatient: [ER, OR, General], Outpatient: Outpatient, Unknown: Unknown}
    LTM-or-Routine         str. Routine, LTM
    Normal/Abnormal        str. Normal, Abnormal
    No.Seizures/File       int. number of serzures per file
    No.Seizures/Session    int. number of serzures per session
    floderType             str. train, dev
    channelConfig          str. 01_tcp_ar, 02_tcp_le, 03_tcp_ar_a
    date                   str. record date
    start                  float. seizure start time in the file (sec)
    end                    float. seizure end time in the file (sec)
    seizureType            str. FNSZ, GNSZ, CPSZ, TCSZ, ABSZ, TNSZ, SPSZ, MYSZ
    link                   str. link of the webpage
    size                   float. file size of .edf file

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('data/seizureSummary.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7513 entries, 0 to 7512
Data columns (total 19 columns):
index                  7513 non-null int64
fileNo                 7513 non-null int64
patient                1423 non-null float64
session                1423 non-null object
file                   5612 non-null object
EEGtype                1423 non-null object
EEGsubtype             1423 non-null object
LTM-or-Routine         1423 non-null object
Normal/Abnormal        1423 non-null object
No.Seizures/File       5612 non-null float64
No.Seizures/Session    1423 non-null float64
floderType             7513 non-null object
channelConfig          7513 non-null object
date                   7513 non-null object
start                  3050 non-null float64
end                    3050 non-null float64
seizureType            3050 non-null object
link                   7513 non-null object
size                   7513 non-null float64
dtypes: float64(6), int64(2), object(11)
memory usage

In [3]:
df_copy = df.copy()

##  pre-processing  <a class="anchor" id="preprocess"></a>

  

In [4]:
# fill NA by fill in th above value
df[['patient', 'session', 'file','EEGtype','EEGsubtype','LTM-or-Routine','Normal/Abnormal','No.Seizures/File','No.Seizures/Session']] = df[['patient', 'session','file','EEGtype','EEGsubtype','LTM-or-Routine','Normal/Abnormal','No.Seizures/File','No.Seizures/Session']].fillna(method='ffill') #fill below

# data type transform
df["patient"] = df['patient'].astype(int).astype(str)

# clean EEGtype & EEGsubtype'Unknown' --> EEG Report Is Not Informative
unknown = df[df['EEGtype'] == 'Unknown']
df = df.drop(unknown.index, axis=0) 
unknown = df[df['EEGsubtype'] == 'Unknown']
df = df.drop(unknown.index, axis=0) 

# sort for easier access
df = df.sort_values(by=['patient','EEGsubtype'])

### condition: If one patient have different type of EEG, treat it as different patient.

In [5]:
# differenate patients with same EEGsubtype by adding suffix
df['patient_suffix'] = df.groupby('patient')['EEGsubtype'].transform(lambda x: (~x.duplicated()).cumsum())

In [6]:
# add new column for patientID
df['patientID'] = [str(col) + '0' for col in df['patient']] + df['patient_suffix'].astype(str)


## 1-d data description <a class="anchor" id="onedim"></a>


In [7]:
(df['fileNo'].value_counts() > 5).value_counts()

False    4383
True      114
Name: fileNo, dtype: int64


### EEG type
    

In [8]:
df['EEGtype'].value_counts()

ICU           2789
Inpatient     2363
EMU           1577
Outpatient     397
Name: EEGtype, dtype: int64

### EEG subtype                    &                                     Seizure Type 

    BURN	Burn Unit						    FNSZ	Focal Non-Specific Seizure		
    CICU	Cardiac Intensive Care			   GNSZ	Generalized Non-Specific Seizure		
	ICU     Intensive Care Unit				  SPSZ	Simple Partial Seizure		
    NICU	Neuro-ICU Facility 				  CPSZ	Complex Partial Seizure		
    NSICU	Neural Surgical ICU 				ABSZ	Absence Seizure		
    PICU	Pediatric Intensive Care Unit		TNSZ	Tonic Seizure		
    RICU	Respiratory Intensive Care Unit	  CNSZ	Clonic Seizure		
    SICU	Surgical Intensive Care Unit		 TCSZ	Tonic Clonic Seizure		
                                                 ATSZ	Atonic Seizure		
                                                 MYSZ	Myoclonic Seizure		

In [9]:
df['EEGsubtype'].value_counts()

General       2354
EMU           1580
NICU          1164
RICU           578
NSICU          533
Outpatient     396
SICU           249
CICU           156
BURN            55
ICU             44
ER               9
PICU             7
OR               1
Name: EEGsubtype, dtype: int64

In [10]:
df['seizureType'].value_counts()

FNSZ    1788
GNSZ     541
CPSZ     349
TNSZ      62
SPSZ      52
TCSZ      47
ABSZ      45
MYSZ       3
Name: seizureType, dtype: int64

### LTM-or-Routine
    LTM: long-term-recording Routine: routine recording 

In [11]:
df['LTM-or-Routine'].value_counts()

LTM        4606
Routine    2520
Name: LTM-or-Routine, dtype: int64

In [12]:
# The No. of patients and the total file size for LTM and Routine recording

df_LR_NoPatient = df.pivot_table(index=['LTM-or-Routine'],values = 'patientID',aggfunc=lambda x: len(x.unique()))
df_LR_filesize = df.pivot_table(index=['LTM-or-Routine'],values = 'size',aggfunc=np.sum)
pd.concat((df_LR_NoPatient, df_LR_filesize), axis=1)

Unnamed: 0_level_0,patientID,size
LTM-or-Routine,Unnamed: 1_level_1,Unnamed: 2_level_1
LTM,246,42304.136
Routine,557,40614.028


### Normal/Abnormal

In [13]:
df['Normal/Abnormal'].value_counts()

Abnormal    6308
Normal       818
Name: Normal/Abnormal, dtype: int64

In [14]:
# Do not understand here -- why normal EEG still have seizure record?
df.pivot_table(index=['Normal/Abnormal','seizureType'],values = 'patientID',aggfunc=lambda x: len(x.unique()))

Unnamed: 0_level_0,Unnamed: 1_level_0,patientID
Normal/Abnormal,seizureType,Unnamed: 2_level_1
Abnormal,ABSZ,9
Abnormal,CPSZ,41
Abnormal,FNSZ,153
Abnormal,GNSZ,74
Abnormal,MYSZ,2
Abnormal,SPSZ,3
Abnormal,TCSZ,11
Abnormal,TNSZ,3
Normal,CPSZ,2
Normal,FNSZ,8


### channel Config -- LE & AR
    Linked Ears Reference (A1+A2, LE, RE): based on the assumption that sites like the ears and mastoid bone lack electrical activity, often implemented using only one ear;

    The Average Reference (AR): uses the average of a finite number of electrodes as a reference.
    


In [15]:
df['channelConfig'].value_counts()

01_tcp_ar      4912
03_tcp_ar_a    1432
02_tcp_le       782
Name: channelConfig, dtype: int64


## Questions <a class="anchor" id="qs"></a>


## Can one patient have multiple type of seizures?  <a class="anchor" id="seizures"></a>
    Answer: Yes, among 257 patients with seizures, 48 have multiple type of seizures.


In [16]:
# count number of patient with different seizure type
df = df.dropna(subset=['seizureType']) 
df_seizureType = df.drop_duplicates(['patientID','seizureType'])
(df_seizureType['patientID'].value_counts() != 1).value_counts()

False    209
True      48
Name: patientID, dtype: int64

In [17]:
# example for same patient with different seizure type
df_seizureType['patientID'].value_counts()[:3]

654602    5
623001    4
845301    3
Name: patientID, dtype: int64

In [18]:
df_seizureType[df_seizureType['patientID'] == '654602']

Unnamed: 0,index,fileNo,patient,session,file,EEGtype,EEGsubtype,LTM-or-Routine,Normal/Abnormal,No.Seizures/File,...,floderType,channelConfig,date,start,end,seizureType,link,size,patient_suffix,patientID
1069,1070,736,6546,s033,t003,Inpatient,General,LTM,Abnormal,2.0,...,train,01_tcp_ar,2014_03_16,1.0,29.0022,FNSZ,/train/01_tcp_ar/065/00006546/s033_2014_03_16/...,3.8,2,654602
6246,6247,96,6546,s014,t000,Inpatient,General,Routine,Abnormal,7.0,...,dev,01_tcp_ar,2011_03_15,90.972,160.672,GNSZ,/dev/01_tcp_ar/065/00006546/s014_2011_03_15/00...,24.0,2,654602
6315,6316,132,6546,s024,t000,Inpatient,General,LTM,Abnormal,4.0,...,dev,01_tcp_ar,2012_02_25,240.9727,286.5742,SPSZ,/dev/01_tcp_ar/065/00006546/s024_2012_02_25/00...,9.4,2,654602
6325,6326,136,6546,s025,t003,Inpatient,General,LTM,Abnormal,2.0,...,dev,01_tcp_ar,2012_02_26,741.5547,793.5547,CPSZ,/dev/01_tcp_ar/065/00006546/s025_2012_02_26/00...,17.0,2,654602
7506,7507,1007,6546,s013,t000,Inpatient,General,LTM,Abnormal,1.0,...,dev,03_tcp_ar_a,2011_02_18,290.0175,361.0775,TCSZ,/dev/03_tcp_ar_a/065/00006546/s013_2011_02_18/...,20.0,2,654602


### condition: If one patient have different type of seizures, treat it as different patient.

In [19]:
df = df.sort_values(by=['patientID', 'seizureType'],ascending=False)
# differenate patients with same seizureType by adding suffix
df['seizure_suffix'] = df.groupby('patientID')['seizureType'].transform(lambda x: (~x.duplicated()).cumsum())

# add new column for patientID
df['patientID2'] = [str(col) + '0' for col in df['patientID']] + df['seizure_suffix'].astype(str)

## How many patients in total? <a class="anchor" id="patients"></a>
    

### In general: 
    Answer: There are 317 unique patients in total.

In [26]:
# count number of unique patient 
df_patient = df.drop_duplicates(['patientID2'])
df_patient['patientID2'].value_counts().sum()

317

### By channel config: 
    Answer: 
    01_tcp_ar: 185
    02_tcp_le: 77
    03_tcp_ar_a: 74
   

In [27]:

print("01_tcp_ar: ", df.loc[df['channelConfig'] == '01_tcp_ar'].drop_duplicates(['patientID2'])['patientID2'].value_counts().sum())
print("02_tcp_le: ", df.loc[df['channelConfig'] == '02_tcp_le'].drop_duplicates(['patientID2'])['patientID2'].value_counts().sum())
print("03_tcp_ar_a:", df.loc[df['channelConfig'] == '03_tcp_ar_a'].drop_duplicates(['patientID2'])['patientID2'].value_counts().sum())



01_tcp_ar:  185
02_tcp_le:  77
03_tcp_ar_a: 74


    result here shows one patient may have different recording by channel config.

## How many records are there for each seizure type? <a class="anchor" id="records"></a>

    subquestion: How many records are there for each seizure type by channel config?
    

###   Seizure Type detailed
    SEIZ	Seizure	This class is a general class for seizure.
    All the following specific seizure classes can fall into this universal seizure class.
    
    FNSZ	Focal Non-Specific Seizure	This event should contain Lobe, Hemispheric and Focal seizures regardless of their location on the scalp (e.g. Temporal Lobe seizure, Left hemispheric seizures, etc..)
    
    GNSZ	Generalized Non-Specific Seizure	
    The seizures which occur over (almost) all the channels.
    
    SPSZ	Simple Partial Seizure (Focal)	
    A Focal seizure containing simple waves which start from on area of the brain and (sometimes) spreads in brain towards other lobes. (not harmful and Patient is conscious).  Length: variable
    
    CPSZ	Complex Partial Seizure (Focal)	
    Seizure which contains Complex waves. (Harmful, could be non-convulsive seizures). Length: variable
    
    ABSZ	Absence Seizure	
    Short brief seizures contains usually 3 to 6 Hz spike and wave complexes. Length: typically 3-4 seconds to upto 11-12 seconds
    
    TNSZ	Tonic Seizure	
    A type of a seizure which indicates stiffening of muscles. Length: variable
    
    CNSZ	Clonic Seizure	
    A type of a seizure which indicates continuous jerking of muscles. Length: variable
    
    TCSZ	Tonic-Clonic Seizure	
    The most severe Seizure includes stiffening in the beginning stage and jerking in later stage. Length: variable
    
    ATSZ	Atonic Seizure	
    A very brief seizure (about  1 second long) where patient loses the consciousness for a second. (Not important for ICU patients at all). Length: 1 second
    
    MYSZ	Myoclonic Seizure	
    very brief motor seizure event, which lasts about 1-2 seconds includes periodic jerks of muscles.  Length: 1-3 seconds
    
    NESZ	Non-Epileptic Seizure	
    Seizures event which does not occur due epilepsy. Length: variable 


In [28]:
# size here refer to .edf file size in megabytes
df_type_NoPatient = df.pivot_table(index=['seizureType'],values = 'patientID2',aggfunc=lambda x: len(x.unique()))
df_type_filesize = df.pivot_table(index=['seizureType'],values = 'size',aggfunc=np.sum)
pd.concat((df_type_NoPatient, df_type_filesize), axis=1)

Unnamed: 0_level_0,patientID2,size
seizureType,Unnamed: 1_level_1,Unnamed: 2_level_1
ABSZ,9,879.2
CPSZ,43,5715.5
FNSZ,159,24926.372
GNSZ,83,9886.1
MYSZ,2,49.0
SPSZ,4,1874.7
TCSZ,13,734.8
TNSZ,4,544.7


In [29]:
df_NoPatient = df.pivot_table(index=['channelConfig','seizureType'],values = 'patientID2',aggfunc=lambda x: len(x.unique()))
df_filesize = df.pivot_table(index=['channelConfig','seizureType'],values = 'size',aggfunc=np.sum)
pd.concat((df_NoPatient, df_filesize), axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,patientID2,size
channelConfig,seizureType,Unnamed: 2_level_1,Unnamed: 3_level_1
01_tcp_ar,ABSZ,1,48.0
01_tcp_ar,CPSZ,17,1841.3
01_tcp_ar,FNSZ,98,15230.311
01_tcp_ar,GNSZ,53,7605.7
01_tcp_ar,MYSZ,1,23.0
01_tcp_ar,SPSZ,4,1874.7
01_tcp_ar,TCSZ,7,382.8
01_tcp_ar,TNSZ,4,544.7
02_tcp_le,ABSZ,8,831.2
02_tcp_le,CPSZ,12,1767.9
