In [None]:
{
    "tags": [
        "hide_input",
        "hide_output"
    ]
}
import pandas as pd 
import numpy as np
import seaborn as sns

# PECARN TBI

In [None]:
pecarn_tbi = pd.read_csv('../data/TBI PUD 10-08-2013.csv', index_col=0)

# Data Types
The majority of the columns in the PECARN dataset are essentially categorical, with the exception of GCS and Age columns which are numeric.

It is helpful to convert everything to a nullable integer type (Int64) as there is some missing data, and then to a Categorical type.

In [None]:
for col in list(pecarn_tbi):
    try:
        pecarn_tbi[col] = pecarn_tbi[col].astype(float).astype('Int64')
        if col not in ['AgeinYears', 'AgeInMonth', 'GCSEye', 'GCSVerbal', 'GCSMotor', 'GCSTotal']:
            pecarn_tbi[col] = pecarn_tbi[col].astype('category')
    except:
        pass

# Initial Investigation
The *pecarn_tbi* dataframe will not be changed from this point onwards, and a *model_inputs* dataframe will be constructed.

## Age
The dataset is evaluated by PECARN for two rule sets, one for a child below 2 years of age, and another for a child over 2 years of age.

We don't need both *AgeinYears* and *AgeInMonth*, but it may make sense to record infants age in months in the end-user UI


In [None]:
pecarn_tbi_by_age_group = pecarn_tbi.groupby('AgeTwoPlus')
pecarn_tbi_by_age_group['AgeinYears','AgeInMonth'].describe()

Lets drop AgeInMonth for now... and fix up the typo while we are at it.

In [None]:
if 'AgeInMonth' in list(pecarn_tbi):
    pecarn_tbi = pecarn_tbi.drop(columns='AgeInMonth')
pecarn_tbi.rename(columns={'AgeinYears': 'Age'}, inplace=True)

# Injury Type
The study reports *"Children were excluded with trivial injury mechanisms defined by ground-level falls or walking or running into stationary objects, and no signs or symptoms of head trauma other than scalp abrasions and lacerations.*"

Can't tell if the statement above means they have already been excluded. So need to investigate.


### Trivial Injury Mechanisms

In [None]:
g = sns.FacetGrid(pecarn_tbi, col='High_impact_InjSev', height=4, aspect=.7)
g.map(sns.countplot, 'InjuryMech')

In [None]:
sns.countplot(pecarn_tbi[pecarn_tbi['High_impact_InjSev'] != 1]['InjuryMech'])

It looks like InjMech = 6 (Fall to ground from standing/walking/running) and InjMech = 7 (Walked or ran into stationary object) are in the data.

It also looks like *High_impact_InjSev* category 1 will exclude the *InjuryMech* category 6 and 7, which appear to be in the dataset, but should be excluded.

In [None]:
model_inputs = pecarn_tbi[pecarn_tbi['High_impact_InjSev'] != 1]

Patients were also excluded if they had penetrating trauma, known brain tumors, pre-existing neurological disorders complicating assessment, or neuroimaging at an outside hospital before transfer.

Patients were excluded if they had ventricular shunts or bleeding disorders.

# Imbalanced Data
It looks like the data is quite imbalanced.

TODO: need to decide how to handle this.

In [None]:
pecarn_tbi.groupby(['PosIntFinal','AgeTwoPlus']).count()

In [None]:
model_inputs.groupby(['PosIntFinal','AgeTwoPlus']).count()

# Response Variables
In the original dataset, the *PosIntFinal* variable is the response or target variable. 

However, *PosIntFinal* variable is "Yes" when at least one of the *HospHeadPosCT*, *Intub24Head*, *Neurosurgery*, or *DeathTBI* variables were "Yes". The model probably doesn't need to predict the category (?)

In [None]:
responses_all_colnames = ['PosIntFinal', 'HospHeadPosCT', 'Intub24Head', 'Neurosurgery','DeathTBI']
pecarn_tbi[responses_all_colnames].describe()

In [None]:
model_inputs[responses_all_colnames].describe()