## Introduction<a id='Introduction'></a>

The purpose of this notebook will be to preprocess our data to get it ready for Machine Learning. 

    1. Create dummy or indicator features for categorical variables
    2. Split your data into testing and training datasets
    3. Standardize the magnitude of numeric features using a scaler
    
We will then save the split data into new csv files to use for machine learning models in a later notebook. 

## Table of Contents<a id='Table_of_Conents'></a>

* [Introduction](#Introduction)
* [Table of Contents](#Table_of_Contents)
* [Imports](#Imports)
* [View Data](#View_Data)
* [Create Dummy Columns](#Create_Dummy_Columns)
* [Feature Engineering](#Feature_Engineering)
* [Train Test Split](#Train_Test_Split)
* [Save the training/testing data](#Save_the_training/testing_data)
* [Conclusion](#Conclusion)

## Imports<a id='Imports'></a>

Here we are importing the necessary libraries as well as the data to help with preprocessing and to get a basic machine learning model created for the dataset. We also read in the data into the data dataframe for further processing. 

In [1]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#Don't display future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set()

In [2]:
#Import cleaned data
data = pd.read_csv('data/clean_data/clean_data.csv')

## View Data<a id='View_Data'></a>

Here we will make some adjustments to the dataset so we have all the right data types/values that we need. 

In [3]:
#Update the data type to Strings for Product Code as they are categories, not ints. 
data['ProductCode'] = data['ProductCode'].values.astype('str')

#product codes: 1293 (amusement devices), 3295 (water slides, public), and 3259 (go-karts).
#Update the product codes values to align with their actual meaning. 

#Create a dictionary for the product codes
product_codes = {'1293':'amusement devices (1293)', '3295': 'water slides, public (3295)', '3259':'go-karts (3259)'}

#Update the product code to the proper strings
data.replace(product_codes, inplace=True)

#Convert to Date Time object
data['Treatment_Date'] = pd.to_datetime(data['Treatment_Date'])

In [4]:
data.set_index('CPSC_Case_Number', inplace = True)

We have also set the Case Number as the index since it is a unqiue identifier of the data. Let's look at the current data types for the remaining columns. 

In [5]:
data.dtypes

Treatment_Date     datetime64[ns]
Age                         int64
Sex                        object
Narrative                  object
Diagnosis                  object
Other_Diagnosis            object
BodyPart                   object
Disposition                object
ProductCode                object
Device_category            object
Device_type                object
Location                   object
Stratum                    object
PSU                         int64
Weight                    float64
dtype: object

Here we see that everything is a string except for Age, PSU, and Weight. No information was available for what PSU and Weight actually mean so I will leave them alone for now. Let's veiw the data to make sure it looks acceptable. 

In [6]:
data

Unnamed: 0_level_0,Treatment_Date,Age,Sex,Narrative,Diagnosis,Other_Diagnosis,BodyPart,Disposition,ProductCode,Device_category,Device_type,Location,Stratum,PSU,Weight
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
180125260,2017-12-31,3,F,3 YOF JUMPING BOUNCE HOUSE W/MOM JUMPED UP LAN...,Fracture,,"Leg, lower",Treated and released,amusement devices (1293),Inflatables,inflatable,Sports/recreation,C,32,4.7570
180108428,2017-12-31,10,F,10YOF PLAYING *** AT *** AT A PARTY AT *** LOC...,Dental injury,,Mouth,Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,C,8,4.7570
180120413,2017-12-31,14,M,14YOM- PT WAS PLAYING *** TODAY SWELLING TO RI...,Other,TENDONITIS,Knee,Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,C,90,4.7570
180125238,2017-12-30,2,F,2 YOF JUMPING IN BOUNCE HOUSE LANDED AWKWARDLY...,Fracture,,"Leg, lower",Treated and released,amusement devices (1293),Inflatables,inflatable,Sports/recreation,C,32,4.7570
180135290,2017-12-30,17,F,"17YOF DRIVING GO CART, RAN INTO BARRIER, HIT L...","Strain, sprain",,Hand,Treated and released,go-karts (3259),Go karts,go kart,Sports/recreation,M,54,79.1731
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130113361,2013-01-03,56,M,"DX CONTU CHEST WALL: 56YOM GO-CART RACING, HIT...","Contusion, abrasion",,"Trunk, upper",Treated and released,go-karts (3259),Go karts,go kart,Sports/recreation,V,67,14.8537
130109590,2013-01-02,12,F,12YO F WAS PLAYING *** WHEN BUMPED ELBOW ON GU...,Laceration,,Elbow,Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,L,89,77.2173
130113339,2013-01-02,55,M,DX ACITE EXACERBATOPM PF CHR LBP/SCIATICA/RADI...,Nerve damage,,"Trunk, lower",Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,V,67,14.8537
130109054,2013-01-01,6,M,6YOM PLAYING IN A BOUNCE HOUSE AND HIT HEAD SU...,Hematoma,,Face,Treated and released,amusement devices (1293),Inflatables,inflatable,Sports/recreation,S,73,76.7142


## Create Dummy Columns<a id='Create_Dummy_Columns'></a>

Let's now convert the data into dummy variable columns so that all the categorical columns are split out as 0 or 1 as opposed to strings. First, we will create all the necessary dummy columns which include the following:

    - Sex
    - BodyPart
    - ProductCode
    - Device_category
    - Device_type
    - Location
    - Stratum
    
Then we will convert the "Treatment_Day" field into new columns for day, month, and year and drop the main Treatment Date.

In [7]:
#Not including Diagnosis as that is the first prediction. 'Diagnosis'
#Not including 'Other_Diagnosis' as that value is derived from the Diagnosis column. 
d_data = pd.get_dummies(data, columns= ['Sex', 'BodyPart', 
                               'ProductCode', 'Device_category', 
                               'Device_type', 'Location', 'Stratum'], drop_first=True)

#Need to convert the treatment date to Day, Month, and Year
d_data['Treatment_Day'] = d_data['Treatment_Date'].dt.day
d_data['Month_Day'] = d_data['Treatment_Date'].dt.month
d_data['Year_Day'] = d_data['Treatment_Date'].dt.year

In [8]:
d_data.head()

Unnamed: 0_level_0,Treatment_Date,Age,Narrative,Diagnosis,Other_Diagnosis,Disposition,PSU,Weight,Sex_M,BodyPart_Ankle,...,Location_Sports/recreation,Location_Street,Location_Unknown,Stratum_L,Stratum_M,Stratum_S,Stratum_V,Treatment_Day,Month_Day,Year_Day
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
180125260,2017-12-31,3,3 YOF JUMPING BOUNCE HOUSE W/MOM JUMPED UP LAN...,Fracture,,Treated and released,32,4.757,0,0,...,1,0,0,0,0,0,0,31,12,2017
180108428,2017-12-31,10,10YOF PLAYING *** AT *** AT A PARTY AT *** LOC...,Dental injury,,Treated and released,8,4.757,0,0,...,1,0,0,0,0,0,0,31,12,2017
180120413,2017-12-31,14,14YOM- PT WAS PLAYING *** TODAY SWELLING TO RI...,Other,TENDONITIS,Treated and released,90,4.757,1,0,...,1,0,0,0,0,0,0,31,12,2017
180125238,2017-12-30,2,2 YOF JUMPING IN BOUNCE HOUSE LANDED AWKWARDLY...,Fracture,,Treated and released,32,4.757,0,0,...,1,0,0,0,0,0,0,30,12,2017
180135290,2017-12-30,17,"17YOF DRIVING GO CART, RAN INTO BARRIER, HIT L...","Strain, sprain",,Treated and released,54,79.1731,0,0,...,1,0,0,0,1,0,0,30,12,2017


We see that our columns are now converted to 0 or 1 for each category and has a 1 if it has that specific value. We dropped the first column of each category so there is no duplicate data. 

## Feature Engineering<a id='Feature_Engineering'></a>

Let's now split the Diagnosis column into 2 specific categories to simplify the model for now. We will use "Common" and "Uncommon" so we can predict what types of situations contribute to the most common injuries so safegaurds can be taken to try to prevent these types of injuries moving forward. The three common types will be the top three most common injuries which is "Strain, sprain", "Fracture", and "Contusion, abrasion". Everything else will be considered uncommon. 

In [9]:
d_data['Diagnosis'].value_counts()

Strain, sprain                1391
Fracture                      1280
Contusion, abrasion           1030
Other                          986
Laceration                     660
Internal injury                540
Concussion                     233
Dislocation                    140
Dental injury                   73
Hematoma                        48
Burn, thermal                   43
Dermatitis, conjunctivitis      33
Crushing                        22
Avulsion                        21
Foreign body                    18
Nerve damage                    17
Puncture                        11
Hemorrhage                       9
Amputation                       9
Submersion                       6
Radiation                        3
Electric shock                   3
Anoxia                           2
Aspiration                       2
Poisoning                        2
Ingestion                        1
Burn, chemical                   1
Name: Diagnosis, dtype: int64

The split for common and uncommon based on the value counts of the Diagnosis column will be 3701 for common, and 2883 for uncommon. That is a 56/44% split in the data. 

In [10]:
common = {'Strain, sprain':'common' , 'Fracture':'common', 'Contusion, abrasion':'common'}

d_data = d_data.assign(Split_Diagnosis = d_data.Diagnosis.map(common))

d_data.fillna(value='uncommon', inplace=True)

d_data['Split_Diagnosis'].value_counts()

common      3701
uncommon    2883
Name: Split_Diagnosis, dtype: int64

Let's look at the new Split_Diagnosis column:

In [11]:
d_data.loc[:,['Diagnosis','Split_Diagnosis']]

Unnamed: 0_level_0,Diagnosis,Split_Diagnosis
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1
180125260,Fracture,common
180108428,Dental injury,uncommon
180120413,Other,uncommon
180125238,Fracture,common
180135290,"Strain, sprain",common
...,...,...
130113361,"Contusion, abrasion",common
130109590,Laceration,uncommon
130113339,Nerve damage,uncommon
130109054,Hematoma,uncommon


We see that the only values that are common are the three that we had split it out, so the new Split_Diagnosis field is exactly how we want it to be for our purposes. 

Let's also add a feature to our data to be the day of the week, Monday-Sunday, signified as 0-6. 

In [12]:
d_data['day_of_week'] = d_data['Treatment_Date'].dt.dayofweek

Let's view the new column and see how it looks. 

In [13]:
d_data['day_of_week_S'] = d_data['Treatment_Date'].apply(lambda x: x.strftime('%A'))

d_data.loc[:,['day_of_week','Treatment_Date','day_of_week_S']]

Unnamed: 0_level_0,day_of_week,Treatment_Date,day_of_week_S
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
180125260,6,2017-12-31,Sunday
180108428,6,2017-12-31,Sunday
180120413,6,2017-12-31,Sunday
180125238,5,2017-12-30,Saturday
180135290,5,2017-12-30,Saturday
...,...,...,...
130113361,3,2013-01-03,Thursday
130109590,2,2013-01-02,Wednesday
130113339,2,2013-01-02,Wednesday
130109054,1,2013-01-01,Tuesday


In [14]:
#Dropping the string value as it was just used for comparison's sake.
d_data.drop(['day_of_week_S'], axis=1, inplace=True)

We now have the data that we want to split out for training and testing purposes. 

## Train Test Split<a id='Train_Test_Split'></a>

X will be all the columns except the one we want to predict which to start will be "Diagnosis", as well as the columns that cannot be used which in this case are the following: 

    "Other Diagnosis" because that is derived from Diagnosis.
    "Treatment Date" as that is split up into Day, Month and Year for separate columns.
    "Narrative" as there was no way to quantify that field.
    "Disposition" because that occurs after the Diagnosis. 

The variable y will be the "Split_Diagnosis" column with the common diagnosis column denoting a common diagnosis as 1 and an uncommon diagnosis as 0. We will then split our X and y values into a train (80%) and test (20%) set.

In [15]:
#Setup X and y variables
#X is the rest of the dataframe
#y is the variable to predict
X = d_data.drop(["Diagnosis", 'Split_Diagnosis','Other_Diagnosis','Treatment_Date','Narrative','Disposition'], axis=1)
y = pd.get_dummies(d_data['Split_Diagnosis'])

#Keep just the common column for y
y.drop('uncommon', axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
#Nothing to scale since age is the only real numerical value we are using. 

Since almost all of our data are categorical, there isn't much to scale on the numerical side as Age is our only real numerical value. However, for the sake of this excersise, we will standardize the Age, PSU and Weight columns.

In [16]:
scaler = StandardScaler()
scaler.fit(X_train[['Age','PSU','Weight']])

X_train_age_scaled = pd.DataFrame(scaler.transform(X_train[['Age','PSU','Weight']]), columns= ['Age', 'PSU','Weight'],index=X_train.index)
X_test_age_scaled = pd.DataFrame(scaler.transform(X_test[['Age','PSU', 'Weight']]), columns= ['Age', 'PSU','Weight'], index=X_test.index )

X_train['Age'] = X_train_age_scaled['Age']
X_test['Age'] = X_test_age_scaled['Age']

X_train['PSU'] = X_train_age_scaled['PSU']
X_test['PSU'] = X_test_age_scaled['PSU']

X_train['Weight'] = X_train_age_scaled['Weight']
X_test['Weight'] = X_test_age_scaled['Weight']

Let's now confirm the shape of each train and test value. 

In [17]:
X_train.shape

(5267, 102)

In [18]:
X_test.shape

(1317, 102)

In [19]:
y_train.shape

(5267, 1)

In [20]:
y_test.shape

(1317, 1)

We see that there are 5267 rows for the training data and 1317 rows for the test data. We also see that the X values have 101 columns while the y value has 1 column. Let's view the X_train data to make sure it looks as we would expect it to. 

In [21]:
X_train

Unnamed: 0_level_0,Age,PSU,Weight,Sex_M,BodyPart_Ankle,"BodyPart_Arm, lower","BodyPart_Arm, upper",BodyPart_Ear,BodyPart_Elbow,BodyPart_Eyeball,...,Location_Street,Location_Unknown,Stratum_L,Stratum_M,Stratum_S,Stratum_V,Treatment_Day,Month_Day,Year_Day,day_of_week
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
140248355,-0.670665,0.829478,1.357959,0,0,0,1,0,0,0,...,0,0,0,0,1,0,25,7,2013,3
131033338,-0.889547,0.024048,1.292006,1,0,0,0,0,0,0,...,0,0,0,0,1,0,5,10,2013,5
141120816,-1.035468,0.059067,1.161436,1,0,0,0,0,0,0,...,0,0,1,0,0,0,19,10,2014,6
150717427,-0.451783,0.654385,-0.554388,1,0,1,0,0,0,0,...,0,0,0,0,0,1,5,7,2015,6
130910795,-0.743625,-0.396176,-0.820458,0,0,0,0,0,0,0,...,0,0,0,0,0,0,31,8,2013,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150604252,-0.962507,-0.851419,-0.850115,0,0,0,0,0,0,0,...,0,1,0,0,0,0,24,5,2015,6
140443952,-0.889547,-0.361157,-0.848867,0,0,0,0,0,0,0,...,0,0,0,0,0,0,17,4,2014,3
140426349,2.831447,1.214684,-0.597344,0,0,0,0,0,0,0,...,0,0,0,0,0,1,8,4,2014,1
140125949,-0.305861,1.074609,1.326441,0,0,0,0,0,0,0,...,0,0,0,0,1,0,11,1,2014,5


We can see what each case number in the training data is as well as their values. Now let's look at the y_train value. 

In [22]:
y_train

Unnamed: 0_level_0,common
CPSC_Case_Number,Unnamed: 1_level_1
140248355,1
131033338,1
141120816,0
150717427,1
130910795,1
...,...
150604252,1
140443952,1
140426349,1
140125949,0


We see each case number and the diagnosis they ended up with. 1 will refer to a common diagnosis while a 0 will refer to an uncommon diagnosis. This will give our model the training data it needs to see how the other values correlate to the diagnosis. 

## Save the training/testing data<a id='Save_the_training/testing_data'></a>

Now let's save the train and test data frames so we can use them in the future 

In [23]:
#Save the modified dataframe to a new .csv file called clean_data.csv
X_train.to_csv("data/clean_data/X_train.csv", index=True, quoting=1)

X_test.to_csv("data/clean_data/X_test.csv", index=True, quoting=1)

y_train.to_csv("data/clean_data/y_train.csv", index=True, quoting=1)

y_test.to_csv("data/clean_data/y_test.csv", index=True, quoting=1)

## Conclusion<a id='Conclusion'></a>

In this notebook, we were able to successfully create dummy columns for categorical variables, split the data into traning and testing sets, and standardized numerical values. We split the Diagnosis column into 2 categories, common and uncommon. The common diagnosis will be "Strain, sprain", "Fracture", and "Contusion, abrasion" and the uncommon diagnosis will be everything else. We then saved the standardized and dummy/indicator fielded train and test datasets into X_train, X_test, y_train, and y_test files in the clean_data folder.  