## Introduction

The purpose of this notebook will be to preprocess our data to get it ready for Machine Learning. 

    1. Create dummy or indicator features for categorical variables
    2. Standardize the magnitude of numeric features using a scaler
    3. Split your data into testing and training datasets

## Imports

Here we are importing the necessary libraries as well as the data to help with preprocessing and to get a basic machine learning model created for the dataset. 

In [20]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression

#Don't display future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set()

In [64]:
#Import cleaned data
data = pd.read_csv('data/clean_data/clean_data.csv')

## View Data

Here we will make some adjustements to the dataset so we have all the right data types/values that we need. 

In [65]:
#Update the data type to Strings for Product Code as they are categories, not ints. 
data['ProductCode'] = data['ProductCode'].values.astype('str')

#product codes: 1293 (amusement devices), 3295 (water slides, public), and 3259 (go-karts).
#Update the product codes values to align with their actual meaning. 

#Create a dictionary for the product codes
product_codes = {'1293':'amusement devices (1293)', '3295': 'water slides, public (3295)', '3259':'go-karts (3259)'}

#Update the product code to the proper strings
data.replace(product_codes, inplace=True)

#Convert to Date Time object
data['Treatment_Date'] = pd.to_datetime(data['Treatment_Date'])

In [66]:
data.set_index('CPSC_Case_Number', inplace = True)

We have also set the Case Number as the index since it is a unqiue identifier of the data. Let's look at the current data types for the remaining columns. 

In [67]:
data.dtypes

Treatment_Date     datetime64[ns]
Age                         int64
Sex                        object
Narrative                  object
Diagnosis                  object
Other_Diagnosis            object
BodyPart                   object
Disposition                object
ProductCode                object
Device_category            object
Device_type                object
Location                   object
Stratum                    object
PSU                         int64
Weight                    float64
dtype: object

Here we see that everything is a string except for Age, PSU, and Weight. No information was available for what PSU and Weight actually mean so I will leave them alone for now. Let's veiw the data to make sure it looks acceptable. 

In [71]:
data.head()

Unnamed: 0_level_0,Treatment_Date,Age,Sex,Narrative,Diagnosis,Other_Diagnosis,BodyPart,Disposition,ProductCode,Device_category,Device_type,Location,Stratum,PSU,Weight
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
180125260,2017-12-31,3,F,3 YOF JUMPING BOUNCE HOUSE W/MOM JUMPED UP LAN...,Fracture,,"Leg, lower",Treated and released,amusement devices (1293),Inflatables,inflatable,Sports/recreation,C,32,4.757
180108428,2017-12-31,10,F,10YOF PLAYING *** AT *** AT A PARTY AT *** LOC...,Dental injury,,Mouth,Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,C,8,4.757
180120413,2017-12-31,14,M,14YOM- PT WAS PLAYING *** TODAY SWELLING TO RI...,Other,TENDONITIS,Knee,Treated and released,amusement devices (1293),Not identified or unrelated,not identified,Sports/recreation,C,90,4.757
180125238,2017-12-30,2,F,2 YOF JUMPING IN BOUNCE HOUSE LANDED AWKWARDLY...,Fracture,,"Leg, lower",Treated and released,amusement devices (1293),Inflatables,inflatable,Sports/recreation,C,32,4.757
180135290,2017-12-30,17,F,"17YOF DRIVING GO CART, RAN INTO BARRIER, HIT L...","Strain, sprain",,Hand,Treated and released,go-karts (3259),Go karts,go kart,Sports/recreation,M,54,79.1731


## Get Dummies

Let's now convert the data into dummy variable columns so that all the categorical columns are split out as 0 or 1 as opposed to strings. We will also convert the "Treatment_Day" field into new columns for day, month, and year. Lastly, we will drop the "Narrative" field as it will not be used with the machine learning algorithms. 

In [72]:
#Not including Diagnosis as that is the first prediction. 'Diagnosis',
d_data = pd.get_dummies(data, columns= ['Sex', 'Other_Diagnosis', 'BodyPart', 
                               'Disposition', 'ProductCode', 'Device_category', 
                               'Device_type', 'Location', 'Stratum', 'PSU'], drop_first=True)

#Need to convert the treatment date to Day, Month, and Year
d_data['Treatment_Day'] = d_data['Treatment_Date'].dt.day
d_data['Month_Day'] = d_data['Treatment_Date'].dt.month
d_data['Year_Day'] = d_data['Treatment_Date'].dt.year
d_data.drop('Treatment_Date', axis=1,inplace= True)
d_data.drop('Narrative', axis=1,inplace= True)

In [73]:
d_data.head()

Unnamed: 0_level_0,Age,Diagnosis,Weight,Sex_M,Other_Diagnosis_ABD PAIN,Other_Diagnosis_ABD PX,Other_Diagnosis_ABDOMINAL PAIN,Other_Diagnosis_ABSCESS,Other_Diagnosis_ACHE,Other_Diagnosis_ACUTE ABD PAIN,...,PSU_95,PSU_96,PSU_97,PSU_98,PSU_99,PSU_100,PSU_101,Treatment_Day,Month_Day,Year_Day
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
180125260,3,Fracture,4.757,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,31,12,2017
180108428,10,Dental injury,4.757,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,31,12,2017
180120413,14,Other,4.757,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,31,12,2017
180125238,2,Fracture,4.757,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,30,12,2017
180135290,17,"Strain, sprain",79.1731,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,30,12,2017


We see that our columsn are now converted to 0 or 1 for each category and has a 1 if it has that specific value. We dropped the first column of each category so there is no duplicate data. Now we can setup our training and testing sets. 

## Train Test Split

X will be all the columns expect the one we want to predict which to start will be "Diagnosis". The variable y will be the "Diagnosis" column that has all the different Diagnosis available as separate columns and are denoted as a 0 or a 1. We will then split our X and y values into a train (80%) and test (20%) set.

In [74]:
#Setup X and y variables
#X is the rest of the dataframe
#y is the variable to predict
X = d_data.drop("Diagnosis", axis=1)
y = pd.get_dummies(d_data['Diagnosis'])

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
#Nothing to scale since age is the only real numerical value we are using. 

Since almost all of our data are categorical, there isn't anything to scale on the numerical side as Age is our only real numerical value. Let's now confirm the shape of each train and test value. 

In [75]:
X_train.shape

(5267, 433)

In [76]:
X_test.shape

(1317, 433)

In [77]:
y_train.shape

(5267, 27)

In [78]:
y_test.shape

(1317, 27)

We see that there are 5267 rows for the training data and 1317 rows for the test data. We also see that the X values have 433 columns while the y values have 27 columns. Let's view the X_train data to make sure it looks as we would expect it to. 

In [79]:
X_train

Unnamed: 0_level_0,Age,Weight,Sex_M,Other_Diagnosis_ABD PAIN,Other_Diagnosis_ABD PX,Other_Diagnosis_ABDOMINAL PAIN,Other_Diagnosis_ABSCESS,Other_Diagnosis_ACHE,Other_Diagnosis_ACUTE ABD PAIN,Other_Diagnosis_ACUTE COSTOCOMNDRISI,...,PSU_95,PSU_96,PSU_97,PSU_98,PSU_99,PSU_100,PSU_101,Treatment_Day,Month_Day,Year_Day
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
140248355,6,81.0979,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,25,7,2013
131033338,3,78.8451,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,10,2013
141120816,1,74.3851,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,19,10,2014
150717427,9,15.7762,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,7,2015
130910795,5,6.6878,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,31,8,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150604252,2,5.6748,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,24,5,2015
140443952,3,5.7174,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,17,4,2014
140426349,54,14.3089,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,8,4,2014
140125949,11,80.0213,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,11,1,2014


We can see for each case number in the training data as well as their values. Now let's look at the y_train value. 

In [80]:
diagnosis = y_train.columns #Saving the column values so we can identify each column later on. 
y_train

Unnamed: 0_level_0,Amputation,Anoxia,Aspiration,Avulsion,"Burn, chemical","Burn, thermal",Concussion,"Contusion, abrasion",Crushing,Dental injury,...,Ingestion,Internal injury,Laceration,Nerve damage,Other,Poisoning,Puncture,Radiation,"Strain, sprain",Submersion
CPSC_Case_Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
140248355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
131033338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
141120816,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
150717427,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
130910795,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150604252,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
140443952,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
140426349,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
140125949,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


We see each case number and the diagnosis they ended up with. This will give our model the training data it needs to see how the other values correlate to the diagnosis. 

## Guessing with Mean

Before we create a model, let's get a baseline down of what we would expect out of each Diagnosis by taking the mean value of the training set and get the percentage of what each Diagnosis would be. 

In [81]:
train_percent = np.mean(y_train)*100
train_percent

Amputation                     0.113917
Anoxia                         0.037972
Aspiration                     0.018986
Avulsion                       0.341751
Burn, chemical                 0.018986
Burn, thermal                  0.550598
Concussion                     3.474464
Contusion, abrasion           15.872413
Crushing                       0.379723
Dental injury                  1.082210
Dermatitis, conjunctivitis     0.512626
Dislocation                    2.050503
Electric shock                 0.037972
Foreign body                   0.265806
Fracture                      19.176002
Hematoma                       0.664515
Hemorrhage                     0.151889
Ingestion                      0.018986
Internal injury                8.239985
Laceration                    10.347446
Nerve damage                   0.303778
Other                         14.752231
Poisoning                      0.018986
Puncture                       0.208848
Radiation                      0.056958


We see that the "Strain, sprain" diagnosis makes up on average 21.2% of the cases in the training set. This shows the breakdown of what we would expect out of each Diagnosis given the training data. 

## Modeling

Now let's test a model. We can try Linear Regression first to make sure we can get the machine learning running smoothly and try to make sense of the results. 

In [83]:
#Try linear regression first

linreg = LinearRegression()
linreg.fit(X_train,y_train)

y_pred = linreg.predict(X_test)
(pd.DataFrame(y_pred, columns= diagnosis, index=y_test.index) *100).T

CPSC_Case_Number,170827787,160852455,170157166,141223560,130852418,130850294,130622327,140644499,141043892,161009850,...,160605794,140960494,160406571,150834170,130658810,161044669,130902754,171106994,160504952,160205773
Amputation,-0.30494,0.123795,0.010496,-0.464502,-1.350684,1.834393,-0.452217,-0.10954,0.153382,0.079069,...,0.461394,-1.402919,0.542218,0.017948,-0.183036,-0.319931,-0.268615,-0.429663,0.177081,0.49176
Anoxia,-0.138998,-0.00886,0.802048,-0.757487,-0.080827,0.072271,-0.113203,0.0757,-0.107189,-0.0419,...,-0.20718,0.003853,-0.039199,0.094268,-0.315144,-0.279145,-0.083387,-0.164868,-0.077592,0.052577
Aspiration,-0.056878,-0.015776,0.327169,0.091346,0.066439,0.068145,0.151804,0.022288,0.023229,0.014986,...,0.030997,0.016008,-0.046653,0.127059,0.093322,0.055765,0.123638,-0.063085,0.000734,0.031977
Avulsion,0.342918,0.352372,-0.105274,-0.50686,-0.919458,3.0302,-0.56882,-1.238939,0.222877,0.349072,...,-0.853804,-0.012545,-0.164127,1.951948,0.48078,1.605495,0.102381,1.588791,-0.155248,-0.572764
"Burn, chemical",-0.021937,0.103399,0.020405,-0.039538,-0.054083,-0.025689,0.05714,-0.036958,0.312478,0.160148,...,-0.07104,-0.142117,0.218381,-0.08639,-0.013607,-0.194502,-0.059072,-0.153313,0.205272,-0.047999
"Burn, thermal",-0.058401,-0.781174,-0.886972,-0.452991,-2.760312,1.484586,5.006734,-1.466008,1.7363,-0.293667,...,-0.500372,-0.859864,-0.060235,1.719753,0.296537,-1.956613,-0.425766,-2.356918,2.934243,-0.343637
Concussion,1.288284,-1.014004,18.611813,-0.837432,-0.608096,-0.499364,5.884704,-3.164835,1.415594,-0.938197,...,2.590544,6.00177,0.661348,22.916037,-5.491601,0.777572,6.91512,3.897153,0.24937,1.215859
"Contusion, abrasion",14.87293,32.532088,7.066135,38.75382,10.443583,9.708504,43.142769,-3.271033,42.483746,7.325715,...,23.335314,24.29745,56.297182,-2.252086,3.139543,46.401169,-12.438334,27.140726,11.833013,-3.81943
Crushing,1.385972,-0.186659,3.084086,0.741477,-0.576957,1.379316,-0.262002,0.637653,0.302191,-0.224562,...,1.166697,0.250569,-0.394311,-0.045348,0.524686,1.693701,2.075211,-1.145216,0.71726,1.307705
Dental injury,-0.521808,-0.441588,-2.200362,-0.26264,-1.599728,-0.835333,-2.170608,-1.675106,-0.002902,-0.186419,...,0.254955,-0.350637,-0.153676,-0.95154,0.024705,-0.339628,0.640066,0.485978,-0.97661,-0.687454


Based on the results of our prediction, we can see for each case number, the expected percentage of the diagnosis. The higher the number, the more likely that is the Diagnosis for that case. 

## Conclusion

## To Do

Table of Contents

Conclusion