# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Create dummy features](#4.5_Create_Dummy_Features)
  * [4.6 Train/Test Split](#4.6_Train/Test_Split)

## 4.2 Introduction<a id='4.2_Introduction'></a>

In this step 
* Creating dummyfeatures 
* Split data into training and testing subsets

## 4.3 Imports<a id='4.3_Imports'></a>

In [6]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [8]:
data = pd.read_csv('../data/processed/heart_attack_prediction_dataset_selected_features.csv')
data.head(10)

Unnamed: 0,Cholesterol,Diabetes,Exercise Hours Per Week,Triglycerides,Systolic,Age,Previous Heart Problems,Medication Use,Heart Attack Risk,Cholesterol Level,Blood Pressure
0,208,0,4.168,286,158,67,0,0,0,At Risk,Stage 2 Hypertension
1,389,1,1.813,235,165,21,1,0,0,Dangerous,Stage 2 Hypertension
2,324,1,2.078,587,174,21,1,1,0,Dangerous,Stage 2 Hypertension
3,383,1,9.828,378,163,84,1,0,0,Dangerous,Stage 2 Hypertension
4,318,1,5.804,231,91,66,1,0,0,Dangerous,Healthy
5,297,1,0.625,795,172,54,1,1,1,Dangerous,Stage 2 Hypertension
6,358,0,4.098,284,102,90,0,0,1,Dangerous,Healthy
7,220,0,3.428,370,131,84,0,1,1,At Risk,Stage 1 Hypertension
8,145,1,16.868,790,144,20,0,0,0,Hearty Health,Stage 2 Hypertension
9,248,0,0.195,232,160,43,0,0,0,Dangerous,Stage 2 Hypertension


## 4.5 Create Dummy Features<a id='4.5_Create_Dummy_Features'></a>

Columns Cholestrol Level and Blood Presssure are categorical features and need to convert to numeric using dummy features

In [11]:
data_encoded = pd.get_dummies(data, columns=['Cholesterol Level', 'Blood Pressure'], dtype=int)
data_encoded.head(10)

Unnamed: 0,Cholesterol,Diabetes,Exercise Hours Per Week,Triglycerides,Systolic,Age,Previous Heart Problems,Medication Use,Heart Attack Risk,Cholesterol Level_At Risk,Cholesterol Level_Dangerous,Cholesterol Level_Hearty Health,Blood Pressure_Elevated,Blood Pressure_Healthy,Blood Pressure_Hypertension Crisis,Blood Pressure_Stage 1 Hypertension,Blood Pressure_Stage 2 Hypertension
0,208,0,4.168,286,158,67,0,0,0,1,0,0,0,0,0,0,1
1,389,1,1.813,235,165,21,1,0,0,0,1,0,0,0,0,0,1
2,324,1,2.078,587,174,21,1,1,0,0,1,0,0,0,0,0,1
3,383,1,9.828,378,163,84,1,0,0,0,1,0,0,0,0,0,1
4,318,1,5.804,231,91,66,1,0,0,0,1,0,0,1,0,0,0
5,297,1,0.625,795,172,54,1,1,1,0,1,0,0,0,0,0,1
6,358,0,4.098,284,102,90,0,0,1,0,1,0,0,1,0,0,0
7,220,0,3.428,370,131,84,0,1,1,1,0,0,0,0,0,1,0
8,145,1,16.868,790,144,20,0,0,0,0,0,1,0,0,0,0,1
9,248,0,0.195,232,160,43,0,0,0,0,1,0,0,0,0,0,1


In [12]:
# Check the data types of all column to see all are numerical
data_encoded.dtypes

Cholesterol                              int64
Diabetes                                 int64
Exercise Hours Per Week                float64
Triglycerides                            int64
Systolic                                 int64
Age                                      int64
Previous Heart Problems                  int64
Medication Use                           int64
Heart Attack Risk                        int64
Cholesterol Level_At Risk                int32
Cholesterol Level_Dangerous              int32
Cholesterol Level_Hearty Health          int32
Blood Pressure_Elevated                  int32
Blood Pressure_Healthy                   int32
Blood Pressure_Hypertension Crisis       int32
Blood Pressure_Stage 1 Hypertension      int32
Blood Pressure_Stage 2 Hypertension      int32
dtype: object

In [13]:
data_encoded.shape

(8763, 17)

## 4.6 Train/Test Split<a id='4.6_Train/Test_Split'></a>

Partition the data into 70/30 train/test split

In [16]:
len(data_encoded) * .7, len(data_encoded) * .3

(6134.099999999999, 2628.9)

* X contains all features except target feature "Heart Attack Risk"
* y contains target feature "Heart Attack Risk"

In [18]:
X = data_encoded.drop("Heart Attack Risk", axis=1)
y = data_encoded["Heart Attack Risk"]

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [20]:
X_train.shape, X_test.shape

((6134, 16), (2629, 16))

In [21]:
X_train.head()

Unnamed: 0,Cholesterol,Diabetes,Exercise Hours Per Week,Triglycerides,Systolic,Age,Previous Heart Problems,Medication Use,Cholesterol Level_At Risk,Cholesterol Level_Dangerous,Cholesterol Level_Hearty Health,Blood Pressure_Elevated,Blood Pressure_Healthy,Blood Pressure_Hypertension Crisis,Blood Pressure_Stage 1 Hypertension,Blood Pressure_Stage 2 Hypertension
2963,346,0,2.68,143,99,77,0,1,0,1,0,0,1,0,0,0
599,287,0,11.942,385,162,59,1,0,0,1,0,0,0,0,0,1
45,185,0,16.16,675,120,33,0,1,0,0,1,1,0,0,0,0
1444,352,0,3.06,799,101,74,1,1,0,1,0,0,1,0,0,0
1652,260,1,19.363,36,90,29,1,1,0,1,0,0,1,0,0,0


In [22]:
X_test.head()

Unnamed: 0,Cholesterol,Diabetes,Exercise Hours Per Week,Triglycerides,Systolic,Age,Previous Heart Problems,Medication Use,Cholesterol Level_At Risk,Cholesterol Level_Dangerous,Cholesterol Level_Hearty Health,Blood Pressure_Elevated,Blood Pressure_Healthy,Blood Pressure_Hypertension Crisis,Blood Pressure_Stage 1 Hypertension,Blood Pressure_Stage 2 Hypertension
1226,340,0,9.871,315,124,65,1,1,0,1,0,1,0,0,0,0
7903,361,1,2.763,471,177,77,0,0,0,1,0,0,0,0,0,1
1559,341,1,16.325,104,156,70,1,0,0,1,0,0,0,0,0,1
3621,392,0,5.162,201,155,47,0,0,0,1,0,0,0,0,0,1
7552,173,0,3.681,638,103,63,0,1,0,0,1,0,1,0,0,0


In [35]:
data_encoded.to_csv('../data/processed/heart_attack_prediction_dataset_encoded.csv', index=False)