# <h1><center>Heart Attack Dataset -- Preprocessing and Training Data Development</center></h1>

### <center>By: Hio Wa Mak</center>

## Import packages

In [1]:
#Import pandas, matplotlib.pyplot, and seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import statsmodels.api as sm # Module for running linear regression
from statsmodels.graphics.api import abline_plot # Add reference line to a plot
from sklearn.metrics import mean_squared_error, r2_score # calculate mean squared error and r square
from sklearn.model_selection import train_test_split #  split data into training and testing sets
from sklearn import linear_model, preprocessing # Linear_model is used to run OLS models and preprocessing can help process the data before running models

## Read in the data

In [2]:
# Read in data from a csv file
heart = pd.read_csv('../data/heartattack_cleaned.csv')

### Examine details about the heart attack dataset

In [3]:
#Examine info on this dataset
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442067 entries, 0 to 442066
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      442067 non-null  object 
 1   Sex                        442067 non-null  object 
 2   GeneralHealth              440972 non-null  object 
 3   PhysicalHealthDays         431470 non-null  float64
 4   MentalHealthDays           433275 non-null  float64
 5   LastCheckupTime            434026 non-null  object 
 6   PhysicalActivities         441095 non-null  object 
 7   SleepHours                 436871 non-null  float64
 8   RemovedTeeth               431057 non-null  object 
 9   HadHeartAttack             442067 non-null  object 
 10  HadAngina                  438479 non-null  object 
 11  HadStroke                  440997 non-null  object 
 12  HadAsthma                  440630 non-null  object 
 13  HadSkinCancer              43

In [4]:
#Number of columns and observations
heart.shape

(442067, 34)

In [5]:
#Look at the first few observations
heart.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers
0,Alabama,Female,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,,No,...,No,Never smoked,Not at all (right now),No,"White only, Non-Hispanic",Age 80 or older,,,,No
1,Alabama,Female,Excellent,0.0,0.0,,No,6.0,,No,...,No,Never smoked,Never used e-cigarettes in my entire life,No,"White only, Non-Hispanic",Age 80 or older,1.6,68.04,26.57,No
2,Alabama,Female,Very good,2.0,3.0,Within past year (anytime less than 12 months ...,Yes,5.0,,No,...,No,Never smoked,Never used e-cigarettes in my entire life,No,"White only, Non-Hispanic",Age 55 to 59,1.57,63.5,25.61,No
3,Alabama,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,,No,...,No,Current smoker - now smokes some days,Never used e-cigarettes in my entire life,Yes,"White only, Non-Hispanic",,1.65,63.5,23.3,No
4,Alabama,Female,Fair,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,,No,...,No,Never smoked,Never used e-cigarettes in my entire life,Yes,"White only, Non-Hispanic",Age 40 to 44,1.57,53.98,21.77,Yes


## Recoding

### Recode categorical variables into numbers for subsequent analyses

#### Recode binary (yes and no) variables

In [6]:
#Select binary columns
binarycolumns = ['PhysicalActivities', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
                'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
                'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 'DifficultyConcentrating', 'DifficultyWalking',
                'DifficultyDressingBathing', 'DifficultyErrands', 'ChestScan', 'AlcoholDrinkers']

#Create a new dataset
heartn = heart.copy()

for column in binarycolumns:
    heartn[column] = heart[column].map({'Yes': 1, 'No': 0})

#View recoded dataset
heartn.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers
0,Alabama,Female,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,0.0,8.0,,0,...,0.0,Never smoked,Not at all (right now),0.0,"White only, Non-Hispanic",Age 80 or older,,,,0.0
1,Alabama,Female,Excellent,0.0,0.0,,0.0,6.0,,0,...,0.0,Never smoked,Never used e-cigarettes in my entire life,0.0,"White only, Non-Hispanic",Age 80 or older,1.6,68.04,26.57,0.0
2,Alabama,Female,Very good,2.0,3.0,Within past year (anytime less than 12 months ...,1.0,5.0,,0,...,0.0,Never smoked,Never used e-cigarettes in my entire life,0.0,"White only, Non-Hispanic",Age 55 to 59,1.57,63.5,25.61,0.0
3,Alabama,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,1.0,7.0,,0,...,0.0,Current smoker - now smokes some days,Never used e-cigarettes in my entire life,1.0,"White only, Non-Hispanic",,1.65,63.5,23.3,0.0
4,Alabama,Female,Fair,2.0,0.0,Within past year (anytime less than 12 months ...,1.0,9.0,,0,...,0.0,Never smoked,Never used e-cigarettes in my entire life,1.0,"White only, Non-Hispanic",Age 40 to 44,1.57,53.98,21.77,1.0


#### Recode other categorical variables (e.g., those that were not yes and no) into numbers

In [7]:
#Recode other categorical variables using ordinal encoding
heartn['Female'] = heartn['Sex'].map({'Female': 1, 'Male': 0})
heartn['GeneralHealth'] = heartn['GeneralHealth'].map({'Poor': 0, 'Fair': 1, 'Good': 2, 
                                                       'Very good': 3, 'Excellent': 4})
heartn['LastCheckupTime'] = heartn['LastCheckupTime'].map({'Within past year (anytime less than 12 months ago)': 0, 
                                                         'Within past 2 years (1 year but less than 2 years ago)': 1,
                                                         'Within past 5 years (2 years but less than 5 years ago)': 2,
                                                         '5 or more years ago': 3})
heartn['RemovedTeeth'] = heartn['RemovedTeeth'].map({'None of them': 0, 
                                                   '1 to 5': 1,
                                                   '6 or more, but not all': 2,
                                                   'All': 3})
heartn['HadDiabetes'] = heartn['HadDiabetes'].map({'No': 0, 
                                                 'Yes, but only during pregnancy (female)': 0,
                                                 'No, pre-diabetes or borderline diabetes': 1,
                                                 'Yes': 2})
heartn['SmokerStatus'] = heartn['SmokerStatus'].map({'Never smoked': 0, 
                                                   'Former smoker': 1,
                                                   'Current smoker - now smokes some days': 2,
                                                   'Current smoker - now smokes every day': 3})
heartn['ECigaretteUsage'] = heartn['ECigaretteUsage'].map({'Never used e-cigarettes in my entire life': 0,
                                                         'Not at all (right now)': 1,
                                                         'Use them some days': 2,
                                                         'Use them every day': 3})
heartn['AgeCategory'] = heartn['AgeCategory'].map({'Age 18 to 24': 0, 'Age 25 to 29': 1, 
                                                 'Age 30 to 34': 2, 'Age 35 to 39': 3, 
                                                 'Age 40 to 44': 4, 'Age 45 to 49': 5, 
                                                 'Age 50 to 54': 6, 'Age 55 to 59': 7, 
                                                 'Age 60 to 64': 8, 'Age 65 to 69': 9, 
                                                 'Age 70 to 74': 10, 'Age 75 to 79': 11, 
                                                 'Age 80 or older': 12})

In [8]:
#Recode BMI into ordinal categories
# Function to categorize BMI
def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 24.9:
        return 'Normal weight'
    elif 25 <= bmi < 29.9:
        return 'Overweight'
    elif 30 <= bmi < 34.9:
        return 'Obesity I'
    elif 35 <= bmi < 39.9:
        return 'Obesity II'
    else:
        return 'Obesity III'

# Apply function to the BMI column
heartn['BMICategory'] = heartn['BMI'].apply(categorize_bmi)

# Define a mapping from category to numeric value
category_to_numeric = {
    'Underweight': 0,
    'Normal weight': 1,
    'Overweight': 2,
    'Obesity I': 3,
    'Obesity II': 4,
    'Obesity III': 5
}

# Apply the mapping
heartn['BMICategory'] = heartn['BMICategory'].map(category_to_numeric)

#### Drop variables with duplicated information

In [9]:
#Drop height, weight, and BMI columns because they overlap with BMI category, and the Sex column because it the same as the Female column. 
heartn = heartn.drop(['HeightInMeters', 'WeightInKilograms', 'BMI', 'Sex'], axis=1)

In [10]:
heartn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442067 entries, 0 to 442066
Data columns (total 32 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      442067 non-null  object 
 1   GeneralHealth              440972 non-null  float64
 2   PhysicalHealthDays         431470 non-null  float64
 3   MentalHealthDays           433275 non-null  float64
 4   LastCheckupTime            434026 non-null  float64
 5   PhysicalActivities         441095 non-null  float64
 6   SleepHours                 436871 non-null  float64
 7   RemovedTeeth               431057 non-null  float64
 8   HadHeartAttack             442067 non-null  int64  
 9   HadAngina                  438479 non-null  float64
 10  HadStroke                  440997 non-null  float64
 11  HadAsthma                  440630 non-null  float64
 12  HadSkinCancer              439303 non-null  float64
 13  HadCOPD                    44

In [11]:
#RaceEthnicityCategory
print(heartn['RaceEthnicityCategory'].value_counts())

#State
print(heartn['State'].value_counts())
print(heartn['State'].nunique())

#Exclude cases in territories that are not located in the US
states_to_drop = ['Puerto Rico', 'Guam', 'Virgin Islands']
heartn = heartn[~heartn['State'].isin(states_to_drop)]

#Check again
print(heartn.shape)
print(heartn['State'].nunique())

RaceEthnicityCategory
White only, Non-Hispanic         318782
Hispanic                          42524
Black only, Non-Hispanic          35116
Other race only, Non-Hispanic     22442
Multiracial, Non-Hispanic          9505
Name: count, dtype: int64
State
Washington              25997
New York                17631
Minnesota               16738
Ohio                    16394
Maryland                16299
Texas                   14129
Florida                 13282
Wisconsin               11210
Kansas                  11179
Massachusetts           10958
California              10853
Maine                   10584
Indiana                 10378
Virginia                10353
Arizona                 10089
South Carolina           9967
Michigan                 9958
Utah                     9773
Connecticut              9702
Colorado                 9310
Georgia                  9163
Iowa                     8882
Vermont                  8766
New Jersey               8154
Hawaii                   7

In [12]:
#Recode state and race/ethnicity using one-hot encoding (for machine learning models)
heart_one_hot = pd.get_dummies(heartn, columns=['State', 'RaceEthnicityCategory'])

# Convert columns to integer type
state_columns = heart_one_hot.filter(like='State').columns
race_columns = heart_one_hot.filter(like='RaceEthnicityCategory').columns
# Convert those columns to integers
heart_one_hot[state_columns] = heart_one_hot[state_columns].astype(int)
heart_one_hot[race_columns] = heart_one_hot[race_columns].astype(int)

#Rename the categories
heart_one_hot = heart_one_hot.rename(columns={
    'RaceEthnicityCategory_White only, Non-Hispanic': 'Non_Hispanic_White',
    'RaceEthnicityCategory_Black only, Non-Hispanic': 'Non_Hispanic_Black',
    'RaceEthnicityCategory_Other race only, Non-Hispanic': 'Non_Hispanic_other',
    'RaceEthnicityCategory_Multiracial, Non-Hispanic': 'Non_Hispanic_multiracial',
    'RaceEthnicityCategory_Hispanic': 'Hispanic'
})

#Remove the state prefix
heart_one_hot.columns = heart_one_hot.columns.str.replace('State_', '')

#Recode race/enthnicity using dummy coding (for linear models to avoid multicollinearity)
heart_dummy = heart_one_hot.drop(['Non_Hispanic_White', 'District of Columbia'], axis=1)

print(heart_dummy.info())

<class 'pandas.core.frame.DataFrame'>
Index: 432809 entries, 0 to 432808
Data columns (total 84 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   GeneralHealth              431744 non-null  float64
 1   PhysicalHealthDays         422377 non-null  float64
 2   MentalHealthDays           424173 non-null  float64
 3   LastCheckupTime            424890 non-null  float64
 4   PhysicalActivities         431843 non-null  float64
 5   SleepHours                 427814 non-null  float64
 6   RemovedTeeth               422017 non-null  float64
 7   HadHeartAttack             432809 non-null  int64  
 8   HadAngina                  429275 non-null  float64
 9   HadStroke                  431745 non-null  float64
 10  HadAsthma                  431384 non-null  float64
 11  HadSkinCancer              430066 non-null  float64
 12  HadCOPD                    431001 non-null  float64
 13  HadDepressiveDisorder      430415 

In [13]:
print(heart_one_hot.info())
heart_one_hot.head()

<class 'pandas.core.frame.DataFrame'>
Index: 432809 entries, 0 to 432808
Data columns (total 86 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   GeneralHealth              431744 non-null  float64
 1   PhysicalHealthDays         422377 non-null  float64
 2   MentalHealthDays           424173 non-null  float64
 3   LastCheckupTime            424890 non-null  float64
 4   PhysicalActivities         431843 non-null  float64
 5   SleepHours                 427814 non-null  float64
 6   RemovedTeeth               422017 non-null  float64
 7   HadHeartAttack             432809 non-null  int64  
 8   HadAngina                  429275 non-null  float64
 9   HadStroke                  431745 non-null  float64
 10  HadAsthma                  431384 non-null  float64
 11  HadSkinCancer              430066 non-null  float64
 12  HadCOPD                    431001 non-null  float64
 13  HadDepressiveDisorder      430415 

Unnamed: 0,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,...,Virginia,Washington,West Virginia,Wisconsin,Wyoming,Non_Hispanic_Black,Hispanic,Non_Hispanic_multiracial,Non_Hispanic_other,Non_Hispanic_White
0,3.0,0.0,0.0,0.0,0.0,8.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
1,4.0,0.0,0.0,,0.0,6.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2,3.0,2.0,3.0,0.0,1.0,5.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
3,4.0,0.0,0.0,0.0,1.0,7.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
4,1.0,2.0,0.0,0.0,1.0,9.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


In [14]:
print(heart_dummy.info())
heart_dummy.head()

<class 'pandas.core.frame.DataFrame'>
Index: 432809 entries, 0 to 432808
Data columns (total 84 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   GeneralHealth              431744 non-null  float64
 1   PhysicalHealthDays         422377 non-null  float64
 2   MentalHealthDays           424173 non-null  float64
 3   LastCheckupTime            424890 non-null  float64
 4   PhysicalActivities         431843 non-null  float64
 5   SleepHours                 427814 non-null  float64
 6   RemovedTeeth               422017 non-null  float64
 7   HadHeartAttack             432809 non-null  int64  
 8   HadAngina                  429275 non-null  float64
 9   HadStroke                  431745 non-null  float64
 10  HadAsthma                  431384 non-null  float64
 11  HadSkinCancer              430066 non-null  float64
 12  HadCOPD                    431001 non-null  float64
 13  HadDepressiveDisorder      430415 

Unnamed: 0,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,...,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming,Non_Hispanic_Black,Hispanic,Non_Hispanic_multiracial,Non_Hispanic_other
0,3.0,0.0,0.0,0.0,0.0,8.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,4.0,0.0,0.0,,0.0,6.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,3.0,2.0,3.0,0.0,1.0,5.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,4.0,0.0,0.0,0.0,1.0,7.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,1.0,2.0,0.0,0.0,1.0,9.0,,0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


## Examine Missing Data

In [15]:
# Examine missing data
print(heart_dummy.info())

<class 'pandas.core.frame.DataFrame'>
Index: 432809 entries, 0 to 432808
Data columns (total 84 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   GeneralHealth              431744 non-null  float64
 1   PhysicalHealthDays         422377 non-null  float64
 2   MentalHealthDays           424173 non-null  float64
 3   LastCheckupTime            424890 non-null  float64
 4   PhysicalActivities         431843 non-null  float64
 5   SleepHours                 427814 non-null  float64
 6   RemovedTeeth               422017 non-null  float64
 7   HadHeartAttack             432809 non-null  int64  
 8   HadAngina                  429275 non-null  float64
 9   HadStroke                  431745 non-null  float64
 10  HadAsthma                  431384 non-null  float64
 11  HadSkinCancer              430066 non-null  float64
 12  HadCOPD                    431001 non-null  float64
 13  HadDepressiveDisorder      430415 

## Training and Testing Data Split

In [16]:
print(heart_dummy.columns)

Index(['GeneralHealth', 'PhysicalHealthDays', 'MentalHealthDays',
       'LastCheckupTime', 'PhysicalActivities', 'SleepHours', 'RemovedTeeth',
       'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
       'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease',
       'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing',
       'BlindOrVisionDifficulty', 'DifficultyConcentrating',
       'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands',
       'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'AgeCategory',
       'AlcoholDrinkers', 'Female', 'BMICategory', 'Alabama', 'Alaska',
       'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
       'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New 

In [17]:
X = heart_dummy.drop('HadHeartAttack', axis=1)
y = heart_dummy['HadHeartAttack'].values
print(type(X), type(y))

# Categorical columns (all the binary columns + ordinal columns)
#cat = heart_dummy.drop(['HadHeartAttack', 'SleepHours', 'PhysicalHealthDays', 'MentalHealthDays'], axis=1)

# Numeric columns
#num = heart_dummy[['SleepHours', 'PhysicalHealthDays', 'MentalHealthDays']]

# Separate data into categorical/ordinal, and numerical columns for subsequent train test split and imputation
X_cat = heart_dummy.drop(['HadHeartAttack', 'SleepHours', 'PhysicalHealthDays', 'MentalHealthDays'], axis=1).values
X_num = heart_dummy[['SleepHours', 'PhysicalHealthDays', 'MentalHealthDays']].values
y = heart_dummy['HadHeartAttack'].values

# Examine data shape
print(X_cat.shape)
print(X_num.shape)

#Perform training and testing data split
from sklearn.model_selection import train_test_split
X_train_cat, X_test_cat, y_train, y_test= train_test_split(X_cat, y, test_size = 0.3, random_state = 12)
X_train_num, X_test_num, y_train, y_test= train_test_split(X_num, y, test_size = 0.3, random_state = 12)

<class 'pandas.core.frame.DataFrame'> <class 'numpy.ndarray'>
(432809, 80)
(432809, 3)


## Impute Missing Data in Training and Test Sets

In [18]:
from sklearn.impute import SimpleImputer
# Impute missing data in categorical/ordinal columns (with mode)
imp_cat = SimpleImputer(strategy='most_frequent')
X_train_cat = imp_cat.fit_transform(X_train_cat)
X_test_cat = imp_cat.transform(X_test_cat)

# Impute missing data in numeric columns (with mean)
imp_num = SimpleImputer() 
X_train_num= imp_num.fit_transform(X_train_num) 
X_test_num= imp_num.transform(X_test_num)

# Combine categorical/ordinal and numeric coloumns after imputation
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)

In [19]:
# Check data again after imputation
X_train_missing_count = np.sum(np.isnan(X_train))
print(f"Number of missing values: {X_train_missing_count}")
X_test_missing_count = np.sum(np.isnan(X_test))
print(f"Number of missing values: {X_test_missing_count}")

Number of missing values: 0
Number of missing values: 0


## Standardize Features

In [20]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#check after standardization
print(np.mean(X, axis=0), np.std(X, axis=0))
print(np.mean(X_train_scaled, axis=0), np.std(X_train_scaled, axis=0))

GeneralHealth               2.441931
PhysicalHealthDays          4.313305
MentalHealthDays            4.382766
LastCheckupTime             0.341620
PhysicalActivities          0.763954
                              ...   
Wyoming                     0.009526
Non_Hispanic_Black          0.078758
Hispanic                    0.084913
Non_Hispanic_multiracial    0.021291
Non_Hispanic_other          0.048215
Length: 83, dtype: float64 GeneralHealth               1.048537
PhysicalHealthDays          8.651013
MentalHealthDays            8.372134
LastCheckupTime             0.775742
PhysicalActivities          0.424651
                              ...   
Wyoming                     0.097136
Non_Hispanic_Black          0.269360
Hispanic                    0.278752
Non_Hispanic_multiracial    0.144353
Non_Hispanic_other          0.214221
Length: 83, dtype: float64
[ 1.32901510e-13 -2.83636232e-14  2.93282133e-14 -1.60173104e-15
  2.72998017e-15 -8.60228442e-16  5.90502652e-15 -1.72816372e-15
  

### Save datasets

In [21]:
#Dummy encoding (state + race)
path = 'D:\\Springboard\\Projects\\capstone2\\Heart-Attack\\data\\'

# Save training and testing data as CSV files for subsequent modeling
pd.DataFrame(X_train_scaled).to_csv(path + 'X_train_scaled.csv', index=False)
pd.DataFrame(X_test_scaled).to_csv(path + 'X_test_scaled.csv', index=False)
pd.DataFrame(y_train).to_csv(path + 'y_train.csv', index=False)
pd.DataFrame(y_test).to_csv(path + 'y_test.csv', index=False)

## Summary

In the preprocessing stage, the primary objective is to prepare the dataset for machine learning analyses. A key part of this process is converting categorical features into numerical ones through several steps. First, I transformed binary columns (e.g., 'yes'/'no') into numeric form. Second, ordinal categorical columns were recoded into numeric values. Third, I applied one-hot encoding/dummy coding to purely categorical features, such as race and state, while dropping the reference categories ('Non_Hispanic_White' and 'District of Columbia') to prevent multicollinearity in the analysis. Additionally, I removed observations from non-U.S. states and eliminated columns with duplicated information after recoding.

After converting the categorical variables into numeric features, I split the data into training and testing sets. Since the imputation process for missing data differs between categorical and numeric features, I first separated the data into these two types. For numeric features, I applied mean imputation, and for categorical features, I used mode imputation. Once the missing values were imputed, I combined the categorical and numeric features back into a single training set and a single testing set. Finally, I standardized all features to ensure they are on the same scale, a crucial step for many machine learning models. Note that missing data imputation and feature standardization were performed after splitting the data to avoid data leakage. The processed datasets (X_train_scaled, X_test_scaled, y_train, and y_test) were saved as CSV files for subsequent modeling.