### ***PREDICTING STROKE RISK USING PATIENT HEALTH DATA - PREPROCESSING AND TRAINING DATA DEVELOPMENT***

***Preprocessing and Training Data Plan***

1. Create Dummy/Indicator Features for Categorical Variables

    Goal: Convert categorical variables into numerical form using dummy variables (one-hot encoding).
    Steps:
        Identify categorical columns that need to be encoded.
        Use pandas.get_dummies() to create dummy variables for each categorical feature.

2. Scale Standardization

    Goal: Standardize the numeric feature magnitudes to have a mean of 0 and standard deviation of 1. This is especially important for algorithms that are sensitive to feature scales, like logistic regression, KNN, or neural networks.

    Steps:
        Identify the numeric columns to scale.
        Use StandardScaler from sklearn to scale the numeric columns.

3. Split Data into Training and Testing Subsets

    Goal: Split the cleaned and preprocessed data into training and testing sets. Typically, an 80/20 or 70/30 split is used for training and testing respectively.
    Steps:
        Separate features (X) and the target variable (y).
        Use train_test_split() from sklearn.model_selection to create the training and testing sets.

Key Considerations

    Categorical Data: Ensure that all categorical variables (like gender, residence type) are one-hot encoded to avoid misinterpretation of data by the model.
    Scaling: Apply scaling only to numeric features, as categorical features don’t need scaling.
    Data Splitting: Maintain consistent random states across your workflow to ensure reproducibility.

In [5]:
# First we start by importing the necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [9]:
# Next, we'll load the cleaned/encoded datasets that we created from the previous steps of this project
# Load the dataset
file_path = 'C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/stroke_data_encoded.csv'
stroke_data_encoded = pd.read_csv(file_path)

# Show the first few rows of the dataset to inspect its structure
stroke_data_encoded.head()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Male,gender_Other,ever_married_Yes,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Urban,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,age_group,age_hypertension_interaction
0,9046,67.0,0,1,2.706375,1.005086,1,True,False,True,False,True,False,False,True,True,False,False,senior,0.0
1,51676,61.0,0,0,2.121559,-0.098981,1,False,False,True,False,False,True,False,False,False,True,False,senior,0.0
2,31112,80.0,0,1,-0.005028,0.472536,1,True,False,True,False,True,False,False,False,False,True,False,senior,0.0
3,60182,49.0,0,0,1.437358,0.719327,1,False,False,True,False,True,False,False,True,False,False,True,middle-aged,0.0
4,1665,79.0,1,0,1.501184,-0.631531,1,False,False,True,False,False,True,False,False,False,True,False,senior,79.0


We will use pd.get_dummies() to convert categorical variables into dummy/indicator variables.

In [18]:
# Step 1: Create Dummy Variables for Categorical Features

# Identify categorical columns
categorical_columns = ['gender_Male', 'gender_Other', 'ever_married_Yes', 'work_type_Never_worked', 'work_type_Private', 'work_type_Self-employed', 'work_type_children', 'Residence_type_Urban', 'smoking_status_formerly smoked', 'smoking_status_never smoked', 'smoking_status_smokes']

# Create dummy variables and drop the first category to avoid multicollinearity
stroke_data_preprocessed = pd.get_dummies(stroke_data_encoded, columns=categorical_columns, drop_first=True)

# View the result
stroke_data_preprocessed.head()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,age_group,age_hypertension_interaction,gender_Male_True,gender_Other_True,ever_married_Yes_True,work_type_Never_worked_True,work_type_Private_True,work_type_Self-employed_True,work_type_children_True,Residence_type_Urban_True,smoking_status_formerly smoked_True,smoking_status_never smoked_True,smoking_status_smokes_True
0,9046,67.0,0,1,2.706375,1.005086,1,senior,0.0,True,False,True,False,True,False,False,True,True,False,False
1,51676,61.0,0,0,2.121559,-0.098981,1,senior,0.0,False,False,True,False,False,True,False,False,False,True,False
2,31112,80.0,0,1,-0.005028,0.472536,1,senior,0.0,True,False,True,False,True,False,False,False,False,True,False
3,60182,49.0,0,0,1.437358,0.719327,1,middle-aged,0.0,False,False,True,False,True,False,False,True,False,False,True
4,1665,79.0,1,0,1.501184,-0.631531,1,senior,79.0,False,False,True,False,False,True,False,False,False,True,False


Explanation:

    We first identify categorical variables that need to be encoded.
    The pd.get_dummies() function creates dummy/indicator variables for each category.
    drop_first=True prevents multicollinearity by dropping one dummy column per feature.

We will now standardize the numeric features to have a mean of 0 and a standard deviation of 1 using StandardScaler from sklearn.

In [32]:
# Step 2: Scale Standardization of Numeric Columns

# Save the original pre-scaled version
stroke_data_prescaled = stroke_data_preprocessed.copy()

# Identify numeric columns
numeric_columns = ['age', 'avg_glucose_level', 'bmi']

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the numeric columns and transform them
stroke_data_scaled = stroke_data_preprocessed.copy()  # Copy the dataset to apply scaling
stroke_data_scaled[numeric_columns] = scaler.fit_transform(stroke_data_scaled[numeric_columns])

# View the scaled data
stroke_data_preprocessed.head()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,age_group,age_hypertension_interaction,gender_Male_True,gender_Other_True,ever_married_Yes_True,work_type_Never_worked_True,work_type_Private_True,work_type_Self-employed_True,work_type_children_True,Residence_type_Urban_True,smoking_status_formerly smoked_True,smoking_status_never smoked_True,smoking_status_smokes_True
0,9046,1.051434,0,1,2.706375,1.005086,1,senior,0.0,True,False,True,False,True,False,False,True,True,False,False
1,51676,0.78607,0,0,2.121559,-0.098981,1,senior,0.0,False,False,True,False,False,True,False,False,False,True,False
2,31112,1.62639,0,1,-0.005028,0.472536,1,senior,0.0,True,False,True,False,True,False,False,False,False,True,False
3,60182,0.255342,0,0,1.437358,0.719327,1,middle-aged,0.0,False,False,True,False,True,False,False,True,False,False,True
4,1665,1.582163,1,0,1.501184,-0.631531,1,senior,79.0,False,False,True,False,False,True,False,False,False,True,False


In [36]:
#Save both versions locally
stroke_data_prescaled.to_csv('stroke_data_prescaled.csv', index=False)
stroke_data_scaled.to_csv('stroke_data_scaled.csv', index=False)

Explanation:

    We identify the numeric columns that need to be scaled.
    StandardScaler standardizes these columns by removing the mean and scaling them to unit variance.
    This ensures that all features have the same scale, which is particularly important for certain models (e.g., logistic regression, KNN).
    We saved prescaled and scaled versions of the data. 
    For modeling: Continue using the scaled age values.
    For visualization or presentation: Use the original age values for clarity.
Key points:
    
    stroke_data_prescaled: This is your dataset with the original values before scaling.
    stroke_data_scaled: This is your dataset with scaled numeric columns.
    Both versions are saved locally as CSV files.    

In [39]:
# Step 3: Split the data into Training and Testing Sets
X = stroke_data_scaled.drop(columns='stroke')
y = stroke_data_scaled['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inspect training set
print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (4088, 19)
Testing set size: (1022, 19)


In [41]:
# Save the training and testing sets as CSV files
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)