## DATA PRE-PROCESSING

Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed, and is often an important step. The preprocessing pipeline used can often have large effects on the conclusions drawn from the downstream analysis. Examples of methods used in data preprocessing include cleaning, instance selection, normalization, one-hot encoding, data transformation, feature extraction and feature selection.

**Goal**

The overall aim of data preprocessing is to prepare the data in such a way that it maximizes the performance and interpretability of machine learning models or facilitates effective analysis. Properly preprocessed data can lead to more accurate models, improved generalization, and better-informed decision-making.

Let's start by loading the libraries and by loading the DataSet into a DataFrame:

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [8]:
stroke = pd.read_csv(r"C:\Users\maria\Desktop\proyecto infarto de miocardio\healthcare-dataset-stroke-data.csv")
stroke = stroke.dropna(subset=['bmi'])
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


Now, let's use the vectorize method from NumPy to create an optimized function for operating on arrays. This function will allow us to replace labels with their corresponding numbers in the dictionary.

In [9]:
column_mapping = {
    'gender': {"Male": 0, "Female": 1, "Other": 2},
    'ever_married': {"Yes": 1, "No": 0},
    'work_type': {"Private": 0, "Self-employed": 1, "Govt_job": 2, "children": 3, "Never_worked": 4},
    'Residence_type':{"Urban": 0, "Rural": 1},
    'smoking_status': {"formerly smoked": 0, "never smoked": 1, "smokes": 2, "Unknown": 3}
}

for column, mapping in column_mapping.items():
    fmap = np.vectorize(lambda t: mapping.get(t, -1))
    stroke[column] = fmap(stroke[column])

In [10]:
stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,0,67.0,0,1,1,0,0,228.69,36.6,0,1
2,31112,0,80.0,0,1,1,0,1,105.92,32.5,1,1
3,60182,1,49.0,0,0,1,0,0,171.23,34.4,2,1
4,1665,1,79.0,1,0,1,1,1,174.12,24.0,1,1
5,56669,0,81.0,0,0,1,0,0,186.21,29.0,0,1


We have observed in the exploratory data analysis that the features id and Residence_type are not going to contribute much information. Therefore, we are going to remove them from the DataFrame.

In [11]:
drop_columns=['id', 'Residence_type'] 

Let's now create the NumPy arrays for the train and test:

In [12]:
X = stroke.drop('stroke',axis=1)
X = X.drop(drop_columns, axis=1)
Y = stroke['stroke'].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=0)
len(X_train), len(X_test), len(X_train.columns)

(3436, 1473, 9)

The classes to which they can belong are represented by two numerical values:

- 1 = Positive class (The patient has had a stroke)
- 0 = Negative class (The patient has not had a stroke)

Let's proceed to standardize the DataFrame to have all the data on the same scale:

In [13]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

So now we are ready to build the model.