# Prepare Your Data For Machine Learning

Many machine learning algorithms make assumptions about your data. It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use. In this chapter you will discover how to prepare your data for machine learning in Python using scikit-learn. After completing this lesson you will know how to:

1. Handling missing values 
2. Encoding categorical data
3. Rescale data.
4. Standardize data.
5. Normalize data.
6. Binarize data.

## Why Data Pre-processing
You almost always need to pre-process your data. It is a required step. A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms
can deliver better results without pre-processing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

## Data Transformation 
In this lecture you will work through 4 different data pre-processing techniques for machine learning. The Cardiovascular Disease dataset is used in each techniques. Each technique follows the same structure:

- Load the dataset
- Split the dataset into the input and output variables for machine learning
- Apply a pre-processing transform to the input variables
- Summarize the data to show the change

The `scikit-learn` library provides two standard idioms for transforming data. Each are useful in different circumstances. The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future. The scikit-learn
documentation has some information on how to use various different pre-processing methods:

- Fit and Multiple Transform.
- Combined Fit-And-Transform.


In [4]:
# Import necessary packages
import pandas as pd

In [5]:
# import data 
df = pd.read_csv("../data/Heart_Attack.csv")
df.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Outcome
0,64,Male,66.0,160.0,83.0,160.0,1.8,0.012,Negative
1,21,Male,94.0,98.0,46.0,296.0,6.75,1.06,Positive
2,55,Male,64.0,160.0,77.0,270.0,1.99,0.003,Negative
3,64,Male,,120.0,55.0,270.0,13.87,0.122,Positive
4,55,Male,64.0,112.0,65.0,300.0,1.08,0.003,Negative


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1319 entries, 0 to 1318
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       1319 non-null   int64  
 1   Gender                    1311 non-null   object 
 2   Heart rate                1311 non-null   float64
 3   Systolic blood pressure   1317 non-null   float64
 4   Diastolic blood pressure  1318 non-null   float64
 5   Blood sugar               1317 non-null   float64
 6   CK-MB                     1318 non-null   float64
 7   Troponin                  1319 non-null   float64
 8   Outcome                   1313 non-null   object 
dtypes: float64(6), int64(1), object(2)
memory usage: 92.9+ KB


In [7]:
df.isnull().sum()

Age                         0
Gender                      8
Heart rate                  8
Systolic blood pressure     2
Diastolic blood pressure    1
Blood sugar                 2
CK-MB                       1
Troponin                    0
Outcome                     6
dtype: int64

## Select Columns by Data Type
In this code snippet, we're exploring a practical way to select columns in a dataset based on their data types. We're using a convenient tool from `scikit-learn` called `make_column_selector`. Let's break it down step by step:

In [8]:
from sklearn.compose import make_column_selector

In [9]:
# Create a selector for columns with numeric data types
num_cols = make_column_selector(dtype_include='number')
df[num_cols]

Unnamed: 0,Age,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin
0,64,66.0,160.0,83.0,160.0,1.80,0.012
1,21,94.0,98.0,46.0,296.0,6.75,1.060
2,55,64.0,160.0,77.0,270.0,1.99,0.003
3,64,,120.0,55.0,270.0,13.87,0.122
4,55,64.0,112.0,65.0,300.0,1.08,0.003
...,...,...,...,...,...,...,...
1314,44,94.0,122.0,67.0,204.0,1.63,0.006
1315,66,84.0,125.0,55.0,149.0,1.33,0.172
1316,45,85.0,168.0,104.0,96.0,1.24,4.250
1317,54,58.0,117.0,68.0,443.0,5.80,0.359


In [10]:
# Create a selector for columns excluding numeric data types
cat_cols = make_column_selector(dtype_exclude='number')
df[cat_cols]

Unnamed: 0,Gender,Outcome
0,Male,Negative
1,Male,Positive
2,Male,Negative
3,Male,Positive
4,Male,Negative
...,...,...
1314,Male,Negative
1315,Male,Positive
1316,Male,Positive
1317,Male,Positive


## Handling Missing Values for Numerical Variables

In [11]:
from sklearn.impute import SimpleImputer

In [12]:
# Create a SimpleImputer instance with the desired strategy (e.g., mean, median, or constant)
numeric_imputer = SimpleImputer(strategy='mean')

In [13]:
# Use the fit_transform method to fill missing values in the specified numeric columns
df[num_cols] = numeric_imputer.fit_transform(df[num_cols])

In [14]:
# check again: # Display the count of missing values in each column of the DataFrame
df.isnull().sum()

Age                         0
Gender                      8
Heart rate                  0
Systolic blood pressure     0
Diastolic blood pressure    0
Blood sugar                 0
CK-MB                       0
Troponin                    0
Outcome                     6
dtype: int64

## Handling Missing Values for Categorical Variables

In [15]:
# Create a SimpleImputer instance with the desired strategy (e.g., 'most_frequent' for mode)
categorical_imputer = SimpleImputer(strategy='most_frequent')

In [16]:
# Use the fit_transform method to fill missing values in the specified categorical columns
df[cat_cols] = categorical_imputer.fit_transform(df[cat_cols])

In [17]:
# check again: # Display the count of missing values in each column of the DataFrame
df.isnull().sum()

Age                         0
Gender                      0
Heart rate                  0
Systolic blood pressure     0
Diastolic blood pressure    0
Blood sugar                 0
CK-MB                       0
Troponin                    0
Outcome                     0
dtype: int64

In [18]:
df.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Outcome
0,64.0,Male,66.0,160.0,83.0,160.0,1.8,0.012,Negative
1,21.0,Male,94.0,98.0,46.0,296.0,6.75,1.06,Positive
2,55.0,Male,64.0,160.0,77.0,270.0,1.99,0.003,Negative
3,64.0,Male,78.357742,120.0,55.0,270.0,13.87,0.122,Positive
4,55.0,Male,64.0,112.0,65.0,300.0,1.08,0.003,Negative


## Encoding Categorical Data 

In [23]:
from sklearn.preprocessing import OneHotEncoder

In [24]:
c

In [25]:
ohe.fit_transform(df[make_column_selector(dtype_exclude="number")])

<1319x4 sparse matrix of type '<class 'numpy.float64'>'
	with 2638 stored elements in Compressed Sparse Row format>

In [26]:
df.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Outcome
0,64.0,Male,66.0,160.0,83.0,160.0,1.8,0.012,Negative
1,21.0,Male,94.0,98.0,46.0,296.0,6.75,1.06,Positive
2,55.0,Male,64.0,160.0,77.0,270.0,1.99,0.003,Negative
3,64.0,Male,78.357742,120.0,55.0,270.0,13.87,0.122,Positive
4,55.0,Male,64.0,112.0,65.0,300.0,1.08,0.003,Negative


## Split the Dataset

In [9]:
# Separate the DataFrame into input (features) and output (target) components
# Features (Input): Exclude the 'Outcome' column to create the input DataFrame (X)
X = df.drop('Outcome', axis=1)
X.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin
0,64,Male,66.0,160.0,83.0,160.0,1.8,0.012
1,21,Male,94.0,98.0,46.0,296.0,6.75,1.06
2,55,Male,64.0,160.0,77.0,270.0,1.99,0.003
3,64,Male,,120.0,55.0,270.0,13.87,0.122
4,55,Male,64.0,112.0,65.0,300.0,1.08,0.003


In [10]:
# Target (Output): Extract only the 'Outcome' column to create the target Series (y)
y = df['Outcome']
y.head()

0    Negative
1    Positive
2    Negative
3    Positive
4    Negative
Name: Outcome, dtype: object

## Rescale Data

In [2]:
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler