# Machine Learning 101 - Preprocessing

## How to Prepare a Dataset for Machine Learning Models

Author: Kris Barbier

### Overview:
This notebook aims to outline the steps needed to prepare a set of data for machine learning, a process called preprocessing. I will also explain the differences between supervised and unsupervised learning.

### Types of Machine Learning

- In the field of machine learning, there are two main types: supervised learning and unsupervised learning.
- Supervised machine learning is where the model is given data to train on that also contains the "answers" to the question it is trying to work out. For example, if a model is trying to predict house prices based on certain features in the data (year built, size, location, etc.), the dataset given to the model also includes the actual house price so that the model may "learn" more quickly about how those features interact to affect the sales price. Supervised models are commonly used to make predictions, and come in two types (which will be discussed in later notebooks): regression and classification.
- Unsupervised learning is the opposite. An unsupervised model does not receive the answers to the problem, and therefore it must learn how to use the patterns and trends that it finds in the data to complete it's task. A common use case for unsupervised learning models is clustering, which could be used for customer segmentation analysis.

### Preprocessing Overview

- Once a data set has been sufficiently cleaned and explored for the task at hand, it needs to be processed in such a way that a machine learning model can use it to solve the problem outlined. This involves several steps, outlined below:

     1. If doing supervised machine learning, the first step is to split the data into two sets, training and testing data. This is a very important step, called a "validation split", and cannot be overlooked. The purpose of the validation split is to give the model data to learn from (the training set), and data to test itself on (the testing set). In order to develop models that you can be confident about employing, they must be tested on "unseen" data. By reserving a small portion of the original data set (testing set), this acts as new and unseen data for the  model to test itself on. The testing set usually comprises between 20-30% of the original data set.
     2. Machine learning models cannot make sense of textual data. Categorical type data needs to be transformed into numerical data, through a process called "encoding." Encoding turns categorical data into binarized, boolean values (0 and 1), and creates a new column for every value included in the original feature. The new data will contain a 1 in the new column if it's original value was that value, and 0 if it was not.
     3. Depending on the size of the data set, numerical data may not all contain the same units of measurement. For example, for a data set related to predicting a house price, there may be numerical data pertaining to the size of the house (in square feet), but also numbers that relate how many bedrooms/bathrooms there are. These two features have different units, and these units need to be transformed so that they are more standardized. This process is called scaling, and usually transforms the numbers to be between 0 and 1, where each column will have a mean of 0 and a standard deviation of 1.
     4. Finally, missing values need to be completely filled in order for the model to be most accurate. Depending on the types of data missing (categorical, ordinal, numeric), there are difference strategies that can be used to properly fill null values prior to machine learning.
        
- After completing these steps, the data will be ready to plug into a machine learning model!

## Preprocessing Steps in Code

### Load Libraries and Read in Data

- Here, we will import common libraries for data science and machine learning. Scikit-learn, or sklearn, is a common machine learning library that has several different modules to select tools from.

In [13]:
#Common imports for data science
import pandas as pd
import numpy as np

#Imports for machine learning 
from sklearn.model_selection import train_test_split  #For validation split

#Imports for feature transformations
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

#Imports for building preprocessing object
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

#Set sklearn output to pandas
from sklearn import set_config
set_config(transform_output = 'pandas')

In [2]:
#Read in sample dataset from repo folder
file_path = "Data/insurance_mod.csv"
df = pd.read_csv(file_path)
#Preview data
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,1,southwest,16885.0
1,18,male,33.77,1,0,southeast,1726.0
2,28,male,33.0,3,0,southeast,4449.0
3,33,male,22.705,0,0,northwest,21984.0
4,32,male,28.88,0,0,northwest,3867.0


### Perform Validation Split

- Here, we define two variables: X and y. The y variable is the target, or the variable the model will be predicting. In this case, it is charges.
- Then, the data is split into training and testing data. The default split is 75/25% for training and testing, respectively.

In [3]:
#Define X and y variables
y = df['charges']
X = df.drop(columns = 'charges')

In [5]:
#Perform validation split
#Setting a random state will make this reproducible in the future
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Verify the split is correct
X_train.head()  #Note the absence of the charges column from the X_train data

Unnamed: 0,age,sex,bmi,children,smoker,region
693,24,male,23.655,0,0,northwest
1297,28,female,26.51,2,0,southeast
634,51,male,39.7,1,0,southwest
1022,47,male,36.08,1,1,southeast
178,46,female,28.9,2,0,southwest


### Transform Features

- Now, we will perform the steps needed to transform the features into machine learning ready variables, using imputation for missing values (if any), scaling for numeric variables, and encoding for categorical variables.

In [6]:
#Define numeric and categorical columns
num_cols = X_train.select_dtypes('number').columns
cat_cols = X_train.select_dtypes('object').columns

In [7]:
##Instantiate transformers

#Simple Imputer - imputes missing values
#Here are two imputers, one for categorical and one for numeric data
#Categorical imputer:
impute_missing = SimpleImputer(strategy='constant', fill_value='Missing')

#Numeric imputer:
impute_mean = SimpleImputer(strategy='mean')

#Standard scaler:
scaler = StandardScaler()

#OneHotEncoder:
cat_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

### Create Pipelines

- The next step will be to create pipelines for each type of data that will include the transformers created above. These pipelines make it easy to control the flow of data into the final preprocessing object.

In [8]:
##Create two pipelines
#Numerical pipeline
num_pipe = make_pipeline(impute_mean, scaler)

#Categorical pipeline
cat_pipe = make_pipeline(impute_missing, cat_encoder)

### Final Preprocessor

- Here, we will make the final preprocessing object that we will use to transform the data into the format needed for machine learning. The preprocessing object is called a "column transformer," and requires certain arguments when it is created, included tuples containing information about each pipeline created above.

In [9]:
##Create tuples for each pipeline
##Include a title for the pipeline, the pipe variable, and the column variable

#Numeric tuple
num_tuple = ('Numeric', num_pipe, num_cols)

#Categorical tuple
cat_tuple = ('Categorical', cat_pipe, cat_cols)

In [10]:
##Finally, create final preprocessor
preprocessor = ColumnTransformer([num_tuple, cat_tuple], verbose_feature_names_out= False)

#View preprocessor
preprocessor

- When we view the final preprocessor, we can see that it includes all of the transformers we created earlier, as they pertain to the different types of data (numeric and categorical).

### Final Steps: Fit and Transform Data

- After completing the preprocessing object, the next step is to use it to actually transform the data into the format needed for machine learning models to use. Until now, the data is not scaled, imputed, or encoded, so we need to complete this final step before we can start using models.
- When transforming the data, we never want to fit it onto training data, as this will invalidate the split. Only use transform on the testing data!

In [11]:
#Fit preprocessor to training data
preprocessor.fit(X_train)

In [14]:
#Transform training and testing data and save as new, transformed variables (tf)
X_train_tf = preprocessor.transform(X_train)
X_test_tf = preprocessor.transform(X_test)

#Verify the change:
X_train_tf.head()



Unnamed: 0,age,bmi,children,smoker,sex_female,sex_male,region_northeast,region_northwest,region_southeast,region_southwest
693,-1.087167,-1.140875,-0.9175,-0.508399,0.0,1.0,0.0,1.0,0.0,0.0
1297,-0.802106,-0.665842,0.743605,-0.508399,1.0,0.0,0.0,0.0,1.0,0.0
634,0.836992,1.528794,-0.086947,-0.508399,0.0,1.0,0.0,0.0,0.0,1.0
1022,0.551932,0.926476,-0.086947,1.96696,0.0,1.0,0.0,0.0,1.0,0.0
178,0.480667,-0.268178,0.743605,-0.508399,1.0,0.0,0.0,0.0,0.0,1.0


- In this final output, we can see that the numeric data was scaled, and the categorical data was encoded. Now, we can use this data for machine learning!

## Conclusion

- Preprocessing is a necessary step in order to prepare data for machine learning modeling. The steps to be taken include:
    1. Validation Split: Split the data into two parts, training and testing. This is to validate that the model can be used on unseen data.
    2. Transform features based on data types:
    - Numeric data needs to have missing values imputed, and be scaled to account for different units of measurement.
    - Categorical data needs to have missing values imputed and features encoded to become numeric values.
    3. Create pipelines to easily plug data into the preprocessing object.
    4. Use the preprocessor to transform data into the proper format. Never fit the preprocessor on the testing data, only the training data.
    
- For additional questions or more information, contact the author at krisbarbier02@gmail.com