## Demo: Data Pre-Processing in scikit-learn

In scikit-learn, data preprocessing refers to the preparation and transformation of raw data into a format that is suitable for machine learning algorithms. This process typically involves several steps, including:

- Handling missing values, outliers, or other inconsistencies in the dataset. Scikit-learn provides tools for imputation of missing values.

- Scaling numerical features to ensure that they have similar scales. This is important for many machine learning algorithms, particularly those based on distance calculations or gradient descent optimization. Scikit-learn offers various scaling methods such as standardization (StandardScaler) and min-max scaling (MinMaxScaler).

- Converting categorical features into a numerical format suitable for machine learning algorithms. This may involve techniques such as one-hot encoding (`OneHotEncoder`) or label encoding (`LabelEncoder`).



## Creating Train-Test Splits

The primary purpose of splitting data into train and test sets is to evaluate the performance of the model on unseen data. By training the model on a subset of the data (the training set) and then evaluating its performance on a separate subset (the test set), we can assess how well the model generalizes to new, unseen data. This helps us understand whether the model has learned meaningful patterns from the data or if it"s simply memorizing the training examples, which is the definition of overfitting. Overfitting occurs when a model learns to perform well on the training data but fails to generalize to new data. By evaluating the model on a separate test set, we can detect overfitting. If the model performs significantly worse on the test set compared to the training set, it indicates that the model has overfit the training data, and adjustments may be needed to improve its generalization ability.

In addition, machine learning models often have hyperparameters that need to be tuned for optimal performance. Splitting the data allows us to perform hyperparameter tuning on the training set while keeping the test set untouched. This helps prevent "leakage" of information from the test set into the model training process, which could lead to overly optimistic performance estimates.


### Preparing the Small Ames Housing Dataset for a Regression Task

The Small Ames Housing Dataset consists of 2,930 records, each of which represent homes sales in Ames. 
The target variable is `sale_price`. The goal would be to create a model to predict `sale_price` using 
the other columns in the dataset. 

We read the data into a Pandas DataFrame and display the first few records:

In [10]:

import numpy as np
import pandas as pd


df = pd.read_csv(
    "https://gist.githubusercontent.com/jtrive84/9b96df3f5b23ef1cef68c6bbe5983153/raw/588858f552167732de4d2440505eab54d8d80316/ames-housing-small.csv",
    na_values=""
    )

df.head()

Unnamed: 0,lot_area,bld_type,house_style,exterior,foundation,basement_sqft,central_air,first_floor_sqft,full_bath,kitchen_score,fireplaces,garage_type,garage_nbr_cars,paved_drive,sale_price
0,31770,1Fam,1Story,BrkFace,CBlock,1080.0,Y,1656,1,TA,2,Attchd,2,P,215000
1,11622,1Fam,1Story,VinylSd,CBlock,882.0,Y,896,1,TA,0,Attchd,1,Y,105000
2,14267,1Fam,1Story,Wd Sdng,CBlock,1329.0,Y,1329,1,Gd,0,Attchd,1,Y,172000
3,11160,1Fam,1Story,BrkFace,CBlock,2110.0,Y,2110,2,Ex,2,Attchd,2,Y,244000
4,13830,1Fam,2Story,VinylSd,PConc,928.0,Y,928,2,TA,1,Attchd,2,Y,189900



The first step is to identify which columns represent continuous features and which represent categorical features. Categorical features are usually strings (or objects), but can also be integer-valued (for example, `fireplaces` should be treated as a categorical feature, not a continuous feature, since the column only takes on 5 distinct values). 

Note also that categorical features can be one of two types:

- **Nominal features** are categorical variables where the categories do not have a natural ordering or hierarchy ("bird", "cat", "dog").

- **Ordinal features** are categorical variables where the categories have a natural order or hierarchy ("low", "medium", "high"), or "fireplaces" from our dataset. 

We can print the data types of each column in the DataFrame to give us an idea of which features are continuous and which are categorical:

In [2]:
df.dtypes

lot_area              int64
bld_type             object
house_style          object
exterior             object
foundation           object
basement_sqft       float64
central_air          object
first_floor_sqft      int64
full_bath             int64
kitchen_score        object
fireplaces            int64
garage_type          object
garage_nbr_cars       int64
paved_drive          object
sale_price            int64
dtype: object

We treat `lot_area`, `first_floor_sqft` and `basement_sqft` as continuous features. All other columns will be considered categorical. 

In [3]:

target = "sale_price"

continuous = [
    "lot_area", "first_floor_sqft", "basement_sqft"
    ]

categorical = [
    "bld_type", "house_style", "exterior", "foundation", "central_air", 
    "full_bath", "kitchen_score", "fireplaces", "garage_type", "garage_nbr_cars", 
    "paved_drive"
    ]


scikit-learn has good support for imputing continuous features, but for categorical features we will typically forgo imputation. Instead of imputing missing categorical values, we identify them as "missing", and treat those values as a category unto themselves. This will allow us to assess whether these missing values have any association with the target variable as a group. 

In [5]:

# Set any missing categorical 
for feature in categorical:
    df[feature] = df[feature].astype("str").fillna("missing")


We are now in a position to create our preprocessing pipelines. We initialize a `ColumnTransformer` instance, which gives us the ability to define separate preprocessing steps for different groups of columns (in our case, categorical vs. continuous). 

Let's assume we intend to use the `LinearRegression` model in scikit-learn. As most models don't support categorical features, it is necessary to encode them using a numerical representation. We will [one-hot encode](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) our categorical features within our pre-processing pipeline. 

Note that some models can accept ordinal categorical features directly (for example, `GradientBoostingRegressor`), but for now we are assuming all categorical features will be one-hot encoded. 

We also impute missing continuous values using `KNNImputer`. This is preferable to `SimpleImputer` since it doesn't assume independence of features as `SimpleImputer` does. 

Finally, we standardize the continuous features to have mean 0 and standard deviation 1. This is not strictly necessary, but helps convergence for models that rely on gradient descent. It also makes model coefficients more easily interpretable. 

Here is how we would setup the pre-processing pipeline, starting with creating the train-test split. 



In [6]:
"""
Example pre-processing pipeline for Small Ames Housing Dataset.
"""
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


# Create train-test split. Use 15% of data for test set. 
y = df[target]
dftrain, dftest, ytrain, ytest = train_test_split(df, y, test_size=.15, random_state=516)


# Continuous feature preprocessing  pipeline.
continuous_transformer = make_pipeline(
    KNNImputer(), 
    StandardScaler()
    )

# Categorical feature preprocessing pipeline.
categorical_transformer = make_pipeline(
    OneHotEncoder(drop="first", sparse_output=False)
    )

# Combine categorical and continuous feature pipelines.
transformer = ColumnTransformer(transformers=[
    ("continuous" , continuous_transformer, continuous),  
    ("categorical", categorical_transformer, categorical)
    ], remainder="drop"
    ).set_output(transform="pandas")


# Call fit_transform on dftrain.
dftrain = transformer.fit_transform(dftrain)


# Only call transform on dftest. 
dftest = transformer.transform(dftest)


Note that we call `fit_transform` on the training set but only `transform` on the test set. This is to prevent leaking information from our test set into our training set. This is a common mistake that even experienced machine learning practitioners sometimes fall victim to. **Ensure `fit_transform` is only ever called on the training set, and that only `transform` is called on the test/validation set.**.

We can inspect our training data and see that the specified preprocessing steps have been executed:


In [7]:
dftrain.head()

Unnamed: 0,continuous__lot_area,continuous__first_floor_sqft,continuous__basement_sqft,categorical__bld_type_2fmCon,categorical__bld_type_Duplex,categorical__bld_type_Twnhs,categorical__bld_type_TwnhsE,categorical__house_style_1.5Unf,categorical__house_style_1Story,categorical__house_style_2.5Fin,...,categorical__garage_type_BuiltIn,categorical__garage_type_CarPort,categorical__garage_type_Detchd,categorical__garage_type_nan,categorical__garage_nbr_cars_1,categorical__garage_nbr_cars_2,categorical__garage_nbr_cars_3,categorical__garage_nbr_cars_4,categorical__paved_drive_P,categorical__paved_drive_Y
2488,-0.102344,-0.446313,-0.145284,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2688,-0.336538,-1.255068,-1.468916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1080,-0.804663,1.054573,0.775406,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2880,-0.568115,1.5667,1.1806,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1626,0.090767,-0.907735,-1.495929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


Continuous feature names are prefixed with `continuous_` and categorical feature names with `categorical_`. Each categorical column will now consist only of 1s and 0s. Notice there are no longer string or object type columns:

In [8]:

dftrain.dtypes


continuous__lot_area                float64
continuous__first_floor_sqft        float64
continuous__basement_sqft           float64
categorical__bld_type_2fmCon        float64
categorical__bld_type_Duplex        float64
categorical__bld_type_Twnhs         float64
categorical__bld_type_TwnhsE        float64
categorical__house_style_1.5Unf     float64
categorical__house_style_1Story     float64
categorical__house_style_2.5Fin     float64
categorical__house_style_2.5Unf     float64
categorical__house_style_2Story     float64
categorical__house_style_SFoyer     float64
categorical__house_style_SLvl       float64
categorical__exterior_CemntBd       float64
categorical__exterior_HdBoard       float64
categorical__exterior_MetalSd       float64
categorical__exterior_Plywood       float64
categorical__exterior_VinylSd       float64
categorical__exterior_Wd Sdng       float64
categorical__exterior_WdShing       float64
categorical__exterior_nan           float64
categorical__foundation_CBlock  


Finally, we are ready to use our data to fit a model. Let's use the `LinearRegression` model mentioned earlier:

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

lr = LinearRegression()

# Fit model on training set. 
mdl = lr.fit(dftrain, ytrain)

# Generate predictions on test set.
ypred = mdl.predict(dftest)

# Compute mean absolute error.
mse = mean_absolute_error(ytest, ypred)

mse


22161.50395534319