### Building Efficient Machine Learning Pipelines: A Step-by-Step Guide

This notebook is inspired by the article "[A Gentle Introduction to Machine Learning Modeling Pipelines](https://machinelearningmastery.com/machine-learning-modeling-pipelines/)" from the Machine Learning Mastery website. It offers a practical and detailed walkthrough of how to create and implement machine learning pipelines effectively.

Machine learning pipelines are essential tools that streamline the process of building models by combining multiple steps into a cohesive, repeatable workflow. This notebook will guide you through the core concepts and best practices for structuring your machine learning projects, making your workflows more efficient and easier to manage.

In this notebook, we cover:

- Data Preprocessing: How to prepare and clean your data for modeling, including handling missing values, scaling, and encoding categorical variables.

- Feature Engineering: Transforming and selecting features to improve the performance of your model.

- Model Selection: Choosing the right machine learning algorithm based on your data and problem.

- Pipeline Construction: Creating pipelines that seamlessly integrate preprocessing, feature engineering, and model training.

- Evaluation and Optimization: Assessing model performance and fine-tuning hyperparameters using cross-validation.

### What is a modeling Pipeline?

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations. It allows the sequence of steps to be specified, evaluated, and used as an atomic unit.

**Pipeline**: A linear sequence of data preparation and modeling steps that can be treated as an atomic unit.
To make the idea clear, let's look at two simple examples:

The first example uses data normalization for the input variables and fits a logistic regression model:
- [Input], [Normalization], [Logistic Regression], [Predictions]

The second example standardizes the input variables, applies RFE feature selection, and fits a support vector machine.

- [Input], [Standardization], [RFE], [SVM], [Predictions]

In [28]:
import pandas as pd
import numpy as np

import joblib

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

# Load the datasets
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

# Create a validation set from the training set
msk = np.random.rand(len(train_df)) < 0.8
val_df = train_df[~msk]
train_df = train_df[msk]

# Define feature groups
nominal = ["MSZoning", "LotShape", "LandContour", "LotConfig", "Neighborhood",
           "Condition1", "BldgType", "RoofStyle",
           "Foundation", "CentralAir", "SaleType", "SaleCondition"]

ordinal = ["LandSlope", "OverallQual", "OverallCond", "YearRemodAdd",
          "ExterQual", "ExterCond", "BsmtQual", "BsmtCond", "BsmtExposure",
          "KitchenQual", "Functional", "GarageCond", "PavedDrive"]

numerical = ["LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtUnfSF",
            "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "GrLivArea", "GarageArea",
            "OpenPorchSF"]

# Separate features and labels
train_features = train_df[nominal + ordinal + numerical]
train_label = train_df['SalePrice']

val_features = val_df[nominal + ordinal + numerical]
val_label = val_df['SalePrice']

test_features = test_df[nominal + ordinal + numerical]

# Define pipelines for different feature types
ordinal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder()),
])

nominal_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=True, handle_unknown="ignore"))
])

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
])

# Create a ColumnTransformer to handle different types of features
preprocessing_pipeline = ColumnTransformer([
    ('nominal_preprocessor', nominal_pipeline, nominal),
    ('ordinal_preprocessor', ordinal_pipeline, ordinal),
    ('numerical_preprocessor', numerical_pipeline, numerical)
])

# Combine the preprocessor and the model into a complete pipeline
complete_pipeline = Pipeline([
    ('preprocessor', preprocessing_pipeline),
    ('estimator', LinearRegression())
])

# Fit the pipeline on the training data and evaluate on validation data
complete_pipeline.fit(train_features, train_label)
score = complete_pipeline.score(val_features, val_label)

# Print the validation score
print("Validation score:", score)

# # Predict the sale prices for the test dataset
# predictions = complete_pipeline.predict(test_features)

pipeline_filename = 'pipeline.pkl'
joblib.dump(complete_pipeline, pipeline_filename)

pipeline = joblib.load(pipeline_filename)
predictions = pipeline.predict(test_df)
print(predictions)

Validation score: 0.8694063878949725
[107114. 172390. 175274. ... 136586. 129194. 218824.]
