# Training A ML Model From Scratch

> A demonstration of ML pipelines and how to develop them

## About

This notebook is to help those attending my 2024 Intel oneAPI Workshop on reusing and extending deep learning models.

This notebook, while it does not provide any deep learning code, does lay the foundational information for:

1. What is a machine learning pipeline,
2. How to train a machine learning model,
3. How to evaluate a machine learning model, and
4. How to execute a machine learning model on new data.

## Pipelines

A *data pipeline* can be thought of as a series of steps or actions to manipulate data to adhere to a specified format **without**.

A common data pipeline pattern is *ETL*, or *Extract, Transform, Load*.

*Machine learning pipelines* expand on ETL by including further steps to train, evaluate, refine, and deploy machine learning models.

### Data Pipeline

From IBM:

```
A data pipeline is a method in which raw data is ingested from various data sources, transformed and then ported to a data store, such as a data lake or data warehouse, for analysis.
```

- [IBM, *What is a data pipeline?*](https://www.ibm.com/topics/data-pipeline)

To illistrate the flow of data, see the below figure:

![Data pipeline example image](images/data_pipeline_example.png)

**NOTE:** I intentionally did not add any detail on *how* data flows through the pipeline. Pipelines, at a high-level, can be thought of implementation agnostic details of a larger application.

**NOTE:** The pipeline described can also be thought of as an *Extract, Transform, Load* (*ETL*) pipeline as well as we will next discuss. But while uncommon, data pipelines, unlike ETL pipelines, **do not need to** transform data to be considered  as data pipelines. 

### Extract, Transform, Load (ETL) Pipeline

ETL pipelines are a popular sub-category of data pipelines that follow a rigid set of instructions for manipulating data.

These instructions are:

1. Extract the data from a source data store, repository, or database,
2. Transform the data with set algorithmic instructions or processes to generate a subset of the data, new representations of the data, or entirely new data, and
3. Load the transformed data into a data store, repository, or database.

From IBM:

```
ETL pipelines follow a specific sequence. As the abbreviation implies, they extract data, transform data, and then load and store data in a data repository. Not all data pipelines need to follow this sequence.
```

- [IBM, *What is a data pipeline?*](https://www.ibm.com/topics/data-pipeline)

To illistrate the flow of data, see the below figure:

![ETL pipeline example image](images/etl_pipeline_example.png)

ETL pipelines can be used to allow for interoperability between two different applications on the same data.

Assume we have two applications, **x** and **y** and source data **a**:

If application **x** takes **a** as input and outputs new data **b**, then application **y** can take **b** as input and ouput new data **c**.

### Machine Learning Pipelines

![ML pipeline stages image](images/ml_pipeline_stages.png)

- Yibo Wang, Ying Wang, Tingwei Zhang, Yue Yu, Shing-Chi Cheung, Hai Yu, and Zhiliang Zhu. 2023. *Can Machine Learning Pipelines Be Better Configured*? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 463–475. [https://doi.org/10.1145/3611643.3616352](https://doi.org/10.1145/3611643.3616352)

ML pipelines expand the concept of a data and ETL pipelines by including *feedback loops*, *feature engineering*, and ML specific stages including *training*, *evaluation*, and *deployment*.

Furthermore, Machine Learning Operations (MLOps) (an extension of DevOps practices aimed at machine and deep learning) now takes into consideration the state of the model post-deployment and how to update the model to continously match the requirements.

While MLOps is an interesting and exciting topic, **it is not covered** in this workshop.

However, please take a look at [Intel's MLOps Professional course](https://www.intel.com/content/www/us/en/developer/certification/mlops.html) for more information.


#### Feedback Loops

Machine learning is not strictly an engineering task, but also a scientific one.

This is to say that when you train a model on a dataset, you may not get the best result the first time.

Different model architectures, implemntations, hyper-parameters, and features may result in better or worse models.

Thus, while you need to be a software engineer to build a machine learning pipeline, you also need to be a computer scientist and explore different model configurations to identify which best suits your needs.

For this workshop, we are less interested in the software engineering aspect, and more interested in the computer science one.

#### Feature Engineering

Feature engineering is the act of taking a data source and undergoing a *data pipeline* to **extract** relevant features of the data to train a ML model on.

Given that we have to extract relevant features, an ETL pipeline is a good first choice for designing a data pipeline for a machine learning model

#### ML Specific Stages

**Training** is the process of taking your engineered data and processing it with an algorithm that updates its underlying weights continously as the data is passed through it (this is the core of ML and DL).

**Evaluation** is the process of taking labelled testing data and passing them into your trained ML model and computing metrics such as accuracy, precision, and recall.

**Deployment** is the process of actually deploying your model to within an application for users to provide completely unseen data to your model.

There is enough academic and professional literature on all three of these stages to fill several volumes, so for conciseness, I will not expand on the intricies of these here until relevant. 

## Time To Code!

### Install requirements

In [1]:
%pip install --upgrade pip
%pip install progress ucimlrepo numpy pandas "unidist[all]" "modin[all]" scikit-learn scikit-learn-intelex

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Download (Extract) Dataset From UCI Machine Learning Repository

> Wine dataset hosted on the *UCI Machine Learning Repository* ([source](https://archive.ics.uci.edu/dataset/109/wine))

In [2]:
from ucimlrepo  import fetch_ucirepo
from ucimlrepo.dotdict import dotdict
from pandas import DataFrame
from numpy import ndarray
import warnings
from typing import List

# Disable warnings
warnings.filterwarnings(action="ignore")

# Random state value to use for consitency
RANDOM_STATE: int = 42

# Download dataset
wine: dotdict = fetch_ucirepo(id=109)

# Extract dataset as DataFrame
wineDF: DataFrame = wine["data"]["original"]

# Get column names of Wine dataset
columns: List[str] = wineDF.columns.to_list()

# Get number of rows of Wine dataset
rowCount: int = wineDF.shape[0]

# Create Wine features dataframe (excludes class labels)
wineFeaturesDF: DataFrame = wineDF.drop(labels="class", axis=1, inplace=False)

# Convert DataFrames to NDArrays for ease of use
wineFeaturesNDArrary: ndarray = wineFeaturesDF.to_numpy()
wineNDArray: ndarray = wineDF.to_numpy()

### Transform Dataset By Standardizing It

Currently our dataset has values that are much larger and much smaller than one another.

We have to standardize the data within this dataset prior to performing any machine learning on it.

As we are manually adjusting the features of the dataset, our actions are henceforth called *feature engineering*.

In [3]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler preprocessor to remove the mean and scale to unit variance
scaler: StandardScaler = StandardScaler()
scaledWineFeaturesNDArray: ndarray = scaler.fit_transform(X=wineFeaturesNDArrary)

# Take the scaled data and create a complete DataFrame object representing the data
scaledWineDF: DataFrame = DataFrame(data=scaledWineFeaturesNDArray)
scaledWineDF["class"] = wineDF["class"]
scaledWineDF.columns = columns

# Create NDArray for ease of use
scaledWineNDArray = scaledWineDF.to_numpy()

### Transform Training, Validation, and Testing Datasets From Original Dataset

We will generate these datasets from our scaled data

- **Training** dataset will consist of a unique 50% of our original data and will be used for *training* the model.
- **Validation** dataset will consist of a unique 25% of our original data and will be used for *validating* our training process as we train the model.
- **Testing** datasets will consist of the remaining unique 25% of our original data and will be used for testing the final version of the model.

In [4]:
# Uses Intel Extension for Scikit-learn
from sklearnex.model_selection import train_test_split
from modin.pandas import DataFrame as Modin_DataFrame

# Generate training and temporary data splits
training: ndarray
temp: ndarray
validation: ndarray
testing: ndarray
training, temp = train_test_split(scaledWineNDArray, test_size=0.5, train_size=0.5, random_state=RANDOM_STATE, shuffle=True)

# From the temporary data split, generate the validation and testing splits
validation, testing = train_test_split(temp, test_size=0.5, train_size=0.5, random_state=RANDOM_STATE, shuffle=True)

# Convert ndarrays to Modin DataFrames for ease of use
trainingDF: Modin_DataFrame = Modin_DataFrame(data=training, columns=columns)
validationDF: Modin_DataFrame = Modin_DataFrame(data=validation, columns=columns)
testingDF: Modin_DataFrame = Modin_DataFrame(data=testing, columns=columns)

# Print out Modin DataFrames
# print(trainingDF)
# print(validationDF)
# print(testingDF)

# Print out size stats of training, validation, and testing DataFrames
print(f"Training data size w.r.t original: {trainingDF.shape[0] / rowCount}")
print(f"Validation data size w.r.t original: {validationDF.shape[0] / rowCount}")
print(f"Testing data size w.r.t original: {testingDF.shape[0] / rowCount}")

2024-03-19 14:59:12,690	INFO worker.py:1724 -- Started a local Ray instance.


Training data size w.r.t original: 0.5
Validation data size w.r.t original: 0.24719101123595505
Testing data size w.r.t original: 0.25280898876404495


### Load Training Dataset Into Model For Training

Now that we have created our training, validation, and testing datasets, we know need to train our machine learning model.

For our particular task, we will be classifying wines given their features.

For this task, we are going to leverage Support Vector Machines (SVMs) as they are fast and efficient algorithms for classifying data.

Other models that we could have used were Decision Trees, Multilayer Perceptrons, or K-Nearest Neighbors

In addition to training our models, we need to figure out which model configuration is the best.

To do so, we will leverage our validation dataset by generating metrics for each model we train.

Thus, the best model (as measured by accuracy in our case) will be returned to us

In [5]:
# Uses Intel Extension for Scikit-learn
from sklearnex.svm import SVC
from sklearn.model_selection import ParameterGrid
from progress.bar import Bar
from typing import Tuple
from sklearn.metrics import accuracy_score

def trainModels(hyper: ParameterGrid)   ->  SVC:
    bestModel: Tuple[int, float, SVC] | None = None

    # Get training features and classes
    features: ndarray = trainingDF.drop(labels="class", axis=1, inplace=False).to_numpy()
    classes: ndarray = trainingDF["class"].to_numpy()

    # Get validation features and classes
    validationFeatures: ndarray = validationDF.drop(labels="class", axis=1, inplace=False).to_numpy()
    validationClasses: ndarray = validationDF["class"].to_numpy()

    # For each hyperparameter combination, train a model and evaluate on the validation dataset
    for idx in range(hyper.__len__()):
        # Instantiate the model with hyperparameters
        svm: SVC = SVC(**hyper[idx])

        # Train the model on the training data
        svm.fit(X=features, y=classes)

        # Make predications on the validation data
        pred: ndarray = svm.predict(X=validationFeatures)

        # Compute accuracy of the model on the validation data
        accuracy: float = accuracy_score(y_true=validationClasses, y_pred=pred)

        # Store the best model, its accuracy, and hyperparameter configuration index in a tuple if it is the best model
        if bestModel is None:
            bestModel = (idx, accuracy, svm)
            print(f"Best model is config {idx} w/ {accuracy} accuracy on validation data")
        else:
            if bestModel[1] < accuracy:
                bestModel = (idx, accuracy, svm)
                print(f"Best model is config {idx} w/ {accuracy} accuracy on validation data")

    return bestModel

# Define hyper-parameters
hyper: ParameterGrid = ParameterGrid(param_grid={
    "C": [0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1],
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "degree": [2, 3, 4, 5, 10, 15, 20],
    "probability": [True],
    "max_iter": [10, 20, 30, 40, 50, 100],
})

idx: int
accuracy: float
svm: SVC
idx, accuracy, svm = trainModels(hyper=hyper)

Best model is config 0 w/ 0.9545454545454546 accuracy on validation data
Best model is config 168 w/ 1.0 accuracy on validation data


### Load Testing Dataset Into Model For Evaluation

Now that we have trained our SVM model, we now need to evaluate it against the training dataset.

At this point, the machine learning model has not yet seen this data.

In [6]:
# Get testing features and classes
features: ndarray = testingDF.drop(labels="class", axis=1, inplace=False).to_numpy()
classes: ndarray = testingDF["class"].to_numpy()

# Generate predictions on testing data
pred: ndarray = svm.predict(X=features)

# Compute accuracy on the generated data
accuracy: float = accuracy_score(y_pred=pred, y_true=classes)

print(f"The best model's accuracy on the training dataset is {accuracy}")

The best model's accuracy on the training dataset is 0.9555555555555556
