# Example `mlxpy` usage on the iris dataset for multi-class classification using a RandomForest and SGDClassifier.

In [1]:
# First, load the dataset, models, and mlexpy modules...
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))
from sklearn.datasets import load_iris
from mlexpy import experiment, pipeline_utils, processor

from typing import List, Optional, Union, Callable, Type

# load a random forest and sgd classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

# and numpy, pandas
import numpy as np
import pandas as pd

## First, an example of the general method flow with `mlexpy` (as described in the `README`):
1. Load in the dataset
2. Create your training and testing set split -- this results in an imutable named tuple structure termed an `ExperimentSetup`, this is made up of 2 `MLSetup` named tuples. Each `MLSetup` named tuple has 2 attributes, a `.obs` attribute,  and a `.labels` attribute. In essence the `.obs` attribute is your feature set (in `mlexpy` this is a pandas DataFrame, and the `.labels` is a pandas Series). An `ExperimentSetup` thus contains an `MLSetup` to use for training, and an `MLSetup` to use _purely_ for testing. This is meant to simply, and in pythonic clear language differentiate the training data (as `ExpiramentSetup.training_data`) and the test data (`ExperimentSetup.test_data`).
    - Note: `mlexpy` defers to using a stratified train test split to retain class imbalance / class proporting in training at testing.
3. Defing a class to do the data processing / feature engineering that inherits the `mlexpy.processor.ProcessPipelineBase` class; and a class to do the model training that inherits the `mlexpy.expirament.ClassifierExpiramentBase` class. (The explicit notebook cells will better outline this usage.)

    - `mlexpy` operates in an object oriented framework. These baseclasses are built to carry a large amount of convieneint, clear, and reproducable behavior.

4. Perform your feature engineering, and perform your model training.
5. Evaluate your model.

### (1) We will see how this works with all of your dev in a jupyter notebook:


In [2]:
# First, set the random seed(s) for the exprament
MODEL_SEED = 10
PROCESS_SEED = 100

model_rs = np.random.RandomState(MODEL_SEED)
process_rs = np.random.RandomState(PROCESS_SEED)

# First, read in the dataset as a dataframe. Because mlexpy is meant to be an exploratory/experimental tool, 
# dataframes are preferred for their readability.
data = load_iris(as_frame=True)
features = data["data"]
labels = data["target"]

# We want to look at the dataset for any faulty records...
print(features.isna().sum())

# Spoiler -- there are none in the features. Next look in the labels...
print(labels.isna().sum())

# Spoiler -- none again, so we use all data.


sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64
0


In [3]:

# Now, generate the ExperimentSetup object, that splits the dataset for training and testing.
experiment_setup = pipeline_utils.get_stratified_train_test_data(train_data=features, label_data=labels, test_frac=0.35, random_state=process_rs)

# This provides us with a named tuple, with attributes of .train_data and .test_data 
# each one with attributes of .obs and .labels. For example...
train_label_count = experiment_setup.train_data.labels.shape[0]
test_label_count = experiment_setup.test_data.labels.shape[0]
total_data_count = features.shape[0]

print(f"Train labels are {round((total_data_count - train_label_count) / total_data_count * 100, 2)}% of the original data ({train_label_count}).")
print(f"Test labels are {round((total_data_count - test_label_count) / total_data_count * 100, 2)}% of the original data ({test_label_count}).")

Train labels are 35.33% of the original data (97).
Test labels are 64.67% of the original data (53).


In [4]:
# Now, define the processing class. This inherits from the `ProcessPipelineBase` class. 
# For minimal functionality, this class simply needs the `.process_data()` method to be defined. Not adding 
# code for this class will result in a `NotImplementedError`.

# The following shows an example of how to use this class:
class IrisPipeline(processor.ProcessPipelineBase):
    def __init__(
        # All of the Optional arguments are not strictly necessary but shown for brevity.
        self, 
        process_tag: str = "iris_development", 
        model_dir = None, 
        model_storage_function = None, 
        model_loading_function = None
        ) -> None:
        super().__init__(process_tag, model_dir, model_storage_function, model_loading_function)

    # Now -- define the .process_data() method.
    def process_data(self, df: pd.DataFrame, training: bool = True) -> pd.DataFrame:
        # Now, simply do all feature engineering in this method, and return the final data/feature set to perform
        # predictions on.

        # Imagine we have 1 desired feature to engineer, petal/sepal area, and then normalize the feature values.
        # We need to pay attention in the normalizing step, because we can ONLY apply the normalize to the test
        # set, thus we will have a fork in the process when doing the feature normalization. 
        
        # In order to easily maintain reproducibility in data processing, any model based feature engineering (such
        # as normalization) is done by creating a specific data structure storing the order of steps for processing each column, 
        # and the model that should be applied. This is somewhat similar to the ColumnTransformer in sklearn.

        # Model based features are handled in the .fit_model_based_features() method, described below.
         
        # Lets begin:

        # Do a copy of the passed df
        df = df.copy()

        # First, compute the petal / sepal areas (but make the columns simpler)
        df.columns = [col.replace(" ", "_").strip("_(cm)") for col in df.columns]

        for object in ["petal", "sepal"]:
            df[f"{object}_area"] = df[f"{object}_length"] * df[f"{object}_width"]

        # Now perform the training / testing dependent feature processing. This is why a `training` boolean is passed.
        if training:
            # Now FIT all of the model based features...
            self.fit_model_based_features(df)
            # ... and get the results of a transformation of all model based features.
            model_features = self.transform_model_based_features(df)
        else:
            # Here we can ONLY apply the transformation
            model_features = self.transform_model_based_features(df)

        # Imagine we only want to use ONLY the scaled features for prediction, then we retrieve only the scaled columns.
        # (This is easy because the columns are renamed with the model name in the column name)
        prediction_df = model_features[[col for col in model_features if "standardscaler" in col.lower()]]

        return prediction_df

    def fit_model_based_features(self, df: pd.DataFrame) -> None:
        # Here we do any processing of columns that will require a model based transformation / engineering.

        # In this case, simply fit a standard (normalization) scaler to the numerical columns. 
        # This case will result in additional columns on the dataframe named as 
        # "<original-column-name>_StandardScaler()".

        # Note: there are no returned values for this method, the result is an update in the self.column_transformations dictionary
        for column in df.columns:
            if df[column].dtype not in ("float", "int"):
                continue
            self.fit_scaler(df[column], standard_scaling=True)

In [5]:
# As an example, lets look at the outputs of the `.process_data()` method.
iris_processor = IrisPipeline(model_dir=Path.cwd())  # set the model path to the examples directory

# now run the process_data method
processed_df = iris_processor.process_data(df=experiment_setup.train_data.obs.copy(), training=True)

processed_df.head()

INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_length.
INFO:mlexpy.processor:Fitting a standard scaler to petal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_area.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_area.
INFO:mlexpy.processor:Dumping 1 models to /Users/NathanSankary/git/mlexpy/examples/iris_development/sepal_length
INFO:mlexpy.processor:Dumping 1 models to /Users/NathanSankary/git/mlexpy/examples/iris_development/sepal_width
I

Unnamed: 0,sepal_length_standardscaler(),sepal_width_standardscaler(),petal_length_standardscaler(),petal_width_standardscaler(),petal_area_standardscaler(),sepal_area_standardscaler()
131,2.552306,1.77599,1.50883,1.061415,1.551161,3.572511
127,0.316317,-0.117098,0.617499,0.783903,0.651933,0.143745
84,-0.553235,-0.117098,0.379811,0.367633,0.184243,-0.470624
111,0.688982,-0.827006,0.855188,0.922659,0.934354,-0.154663
96,-0.18057,-0.353734,0.201545,0.09012,-0.107215,-0.374081


In [6]:
# Now, define the IrisExperiment to inherit from the ClassifierExperimentBase class.
#  This functionality works similarly to the `ProcessPipelineBase` class where an experiment is instantiated
# as a class, inheriting a variety of experimental tooling.

# We need to define 1 method for our child class, shown below, to handle the training and testing data processing outlined in 
# our IrisPipeline class

class IrisExperiment(experiment.ClassifierExperimentBase):
    def __init__(
        self, 
        train_setup: pipeline_utils.MLSetup, 
        test_setup: pipeline_utils.MLSetup, 
        cv_split_count: int, 
        rnd_int: int = 100, 
        model_dir: Optional[Union[str, Path]] = None, 
        model_storage_function: Optional[Callable] = None, 
        model_loading_function: Optional[Callable] = None, 
        model_tag: str = "example_development_model",
        process_tag: str = "example_development_process"
        ) -> None:
        super().__init__(train_setup, test_setup, cv_split_count, rnd_int, model_dir, model_storage_function, model_loading_function, model_tag, process_tag)


    def process_data(
        self,
        process_method_str: str = "process_data"
    ) -> pipeline_utils.ExperimentSetup:


        # Now do the data processing on the method defined in process_method_str.
        process_method = getattr(self.pipeline, process_method_str)
        train_df = process_method(self.training.obs, training=True)
        test_df = process_method(self.testing.obs, training=False)

        print(
            f"The train data are of size {train_df.shape}, the test data are {test_df.shape}."
        )

        assert (
            len(set(train_df.index).intersection(set(test_df.index))) == 0
        ), "There are duplicated indices in the train and test set."

        return pipeline_utils.ExperimentSetup(
            pipeline_utils.MLSetup(
                train_df,
                self.training.labels,
            ),
            pipeline_utils.MLSetup(
                test_df,
                self.testing.labels,
            ),
        )
    

In [7]:
# Now, our "work" is done, lets pass our data through this process! Lets try using a randomforest model

# Define the experiment
experiment_obj = IrisExperiment(
    train_setup=experiment_setup.train_data,
    test_setup=experiment_setup.test_data,
    cv_split_count=20,
    model_tag="example_development_model",
    process_tag="example_development_process",
    model_dir=Path.cwd()
)

# Set the pipeline attribute to use
experiment_obj.set_pipeline(IrisPipeline)

# Now begin the experimentation, start with performing the data processing...
processed_datasets = experiment_obj.process_data()

# ... then train our model...
trained_model = experiment_obj.train_model(
    RandomForestClassifier(random_state=model_rs),  # This is why we have 2 different random states...
    processed_datasets,
    # model_algorithm.hyperparams,  # If this is passed, then cross validation search is performed, but slow.
)

INFO:mlexpy.experiment:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.experiment:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.experiment:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples/example_development_process. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_len

The train data are of size (97, 6), the test data are (53, 6).


INFO:mlexpy.experiment:Model trained


In [8]:
# Now, evaluate the predictions, ClassificationExperimentBase provides some standard classification metrics
# and evaluations.

# Get the predictions and evaluate the performance.
predictions = experiment_obj.predict(processed_datasets, trained_model)
class_probabilities = experiment_obj.predict(processed_datasets, trained_model, proba=True)
results = experiment_obj.evaluate_predictions(
    processed_datasets.test_data.labels,
    predictions=predictions,
    class_probabilities=class_probabilities,
)




The f1_macro score is: 
 0.9558404558404558.

The f1_micro score is: 
 0.9622641509433962.

The f1_weighted score is: 
 0.9622641509433962.

The log_loss score is: 
 0.08168998066053094.

The balanced_accuracy score is: 
 0.9558404558404558.

The accuracy score is: 
 0.9622641509433962.

The confusion_matrix score is: 
 [[22  0  0]
 [ 0 12  1]
 [ 0  1 17]].

The classification_report score is: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       0.92      0.92      0.92        13
           2       0.94      0.94      0.94        18

    accuracy                           0.96        53
   macro avg       0.96      0.96      0.96        53
weighted avg       0.96      0.96      0.96        53
.


### (2) Next, how this same process might look when developing as modules. 
We now use the exact same model and dataset, however use the imported modules as our classes.

In [9]:
# The only change is that we need to now import the classes we developed above classes.
# Rename them for clarity of what is doing what
from from_module_example import IrisExperiment as IrisExpImport
from from_module_example import IrisPipeline as IrisPipeImport

# First, reset our seeds...
model_rs = np.random.RandomState(MODEL_SEED)


# Define the experiment
imported_experiment = IrisExpImport(
    train_setup=experiment_setup.train_data,
    test_setup=experiment_setup.test_data,
    cv_split_count=20,
    model_tag="example_development_model",
    process_tag="example_development_process",
    model_dir=Path.cwd()
)

# Set the pipeline to use
imported_experiment.set

# Now begin the experimentation, start with performing the data processing...
processed_datasets = self.pipeline()

# ... then train the model...
trained_model = imported_experiment.train_model(
    RandomForestClassifier(random_state=model_rs),  # This is why we have 2 different random states...
    processed_datasets,
    # model_algorithm.hyperparams,  # If this is passed, then cross validation search is performed, but slow.
)

# Get the predictions and evaluate the performance.
predictions = imported_experiment.predict(processed_datasets, trained_model)
class_probabilities = imported_experiment.predict(processed_datasets, trained_model, proba=True)
results = imported_experiment.evaluate_predictions(
    processed_datasets.test_data.labels,
    predictions=predictions,
    class_probabilities=class_probabilities,
)


INFO:mlexpy.experiment:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.experiment:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.experiment:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples/example_development_process. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_len

The train data are of size (97, 6), the test data are (53, 6).


INFO:mlexpy.experiment:Model trained



The f1_macro score is: 
 0.9558404558404558.

The f1_micro score is: 
 0.9622641509433962.

The f1_weighted score is: 
 0.9622641509433962.

The log_loss score is: 
 0.08168998066053094.

The balanced_accuracy score is: 
 0.9558404558404558.

The accuracy score is: 
 0.9622641509433962.

The confusion_matrix score is: 
 [[22  0  0]
 [ 0 12  1]
 [ 0  1 17]].

The classification_report score is: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       0.92      0.92      0.92        13
           2       0.94      0.94      0.94        18

    accuracy                           0.96        53
   macro avg       0.96      0.96      0.96        53
weighted avg       0.96      0.96      0.96        53
.


#### And you can se that we get the exact same results when using the imported modules.

### (3) Do the same process using this time and `SGDClassifier`

This time, all we need to do is change the model that is passed when training the model.

In [10]:
# The only change is that we need to now import the classes we developed above classes.
# Rename them for clarity of what is doing what

# Again, reset our seeds...
model_rs = np.random.RandomState(MODEL_SEED)


# Define the experiment
imported_experiment = IrisExpImport(
    train_setup=experiment_setup.train_data,
    test_setup=experiment_setup.test_data,
    cv_split_count=20,
    model_tag="example_development_model",
    process_tag="example_development_process",
    model_dir=Path.cwd()
)

# Now begin the experimentation, start with performing the data processing...
processed_datasets = imported_experiment.process_data()

# ... then train the model...
trained_model = imported_experiment.train_model(
    SGDClassifier(random_state=model_rs, loss="log"),  # This is why we have 2 different random states...
    processed_datasets,
    # model_algorithm.hyperparams,  # If this is passed, then cross validation search is performed, but slow.
)

# Get the predictions and evaluate the performance.
predictions = imported_experiment.predict(processed_datasets, trained_model)
class_probabilities = imported_experiment.predict(processed_datasets, trained_model, proba=True)
results = imported_experiment.evaluate_predictions(
    processed_datasets.test_data.labels,
    predictions=predictions,
    class_probabilities=class_probabilities,
)


INFO:mlexpy.experiment:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.experiment:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.experiment:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples/example_development_process. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_len

The train data are of size (97, 6), the test data are (53, 6).

The f1_macro score is: 
 0.9558404558404558.

The f1_micro score is: 
 0.9622641509433962.

The f1_weighted score is: 
 0.9622641509433962.

The log_loss score is: 
 0.1955833766845036.

The balanced_accuracy score is: 
 0.9558404558404558.

The accuracy score is: 
 0.9622641509433962.

The confusion_matrix score is: 
 [[22  0  0]
 [ 0 12  1]
 [ 0  1 17]].

The classification_report score is: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       0.92      0.92      0.92        13
           2       0.94      0.94      0.94        18

    accuracy                           0.96        53
   macro avg       0.96      0.96      0.96        53
weighted avg       0.96      0.96      0.96        53
.


### (4) Now what if we wanted to use a different set of columns (ex. ALL cols not only the scaled columns)?
We do that simply by re-defining the method to process our data. We can either, overwrite the method with the change, or compute a new process_data method for this specific case.

In [11]:
# All we need to do is change the method that is called to perform our data processing below
from from_module_example import IrisPipeline as IrisPipeImport

# Again, reset our seeds...
model_rs = np.random.RandomState(MODEL_SEED)


# Define the experiment
imported_experiment = IrisExpImport(
    train_setup=experiment_setup.train_data,
    test_setup=experiment_setup.test_data,
    cv_split_count=20,
    model_tag="example_development_model",
    process_tag="example_development_process",
    model_dir=Path.cwd()
)

# Now begin the experimentation, however, we here provide a string corresponding to a method name
# to use to do the data processing.
#  Not providing any process_method_str value will default to using "process_data"
processed_datasets = imported_experiment.process_data(process_method_str="process_data_keep_all_columns")

# ... then train the model...
trained_model = imported_experiment.train_model(
    SGDClassifier(random_state=model_rs, loss="log"),  # This is why we have 2 different random states...
    processed_datasets,
    # model_algorithm.hyperparams,  # If this is passed, then cross validation search is performed, but slow.
)

# Get the predictions and evaluate the performance.
predictions = imported_experiment.predict(processed_datasets, trained_model)
class_probabilities = imported_experiment.predict(processed_datasets, trained_model, proba=True)
results = imported_experiment.evaluate_predictions(
    processed_datasets.test_data.labels,
    predictions=predictions,
    class_probabilities=class_probabilities,
)


INFO:mlexpy.experiment:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.experiment:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.experiment:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples/example_development_process. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_len

The train data are of size (97, 12), the test data are (53, 12).

The f1_macro score is: 
 0.9781305114638448.

The f1_micro score is: 
 0.9811320754716981.

The f1_weighted score is: 
 0.9812119397025056.

The log_loss score is: 
 0.139554071082744.

The balanced_accuracy score is: 
 0.9814814814814815.

The accuracy score is: 
 0.9811320754716981.

The confusion_matrix score is: 
 [[22  0  0]
 [ 0 13  0]
 [ 0  1 17]].

The classification_report score is: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        22
           1       0.93      1.00      0.96        13
           2       1.00      0.94      0.97        18

    accuracy                           0.98        53
   macro avg       0.98      0.98      0.98        53
weighted avg       0.98      0.98      0.98        53
.


As we can see, using all of the data results in better scored for the `SGDClassifier`. (We confrim that the data was processed differently looking at the log of the train and test data shape (`The train data are of size (97, 12), the test data are (53, 12).`))