# Example `mlxpy` usage on the iris dataset for multi-class classification using a RandomForest and SGDClassifier.

In [1]:
# First, load the dataset, models, and mlexpy modules...
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))
from sklearn.datasets import load_iris
from mlexpy import experiment, pipeline_utils, processor

from typing import List, Optional, Union, Callable

# load a random forest and sgd classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

# and numpy
import numpy as np
import pandas as pd

## First, an example of the general method flow with `mlexpy` (as described in the `README`):
1. Load in the dataset
2. Create your training and testing set split -- this results in an imutable named tuple structure termed an `ExperimentSetup`, this is made up of 2 `MLSetup` named tuples. Each `MLSetup` named tuple has 2 attributes, a `.obs` attribute,  and a `.labels` attribute. In essence the `.obs` attribute is your feature set (in `mlexpy` this is a pandas DataFrame, and the `.labels` is a pandas Series). An `ExperimentSetup` thus contains an `MLSetup` to use for training, and an `MLSetup` to use _purely_ for testing. This is meant to simply, and in pythonic clear language differentiate the training data (as `ExpiramentSetup.training_data`) and the test data (`ExperimentSetup.test_data`).
    - Note: `mlexpy` defers to using a stratified train test split to retain class imbalance / class proporting in training at testing.
3. Defing a class to do the data processing / feature engineering that inherits the `mlexpy.processor.ProcessPipelineBase` class; and a class to do the model training that inherits the `mlexpy.expirament.ClassifierExpirament` class. (The explicit notebook cells will better outline this usage.)

    - `mlexpy` operates in an object oriented framework. These baseclasses are built to carry a large amount of convieneint, clear, and reproducable behavior.

4. Perform your feature engineering, and perform your model training.
5. Evaluate your model.
6. Store your model (and feature transformation models).


In [2]:
# First, set the random seed(s) for the exprament
MODEL_SEED = 10
PROCESS_SEED = 100

model_rs = np.random.RandomState(MODEL_SEED)
process_rs = np.random.RandomState(PROCESS_SEED)

# First, read in the dataset as a dataframe. Because mlexpy is meant to be an exploratory/expiramental tool, 
# dataframes are prefered for their readability.
data = load_iris(as_frame=True)
features = data["data"]
labels = data["target"]

# We want to look at the dataset for any faulty records...
print(features.isna().sum())
print(features.describe())

# Spoiler -- there are none in the features. Next look in the labels...
print(labels.isna().sum())

# Spoiner -- none again, so we use all data.


sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)  
count        150.000000  
mean           1.199333  
std            0.762238  
min            0.100000  
25%            0.300000  
50%            1.300000  
75%            1.800000  
max            2.500000  
0


In [3]:

# Now, generate the ExpiramentSetup object, that splits the dataset for training and testing.
expirament_setup = pipeline_utils.get_stratified_train_test_data(train_data=features, label_data=labels, test_frac=0.35, random_state=process_rs)

# This provides us with a named tuple, with attributes of .train_data and .test_data 
# each one with attributes of .obs and .labels. For example...
train_label_count = expirament_setup.train_data.labels.shape[0]
test_label_count = expirament_setup.test_data.labels.shape[0]
total_data_count = features.shape[0]

print(f"Train labels are {round((total_data_count - train_label_count) / total_data_count * 100, 2)}% of the original data ({train_label_count}).")
print(f"Test labels are {round((total_data_count - test_label_count) / total_data_count * 100, 2)}% of the original data ({test_label_count}).")

Train labels are 35.33% of the original data (97).
Test labels are 64.67% of the original data (53).


In [4]:
# Now, define the processing class. This inherits from the `ProcessPipelineBase` class. 
# For minimal functionality, this class simply needs the `.process_data()` method to be defined. Not adding 
# code for this class will result in a `NotImplementedError`.

# The following shows an example of how to use this class:

class IrisPipeline(processor.ProcessPipelineBase):
    def __init__(
        # All of the Optional arguments are not strictly necessary but shown for brevity.
        self, 
        process_tag: str = "_development", 
        model_dir = None, 
        model_storage_function = None, 
        model_loading_function = None
        ) -> None:
        super().__init__(process_tag, model_dir, model_storage_function, model_loading_function)

    # Now -- define the .process_data() method.
    def process_data(self, df_i: pd.DataFrame, training: bool = True) -> pd.DataFrame:
        """Now, simply do all feature engineering in this method, and return the final data/feture set to perform
        predictions on.

        Imagine we have 1 desired feature to engineer, petal/sepal area, and then normalize the features.
        We need to pay atenting to the normalizing step, becuase we can ONLY apply the normalize to the test
        set, thus we will have a fork in the process when doing the feature normalization. 
        
        In order to easily mainting reproducability in data processing, any model based feature engineering (such
        as normalization) is done by creating a specific data structure storing the order from processing each column, 
        and the model that should be applied. This is somewhat similar to the ColumnProcess in sklearn.

        Model based features are handeled in the .fit_model_based_features() method, described below.
         
        Lets begin:
        """

        # Do a copy of the passed df
        df = df_i.copy()

        # First, compute the petal / sepal areas (but make the columns simpler)
        df.columns = [col.replace(" ", "_").strip("_(cm)") for col in df.columns]

        for object in ["petal", "sepal"]:
            df[f"{object}_area"] = df[f"{object}_length"] * df[f"{object}_width"]

        # Now perform the training / testing dependent feature processsing. This is why a `training` boolean is passed.

        if training:
            # Now FIT all of the model based features...
            self.fit_model_based_features(df)
            # ... and get the results of a transformation of all model based features.
            model_features = self.transform_model_based_features(df)
        else:
            # Here we can ONLY apply the transformation
            model_features = self.transform_model_based_features(df)
        
        # Now, add these 2 dataframes toghert "horizontaly"

        all_feature_df = pd.concat([df, model_features], axis=1)

        # Imagine we only want to use the scaled features for prediction, then we retrieve only the scaled colums.
        # (This is easy becuase the columns are renamed with the model name in the column name)

        prediction_df = all_feature_df[[col for col in all_feature_df if "standardscaler" in col]]

        return prediction_df

    def fit_model_based_features(self, df: pd.DataFrame) -> None:
        """
        Here we do any processing of columns that will require a model based transformation / engineering.

        In this case, simply fit a standard (normalization) scaler to the numerical columns. 
        This case will result in additional columns on the dataframe named as 
        "<original-column-name>_StandardScaler()".

        Note: there are no returned values for this method, the reult is an update in the self.column_transformations dictionary
        """
        for column in df.columns:
            if df[column].dtype not in ("float", "int"):
                continue
            self.fit_scaler(df[column], standard_scaling=True)

In [5]:
# As an example, lets look at the outputs of the `.process_data()` method.

iris_processor = IrisPipeline(model_dir=Path.cwd())  # set the model path to the examples directory

# now run the process_data method
processed_df = iris_processor.process_data(df_i=expirament_setup.train_data.obs.copy(), training=True)

processed_df.head()

INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_length.
INFO:mlexpy.processor:Fitting a standard scaler to petal_width.
INFO:mlexpy.processor:Fitting a standard scaler to petal_area.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_area.
INFO:mlexpy.processor:Applying the StandardScaler() to sepal_length
INFO:mlexpy.processor:Applying the StandardScaler() to sepal_width
INFO:mlexpy.processor:Applying the StandardScaler() to petal_length
INFO:mlexpy.processor:A

Unnamed: 0,sepal_length_standardscaler(),sepal_width_standardscaler(),petal_length_standardscaler(),petal_width_standardscaler(),petal_area_standardscaler(),sepal_area_standardscaler()
101,-0.077948,-0.898694,0.748068,0.898228,0.80441,-0.707603
59,-0.80965,-0.898694,0.067794,0.246071,-0.090103,-1.205176
83,0.165953,-0.898694,0.748068,0.506934,0.480863,-0.541745
6,-1.541352,0.791648,-1.349444,-1.188672,-1.155906,-0.713746
92,-0.077948,-1.140172,0.124483,-0.014791,-0.229673,-0.885746


In [6]:
# Now, define the IrisExpirament to inherit from the ClassifierExpirament class.
#  This functionality works similarily to the `ProcessPipelineBase` class where an expirament is instanciated
# as a class, inheriting a variety of expiramental tooling.

# We need to define 1 method for our child class, shown below, to handle the training and tes data processing outlined in 
# our IrisPipeline class

class IrisExpirament(experiment.ClassifierExpirament):
    def __init__(
        self, 
        train_setup: pipeline_utils.MLSetup, 
        test_setup: pipeline_utils.MLSetup, 
        cv_split_count: int, 
        rnd_int: int = 100, 
        model_dir: Optional[Union[str, Path]] = None, 
        model_storage_function: Optional[Callable] = None, 
        model_loading_function: Optional[Callable] = None, 
        model_tag: str = "_development",
        process_tag: str = "_development"
        ) -> None:
        super().__init__(train_setup, test_setup, cv_split_count, rnd_int, model_dir, model_storage_function, model_loading_function, model_tag, process_tag)


    def process_data(
        self,
    ) -> pipeline_utils.ExperimentSetup:

        processor = IrisPipeline(process_tag=self.process_tag, model_dir=self.model_dir)

        # Here, we do label encoding.
        # processor.fit_label_encoder(self.training.labels)
        # encoded_training_lables = processor.encode_labels(self.training.labels)
        # encoded_testing_lables = processor.encode_labels(self.testing.labels)
        encoded_training_lables = self.training.labels
        encoded_testing_lables = self.testing.labels

        # Now call the .process_data() method we defined above.
        train_df = processor.process_data(self.training.obs, training=True)
        test_df = processor.process_data(self.testing.obs, training=False)

        retained_training_indecies = set(train_df.index).intersection(
            encoded_training_lables.index
        )
        retained_testing_indecies = set(test_df.index).intersection(
            encoded_testing_lables.index
        )

        print(
            f"The train data are of size {train_df.shape}, the test data are {test_df.shape}."
        )

        assert (
            len(set(train_df.index).intersection(set(test_df.index))) == 0
        ), "There are duplicated indecies in the train and test set."

        return pipeline_utils.ExperimentSetup(
            pipeline_utils.MLSetup(
                train_df.loc[list(retained_training_indecies)],
                encoded_training_lables.loc[list(retained_training_indecies)],
            ),
            pipeline_utils.MLSetup(
                test_df.loc[list(retained_testing_indecies)],
                encoded_testing_lables.loc[list(retained_testing_indecies)],
            ),
        )
    

In [7]:
print(len(expirament_setup.train_data.labels))


97


In [8]:
# Now, our "work" is done, lets pass our data through this process!

# try using a randomforest

# Define the expirament
expirament = IrisExpirament(
    train_setup=expirament_setup.train_data,
    test_setup=expirament_setup.test_data,
    cv_split_count=20,
    model_tag="iris_classification_example_model",
    process_tag="iris_classification_example_process",
    model_dir=Path.cwd()
)

# Now begin the expiramentation, start with performing the data processing...
processed_datasets = expirament.process_data()

# ... then train the model...
trained_model = expirament.train_model(
    RandomForestClassifier(random_state=model_rs),  # This is why we have 2 different random states...
    processed_datasets,
    # model_algorithm.hyperparams,  # If this is passed, then cross validation search is performed, but slow.
)

INFO:mlexpy.experiment:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.experiment:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.experiment:setting the model path to /Users/NathanSankary/git/mlexpy/examples. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:No model storage function provided. Using the default class method (joblib, or .store_model native method).
INFO:mlexpy.processor:No model loading function provided. Using the default class method (joblib, or .load_model native method).
INFO:mlexpy.processor:setting the model path to /Users/NathanSankary/git/mlexpy/examples/iris_classification_example_process. (Converting from string to pathlib.Path)
INFO:mlexpy.processor:Fitting a standard scaler to sepal_length.
INFO:mlexpy.processor:Fitting a standard scaler to sepal_width.
INFO:mlexpy.processor:Fitting a standard scaler to p

The train data are of size (97, 6), the test data are (53, 6).


INFO:mlexpy.experiment:Model trained


In [10]:
# Now, evalute the predictions, ClassificationExpiramentBase provides some standard classification metrics
# and evaluations.

# Get the predictions and evaluate the performance.
predictions = expirament.predict(processed_datasets, trained_model)
results =expirament.evaluate_predictions(processed_datasets, predictions=predictions)




The f1_macro score is: 
 0.8848684210526315.

The f1_micro score is: 
 0.8867924528301887.

The f1_weighted score is: 
 0.8856752730883812.

The balanced_accruacry score is: 
 0.8877995642701525.

The accuracy score is: 
 0.8867924528301887.

The confusion_matrix score is: 
 [[18  0  0]
 [ 0 16  1]
 [ 0  5 13]].

The classification_report score is: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       0.76      0.94      0.84        17
           2       0.93      0.72      0.81        18

    accuracy                           0.89        53
   macro avg       0.90      0.89      0.88        53
weighted avg       0.90      0.89      0.89        53
.
