# Iris species detection - SVM model

##### Jupyter helpers:

In [None]:
%reload_ext autoreload
%autoreload 2

Define imports

In [None]:
from os import path
from iris.data import DataLoader
from iris.data_processing import EmptyProcessor
from iris.models import BaseModel
from iris.data_processing import DataProcessor
from iris.experimentation import MlflowExperimentation
from iris.evaluation import Evaluator, EvaluationMetrics
from iris import ExperimentRunner

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

## Load data

### DataLoader definition

First thing we do is to implement a `DataLoader`. A `DataLoader` defines the logic for obtaining a dataset. It could either fetch a dataset from a local folder, or from a remote location like the web, S3, Blob storage or similar. 

To implement a `DataLoader`, there are two main functions to be created:
- `download_dataset`: A function for downloading the dataset into the local machine (should be implmented in a way that only downloads once and then checks if the dataset already exists locally)
- `get_dataest`: A function for getting a dataset for modeling, into the experiment code itself.

Note:
- Each dataset should have a name and a version, which will be used to know exactly what data was used for this experiment. This would provide us with the possibility of reproducing the experiment.
- The dataset obtained should already be ready for modeling. Any train/test split should be done prior to the dataset loading. We don't want to introduce any randomness here to make sure we compare models run on the exact same data
- In this example, we've added a new method, `prep_dataset_for_modeling`. This method performs the train/test/split, but it shouldn't be called in the usual lifecycle of a notebook.


In [None]:
class IrisDataLoader(DataLoader):
    
    def get_dataset(self):
        train = pd.read_csv(f"../data/processed/{self.dataset_name}-{self.dataset_version}-train.csv",index_col="Id")
        test = pd.read_csv(f"../data/processed/{self.dataset_name}-{self.dataset_version}-test.csv",index_col="Id")
        
        X_train = train.drop('Species',axis=1)
        y_train = train['Species']
        X_test = test.drop('Species',axis=1)
        y_test = test['Species']
        
        print(f"Loaded {len(train)} train and {len(test)} test samples")
        return X_train, y_train, X_test, y_test
      
    def download_dataset(self): pass

    def prep_dataset_for_modeling(self):
        """
        Creates a train/test split of the dataset and stores it in data/processed
        """
        print("Creating train/test split")
        iris = pd.read_csv(f"../data/raw/{self.dataset_name}.csv",index_col='Id')
        train, test = train_test_split(iris, test_size = 0.3)
        train.to_csv(f"../data/processed/{self.dataset_name}-{self.dataset_version}-train.csv")
        test.to_csv(f"../data/processed/{self.dataset_name}-{self.dataset_version}-test.csv")
        

### Load data
Once we have implemented our `DataLoader`, we can just instantiate it and call `download_dataset()` and then 'get_dataset()'. This way we ensure that our notebook can be run anywhere.


In [None]:
data_loader = IrisDataLoader(dataset_name = 'iris', dataset_version = "1")
data_loader.prep_dataset_for_modeling()
data_loader.download_dataset()
X_train, y_train, X_test, y_test = data_loader.get_dataset()

X_train.head()

## Experiment logging/tracking

The next phase is the experiment logger definition. The default one uses MLflow, but the API is generic and can be extended to any experiment tracking mechanism. 
The experimentation class is in charge of collecting all the parameters and metrics the experiments emit along the way (from the dataset name and version, through model hyperparams and up to the final metric values).

To use the default one, just call `MlflowExperimentation()`

> Note: If you plan to use Mlflow hosted in Databricks, follow these steps:
1. Pass `tracking_uri='databricks'` to the `MlflowExperimentation` object
2. See [this doc on how to create a personal access token](https://docs.databricks.com/dev-tools/api/latest/authentication.html#token-management) 
3. See [this doc on setting up databricks-cli](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/)
4. [Create new experiment on Mlflow](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/) (if needed)


In [None]:
experimentation = MlflowExperimentation()

## Modeling

The next step is writing our actual model, with optional preprocessing and postprocessing.

The class to implement is `BaseModel` which exposes the sklearn-style `fit` and `predict` functions that needs to be implemented.

Note:
- The model class needs to define which parameters should be logged, by adding keys and values to the self.hyper_params dict, or by passing the variables to the super's `__init__` method, e.g. `super().__init__(param_a=param_a, param_b=param_b,...)`.
- The base class contains fields for DataProcessors: preprocessor and postprocessor. Use these if you want the preprocessing or postprocessing to occur during the model call (which makes it easier to operationalize the model on a new environment, without having to provide all the preprocessinr and postprocessing scripts.
- It is also possible to pass the `Experimentation` object, if it is required during training (for example, while storing values for each epoch during model training)

> This is a simple example which wraps the scikit-learn's SVM model. Hyper parameters can be set on the `__init__` method and passed to the super init to be stored as parameters.

> We are not doing any data preprocessing here, but if we did, we would just create an `IrisPreprocessor(DataProcssor)` class, and implement the logic there. Then, we would pass it to the model as a parameter, and the model would run it on every sample during train and test.

In [None]:
class IrisSVMModel(BaseModel):
    """
    sklearn SVM model wrapper
    """

    def __init__(
        self, features, kernel="rbf", label="Species", preprocessor=EmptyProcessor()
    ):
        self.features = features
        self.kernel = kernel
        self.model = None

        super().__init__(features=features, label=label, kernel=kernel, preprocessor=preprocessor)

    def fit(self, X, y=None) -> None:
        train_X = X[self.features]
        train_y = y

        train_X_processed = self.preprocessor.apply_batch(train_X)

        print("Fitting model")
        self.model = svm.SVC(kernel=self.kernel)
        self.model.fit(train_X_processed, train_y)
        print(f"Finished fitting model {self.model}")

    def predict(self, X):
        test_X = X[self.features]
        test_X_processed = self.preprocessor.apply_batch(test_X)

        print(f"Predicting on {len(test_X)} samples")
        predictions = self.model.predict(test_X_processed)
        print(f"Finished prediction")
        return predictions


### Model training

The model we just created can be called and fitted. Alternatively, we can postpone the fit to the last part, which performs a full experiment cycle.


In [None]:
svm_model = IrisSVMModel(features = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm'])
svm_model.fit(X=X_train, y=y_train)

## Model evaluation
In this phase, we define how the model should be evaluated. There are two main building blocks:
- `Evaluator`: Which holds the logic for how evaluation takes place. The function to implement is `evaluate`.
- `EvaluationMetrics`: Which holds the actual values of metrics. The function to implement is `get_metrics`.

> In this example, we implement a simple `IrisEvaluationMetrics` class, and an `IrisEvaluator` class, which holds the evaluation logic.

In [None]:
class IrisEvaluationMetrics(EvaluationMetrics):
    def __init__(self, accuracy):
        self.accuracy=accuracy
    
    def get_metrics(self):
        return {"accuracy":self.accuracy}
    
    def __repr__(self):
        return str(self.__dict__)

class IrisEvaluator(Evaluator):
    def evaluate(self, y_test, prediction) -> EvaluationMetrics:
        return IrisEvaluationMetrics(accuracy=metrics.accuracy_score(prediction, y_test))

evaluator = IrisEvaluator()

## Running an experiment

To run the full experiment, we leverage the `ExpreimentRunner` class. This class is in charge of evaluating the model on a test dataset, calculating metrics, collecting all params and metrics and logging them to the experiment logger. It's like an experiment orchastrator. 
In additional to all the collected params and metrics, one could add additional params to the call to ExperimentRunner and these will too be collected. 

> In many cases the `ExperimentRunner` class could be used it without any modification, but if modifications are needed, just make sure that you implement the various functions (, and also verify that the different params and metrics are logged correctly (in the `__init__`)


In the following cell we instantiate the `ExperimentRunner` object, while passing all the previous building blocks.

Finally, we call `experiment_runner.evaluate()` to perform prediction on the supplied test set, calculate metrics and store everything in the experiment logger.


In [None]:
experiment_runner = ExperimentRunner(
    model=svm_model,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    data_loader=data_loader,
    log_experiment=True,
    experiment_logger=experimentation,
    evaluator=evaluator,
    experiment_name="Experiment",
)

results = experiment_runner.evaluate()
print(results)


### Summary

This example flow demonstrates how to use the different building blocks in this framework.

**Possible next steps:**
1. Implement the different modules in the Python package, and use them in other notebooks / scripts / modules
2. Run `mlflow ui` from this notebook's path and observe the different parameters and metrics stored
3. Create a [notebook template](../notebook_templates/notebook_template.md) for your experiment, which can be used to generate new notebooks containing the experimental flow (lodaing data, experimentation, evaluation, run experiment)

To start the Mlflow UI, run `!mlflow ui` and open http://localhost:5000/#/ in your browser


In [None]:
!mlflow ui

Open http://localhost:5000/#/ to open the Mlflow dashboard