# YellowCab

# Experiment Tracking and Model Management with MLFlow

There are many ways to use the MLFlow Tracking API. For simple local uses, the best is to leave the data management to MLFlow and let it store runs, metrics, models and artifacts locally. For more advanced usage, all of this information can be stored in databases. You can find the detailed on MLFlow's documentation [here](https://mlflow.org/docs/latest/tracking.html#scenario-1-mlflow-on-localhost).

## Exploring MLFlow

MLflow setup:
* Tracking server: no
* Backend store: local filesystem
* Artifacts store: local filesystem

The experiments can be explored locally by launching the MLflow UI.

First make sure you run the mlflow server in this challenge directory. Open a new terminal and run the following command:

```bash
mlflow ui
```
Let's print the tracking server URI, where the experiments and runs are going to be logged. We observe it refers to a local path. 

In [4]:
import mlflow

mlflow.set_tracking_uri("http://localhost:5042")
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")


After this initialization, we can connect create a client to connect to the API and see what experiments are present.

By refering to mlflow's [documentation](https://mlflow.org/docs/latest/python_api/mlflow.client.html), create a client and display a list of the available experiments using the search_experiments function. This function could prove useful later to programatically explore experiments (rather than in the UI)

In [5]:
from mlflow import MlflowClient

client = MlflowClient()
client.search_experiments()

We see that there is a default experiment for which the runs are stored locally in the mlruns folder.

### Creating an experiment and logging a new run

An experiment is a logical entity regrouping the logs of multiple attempts at solving a same problem, called runs. \
We will now work with the classic sklearn dataset iris. Our goal here is to manage to classify the different iris species. To track our models performance, we will log every attempt as a "run" and create a new experiment "iris-experiment-1" to regroup them.

Lookup the mlflow.run and mlflow.start_run functions [here](https://mlflow.org/docs/latest/python_api/mlflow.html?highlight=start_run#mlflow.start_run) to find out how to manage runs.
Explore [this part](https://mlflow.org/docs/latest/python_api/mlflow.html) to learn more about the log_params, log_metrics and log_artifact functions. Find out how to log sklearn models [here](https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html])

Complete the following in order to log the parameters, interesting metrics and the model.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, f1_score

mlflow.set_experiment("iris-experiment-1")
X, y = load_iris(return_X_y=True)

params = {"C": 0.1, "random_state": 42}

model = LogisticRegression(**params).fit(X, y)
y_pred = model.predict(X)
accuracy = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred, average="weighted")

with mlflow.start_run() as run:
    run_id = run.info.run_id
    print(f"run ID: '{run_id}'")
    # Log Params
    mlflow.log_params(params)
    
    # Log Metrics)
    mlflow.log_metric('accuracy', accuracy)
    mlflow.log_metric('f1', f1)

    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'")
    
    # Log Model with signature, 
    from mlflow.models.signature import infer_signature
    signature = infer_signature(X, model.predict(X))
    mlflow.sklearn.log_model(model, "model"
                             , registered_model_name="iris_lr_model"
                             , signature=signature)
    

    # Log Tags
    mlflow.set_tag("mlflow.source.type", "notebook")
    

In [7]:
experiments = client.search_experiments()
experiments

Try running the training script with various parameters to have runs to compare.
You can now explore your run(s) using the ui: \
(Paste "mlflow ui --host 0.0.0.0 --port 5002" in your terminal, or run the cell below)

**N.B.** Make sure you are in the lecture folder and not the repo root!

In [8]:
#!mlflow ui --host 0.0.0.0 --port 5002

You will have to kill the cell to continue experimenting

### Interacting with the model registry

If you are satisfied with the last run's model, you can transform the logged model into a registered model. It will be logged in the Model Registry, which makes it easier to use in production and manage versions.

In [9]:
# We already have our run id from above. Let's use it to register the model

result = mlflow.register_model(f"runs:/{run_id}/models", "iris_lr_model")

In [10]:
# Use Case

Now we will get back to our taxi rides use case: 

In [11]:
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression

from sklearn.metrics import root_mean_squared_error

from typing import List
from scipy.sparse import csr_matrix
import mlflow

## 0 - Download Data

In [12]:
!pip install gdown --quiet

In [13]:
import gdown
import os

DATA_FOLDER = "../../data"
train_path = f"{DATA_FOLDER}/yellow_tripdata_2021-01.parquet"
test_path = f"{DATA_FOLDER}/yellow_tripdata_2021-02.parquet"
predict_path = f"{DATA_FOLDER}/yellow_tripdata_2021-03.parquet"


if not os.path.exists(DATA_FOLDER):
    os.makedirs(DATA_FOLDER)
    print(f"New directory {DATA_FOLDER} created!")

    gdown.download(
        "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
        train_path,
        quiet=False,
    )
    gdown.download(
        "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet",
        test_path,
        quiet=False,
    )
    gdown.download(
        "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet",
        predict_path,
        quiet=False,
    )

## 1 - Load data

In [14]:
def load_data(path: str):
    return pd.read_parquet(path)


train_df = load_data(train_path)
train_df.head()

## 2 - Prepare the data

Let's prepare the data to make it Machine Learning ready. \
For this, we need to clean it, compute the target (what we want to predict), and compute some features to help the model understand the data better.

### 2-1 Compute the target

We want to predict a taxi trip duration in minutes. Let's compute it as a difference between the drop-off time and the pick-up time for each trip.

In [15]:
def compute_target(
    df: pd.DataFrame,
    pickup_column: str = "tpep_pickup_datetime",
    dropoff_column: str = "tpep_dropoff_datetime",
) -> pd.DataFrame:
    df["duration"] = df[dropoff_column] - df[pickup_column]
    df["duration"] = df["duration"].dt.total_seconds() / 60
    return df


train_df = compute_target(train_df)

In [16]:
sns.histplot(train_df["duration"], bins=100)
train_df["duration"].describe()


Let's remove outliers and reduce the scope to trips between 1 minute and 1 hour

In [17]:
MIN_DURATION = 1
MAX_DURATION = 60


def filter_outliers(df: pd.DataFrame, min_duration: int = 1, max_duration: int = 60) -> pd.DataFrame:
    return df[df["duration"].between(min_duration, max_duration)]


train_df = filter_outliers(train_df)
sns.histplot(train_df["duration"], bins=100)


### 2-2 Prepare features

#### 2-2-1 Categorical features

Most machine learning models don't work with categorical features. Because of this, they must be transformed so that the ML model can consume them.

In [18]:
CATEGORICAL_COLS = ["PULocationID", "DOLocationID"]


def encode_categorical_cols(df: pd.DataFrame, categorical_cols: List[str] = None) -> pd.DataFrame:
    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    df[categorical_cols] = df[categorical_cols].fillna(-1).astype("int")
    df[categorical_cols] = df[categorical_cols].astype("str")
    return df


train_df = encode_categorical_cols(train_df)


In [19]:
def extract_x_y(
    df: pd.DataFrame,
    categorical_cols: List[str] = None,
    dv: DictVectorizer = None,
    with_target: bool = True,
) -> dict:

    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    dicts = df[categorical_cols].to_dict(orient="records")

    y = None
    if with_target:
        if dv is None:
            dv = DictVectorizer()
            dv.fit(dicts)
        y = df["duration"].values

    x = dv.transform(dicts)
    return x, y, dv


X_train, y_train, dv = extract_x_y(train_df)
X_train , y_train

## 3 - Train model

We train a basic linear regression model to have a baseline performance

In [20]:
def train_model(x_train: csr_matrix, y_train: np.ndarray):
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    return lr


model = train_model(X_train, y_train)


## 4 - Evaluate model

We evaluate the model on train and test data

### 4-1 On train data

In [21]:
def predict_duration(input_data: csr_matrix, model: LinearRegression):
    return model.predict(input_data)


def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray):
    return root_mean_squared_error(y_true, y_pred)


prediction = predict_duration(X_train, model)
train_me = evaluate_model(y_train, prediction)
train_me

### 4-2 On test data

In [22]:
test_df = load_data(test_path)

In [23]:
test_df = compute_target(test_df)
test_df = encode_categorical_cols(test_df)
X_test, y_test, _ = extract_x_y(test_df, dv=dv)

In [24]:
y_pred_test = predict_duration(X_test, model)
test_me = evaluate_model(y_test, y_pred_test)
test_me

## 5 - Log Model Parameters to MlFlow

Now that all our development functions are built and tested, let's create a training pipeline and log the training parameters, logs and model to MlFlow.

Create a training flow, log all the important parameters, metrics and model. Try to find what could be important and needs to be logged.

In [25]:
N_ESTIMATORS = 50

In [26]:
import loguru
from sklearn.ensemble import RandomForestRegressor

# Set the experiment name
# mlflow.set_experiment("taxi-trip-duration")

# Load data
train_df = load_data(train_path)
test_df = load_data(test_path)
loguru.logger.info("✅ Data loaded ")
# Filter outliers and compute target
train_df = compute_target(train_df)
train_df = filter_outliers(train_df)

test_df = compute_target(test_df)
test_df = filter_outliers(test_df)
loguru.logger.info("✅ Train data processed")


# Encode categorical columns
train_df = encode_categorical_cols(train_df)
X_test, y_test, _ = extract_x_y(test_df, dv=dv)

# Extract X and y
test_df = encode_categorical_cols(test_df)
X_train, y_train, _ = extract_x_y(train_df, dv=dv)
loguru.logger.info("✅ Created X,y")

# Train model
loguru.logger.info("Training Random Forest model with {} estimators", N_ESTIMATORS)
forest = RandomForestRegressor(n_estimators=N_ESTIMATORS
                               , random_state=42
                               ,verbose=2
                               ,max_depth=10)
forest.fit(X_train, y_train)
loguru.logger.info("✅ Model trained")
# Evaluate model on train set
forest_train_pred = forest.predict(X_train)
train_rmse = evaluate_model(y_train, forest_train_pred)
loguru.logger.info("Train RMSE: {:.2f}", train_rmse)

# Evaluate model on test set
forest_test_pred = forest.predict(X_test)
test_rmse = evaluate_model(y_test, forest_test_pred)
loguru.logger.info("Test RMSE: {:.2f}", test_rmse)

In [27]:
forest.predict(X_test[:5])

In [31]:
from mlflow.models.signature import infer_signature

# Start a run
with mlflow.start_run() as run:
    run_id = run.info.run_id

    # Set tags for the run
    mlflow.set_tag("model", "Random Forest")
    mlflow.set_tag("model_version", "v1")
    
    # Log parameters 
    mlflow.log_params({"n_estimators": 100, "random_state": 42})
    # Log metrics
    mlflow.log_metric("train_rmse", train_rmse)
    mlflow.log_metric("test_rmse", test_rmse)

    # Log your model
    signature = infer_signature(X_train, forest.predict(X_train))
    mlflow.sklearn.log_model(forest,'random_forest',  signature=signature)
    loguru.logger.info("✅ Model logged as 'random_forest'")    

If the model is satisfactory, we stage it as production using the appropriate version. This will help us retreiving it for predictions.

Create a mlflow client and use the [mlflow documentation](https://mlflow.org/docs/latest/python_api/mlflow.client.html?highlight=transition_model_version_stage#mlflow.client.MlflowClient.transition_model_version_stage) to stage the appropriate model as being in "production".

In [29]:
client = MlflowClient()
...

## Saving the model in local 

In [32]:
import os 
import pickle
if not os.path.exists("../models"):
    os.makedirs("../models")
    print(f"New directory 'models' created!")
with open("../models/forest_model.pkl", "wb") as f:
    pickle.dump(forest, f)

## Productionizing the model

Now that we have a working logic, create a python package to make it easier to use.  
The package could have the following structure (not everything is mandatory):
```
yellowcab/
    __init__.py
    data.py
    model.py
    predict.py
    train.py
    utils.py
```
