# Notebook for experiment tracking
## Store remote everything
MLflow original setup:
* Tracking server: hosted in a container in Fargate.
* Backend store:  Amazon RDS database.
* Artifacts store: S3 bucket.

The interaction with MLflow is usually made in training jobs, data scientists evaluating their experiments, or APIs that expose our models. In this case the tracking server URI is online so everyone with a username and password can access to it. Since it is running on Docker (in Fargate) it can be taken down when noone is using it. The information will not be lost since the ML models will be stored as artifacts in a S3 bucket and in RDS the MLflow Tracking Server will store experiment and run metadata as well as params, metrics, and tags for runs

To run this you need to launch the mlflow server with the corresponding Github action. The `mlflow server` command is in the exp-track-mod-reg-mlflowFargate\Dockerfile. 

The Uri for the MLflow Tracking server can be seen in the outputs from the cloudformation stack "mlflow-server".

--- The first parts of the notebook taken from 08-my_project/EDA/Version_1.ipynb ---

## Testing mlflow tracking server

In [None]:
Fargate_tracking_uri = "http://mlflo-mlflo-1t73jy0dxw3bw-f2c1f1638afa4ab6.elb.eu-central-1.amazonaws.com/"

In [None]:
import mlflow
mlflow.set_tracking_uri(Fargate_tracking_uri)#NOTE: Important!!!  tput the "http://" at the beginning or it will not work properly
#NOTE: Important!!!  to set the tracking uri here cause otherwise it stores the artifact locally. 
#NOTE: Important!!! The uri wil change every time you start the tracking server on Fargate so you have to change it every time
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

In [2]:
mlflow.get_tracking_uri()

'http://mlflo-mlflo-1t73jy0dxw3bw-f2c1f1638afa4ab6.elb.eu-central-1.amazonaws.com/'

In [3]:
mlflow.set_experiment("my-experimenta")

2022/12/04 18:50:26 INFO mlflow.tracking.fluent: Experiment with name 'my-experimenta' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://34aghxe-mlflow-server-artifacts-141507290110/1', creation_time=1670176225608, experiment_id='1', last_update_time=1670176225608, lifecycle_stage='active', name='my-experimenta', tags={}>

In [12]:
from  mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.search_experiments() # returns a list of mlflow.entities.Experiment
experiments

[<Experiment: artifact_location='s3://34aghxe-mlflow-server-artifacts-141507290110/1', creation_time=1670176225608, experiment_id='1', last_update_time=1670176225608, lifecycle_stage='active', name='my-experimenta', tags={}>,
 <Experiment: artifact_location='s3://34aghxe-mlflow-server-artifacts-141507290110/0', creation_time=1670175501589, experiment_id='0', last_update_time=1670175501589, lifecycle_stage='active', name='Default', tags={}>]

In [None]:
mlflow.end_run() #NOTE: Important to end the run after setting the experiment in mlflow 2.0.1. Otherwise it does not work

In [None]:
experiments
run = client.create_run(experiments[0].experiment_id) # returns mlflow.entities.Run
client.log_param(run.info.run_id, "hello", "world")
client.set_terminated(run.info.run_id)

## Import Libs

In [5]:
from datetime import timedelta, datetime, timezone
from pandas._libs.tslibs.timestamps import Timestamp
import pandas as pd
import numpy as np
import mlflow
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

## Get and prepare data

In [None]:
LAST_DAYS = 30 #Number of days to get data from

now = datetime.now()

dfs = []  # empty list which will hold your dataframes
df_temp_2 = pd.DataFrame()
for d in range(1, LAST_DAYS): #NOTE: do the same that has been done for the first day but for the whole period
    date = now - timedelta(days=d)
    date_str = date.strftime("%m-%d-%Y")
    # print(date_str)
    source_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/' + date_str + '.csv'
    df_temp = pd.read_csv(source_url)
    df_temp.rename(columns={"Last_Update": "Date"}, inplace=True) #Renane dataframe column from "Last_Update" to "Date"
    df_temp_2 = df_temp[["Admin2", "Province_State", "Country_Region","Confirmed", "Deaths"]] #TODO: consider also other columns in future versions like Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
    df_temp_2["Date"] = date.strftime("%Y-%m-%d") #TODO: fix this so that no warning comes
    dfs.append(df_temp_2)  # append dataframe to list
    
res = pd.concat(dfs, ignore_index=True)  # concatenate list of dataframes


In [7]:
# group by Country_Region and sum Confirmed and Deaths
df = res.groupby(['Province_State','Country_Region','Date']).agg({'Confirmed':'sum', 'Deaths':'sum'})
df.reset_index(inplace=True)
df.rename(columns={"Confirmed": "ConfirmedCases", "Deaths": "Fatalities"}, inplace=True)

In [8]:
## Prepare train(dev)-test set
loc_group = ["Province_State", "Country_Region"]
def preprocess(df):
    df["Date"] = df["Date"].astype("datetime64[ms]")
    for col in loc_group:
        df[col].fillna("none", inplace=True) #NOTE: replace all NaN with none
    return df
df = preprocess(df)

TARGETS = ["ConfirmedCases", "Fatalities"]
for col in TARGETS:
    df[col] = np.log1p(df[col]) 
    
for col in TARGETS:
    df["prev_{}".format(col)] = df.groupby(loc_group)[col].shift() #NOTE: the prev_ columns basically has the same than the others but delayed one day
    
    
df = df[df["Date"] > df["Date"].min()].copy() #NOTE: removes the first day since it has NaNs in the "prev" columns

TEST_DAYS = 7 #Number of days to test the model
TEST_FIRST = now - timedelta(days=TEST_DAYS)
TEST_FIRST = TEST_FIRST.replace(hour=0, minute=0, second=0, microsecond=0)
TEST_FIRST = Timestamp(TEST_FIRST)

dev_df, test_df = df[df["Date"] < TEST_FIRST].copy(), df[df["Date"] >= TEST_FIRST].copy() #I am testing the model with the predictions of the last 7 (TEST_DAYS) days and
                                                                                          # training it with data from the previous 30 days (LAST_DAYS) excluding the last 7 days (TEST_DAYS)


In [9]:
features = ["prev_{}".format(col) for col in TARGETS]

## Modeling

In [10]:
#TODO: fix the UserWarnings that appear when running this cell

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def evaluate(df):
    error = 0
    for col in TARGETS:
        error += rmse(df[col].values, df["pred_{}".format(col)].values) #NOTE: checks the error between the predicted columns and the target columns
    return np.round(error/len(TARGETS), 5)


def predict(test_df, first_day, num_days, val=False):
    y_pred = np.clip(model.predict(test_df.loc[test_df["Date"] == first_day][features]), None, 16)#NOTE: here he is predicting the targets for the first day and 
                                                                                                    #saturating (clip) them with max=16
 
    for i, col in enumerate(TARGETS):
        test_df["pred_{}".format(col)] = 0
        test_df.loc[test_df["Date"] == first_day, "pred_{}".format(col)] = y_pred[:, i] #NOTE: here he sets the predicted columns

    if val:
        print(first_day, evaluate(test_df[test_df["Date"] == first_day])) #NOTE: print the date of the first day and the error between the predicted targets and the real targets

    for d in range(1, num_days): #NOTE: do the same that has been done for the first day but for the whole period
        y_pred = np.clip(model.predict(y_pred), None, 16)
        date = first_day + timedelta(days=d)

        for i, col in enumerate(TARGETS):
            test_df.loc[test_df["Date"] == date, "pred_{}".format(col)] = y_pred[:, i]

        if val:
            print(date, evaluate(test_df[test_df["Date"] == date])) #NOTE: when we see all the errors we can see that the farther the date from the first day the higher the error
        
    return test_df

In [11]:
print(f"tracking URI: '{mlflow.get_artifact_uri()}'")

tracking URI: 's3://34aghxe-mlflow-server-artifacts-141507290110/1/bd34cf1e96834b99b18158f38e75ae88/artifacts'


In [14]:
mlflow.set_experiment("experiment-covid-3") 

with mlflow.start_run():
    
    degree_poly = 2
    incl_bias = False
    mlflow.log_param('degree_poly', degree_poly)
    mlflow.log_param('include_bias', incl_bias)
    
    model = Pipeline([('poly', PolynomialFeatures(degree=degree_poly, include_bias=incl_bias)),
                  ('linear', LinearRegression())])
    
    model.fit(dev_df[features], dev_df[TARGETS])
    [mean_squared_error(dev_df[TARGETS[i]], model.predict(dev_df[features])[:, i]) for i in range(len(TARGETS))] #NOTE: check the mean_squared_error from the training dataset
    
    test_df = predict(test_df, TEST_FIRST, TEST_DAYS, val=True) #NOTE: makes predictions 
    #for TEST_DAYS number of days ...just to print it on the screen
    
    eval_RMSE = evaluate(test_df) #NOTE: the error of all the predictions
    print("RMSE:", eval_RMSE)
    mlflow.log_metric("evaluated_RMSE", eval_RMSE)

    mlflow.sklearn.log_model(model, artifact_path="models")
    print(f"Artifacts URI: '{mlflow.get_artifact_uri()}'") #returns where the model is stored
    

2022/12/04 18:54:14 INFO mlflow.tracking.fluent: Experiment with name 'experiment-covid-3' does not exist. Creating a new experiment.


2022-11-27 00:00:00 0.02348
2022-11-28 00:00:00 0.09259
2022-11-29 00:00:00 0.0594
2022-11-30 00:00:00 0.04808
2022-12-01 00:00:00 0.24874
2022-12-02 00:00:00 0.24907
2022-12-03 00:00:00 0.25331
RMSE: 0.17062




Artifacts URI: 's3://34aghxe-mlflow-server-artifacts-141507290110/2/0c2df617d52d47a08f3f840daffc19a9/artifacts'


In [15]:
mlflow.end_run()

I see in mlflow ui that the RUNID for this experiment is 6509bec6c96d4f9d8e1b88c0812e1590

In [16]:
#NOTE: here UI change the degree_poly to have another experiment
mlflow.set_experiment("experiment-covid-2") 

with mlflow.start_run(): 
    
    degree_poly = 3
    incl_bias = False
    mlflow.log_param('degree_poly', degree_poly)
    mlflow.log_param('include_bias', incl_bias)
    
    model = Pipeline([('poly', PolynomialFeatures(degree=degree_poly, include_bias=incl_bias)),
                  ('linear', LinearRegression())])
    
    model.fit(dev_df[features], dev_df[TARGETS])
    [mean_squared_error(dev_df[TARGETS[i]], model.predict(dev_df[features])[:, i]) for i in range(len(TARGETS))] #NOTE: check the mean_squared_error from the training dataset
    
    test_df = predict(test_df, TEST_FIRST, TEST_DAYS, val=True) #NOTE: makes predictions 
    #for TEST_DAYS number of days ...just to print it on the screen
    
    eval_RMSE = evaluate(test_df) #NOTE: the error of all the predictions
    print("RMSE:", eval_RMSE)
    mlflow.log_metric("evaluated_RMSE", eval_RMSE)

    mlflow.sklearn.log_model(model, artifact_path="models")
    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'") #returns where the model is stored
    

2022/12/04 18:56:05 INFO mlflow.tracking.fluent: Experiment with name 'experiment-covid-2' does not exist. Creating a new experiment.


2022-11-27 00:00:00 0.02392
2022-11-28 00:00:00 0.09341
2022-11-29 00:00:00 0.06126
2022-11-30 00:00:00 0.05069
2022-12-01 00:00:00 0.25114
2022-12-02 00:00:00 0.25199
2022-12-03 00:00:00 0.25664
RMSE: 0.1727




default artifacts URI: 's3://34aghxe-mlflow-server-artifacts-141507290110/3/6509bec6c96d4f9d8e1b88c0812e1590/artifacts'


In [17]:
mlflow.end_run()

I see in mlflow ui that the RUNID for this experiment is 0c2df617d52d47a08f3f840daffc19a9

Comparing the experiments in mlflow the first one has less evaluated RSME so I will choose that model => RUNID = 6509bec6c96d4f9d8e1b88c0812e1590
This will also be more in detail in exp-track-mod-reg-mlflowFargate\model-registry.ipynb