# Notebook for experiment tracking
## Store remote only artifacts (does not work)
MLflow original setup:
* Tracking server: local filesystem.
* Backend store: local sqlite database.
* Artifacts store: s3 bucket.

The Reviewers must be able to see the experiments so the MLflow Tracking Server (creates and manages experiments and runs) will be local. Having it remotely it would mean having an EC2 instance shared with other users which could incurr in unnecessary costs. 

For the same reason the backend will be stored locally.

Since we want the models to be available in the cloud the Artifacts store will be a s3 bucket.

NOTE: I follow video 4.3: https://app.clickup.com/t/2k55fz6 and 04-deployment/web-service-mlflow/random-forest_with_pipeline.ipynb

To run this you need to launch the mlflow server locally by running the following command in your terminal:

`mlflow server --backend-store-uri=sqlite:///mlflow_db.db --default-artifact-root=s3://mlflow-artifacts-remote-jaime/`


--- The first parts of the notebook taken from 08-my_project/EDA/Version_1.ipynb ---

## Import Libs

In [26]:
from datetime import timedelta, datetime, timezone
from pandas._libs.tslibs.timestamps import Timestamp
import pandas as pd
import numpy as np
import mlflow
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

## Get and prepare data

In [27]:
LAST_DAYS = 30 #Number of days to get data from

now = datetime.now()

dfs = []  # empty list which will hold your dataframes
df_temp_2 = pd.DataFrame()
for d in range(1, LAST_DAYS): #NOTE: do the same that has been done for the first day but for the whole period
    date = now - timedelta(days=d)
    date_str = date.strftime("%m-%d-%Y")
    # print(date_str)
    source_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/' + date_str + '.csv'
    df_temp = pd.read_csv(source_url)
    df_temp.rename(columns={"Last_Update": "Date"}, inplace=True) #Renane dataframe column from "Last_Update" to "Date"
    df_temp_2 = df_temp[["Admin2", "Province_State", "Country_Region","Confirmed", "Deaths"]] #TODO: consider also other columns in future versions like Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
    df_temp_2["Date"] = date.strftime("%Y-%m-%d") #TODO: fix this so that no warning comes
    dfs.append(df_temp_2)  # append dataframe to list
    
res = pd.concat(dfs, ignore_index=True)  # concatenate list of dataframes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

In [28]:
# group by Country_Region and sum Confirmed and Deaths
df = res.groupby(['Province_State','Country_Region','Date']).agg({'Confirmed':'sum', 'Deaths':'sum'})
df.reset_index(inplace=True)
df.rename(columns={"Confirmed": "ConfirmedCases", "Deaths": "Fatalities"}, inplace=True)

In [29]:
## Prepare train(dev)-test set
loc_group = ["Province_State", "Country_Region"]
def preprocess(df):
    df["Date"] = df["Date"].astype("datetime64[ms]")
    for col in loc_group:
        df[col].fillna("none", inplace=True) #NOTE: replace all NaN with none
    return df
df = preprocess(df)

TARGETS = ["ConfirmedCases", "Fatalities"]
for col in TARGETS:
    df[col] = np.log1p(df[col]) 
    
for col in TARGETS:
    df["prev_{}".format(col)] = df.groupby(loc_group)[col].shift() #NOTE: the prev_ columns basically has the same than the others but delayed one day
    
    
df = df[df["Date"] > df["Date"].min()].copy() #NOTE: removes the first day since it has NaNs in the "prev" columns

TEST_DAYS = 7 #Number of days to test the model
TEST_FIRST = now - timedelta(days=TEST_DAYS)
TEST_FIRST = TEST_FIRST.replace(hour=0, minute=0, second=0, microsecond=0)
TEST_FIRST = Timestamp(TEST_FIRST)

dev_df, test_df = df[df["Date"] < TEST_FIRST].copy(), df[df["Date"] >= TEST_FIRST].copy() #I am testing the model with the predictions of the last 7 (TEST_DAYS) days and
                                                                                          # training it with data from the previous 30 days (LAST_DAYS) excluding the last 7 days (TEST_DAYS)


In [30]:
features = ["prev_{}".format(col) for col in TARGETS]

## Modeling

In [31]:
mlflow.set_tracking_uri("http://127.0.0.1:5000") #NOTE: Important!!!  to set the tracking uri here cause otherwise it stores the artifact locally

In [32]:
#NOTE: since we did not specify a tracking uri by default mlflow will store everything in the local system
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

tracking URI: 'http://127.0.0.1:5000'


In [33]:
#TODO: fix the UserWarnings that appear when running this cell

def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def evaluate(df):
    error = 0
    for col in TARGETS:
        error += rmse(df[col].values, df["pred_{}".format(col)].values) #NOTE: checks the error between the predicted columns and the target columns
    return np.round(error/len(TARGETS), 5)


def predict(test_df, first_day, num_days, val=False):
    y_pred = np.clip(model.predict(test_df.loc[test_df["Date"] == first_day][features]), None, 16)#NOTE: here he is predicting the targets for the first day and 
                                                                                                    #saturating (clip) them with max=16
 
    for i, col in enumerate(TARGETS):
        test_df["pred_{}".format(col)] = 0
        test_df.loc[test_df["Date"] == first_day, "pred_{}".format(col)] = y_pred[:, i] #NOTE: here he sets the predicted columns

    if val:
        print(first_day, evaluate(test_df[test_df["Date"] == first_day])) #NOTE: print the date of the first day and the error between the predicted targets and the real targets

    for d in range(1, num_days): #NOTE: do the same that has been done for the first day but for the whole period
        y_pred = np.clip(model.predict(y_pred), None, 16)
        date = first_day + timedelta(days=d)

        for i, col in enumerate(TARGETS):
            test_df.loc[test_df["Date"] == date, "pred_{}".format(col)] = y_pred[:, i]

        if val:
            print(date, evaluate(test_df[test_df["Date"] == date])) #NOTE: when we see all the errors we can see that the farther the date from the first day the higher the error
        
    return test_df

In [34]:
mlflow.set_experiment("experiment-covid-2") 

with mlflow.start_run():
    
    degree_poly = 2
    incl_bias = False
    mlflow.log_param('degree_poly', degree_poly)
    mlflow.log_param('include_bias', incl_bias)
    
    model = Pipeline([('poly', PolynomialFeatures(degree=degree_poly, include_bias=incl_bias)),
                  ('linear', LinearRegression())])
    
    model.fit(dev_df[features], dev_df[TARGETS])
    [mean_squared_error(dev_df[TARGETS[i]], model.predict(dev_df[features])[:, i]) for i in range(len(TARGETS))] #NOTE: check the mean_squared_error from the training dataset
    
    test_df = predict(test_df, TEST_FIRST, TEST_DAYS, val=True) #NOTE: makes predictions 
    #for TEST_DAYS number of days ...just to print it on the screen
    
    eval_RMSE = evaluate(test_df) #NOTE: the error of all the predictions
    print("RMSE:", eval_RMSE)
    mlflow.log_metric("evaluated_RMSE", eval_RMSE)

    mlflow.sklearn.log_model(model, artifact_path="models")
    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'") #returns where the model is stored
    

2022/08/10 12:46:33 INFO mlflow.tracking.fluent: Experiment with name 'experiment-covid-2' does not exist. Creating a new experiment.


2022-08-03 00:00:00 0.02766
2022-08-04 00:00:00 0.04321
2022-08-05 00:00:00 0.0611
2022-08-06 00:00:00 0.07638
2022-08-07 00:00:00 0.09059
2022-08-08 00:00:00 0.10535
2022-08-09 00:00:00 0.30238
RMSE: 0.13264
default artifacts URI: 's3://mlflow-artifacts-remote-jaime/4/16082a31f2be4eadb6f368b4ded2d309/artifacts'


In [35]:
mlflow.list_experiments() 
#NOTE: this list does not represent the structure in the S3 bucket itself. You have to get the RUNID from mlflow ui and then look for that id under the 
# folders in the s3 bucket https://s3.console.aws.amazon.com/s3/buckets/mlflow-artifacts-remote-jaime?region=eu-central-1&tab=objects


[<Experiment: artifact_location='s3://mlflow-artifacts-remote-jaime/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>,
 <Experiment: artifact_location='s3://mlflow-artifacts-remote-jaime/1', experiment_id='1', lifecycle_stage='active', name='my-experiment-1', tags={}>,
 <Experiment: artifact_location='s3://mlflow-artifacts-remote-jaime/2', experiment_id='2', lifecycle_stage='active', name='experiment-covid-1', tags={}>,
 <Experiment: artifact_location='s3://mlflow-artifacts-remote-jaime/3', experiment_id='3', lifecycle_stage='active', name='my-cool-experiment', tags={}>,
 <Experiment: artifact_location='s3://mlflow-artifacts-remote-jaime/4', experiment_id='4', lifecycle_stage='active', name='experiment-covid-2', tags={}>]

I see in mlflow ui that the RUNID for this experiment is 16082a31f2be4eadb6f368b4ded2d309

In [36]:
#NOTE: here UI change the degree_poly to have another experiment
mlflow.set_experiment("experiment-covid-2") 

with mlflow.start_run(): 
    
    degree_poly = 3
    incl_bias = False
    mlflow.log_param('degree_poly', degree_poly)
    mlflow.log_param('include_bias', incl_bias)
    
    model = Pipeline([('poly', PolynomialFeatures(degree=degree_poly, include_bias=incl_bias)),
                  ('linear', LinearRegression())])
    
    model.fit(dev_df[features], dev_df[TARGETS])
    [mean_squared_error(dev_df[TARGETS[i]], model.predict(dev_df[features])[:, i]) for i in range(len(TARGETS))] #NOTE: check the mean_squared_error from the training dataset
    
    test_df = predict(test_df, TEST_FIRST, TEST_DAYS, val=True) #NOTE: makes predictions 
    #for TEST_DAYS number of days ...just to print it on the screen
    
    eval_RMSE = evaluate(test_df) #NOTE: the error of all the predictions
    print("RMSE:", eval_RMSE)
    mlflow.log_metric("evaluated_RMSE", eval_RMSE)

    mlflow.sklearn.log_model(model, artifact_path="models")
    print(f"default artifacts URI: '{mlflow.get_artifact_uri()}'") #returns where the model is stored
    

2022-08-03 00:00:00 0.0381
2022-08-04 00:00:00 0.06246
2022-08-05 00:00:00 0.08495
2022-08-06 00:00:00 0.10244
2022-08-07 00:00:00 0.11765
2022-08-08 00:00:00 0.13222
2022-08-09 00:00:00 0.31463
RMSE: 0.14873




default artifacts URI: 's3://mlflow-artifacts-remote-jaime/4/ae2e389613094cf48d62eed43ce7850e/artifacts'


I see in mlflow ui that the RUNID for this experiment is ae2e389613094cf48d62eed43ce7850e

Comparing the experiments in mlflow the first one has less evaluated RSME so I will choose that model => RUNID = 16082a31f2be4eadb6f368b4ded2d309
This will also be more in detail in 08-my_project/exp-track-mod-reg/model-registry.ipynb