## Intro mlflow
Save locally experiments

### 1) Install mlflow

In [5]:
#!pip install mlflow

### 2) Initialize server of mlflow (only necessary when works with mlflow locally)
To registry experiments in mlflow **only it is necessary runs the codes and internally the artifacts, parameters and metrics are saved locally**

But, when you want **to see the results in the web page (User Interface), you must run the server.**

**Instrucctions** to run locally server mlflow to see the UI:

- Open a console. For example, Anaconda Prompt
- Navegate in the console (cd) until the folder when the artifacts, parameters and metrics of the experiments in mlflow are saved
- Run (by default where the folder "mlruns" are located)
          `mlflow server`
  
- Run v2 (set the path to the mlflow folder):
        `mlflow server --default-artifact-root <folder until the artifact>`

- This command will run the server of mlflow in your local machine with the url: http://localhost:5000.

In [6]:
#######  mlflow server --default-artifact-root <ruta para almacenar artefactos>


#######  mlflow server

### 3) Import mlflow

In [7]:
import mlflow
!pip show mlflow

Name: mlflow
Version: 2.3.0
Summary: MLflow: A Platform for ML Development and Productionization
Home-page: https://mlflow.org/
Author: Databricks
Author-email: 
License: Apache License 2.0
Location: d:\anaconda\envs\data-science-python-3-10\lib\site-packages
Requires: alembic, click, cloudpickle, databricks-cli, docker, entrypoints, Flask, gitpython, importlib-metadata, Jinja2, markdown, matplotlib, numpy, packaging, pandas, protobuf, pyarrow, pytz, pyyaml, querystring-parser, requests, scikit-learn, scipy, sqlalchemy, sqlparse, waitress
Required-by: 


In [8]:
# import other packages

import warnings
warnings.filterwarnings("ignore")

import os
import pandas as pd
import numpy as np
import pickle

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

### 4) Connect to mlflow
In this part you decide the server of mlflow that you will connect:
- locally
- cluster cloud
- etc

In [9]:
#conect to mlflow
path_local_artifacts_mlflow = 'mlruns'

mlflow.set_tracking_uri(path_local_artifacts_mlflow)

### 5) Set the experiment
If the experiment doesn´t exist, it will be created automatically

In [10]:
experiment_name = 'test_local_mlflow'
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///D:/github-mi-repo/registry_experiments/mlflow/mlruns/422972033151819143', creation_time=1703448974509, experiment_id='422972033151819143', last_update_time=1703448974509, lifecycle_stage='active', name='test_local_mlflow', tags={}>

In [11]:
# validate the experiment is created/set
mlflow.get_experiment_by_name(experiment_name)

<Experiment: artifact_location='file:///D:/github-mi-repo/registry_experiments/mlflow/mlruns/422972033151819143', creation_time=1703448974509, experiment_id='422972033151819143', last_update_time=1703448974509, lifecycle_stage='active', name='test_local_mlflow', tags={}>

### 6) Generation of parameters, variables, artifacts to save
- All these artifacts are generated during model training

- In this example some are created to indicate them as examples

### 6.0 Generate data and train a model

In [12]:
#### generarate data ###


# parameters
len_data = 1000
number_columns = 6
data = []
list_variables = ["240FY050.RO02" , "SGM-PI9514", "SSTRIPPING015", "SGM-PI9516" , "SGM-PI9512", "target"]


# seed - replicability
np.random.seed(42)


# generate random data
for column in range(number_columns):
    random_choise = np.random.choice(10) + 1 # amplitud
    data_column = np.random.rand(len_data)
    data_column = random_choise * data_column
    data.append(data_column)
    

# transform into a dataframe
data = pd.DataFrame(data).T
data.columns = list_variables


# divide into train and test
features = list(set(list_variables) - set(['target']))
X_train, X_test, y_train, y_test = train_test_split(data[features], data['target'], test_size = 0.2, random_state=42)

print('TRAIN')
print('X_train', X_train.shape)
print('y_train', y_train.shape)

print('\nTEST')
print('X_test', X_test.shape)
print('y_test', y_test.shape)

TRAIN
X_train (800, 5)
y_train (800,)

TEST
X_test (200, 5)
y_test (200,)


In [13]:
#### train model ####

#model = LinearRegression()
model = RandomForestRegressor(random_state = 42)
model.fit(X_train, y_train)



#### prediction and evaluation ####

### RMSE
rmse_train = mean_squared_error(y_train, 
                                model.predict(X_train),
                                squared = False)

rmse_test = mean_squared_error(y_test, 
                               model.predict(X_test),
                               squared = False)


### R2
r2_score


r2_train = r2_score(y_train,
                   model.predict(X_train))

r2_test = r2_score(y_test,
                   model.predict(X_test))

### 6.1) Generate parameters to save in mlflow

In [14]:
# print de listado de tags
print('listado de tags', list_variables)

listado de tags ['240FY050.RO02', 'SGM-PI9514', 'SSTRIPPING015', 'SGM-PI9516', 'SGM-PI9512', 'target']


In [15]:
# tipo modelo
model_type = "RF"
model_type

'RF'

In [16]:
# fechas de los datos
start_train = "2020-01-01"
end_train = "2022-12-01"

### 6.2) Generate metrics to save in mlflow

In [13]:
# print de las métricas
print('rmse_train: ', rmse_train)
print('rmse_test: ', rmse_test)
print('r2_train: ', r2_train)
print('r2_test: ', r2_test)

rmse_train:  0.5805765256037049
rmse_test:  1.4541631589134607
r2_train:  0.8383333507540055
r2_test:  0.0005404574429910269


### 6.3) Generate artifacts to save in mlflow
- models
- data
- graphs
- etc

In [17]:
# generar artefacto pickle con el modelo y borrar del local
model_name = 'model.pkl'
with open(model_name, 'wb') as file:
    pickle.dump(model, file)

In [18]:
# generar artefacto data csv y borrar del local
data_name = 'data.csv'
data.to_csv(data_name)

### 7. Save the results of the run
- You have an "experiment" which is the bigger unit and it is conformed by smaller units "runs". When you train a model (which you could call experiment in your traditional languaje) you are training a model and save its results into a "run".

In [24]:
#initialize run
run_name = model_type
mlflow.start_run(run_name = run_name)
run = mlflow.active_run()

In [20]:
# save parametes
mlflow.log_param("Tags", str(list_variables))
mlflow.log_param("Modelo", model_type)
mlflow.log_param("Inicio Train", start_train)
mlflow.log_param("Fin Train", end_train)

'2022-12-01'

In [21]:
# save metrics
mlflow.log_metric("RMSE_train", rmse_train)
mlflow.log_metric("RMSE_test", rmse_test)
mlflow.log_metric("R2_train", r2_train)
mlflow.log_metric("R2_test", r2_test)

In [22]:
# save artifacts. The easyiest way is save the artifact locally and then upload into mlflow

# save model
mlflow.log_artifact(model_name)
os.remove(model_name)

# save data
mlflow.log_artifact(data_name)
os.remove(data_name)

### 8. Finish RUN

In [23]:
# terminar run
mlflow.end_run()

### 9. EXTRAS
Mlflow allows:

- Register the artifact of the generated model (registering the package used to train the model and its requirements) and then register the experimentation run in the "models" menu and with this deploy it using mlflow services (mlflow running in a cloud cluster)

- Of all the RUNs choose the one with the best metric