Here I am going to connect to run MLFlow or should I say 'tracking server' (what MLFlow in essence is) by using EC2 Linux instance as 'server', use production-grade database storage PostgreSQL instead of SQLite, and use S3 bucket to store artifacts.

This kind of environment is useful when you are working in a team of Data Scientists on ML projects, where it would sound natural to connect to some centralized out-of-local-machine type of system where everybody can access and is up and running at glance. 

I wanted to play with it because there is so much to learn about some AWS services I've already encountered in the wild. It is worth mentionig that this is just a 'scenario' and maybe not a suitable option when there is high intensity traffic (imagine team of 20 Data Scientist who have multiple runs) aka it is not scalable. It sole purpose is to bring up a level of ML knowledge. 

---

Quick refresher:

The term "server" is more general and refers to a computer or system that provides services or resources to other computers, known as clients. A server can take various forms and serve different purposes depending on the context.

In the case of MLflow's tracking server, it's not just a database. It's a component that provides a service for tracking and managing machine learning experiments. It does involve storage of data, such as experiment metadata, parameters, metrics, and artifacts, but it also includes functionality for organizing, querying, and serving this information.

---

## Why PostgreSQL instead of SQLite?

Here's what you're storing in the SQLite database when you use MLflow with SQLite as the backend:

**Experiment Metadata:** Information about each experiment, such as the experiment name, start time, and end time.

**Run Metadata:** Details about each run within an experiment, including the run ID, start time, end time, source code version, and any tags associated with the run.

**Parameters:** The values of parameters used in each run. These could be hyperparameters or any other parameters you log during your experiment.

**Metrics:** The values of metrics that you log during the execution of your machine learning code.

**Tags:** Any additional tags you attach to experiments or runs for better organization or categorization.

**Artifacts:** **References or links to artifacts (files) produced during the run, such as model files, plots, or other output files.**


The SQLite database serves as a lightweight, file-based database that stores this metadata. It provides a simple and self-contained solution for small to medium-scale projects. Keep in mind that for larger-scale or production-grade use cases, you might consider using other databases, such as MySQL or PostgreSQL, as the backend for MLflow's tracking server.

---

Execute this:
`sudo yum update`

If there is no pip3 command on Amazon Linux EC2 instance do this first:

`sudo yum -y install python-pip`

`pip3 install mlflow boto3 psycopg2-binary`

`aws configure`


Be **very** careful at filling this:

`mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME`

`DB_ENDPOINT` -> Found on the 'Connectivity and security' (`AWS RDS`)



In [1]:
#To access the scope of preprocessing.py
%run preprocessing.py

In [2]:
import mlflow
import pickle
import xgboost as xgb
import os

Before doing the next step you have to run (localy, in the terminal):

`aws configure --profile <name_of_your_profile>`

**Note: It will ask you about AWS Access Key ID, AWS Secret Access Key and Default region name.
When it asks you for Default output format just press Enter. If you forgot those credentials, create another one. :)**

Then do quick check if the profile is created via:

`cat ~/.aws/config`

Next command will set the environment variable to 

In [3]:
os.environ['AWS_PROFILE'] = 'practise'

In [4]:
TRACKING_SERVER_HOST = 'ec2-16-171-154-38.eu-north-1.compute.amazonaws.com' #Not active anymore :)

mlflow.set_tracking_uri(f"http://{TRACKING_SERVER_HOST}:5000")

In [5]:
print(f"tracking URI: '{mlflow.get_tracking_uri()}'")

tracking URI: 'http://ec2-16-171-154-38.eu-north-1.compute.amazonaws.com:5000'


In [6]:
mlflow.search_experiments()

[<Experiment: artifact_location='s3://mlops-bucket-v3/0', creation_time=1702927385651, experiment_id='0', last_update_time=1702927385651, lifecycle_stage='active', name='Default', tags={}>]

In [7]:
mlflow.set_experiment("mlflow-aws")

2023/12/18 20:26:15 INFO mlflow.tracking.fluent: Experiment with name 'mlflow-aws' does not exist. Creating a new experiment.


<Experiment: artifact_location='s3://mlops-bucket-v3/1', creation_time=1702927575545, experiment_id='1', last_update_time=1702927575545, lifecycle_stage='active', name='mlflow-aws', tags={}>

In [8]:
best_params = {'learning_rate': 0.057786841452234214,
               'max_depth': 5.0,
               'min_child_weight': 11.353071298640767,
               'reg_alpha': 0.008039345251325283,
               'reg_lambda': 0.003694981097974786}

In [9]:
with mlflow.start_run():
        
    
        booster = xgb.XGBClassifier(
            max_depth=int(best_params['max_depth']),
            learning_rate=best_params['learning_rate'],
            reg_alpha=best_params['reg_alpha'],
            reg_lambda=best_params['reg_lambda'],
            min_child_weight=best_params['min_child_weight'],
            objective='binary:logistic',
            eval_metric='auc',
            seed=RANDOM_STATE,
            n_estimators=1000,
            early_stopping_rounds=50
        )
        
        #Pickling the DictVectorizer
        with open("./models/preprocessor.bin", "wb") as f_out:
            pickle.dump(dv, f_out)
        
        mlflow.autolog()
        
        
        booster.fit(X_train, y_train,
                    eval_set=[(X_valid, y_val)]
                    )
        
        y_pred = booster.predict_proba(X_valid)[:, 1]
        roc_auc = roc_auc_score(y_val, y_pred)

        #Log the DictVectorizer as an artifact
        mlflow.log_artifact("./models/preprocessor.bin", artifact_path = "preprocessors")

2023/12/18 20:26:50 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/12/18 20:26:50 INFO mlflow.tracking.fluent: Autologging successfully enabled for xgboost.


[0]	validation_0-auc:0.76757
[1]	validation_0-auc:0.77520
[2]	validation_0-auc:0.79604
[3]	validation_0-auc:0.79677
[4]	validation_0-auc:0.80054
[5]	validation_0-auc:0.80208
[6]	validation_0-auc:0.80581
[7]	validation_0-auc:0.80793
[8]	validation_0-auc:0.80848
[9]	validation_0-auc:0.80954
[10]	validation_0-auc:0.81088
[11]	validation_0-auc:0.81302
[12]	validation_0-auc:0.81364
[13]	validation_0-auc:0.81375
[14]	validation_0-auc:0.81405
[15]	validation_0-auc:0.81445
[16]	validation_0-auc:0.81390
[17]	validation_0-auc:0.81454
[18]	validation_0-auc:0.81422
[19]	validation_0-auc:0.81418
[20]	validation_0-auc:0.81548
[21]	validation_0-auc:0.81538
[22]	validation_0-auc:0.81625
[23]	validation_0-auc:0.81737
[24]	validation_0-auc:0.81788
[25]	validation_0-auc:0.81759
[26]	validation_0-auc:0.81806
[27]	validation_0-auc:0.81847
[28]	validation_0-auc:0.81884
[29]	validation_0-auc:0.81919
[30]	validation_0-auc:0.81903
[31]	validation_0-auc:0.81933
[32]	validation_0-auc:0.81904
[33]	validation_0-au

