This project contains the scaffold for MLOps of a machine learning project. MLOps are essential to most projects, starting from enterprise solutions and research programs all the way to small scale or even learning projects. I've found myself writing over and over again the same boilerplate for my machine learning and data science projects, so I've decided to gather the required infrastructure in the current repository in order to be able to bootstrap future projects with ease. Of course, the project can be used as is or enhanced by anyone.
All dependencies are declared in the setup.py
file. For these, I strongly
recommend the use of a virtual environment. To install the dependencies run:
pip install .
- pandas is used for data manipulation.
- numpy is used for numerical operations.
- matplotlib is used to visualize the results.
- scikit-learn is used for the ready-to-use ML models.
- scikit-optimize is used for Bayesian optimization algorithms.
- mlflow is used to handle the MLOps.
One major difference between managing conventional software and machine learning code is the data. The data used in a project are not static, but they do change frequently. We try out different preprocessing methods and techniques searching, for instance, performance improvements. This means that we need to additionally version our data. The data size varies from a few KBs to GBs and TBs in cases. That's why we utilize DVC which is a data version control toolkit that runs on top of git. So, be sure to install DVC on your machine. To fully incorporate DVC you will also need a remote storage. There are plenty of solutions depending on the project needs and the resources you have available. For smaller projects I tend to use my Google Drive because it's really easy to set it up and running in a couple of minutes.
The data are stored in the data
folder, any data related to MLFlow are stored
in the mlruns
folder, while we manually keep some experiment results in the
results
folder. All three folders are versioned via DVC. The GDrive folder is
public and contains some rudimentary data for demonstration reasons.
Regarding the core scripts we have:
/core/utilities.py
: contains a utility that reads input parameters through the command line./core/scale.py
: contains the basic scaling functionalities./core/experiment.py
: contains the core management of the experiments./core/train.py
: handles the actual training of the models./core/evaluate.py
: contains basic model evaluation techniques such as error metrics.
All scripts contain some basic functionality that I frequenty use. The code is structured in a way to make any enhancements or additions easy to be integrated. The error metrics as well as some of the interpretability tools are used for regression problems. However, they can be easily switched to the corresponding classification metrics if needed.
The hyperparameter tuning can be done in three different ways out-of-the box using K-fold Cross-Validation:
- Grid Search
- Random Search
- Bayesian Optimization
Regardless of the model that is to be trained the core scripts remain the same. To
train a new model one would need to create a script similar to the decision_trees_experiment.py
that I have created to showcase how to run experiments. To run a new experiment execute:
py .\decision_trees_experiment.py --data-file "./data/diabetes.csv" --mode "grid_search" --experiment-name "Test"
To include an additional run in the same experiment execute:
py .\decision_trees_experiment.py --data-file "./data/diabetes.csv" --mode "bayesian" --experiment-id {experiment_id}
The details list of parameters can be found below:
--data-file {data_file}
The data file to be used in the training process.--test-size {test_size}
The test size.--target {target}
The column name of the target variable.--mode {hyperparameter_tuning_id}
'grid_search'
-> Grid Search,'bayesian'
-> Baysian, else -> Random.--scale {scale}
'standard'
-> Standard Scaler,'minmax'
-> Min-Max Scaler.--experiment-name {name}
The name of the experiment to be created.--experiment-tags {tags}
A dictionary containing any extra information.--experiment-id {id}
The id of an already existing experiment.
The results can be visualized in the MLFlow UI. To do so, run:
mlflow ui
Additionally, in the results
folder one can find extra plots like:
- Feature importance plot
- Permutation importance plot
- Partial dependence plot
Of course, new evaluation metrics or plots can be added via the evaluation.py
script.