In [2]:
import warnings
warnings.filterwarnings("ignore")

# Holistic AI Library and MLFlow Integration - Reducing Bias and Improving Reproducibility in Machine Learning Deployments

One of the greatest challenges for implementing artificial intelligence (AI)-based models for decision-making is the potential reproduction of discriminatory biases present in the training data, which may eventually be transmitted to the model's outcomes. In this sense, it is necessary to implement bias mitigation tools as a means of monitoring and controlling experiments conducted with models, especially for models deployed in production.

Other significant challenges associated with the implementation of AI models include management, monitoring, and result reproducibility. [MLFlow](https://mlflow.org/) is an open-source platform for managing the life cycle of AI models. MLFlow enables data scientists to organize and manage experiments, track metrics, store artifacts, and manage AI models, in addition to providing production monitoring.

To aid in the task of conducting reproducible experiments free from discriminatory biases, this tutorial introduces an integration between MLFlow and the Holistic AI library. Through this integration, we aim to facilitate a more responsible implementation of AI.

### MLFlow Environment Setup

In [6]:
!pip install -q mlflow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
import mlflow

# set the tracking server to be localhost
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

To start the MLflow server, run the following command in the terminal:

```bash
mlflow server --host 127.0.0.1 --port 8080
```

### Data Preprocessing

The dataset that we will use is the "Adult" dataset from the UCI Machine Learning Repository, this is a publicly available dataset that contains information about age, education, marital status, race and gender of individuals from the United States. The objective is to predict whether an individual's income will be above or below $50K per year.

Source: [Holistic AI Datasets](https://holistic-ai.readthedocs.io/en/latest/datasets.html)

In [8]:
# load a dataset
from holisticai.datasets import load_dataset

dataset = load_dataset('adult', protected_attribute='sex')
dataset

In [9]:
# print the shape of X and y
print('Data shape:', dataset['X'].shape, dataset['y'].shape)

Data shape: (45222, 82) (45222,)


In this step, we will split the dataset into training and test sets. Im this case is important assure that the groups are considering in the same proportion in both sets.

In [10]:
from sklearn.model_selection import train_test_split

train_test = dataset.train_test_split(test_size=0.2, random_state=42)
train_test

### Training the models

In the next code snipet we explore the implementation of bias mitigation techniques in machine learning models using the HolisticAI framework. The code demonstrates how to leverage various mitigators, such as reweighing, learning fair representation, correlation remover, and disparate impact remover to address bias in a Logistic Regression model. By splitting the data into distinct groups and incorporating HolisticAI's pipeline, the code ensures the fair treatment of different demographic groups during both training and evaluation.

In [11]:
from holisticai.bias import mitigation
import holisticai.bias.metrics as bias_metrics
from holisticai.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# data split
train = train_test['train']
test = train_test['test']

# define the list of mitigators
mitigators = {
    "reweighing": mitigation.Reweighing(),
    "learning_fair_representation": mitigation.LearningFairRepresentation(),
    "correlation_remover": mitigation.CorrelationRemover(),
    "disparate_impact_remover": mitigation.DisparateImpactRemover(),
    "no_mitigator": None
}

# fit parameters
fit_params = {
    "bm__group_a": train['group_a'], 
    "bm__group_b": train['group_b']
}

# predict parameters
predict_params = {
    "bm__group_a": test['group_a'], 
    "bm__group_b": test['group_b']
}

# define model's parameters
model_params = {
    "solver": "lbfgs", 
    "max_iter": 100, 
    "multi_class": "auto", 
    "random_state": 42
}

### ML Flow tracking

This code snippet presents a versatile and organized function, `train_model`, designed for training and evaluating machine learning models within the MLflow framework. The function allows users to seamlessly incorporate bias mitigation into the training pipeline by specifying a mitigator parameter. Leveraging scikit-learn's Logistic Regression as the base model and the HolisticAI pipeline, the code encapsulates the trained model within a custom class, `MyModel`, ensuring compatibility with MLflow standards. 

The function not only predicts on a given input but also computes and logs essential metrics, including accuracy, disparate impact, statistical parity, and a table artifact with classification bias metrics. Furthermore, it logs model parameters, bias evaluation results, and the trained model itself, providing a comprehensive and transparent record of the model's performance and bias characteristics. This approach facilitates reproducibility and thorough analysis, aligning with best practices in machine learning model development and evaluation.

In [12]:
# define the model as a class
class MyModel(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
    
    def predict(self, ctx, model_input, params=None):
        pred_paramns = {
            "bm__group_a": model_input['group_a'], 
            "bm__group_b": model_input['group_b']
        }
        xtest = model_input.copy().drop(['group_a', 'group_b'], axis=1)
        return self.model.predict(xtest, **pred_paramns)

In [16]:
# a function to train the model with pipeline and save logs
def train_model(name, mitigator=None):
    with mlflow.start_run(run_name=name):

        # define the pipeline steps
        steps = [('bm_preprocessing', mitigator)] if mitigator is not None else []
        steps.append(('model', LogisticRegression(**model_params)))

        # fit adn predict the pipeline        
        pipeline = Pipeline(steps=steps)
        pipeline.fit(train['X'], train['y'], **fit_params)
        y_pred = pipeline.predict(test['X'], **predict_params)

        # calculate metrics
        accuracy = accuracy_score(test['y'], y_pred)
        disp_impact = bias_metrics.disparate_impact(test['group_a'], test['group_b'], y_pred)
        stat_parity = bias_metrics.statistical_parity(test['group_a'], test['group_b'], y_pred)

        # log the metrics
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("disparate_impact", disp_impact)
        mlflow.log_metric("statistical_parity", stat_parity)

        # log the model params
        mlflow.log_params(model_params)

        # set a tag that we can use to remind ourselves what this run was for
        mlflow.set_tag(f"Training Info", "Basic LR model with {name}")

        # log the model
        mlflow.pyfunc.log_model(
            python_model=MyModel(pipeline),
            artifact_path=f"model_{name}",
        )

### Training the models and tracking the experiments

For save the model's and mitigator's results we define a simple loop that iterates over a dictionary containing different mitigators, and for each mitigator, it invokes the `train_model` function with the corresponding name and mitigator settings.

In [None]:
# train the model for each mitigator and save the logs
for name, mitigator in mitigators.items():
    train_model(name, mitigator)

### Load unbiased models and make predictions

In the next code snippet, we dive into the practical aspect of deploying and utilizing a machine learning model that has been logged and saved using MLflow. The process begins with loading a previously saved model, specifically the one trained with the 'correlation_remover' mitigator, which is retrieved using its unique run identifier. 

The model is then loaded as a PyFuncModel, making it compatible with MLflow's PyFunc API. Subsequently, predictions are made on a Pandas DataFrame, simulating real-world input data for the model. The DataFrame includes features from the test set along with corresponding group information. The predictions are then printed, showcasing how to seamlessly apply a previously trained model to new data. 

This code snippet highlights the ease of model deployment and prediction using MLflow, demonstrating the practical utility of the platform in the machine learning development lifecycle.

In [None]:
logged_model = 'runs:/dede439c576a4d079f562e3d9ad7bb52/model_learning_fair_representation'

# load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

# predict on a Pandas DataFrame.
input_data = test['X'].copy()
input_data["group_a"] = test['group_a']
input_data["group_b"] = test['group_b']

predictions = loaded_model.predict(input_data)
print(predictions)

### Deploy the model with Holistic AI, MLFlow and FastAPI

In [26]:
!pip install -q fastapi


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Now we can also use other framework for the same purpose. A FastAPI application is set up to serve predictions from a machine learning model trained and logged using MLflow. The script begins by configuring the MLflow tracking URI and loading the pre-trained model with the 'correlation_remover' mitigator. FastAPI is then initialized, and a '/predict' route is defined to handle HTTP POST requests for making predictions. 

Input data is expected to conform to a specific format, validated using a Pydantic model named `InputData`. The route's function processes incoming data, converting it into a Pandas DataFrame, and generates predictions using the loaded MLflow model. Predictions are returned as a JSON response, and exception handling is implemented to manage potential errors, providing informative error messages with appropriate HTTP status codes. 

Create a _app.py_ file with the code below:

In [None]:
from fastapi import FastAPI, HTTPException
from fastapi.encoders import jsonable_encoder
from pydantic import BaseModel
from typing import List

import pandas as pd
import mlflow.pyfunc
import mlflow

# set the tracking server to be localhost
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# define the path to the MLflow model
logged_model = 'runs:/dede439c576a4d079f562e3d9ad7bb52/model_learning_fair_representation'

# load the MLflow model using mlflow.pyfunc.load_model()
loaded_model = mlflow.pyfunc.load_model(logged_model)

app = FastAPI()  # Initialize a FastAPI application

# define a Pydantic model for input data validation
class InputData(BaseModel):
    data: List[List[float]]
    columns: List[str]

# define a route and function for making predictions
@app.post('/predict')  # Define a route '/predict' that accepts HTTP POST requests
async def predict(input_data: InputData):
    try:
        input_df = pd.DataFrame(jsonable_encoder(input_data.data), columns=input_data.columns)

        # make predictions using the loaded model
        predictions = loaded_model.predict(input_df)  # Use the model to make predictions on input data

        # return predictions as a JSON response
        return {"predictions": predictions.tolist()}  # Convert predictions to a JSON response

    except Exception as e:
        # Handle exceptions (e.g., invalid input or model errors) and return an error message as a JSON response
        raise HTTPException(status_code=400, detail=str(e))

To run the application we can use the following command on terminal:

 ```bash
uvicorn app:app --host 127.0.0.1 --port 8000
 ```

After run the application, we can access the API documentation at http://127.0.0.1/8000/docs.

This code demonstrates the integration of FastAPI and MLflow, creating a robust and efficient platform for deploying machine learning models without bias using Holistic AI Library with a focus on ease of use and real-time prediction capabilities.

#### Make Predictions with FastAPI

Now you can make predictions using the FastAPI. To do this, you can use the following command on terminal:

```bash
touch request.py
```

After, you need import essential libraries, including requests for making HTTP requests and pandas for handling data.

We witness the client-side interaction with the FastAPI web application that hosts the machine learning model. First, the input data, mirroring the structure used during model training and testing, is prepared. This includes augmenting the DataFrame with 'group_a' and 'group_b' columns. 

The input data is then converted into JSON format using Pandas' `to_json` method. Subsequently, a POST request is made to the local Flask server ('http://127.0.0.1:5000/predict') with the prepared JSON data. 

The response from the server is captured, and predictions are extracted from the returned JSON content. This code provides a practical example of how to interact with a deployed machine learning model using client-side scripting.


In [28]:
import json 
import requests

# lets use the same input data as before
input_data = test['X'].copy()
input_data["group_a"] = test['group_a']
input_data["group_b"] = test['group_b']

# convert input_data to JSON format
json_input_data = input_data.to_json(orient='split')
data = json.loads(json_input_data)

# make a POST request to the local FastAPI server
response = requests.post('http://127.0.0.1:8000/predict', json=data)

# get predictions from the JSON response of the Flask server
predictions = response.json()
print(predictions)

{'predictions': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### Summary

In conclusion, the presented code not only highlights the seamless integration of MLflow and Flask and FastAPI but also emphasizes the incorporation of the HolisticAI library for mitigating bias in machine learning models. 

The use of MLflow enables effective model tracking, management, and deployment, ensuring transparency and reproducibility in the machine learning development lifecycle. Flask's and FastAPI's modern design and automatic OpenAPI and JSON Schema generation provide an efficient platform for building robust APIs, facilitating real-time predictions. 

Additionally, the code showcases the integration of HolisticAI, a powerful library designed to address bias in models. By incorporating bias mitigation and measure techniques, such as the 'correlation_remover' used in the showcased example, developers can enhance the fairness and ethical considerations of their machine learning models. This approach, combining MLflow, Flask/FastAPI, and HolisticAI, serves as a comprehensive guide for deploying and consuming bias-aware machine learning models, promoting responsible and inclusive AI practices in production environments.