Governed ML Operationalization Enablement

Governed operationalization of AI models is a framework that uses process, people, and technology that helps in ensuring the trustworthiness of AI solutions used for business. The approach uses data and AI technologies that are integrated with an open and diverse ecosystem and is rooted in principles of trustworthy AI ethics. Governed operationalization of AI models encompasses the entire lifecycle of ML Models, starting from inception to decommission. The diagram below captures the process and people aspects of the same.

For more detailed on Governed operationalization of AI models please refer to https://opendatascience.com/trustworthy-ai-operationalizing-ai-models-with-governance-part-1/.

The Enablement materials in this github repository takes you through End-to-End Governed ML Operationalization for a given use case in Heterogeneous platforms. In this Lab, we will be demonstrating the end to end pipeline creation using Airline Delay Dataset. The stages of the Governed ML Ops those are covered in this lab include:

Model Governance Workflow Initiation
Model Candidacy Validation
Data Acquistion
Model Development
Model Validation
Model Deployment
Model Productionization and
Model Monitoring

Step by Step Lab Instructions:

Hello, folks!

Use Case

Flight delay has become a very important subject for air transportation all over the world because of the associated financial losses that the aviation industry incurs. According to data from the Bureau of Transportation Statistics (BTS) of the United States, over 20% of US flights were delayed during 2018, which resulted in a severe economic impact equivalent to 41 billion US$.

These delays not only cause inconveniences to the airlines but also to the passengers. The result is an increase in travel time which increases the expenses associated with food and lodging and ultimately causes stress among passengers. The airlines are victims of extra costs associated to their crews, aircraft repositioning, fuel consumption while trying to reduce elapse times, and many others which together tamper their reputation and often result in a loss of demand by passengers.

The reasons for these delays vary from air congestion to weather conditions, mechanical problems, difficulties while boarding passengers, and simply the airlines inability to handle the demand given its capacity. In this course, we showcase the machine learning operationalization (MLOps) workflow applied to a model which predicts flight delays.

Dataset Information

There are 2 datasets we will be using for this lab.
- The first dataset contains flight information (FLIGHT_ID, MONTH, DAY, DAY_OF_THE_WEEK, DEPARTURE_DELAY TAXI_OUT, DISTANCE, DELAYED, YEAR).
  - DELAYED Field is the target variable to be predicted.
- The second dataset contains flight destination information (FLIGHT_ID, ORIGIN_AIRPORT, DESTINATION_AIRPORT).
Each group will get different versions of the above dataset with same number of attributes/columns but different value of the data.

### Details of Governed MLOps lifecycle workflow

In this Lab, you will run through the following modules in order to execute Governed MlOps flow. The various phases and the steps of Goverened ModelOps flow are depicted in the diagram below.

The notebooks in the various section of this ReadMe contain instructions for executing the steps in the Governed MLOps flow.

Part 1 : Governed Operationalization of Models developed in deployed in IBM Platform

In this part of the Enabement you shall learn how to Operationalize Models developed and deployed in IBM Platform. This part has various modules which which will take your through various stages of model lifecycle workflow for Governed Model Operationalization.

Please note following key aspects of this part -

In this part, each Person will execute every step of the model lifecycle Workflow shown above by changing the Roles.
Each person needs to have 1 Dataset, 1 Catalog, 2 Projects and 2 Deployment Spaces. Catalog, and Deployment Sapces are precreated for each Person in IBM tool. However, you have to create 2 Projects (instruction provided) as you go.

Part 1 - Module 1 : Model Governance Workflow Initiation

This module covers Governance Workflow Initiation in IBM OpenPages with Watson. The Workflow feature of IBM Openpages is used for Model Governance Lifecycle for Risk and Compliance.

Role : This step needs to be executed by ModelOwner

Instructions are provided in the notebook: Model Governance Workflow Initiation
After executing the instructions, you would have entered a unique model name in OpenPages. Please note the Model name which will be used for rest of the lab. Also, the Model Workflow will be assigned to ModelApprover for Model Candidacy Validation.

Part 1 - Module 2 : Model Candidacy Validation

This module of the workflow captures the Model details as part of Model Governance Workflow using IBM Openpages.

Role : This step needs to be executed by ModelApprover.

Log into OpenPages as ModelApprover. From the task list, select the model corresponding to the unique model name as obtained in the previous module.
Instructions are provided in the notebook: Model Candidacy Validation
After this step, the Model Workflow would be assigned to ModelDataEngineer for Data Acquisition.

Part 1 - Module 3 : Data Sourcing/Data Acquisition

This module provides the steps where a data engineer can source the data needed to develop a model.

3.a In this substep of this module Data Enginner needs to review the information that is needed for Data Sourcing

Role : This step needs to be executed by ModelDataEngineer.

Log into OpenPages as ModelDataEngineer. From the task list, select the model corresponding to the unique model name as obtained from Model Governance Initiation Module.
Review the necessary information (Model Details, Model Catalog, Details about Data Source etc.) needed for data sourcing. This is for Reviewing or Information purposes only. No action needs to be taken in Openpages yet.

3.b In this sub-step, the Data Engineer creates a joined virtualized view of the raw dataset by navigating back to home page URL of the Cloud Pak For Data. The Virtualized Dataset will then be added to your respective catalog and profiled.

Role : This step needs to be executed by ModelDataEngineer.

Note: The datasets are pre-loaded in DB2 and Postgres. 
   - Naming convention:
     - <MONTH>_FLIGHT_INFORMATION for DB2
     - <month>_airport_information for postgres
   - We will do some feature engineering and join the datasets using Data Virtualization in the notebook below.

Instructions are provided in the notebook: Data Acquistion

3.c In this sub-step, the relevant information about data sourcing is updated in model workflow.

Role : This step needs to be executed by ModelDataEngineer

3.c.1. Log into Openpages as ModelDataEngineer, navigate to the tasks section and click on your model name obtained from Model Governance Initiation module.
3.c.2. Click on Data Sourcing view, fill out the necessary fields (Training Data Asset Name, Training Data Quality Flag (indicating whether the training data quality is acceptable or not)).
- Training Data Asset Name: Provide the name of the virtualized data set that you created using Data Virtualization and added to the catalog.
- Training Data Quality Flag: Update the Training Data Quality flag as true if, after data profiling you find that the data doesn't have any null values and the attributes are not skewed etc.
3.c.3. Then, click on Save. Next, click on Data Acquisition Verified by navigating to actions on top right. This will move the model to Data Acquisition Completed stage.

3.d. In this sub-step, the model owner needs to take the appropriate action in order to make the model ready for development.

Role : This step needs to be executed by ModelOwner.

3.d.1. Log into Openpages as ModelOwner, navigate to the tasks section and click on the model name obtained from Model Governance Initiation module.
3.d.2. Model Owner verifies the training data quality flag and the training data set by getting the name from OpenPages and investigating the data asset in the respetive catalog by navigating to home page URL of Cloud Pak For Data.
3.d.3. The Model Owner then updates the Model Life Cycle Stage to Approved for Development and clicks on Save. Then, Model owner clicks on actions on top right to Model Development indicating the model is Ready for development stage. This will move the model to Approved for Development stage.

Part 1 - Module 4 : Model Development

This module provides the steps where a data scientist will develop a model. In Model Development step, the developer or data scientist will be creating the development project.

4.a. In this substep of this module Model Developer needs to review the information that is needed for Model Development