# ML Insights run with Custom Metrics

# Use Case

This Notebook shows how to to define and use custom metrics using ML Insights declarative API  metric calculation.

## Note

### Custom Metrics 
    - sum_divide_by_two :  This metrics return the total sum divided by 2 
    - sum_divide_by_k : This metrics return the total sum divided by k , where k : [1 ,length of data set] 
    
## About Dataset
The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

The data set contains medical and demographic data of patients . It consists of various features such as Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome, Prediction, BMICategory, Prediction_Score .

Dataset source : https://www.kaggle.com/datasets/kandij/diabetes-dataset


# Install ML Observability Insights Library SDK

- Prerequisites
    - Linux/Mac (Intel CPU)
    - Python 3.8 and 3.9 only


- Installation
    - ML Insights is made available as a Python package (via Artifactory) which can be installed using pip install as shown below. Depending on the execution engine on which to do the run, one can use scoped package. For eg: if we want to run on dask, use oracle-ml-insights[dask], for spark use oracle-ml-insights[spark], for native use oracle-ml-insights. One can install all the dependencies as use oracle-ml-insights[all]

      !pip install oracle-ml-insights

Refer : [Installation and Setup](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/tutorials/install.html)

In [None]:
!python3 -m pip install oracle-ml-insights

In [None]:
!python3 -m pip install matplotlib

# 1 ML Insights Imports 

In [25]:
# imports

import os
from typing import Any
import pyarrow as pa
import pandas as pd
import json

# Import Data Quality metrics 
from mlm_insights.core.metrics.count import Count
from mlm_insights.core.metrics.min import Min
from mlm_insights.core.metrics.mean import Mean
from mlm_insights.core.metrics.sum import Sum
from mlm_insights.core.metrics.standard_deviation import StandardDeviation

# Import Data Integrity metrics
from mlm_insights.core.metrics.rows_count import RowCount
from mlm_insights.core.metrics.distinct_count import DistinctCount
from mlm_insights.core.metrics.duplicate_count import DuplicateCount
from mlm_insights.core.metrics.rows_count import RowCount



from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.core.post_processors.local_writer_post_processor import LocalWriterPostProcessor


# import data reader
from mlm_insights.core.data_sources import LocalDatePrefixDataSource
from mlm_insights.mlm_native.readers import CSVNativeDataReader



# Custom Metrics

- To Configure the custom metrics , We need to extend the ML Insights MetricBase interface which is Abstract Base Class for defining an Insights Metric and override its methods to write custom metrics logic. 

    - create method : Factory Method to create an object. The configuration will be available in    config.
    - compute method : Use to write the custom metrics logic to calculate the metric value from the passed series object .
    - merge method :  Use to Merge the other metric with the current metric and return a new instance of metric. Use this method to merge the states of the 2 metrics .
    - get_result method : Use to returns the computed value of the metric used by to_json() method internally
    - get_standard_metric_result : Use to returns the computed value of the metric in standard way used by to_json() method internally  


Refer : [Metrics Component Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/metrics_component.html)

In [26]:
# import Custom metrics  
from sum_divide_by_two_custom_metrics import SumDivideByTwo
from sum_divide_by_k_custom_metrics import SumDivideByK

# 2 Configure Feature schema

Feature Schema defines the structure and metadata of the input data, which includes data type, column type, column mapping . The framework, uses this information as the ground truth and any deviation in the actual data is taken as an anomaly and the framework usually will ignore such all such anomaly in data.

In [27]:
def get_input_schema():
    return {
        "Pregnancies": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "BloodPressure": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "SkinThickness": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "Insulin": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "BMI": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "Age": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "DiabetesPedigreeFunction": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "Outcome": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS,column_type = ColumnType.TARGET),
        "Prediction": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS,column_type = ColumnType.PREDICTION),
        "BMICategory":FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL)
        
    }



# 3 Configure Metrics 

Metrics are the core construct for the framework. This component is responsible for calculating all statistical metrics and algorithms. Metric components work based on the type of features (eg. input feature, output feature etc.) available, their data type (eg. int, float, string etc.) as well as additional context (e.g. if any previous computation is available to compare against). ML Insights provides commonly used metrics out of the box for different ML observability use cases.

Refer : [Metrics Component Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/metrics_component.html)



In [28]:
def get_metrics():
    metrics = [
               MetricMetadata(klass=Mean),
               MetricMetadata(klass=Sum),
               MetricMetadata(klass=SumDivideByTwo), # register custom metrics 
               MetricMetadata(klass=SumDivideByK, config={'k': 5}) # register custom metrics 
              ]
    uni_variate_metrics = {
        "BloodPressure": metrics
        
    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details

# 4 Configure Data Reader

Data Reader allows for ingestion of raw data into the framework. This component is primarily responsible for understanding different formats of data (e.g. jsonl, csv) etc. and how to properly read them. At its essence, the primary responsibility of this component is that given a set of valid file locations which represents file of a specific type, reader can properly decode the content and load them in memory.

Additionally, Data Source component is an optional subcomponent, which is usually used along side the Reader. The primary responsibility of the data source component is to embed logic on filtering and partitioning of files to be read by the framework.

Refer : [Data Reader Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/data_reader_component.html)

In [29]:
def get_reader():
    data = {
        "file_type": "csv",
        "date_range": {"start": "2023-06-26", "end": "2023-06-27"}
    }
    base_location ="input_data/diabetes_prediction"
    ds = LocalDatePrefixDataSource(base_location, **data)
    print(ds.get_data_location())
    csv_reader = CSVNativeDataReader(data_source=ds)
    return csv_reader



# 5 Compute the Profile 

Create the builder object which provides core set of api, using which user can set the behavior of their monitoring. By selecting what components and variants to run all aspects of the monitoring task can be customised and configured. 

The run() method is responsible to run the internal workflow. It also handles the life cycle of each component passed, which includes creation (if required), invoking interface functions, destroying etc . Additionally, runner also handles some more advanced operations like thread pooling, compute engine abstraction etc.

Refer : [Builder Object Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/builder_object.html)


In [30]:
def main():    
    # Set up the insights builder by passing: input schema, metric, reader and engine details
    runner = InsightsBuilder(). \
        with_input_schema(get_input_schema()). \
        with_metrics(metrics=get_metrics()). \
        with_reader(reader=get_reader()). \
        with_post_processors(post_processors=[LocalWriterPostProcessor(file_location="output_data/profiles", file_name="custom_metrics_profile.bin")]). \
        build()

    # Run the evaluation
    run_result = runner.run()
    return run_result.profile
    
profile = main()
profile.to_pandas()

['input_data/diabetes_prediction/2023-06-26/2023-06-26.csv', 'input_data/diabetes_prediction/2023-06-27/2023-06-27.csv']


Unnamed: 0,Mean,SumDivideByTwo,SumDivideByK,Sum
BloodPressure,69.134328,32424.0,12969.6,64848.0


## 6 Profile Result

In [31]:
profile_json = profile.to_json()
print(profile_json)




{'dataset_metrics': {}, 'feature_metrics': {'BloodPressure': {'Mean': {'metric_name': 'Mean', 'metric_description': 'Feature Metric to compute mean', 'variable_count': 1, 'variable_names': ['mean'], 'variable_types': ['CONTINUOUS'], 'variable_dtypes': ['FLOAT'], 'variable_dimensions': [0], 'metric_data': [69.13432835820896], 'metadata': {}, 'error': ''}, 'SumDivideByTwo': {'metric_name': 'SumDivideByTwo', 'metric_description': '', 'variable_count': 1, 'variable_names': ['sumdividebytwo'], 'variable_types': ['CONTINUOUS'], 'variable_dtypes': ['FLOAT'], 'variable_dimensions': [0], 'metric_data': [32424.0], 'metadata': {}, 'error': ''}, 'SumDivideByK': {'metric_name': 'SumDivideByK', 'metric_description': '', 'variable_count': 1, 'variable_names': ['sumdividebyK'], 'variable_types': ['CONTINUOUS'], 'variable_dtypes': ['FLOAT'], 'variable_dimensions': [0], 'metric_data': [12969.6], 'metadata': {}, 'error': ''}, 'Sum': {'metric_name': 'Sum', 'metric_description': '', 'variable_count': 1, 'v

In [32]:
print("The Sum Metric value : ")
print(profile_json['feature_metrics']['BloodPressure']['Sum']['metric_data'])

The Sum Metric value : 
[64848.0]


In [33]:
print("The SumDivideByTwo Metric value : ")
print(profile_json['feature_metrics']['BloodPressure']['SumDivideByTwo']['metric_data'])

The SumDivideByTwo Metric value : 
[32424.0]


In [34]:
print("The SumDivideByK Metric value : ")
print(profile_json['feature_metrics']['BloodPressure']['SumDivideByK']['metric_data'])

The SumDivideByK Metric value : 
[12969.6]
