# ML Insights : Performance Metrics For Regression Models

## Use Case  

This Notebook shows how to configure following Regression Metrics using ML Insights .

- MeanAbsoluteError : Computes Mean Absolute Error regression loss of dataset
- MeanSquaredError : Computes Mean Squared Error regression loss of dataset
- R2Score : Computes R2-Score between target and actual columns
- RootMeanSquaredError : Computes Root Mean Square Error regression loss of dataset
- MeanSquaredLogError : Computes Mean Square Log Error regression loss of dataset
- MeanAbsolutePercentageError : Computes Mean absolute percentage error (MAPE) regression loss of dataset
- MaxError :  Computes Max Error regression loss of dataset

## Note

- Performance Metrics For Regression Models works only for continuous type input features
- Performance Metrics For Regression Models needs to have target and prediction features in feature schema. This includes:
    - Column Type as Target for ground truth column ,If these columns are missing or not configured, Insights throw validation errors
    - Column Type as Prediction for prediction column ,If these columns are missing or not configured, Insights throw validation errors
- All Conflict metrics can be view using following Profile API:
    - to_json

### About Dataset 

The data set contains data that has been collected across various property real estate aggregators. It consists of various features such as Floor, LotSize, LotArea, SalePrice, SaleTarget .

Dataste Source : https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

# Install ML Observability Insights Library SDK

- Prerequisites
    - Linux/Mac (Intel CPU)
    - Python 3.8 and 3.9 only


- Installation
    - ML Insights is made available as a Python package (via Artifactory) which can be installed using pip install as shown below. Depending on the execution engine on which to do the run, one can use scoped package. For eg: if we want to run on dask, use oracle-ml-insights[dask], for spark use oracle-ml-insights[spark], for native use oracle-ml-insights. One can install all the dependencies as use oracle-ml-insights[all]

      !pip install oracle-ml-insights


Refer : [Installation and Setup](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/tutorials/install.html)

In [None]:
!python3 -m pip install oracle-ml-insights

In [None]:
!python3 -m pip install scikit-learn

# 1 ML Insights Imports 

In [14]:
# imports

import os
from typing import Any
import pyarrow as pa
import pandas as pd
import json

# Import metrics
from mlm_insights.core.features.feature import FeatureMetadata
from mlm_insights.core.metrics.count import Count
from mlm_insights.core.metrics.max import Max
from mlm_insights.core.metrics.mean import Mean
from mlm_insights.core.metrics.min import Min

# Import dataset metrics
from mlm_insights.core.metrics.rows_count import RowCount

# Import Regression metrics
from mlm_insights.core.metrics.regression_metrics.max_error import MaxError
from mlm_insights.core.metrics.regression_metrics.mean_absolute_error import MeanAbsoluteError
from mlm_insights.core.metrics.regression_metrics.mean_absolute_percentage_error import MeanAbsolutePercentageError
from mlm_insights.core.metrics.regression_metrics.mean_squared_error import MeanSquaredError
from mlm_insights.core.metrics.regression_metrics.mean_squared_log_error import MeanSquaredLogError
from mlm_insights.core.metrics.regression_metrics.r2_score import R2Score
from mlm_insights.core.metrics.regression_metrics.root_mean_squared_error import RootMeanSquaredError

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

# import data reader
from mlm_insights.core.data_sources import LocalDatePrefixDataSource
from mlm_insights.mlm_native.readers import CSVNativeDataReader

# import post processor 
from mlm_insights.core.post_processors.local_writer_post_processor import LocalWriterPostProcessor
from mlm_insights.builder.insights_builder import InsightsBuilder





# 2 Configure Feature schema

Feature Schema defines the structure and metadata of the input data, which includes data type, column type, column mapping . The framework, uses this information as the ground truth and any deviation in the actual data is taken as an anomaly and the framework usually will ignore such all such anomaly in data.

In [15]:
def get_input_schema():
    return {
        "Floor": FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT),
        "LotSize": FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS,column_type=ColumnType.INPUT),
        "LotArea": FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS, column_type=ColumnType.INPUT),
        "SalePrice": FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS, column_type = ColumnType.PREDICTION),
        "SaleTarget": FeatureType(data_type=DataType.INTEGER, variable_type=VariableType.CONTINUOUS, column_type = ColumnType.TARGET)
    }

# 3 Configure Metrics

Metrics are the core construct for the framework. This component is responsible for calculating all statistical metrics and algorithms. Metric components work based on the type of features (eg. input feature, output feature etc.) available, their data type (eg. int, float, string etc.) as well as additional context (e.g. if any previous computation is available to compare against). ML Insights provides commonly used metrics out of the box for different ML observability use cases.

Refer : [Metrics Component Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/metrics_component.html)



In [16]:
def get_metrics():
    metrics = [
               MetricMetadata(klass=Max),
               MetricMetadata(klass=Min),
               MetricMetadata(klass=Count),
               MetricMetadata(klass=Mean)
              ]
    uni_variate_metrics = {
        "SalePrice": metrics,
        "SaleTarget": metrics
    }
    
    dataset_metrics = [MetricMetadata(klass=RowCount),
                       MetricMetadata(klass=MeanAbsoluteError),
                       MetricMetadata(klass=MeanSquaredError),
                       MetricMetadata(klass=RootMeanSquaredError),
                       MetricMetadata(klass=MeanAbsolutePercentageError),
                       MetricMetadata(klass=MeanSquaredLogError),
                       MetricMetadata(klass=R2Score),
                       MetricMetadata(klass=MaxError)]
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=dataset_metrics)
    return metric_details

# 4 Configure Data Reader

Data Reader allows for ingestion of raw data into the framework. This component is primarily responsible for understanding different formats of data (e.g. jsonl, csv) etc. and how to properly read them. At its essence, the primary responsibility of this component is that given a set of valid file locations which represents file of a specific type, reader can properly decode the content and load them in memory.

Additionally, Data Source component is an optional subcomponent, which is usually used along side the Reader. The primary responsibility of the data source component is to embed logic on filtering and partitioning of files to be read by the framework.

Refer : [Data Reader Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/data_reader_component.html)

In [17]:
def get_data_reader(start_date, end_date):
    
    # Define Data Format
    data = {
        "file_type": "csv",
        "date_range": {"start": start_date, "end": end_date}
    }
    
    # Define Data Location
    base_location ="input_data/house_price_prediction_dataset"
    
    # Create new Dataset
    ds = LocalDatePrefixDataSource(base_location, **data)
    
    # Load Dataset
    csv_reader = CSVNativeDataReader(data_source=ds)
    
    
    return csv_reader

# 5 Compute the Profile 

Create the builder object which provides core set of api, using which user can set the behavior of their monitoring. By selecting what components and variants to run all aspects of the monitoring task can be customised and configured. 

The run() method is responsible to run the internal workflow. It also handles the life cycle of each component passed, which includes creation (if required), invoking interface functions, destroying etc . Additionally, runner also handles some more advanced operations like thread pooling, compute engine abstraction etc.

Refer : [Builder Object Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/builder_object.html)


In [18]:
# Create the ML Monitoring Metrics
def run_evaluation(start_date, end_date, output_location, output_file):    
    
    # Set up the insights builder by passing: input schema, metric, reader and engine details
    runner = InsightsBuilder(). \
        with_input_schema(get_input_schema()). \
        with_metrics(metrics=get_metrics()). \
        with_reader(reader=get_data_reader(start_date, end_date)). \
        with_post_processors(post_processors=[LocalWriterPostProcessor(file_location=output_location, file_name=output_file)]). \
        build()

    # Run the Evaluation of Metrics
    run_result = runner.run()
    
    return run_result.profile

In [19]:
# Define Baseline Dates
base_start_date = '2023-06-26'
base_end_date = '2023-06-27'

# Execute Base Profile - Pass in Data Start, Data End, Output Location, Output File
profile = run_evaluation(base_start_date, base_end_date, 'output_data/profiles', 'regression_metrics_profile.bin')

## 6 Profile Result

## 6.1 Visualize the Profile in tabular format

In [20]:
profile.to_pandas()

Unnamed: 0,Count.total_count,Count.missing_count,Count.missing_count_percentage,Min,Mean,Max
SalePrice,1770.0,0.0,0.0,34911.0,183422.359322,555111.0
SaleTarget,1770.0,0.0,0.0,39311.0,181564.945763,451951.0


In [21]:
profile_json = profile.to_json()
dataset_metrics = profile_json['dataset_metrics']
print(json.dumps(dataset_metrics,sort_keys=True, indent=4))


{
    "MaxError": {
        "metadata": {},
        "metric_data": [
            402000.0
        ],
        "metric_description": "Computes the maximum residual error",
        "metric_name": "MaxError",
        "variable_count": 1,
        "variable_dimensions": [
            0
        ],
        "variable_dtypes": [
            "FLOAT"
        ],
        "variable_names": [
            "max_error"
        ],
        "variable_types": [
            "CONTINUOUS"
        ]
    },
    "MeanAbsoluteError": {
        "metadata": {},
        "metric_data": [
            78467.62372881356
        ],
        "metric_description": "Computes Mean Absolute Error regression loss",
        "metric_name": "MeanAbsoluteError",
        "variable_count": 1,
        "variable_dimensions": [
            0
        ],
        "variable_dtypes": [
            "FLOAT"
        ],
        "variable_names": [
            "mean_absolute_error"
        ],
        "variable_types": [
            "CONTINUOUS"
   

In [22]:
pd.json_normalize(dataset_metrics).T.dropna()

Unnamed: 0,0
MeanSquaredLogError.metric_name,MeanSquaredLogError
MeanSquaredLogError.metric_description,Computes Mean Squared Log Error regression loss
MeanSquaredLogError.variable_count,1
MeanSquaredLogError.variable_names,[mean_squared_log_error]
MeanSquaredLogError.variable_types,[CONTINUOUS]
...,...
RootMeanSquaredError.variable_names,[root_mean_squared_error]
RootMeanSquaredError.variable_types,[CONTINUOUS]
RootMeanSquaredError.variable_dtypes,[FLOAT]
RootMeanSquaredError.variable_dimensions,[0]
