# Minimal ML Insights to calculate metrics

# Use Case

This Notebook shows Minimal ML Insights option where after loading the data and defining the schema of the data framework will automatically decide what metrics to evaluate based on some heuristic.

Note : Minimal ML Insights will only compute data quality and data integrity metrics . If we need to performance , classification and drift metrics we need to use run ML Insights using config or api option .

### About Dataset
The Iris flower data set or Fisher's Iris data set is a multivariate data set . The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

Dataset source : https://archive.ics.uci.edu/dataset/53/iris

## What is ML Insights 

ML Insights is a python library for data scientists, ml engineers as well as developers. Insights can be used to ingest data in different formats, apply row based transformations and monitor data and ML Models from validation to production.

ML Insights along with the library also provide multiple ways to process and evaluate data and ml models. The options include low code alternative for customisation, a pre-built application and further extensibility through custom applications and custom components.

Ml Insights helps evaluate and monitor data and ML model for entirety of ML Observability lifecycle.

ML Insights provides component to carry out tasks like data ingestion, row level data transformation, metric calculation and post-processing of metric output.


- Insights currently supports CSV, JSON, JSONL data types.
- It also supports major execution engines like Native Pandas, Dask and Spark.
- Insights provides metric in different groups like
    - Data Integrity
    - Data Quality/ Summary
    - Feature and Prediction Drift Detection
    - Model Performance for both classification and Regression Models

# Install ML Observability Insights Library SDK

- Prerequisites
    - Linux/Mac (Intel CPU)
    - Python 3.8 and 3.9 only


- Installation
    - ML Insights is made available as a Python package (via Artifactory) which can be installed using pip install as shown below. Depending on the execution engine on which to do the run, one can use scoped package. For eg: if we want to run on dask, use oracle-ml-insights[dask], for spark use oracle-ml-insights[spark], for native use oracle-ml-insights. One can install all the dependencies as use oracle-ml-insights[all]

      !pip install oracle-ml-insights

Refer : [Installation and Setup](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/tutorials/install.html)

This example notebook showcases how to use Insights config reader to run the evaluation based on monitor config.  Sample monitor config JSON and sample data are available under `monitor_configs/monitor_config.json` and `input_data/iris-dataset` respectively

In [None]:
!python3 -m pip install oracle-ml-insights

# 1 ML Insights Imports

In [40]:
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.config_reader.insights_config_reader import InsightsConfigReader
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType


In [41]:
from sklearn.datasets import load_iris
import pandas as pd

# 2 Load Data 

In [42]:
iris_dataset = load_iris(as_frame=True)
iris_data_frame = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)

# 3 Configure Feature Schema

In [43]:
def get_input_schema():
    return {
        "sepal length (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "sepal width (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "petal length (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "petal width (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS)
    }



# 4 Compute Profile

After loading the data and defining the schema of the data, we can pass the data and the schema to ML Insights Builder component, which will return the runner (workflow) component. Insights will automatically decide what metrics to evaluate based on some heuristic.



In [44]:
runner = InsightsBuilder(). \
    with_input_schema(get_input_schema()). \
    with_data_frame(data_frame=iris_data_frame). \
    build()

With the runner object correctly built, we can now call the run api to get the profile object, which contains all the metric output.

In [45]:
run_result = runner.run()
# Use Materialized View API to see the results
profile = run_result.profile


# 5 Profile Result

This is a data frame representation of our profile.

Number of rows is equal to the number of features Each column corresponds to a specific Metric associated with a feature and its values Lets take a look at the generated profile output: We have the generic iris data set here, with the following features * sepal length * sepal width * petal length * petal width

Given that no metric were asked for explicitly, Insights has automatically decided what metric should be generated for each of the features. The framework runs some heuristic which is based on

Data type of the feature

Variable type of the feature

And the column type - e.g. if it is input, prediction etc. column.

The output is shown as a tabular format, since we explicitly used the api to get the dataframe representation of the profile. However, user can also pass on specific post processors to emit the profile data in different formats.

## 5.1 Visualize the Profile in  tabular format


In [46]:
profile.to_pandas()

Unnamed: 0,Skewness,StandardDeviation,Min,IsConstantFeature,IQR,Range,KolmogorovSmirnov,ProbabilityDistribution.bins,ProbabilityDistribution.density,Variance,...,Count.missing_count_percentage,Kurtosis,DistinctCount,Sum,Max,Quartiles.q1,Quartiles.q2,Quartiles.q3,Mean,IsQuasiConstantFeature
sepal length (cm),0.311753,0.825301,4.3,False,1.3,3.6,Missing required parameter profile or registry...,"[4.3, 4.7, 5.1, 5.5, 5.9, 6.300000000000001, 6...","[0.06, 0.15333333333333335, 0.1333333333333333...",0.681122,...,0.0,-0.573568,35,876.5,7.9,5.1,5.8,6.4,5.843333,False
sepal width (cm),0.315767,0.434411,2.0,False,0.5,2.4,Missing required parameter profile or registry...,"[2.0, 2.2666666666666666, 2.533333333333333, 2...","[0.02666666666666667, 0.1, 0.18666666666666668...",0.188713,...,0.0,0.180976,23,458.6,4.4,2.8,3.0,3.3,3.057333,False
petal length (cm),-0.272128,1.759404,1.0,False,3.5,5.9,Missing required parameter profile or registry...,"[1.0, 1.6555555555555554, 2.311111111111111, 2...","[0.29333333333333333, 0.03999999999999998, 0.0...",3.095503,...,0.0,-1.395536,43,563.7,6.9,1.6,4.4,5.1,3.758,False
petal width (cm),-0.101934,0.759693,0.1,False,1.5,2.4,Missing required parameter profile or registry...,"[0.1, 0.3666666666666667, 0.6333333333333333, ...","[0.2733333333333333, 0.06, 0.0, 0.066666666666...",0.577133,...,0.0,-1.336067,22,179.9,2.5,0.3,1.3,1.8,1.199333,False


## 5.2 Visualize the Profile in  JSON format

In [47]:
profile.to_json()

{'dataset_metrics': {'RowCount': {'metric_name': 'RowCount',
   'metric_description': 'Dataset-level Metric to compute the total row count of the dataset',
   'variable_count': 1,
   'variable_names': ['rows_count'],
   'variable_types': ['DISCRETE'],
   'variable_dtypes': ['INTEGER'],
   'variable_dimensions': [0],
   'metric_data': [150.0],
   'metadata': {}},
  'PearsonCorrelation': {'metric_name': 'PearsonCorrelation',
   'metric_description': "Pearson's Correlation Coefficient matrix between n numeric features",
   'variable_count': 2,
   'variable_names': ['feature_list', 'matrix'],
   'variable_types': ['NOMINAL', 'CONTINUOUS'],
   'variable_dtypes': ['STRING', 'FLOAT'],
   'variable_dimensions': [1, 2],
   'metric_data': [['petal length (cm)',
     'petal width (cm)',
     'sepal length (cm)',
     'sepal width (cm)'],
    [[1.0, 0.9628654314027963, 0.8717537758865833, -0.4284401043305396],
     [0.9628654314027963, 1.0, 0.8179411262715753, -0.36612593253643916],
     [0.871753