# ML Insights : Data Quality & Data Integrity Metrics

# Use Case

This Notebook shows how to configure following Data Quality & Data Integrity Metrics using ML Insights .

    Sum = Compute Sum metrics of input .
    TypeMetric = Computes TypeMetric of input.
    Mean = Computes Mean Metric of input.
    Min = Computes Minimum of input .
    StandardDeviation = Computes Standard Deviation Metric of input.
    Variance = Computes Variance Metric fof input.
    Max = Computes Maximum of input .
    Range = Computes Range of input .
    Count = Compute total count , nan/missing count & missing count percentage of input.
    Skewness = Compute Skewness metrics of input .
    Kurtosis = Compute Kurtosis metrics of input .
    Quartiles = Compute Quartiles metrics of input .
    IQR = Compute IQR metrics of input .
    TopKFrequentElements = Compute Top K Frequent Elements Metric of input .
    FrequencyDistribution = Compute Frequency Distribution of input .
    DistinctCount = Compute Distinct Count of input .
    DuplicateCount = Compute the number of items that are  duplicate of another item in the data and percentage of duplicate count out of the total count.
    Mode = Computes Mode Metric of input.
    IsConstantFeature = Computes IsConstantFeature Univariate Metric of input.
    IsQuasiConstantFeature =  Computes IsQuasiConstantFeature Univariate Metric of input.
    ProbabilityDistribution = Computes PDF Metric for the dataset for configured number of bins or default number of bins.

## Note
    
- All Data Quality & Data Integrity Metrics requires to define VariableType and DataType of feature in feature schema. This includes:
    - variable_type=VariableType.CONTINUOUS for Numerical features
    - variable_type=VariableType.NOMINAL for Categorical features
    - Supported datatype : data_type=DataType.FLOAT , data_type=DataType.INTEGER ,data_type=DataType.STRING , data_type=DataType.TEXT ,data_type=DataType.BOOLEAN
- All Conflict metrics can be view using following Profile API:
    - to_json
    - to_pandas
    
### About Dataset 

The Iris flower data set or Fisher's Iris data set is a multivariate data set . The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

Dataset source : https://archive.ics.uci.edu/dataset/53/iris

# Install ML Observability Insights Library SDK

- Prerequisites
    - Linux/Mac (Intel CPU)
    - Python 3.8 and 3.9 only


- Installation
    - ML Insights is made available as a Python package (via Artifactory) which can be installed using pip install as shown below. Depending on the execution engine on which to do the run, one can use scoped package. For eg: if we want to run on dask, use oracle-ml-insights[dask], for spark use oracle-ml-insights[spark], for native use oracle-ml-insights. One can install all the dependencies as use oracle-ml-insights[all]

      !pip install oracle-ml-insights

Refer : [Installation and Setup](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/tutorials/install.html)


In [None]:
!python3 -m pip install oracle-ml-insights

# 1 ML Insights Imports 

In [12]:
# imports

import os
from typing import Any
import pyarrow as pa
import pandas as pd
import json


# Import metrics
from mlm_insights.core.features.feature import FeatureMetadata
from mlm_insights.core.metrics.count import Count
from mlm_insights.core.metrics.distinct_count import DistinctCount
from mlm_insights.core.metrics.duplicate_count import DuplicateCount
from mlm_insights.core.metrics.iqr import IQR
from mlm_insights.core.metrics.is_constant_feature import IsConstantFeature
from mlm_insights.core.metrics.is_quasi_constant_feature import IsQuasiConstantFeature
from mlm_insights.core.metrics.kurtosis import Kurtosis
from mlm_insights.core.metrics.max import Max
from mlm_insights.core.metrics.mean import Mean
from mlm_insights.core.metrics.min import Min
from mlm_insights.core.metrics.mode import Mode
from mlm_insights.core.metrics.range import Range
from mlm_insights.core.metrics.rows_count import RowCount
from mlm_insights.core.metrics.skewness import Skewness
from mlm_insights.core.metrics.standard_deviation import StandardDeviation
from mlm_insights.core.metrics.top_k_frequent_elements import TopKFrequentElements
from mlm_insights.core.metrics.type_metric import TypeMetric
from mlm_insights.core.metrics.variance import Variance
from mlm_insights.core.metrics.quartiles import Quartiles
from mlm_insights.core.metrics.sum import Sum

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.data_sources import LocalDatePrefixDataSource
from mlm_insights.mlm_native.readers import CSVNativeDataReader
from mlm_insights.core.post_processors.local_writer_post_processor import LocalWriterPostProcessor
from mlm_insights.builder.insights_builder import InsightsBuilder

# 2 Configure Feature schema

Feature Schema defines the structure and metadata of the input data, which includes data type, column type, column mapping . The framework, uses this information as the ground truth and any deviation in the actual data is taken as an anomaly and the framework usually will ignore such all such anomaly in data.

In [13]:
def get_input_schema():
    return {
        "sepal length (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "sepal width (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "petal length (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS),
        "petal width (cm)": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS)
    }



# 3 Configure Metrics

Metrics are the core construct for the framework. This component is responsible for calculating all statistical metrics and algorithms. Metric components work based on the type of features (eg. input feature, output feature etc.) available, their data type (eg. int, float, string etc.) as well as additional context (e.g. if any previous computation is available to compare against). ML Insights provides commonly used metrics out of the box for different ML observability use cases.

Refer : [Metrics Component Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/metrics_component.html)



In [14]:
def get_metrics():
    metrics = [
               MetricMetadata(klass=Sum),
               MetricMetadata(klass=Quartiles),
               MetricMetadata(klass=Max),
               MetricMetadata(klass=Min),
               MetricMetadata(klass=Count),
               MetricMetadata(klass=Mean),
               MetricMetadata(klass=Skewness),
               MetricMetadata(klass=TypeMetric),
               MetricMetadata(klass=StandardDeviation),
               MetricMetadata(klass=Variance),
               MetricMetadata(klass=IsConstantFeature),
               MetricMetadata(klass=IsQuasiConstantFeature),
               MetricMetadata(klass=Kurtosis),
               MetricMetadata(klass=DistinctCount),
               MetricMetadata(klass=DuplicateCount),
               MetricMetadata(klass=IQR),
               MetricMetadata(klass=Mode),
               MetricMetadata(klass=Range)
              ]
    uni_variate_metrics = {
        "sepal length (cm)": metrics,
        "sepal width (cm)": metrics
    }
    metric_details = MetricDetail(univariate_metric=uni_variate_metrics,
                                  dataset_metrics=[])
    return metric_details

# 4 Configure Data Reader

Data Reader allows for ingestion of raw data into the framework. This component is primarily responsible for understanding different formats of data (e.g. jsonl, csv) etc. and how to properly read them. At its essence, the primary responsibility of this component is that given a set of valid file locations which represents file of a specific type, reader can properly decode the content and load them in memory.

Additionally, Data Source component is an optional subcomponent, which is usually used along side the Reader. The primary responsibility of the data source component is to embed logic on filtering and partitioning of files to be read by the framework.

Refer : [Data Reader Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/data_reader_component.html)

In [15]:
# def get_data_frame():
#     iris_dataset = load_iris(as_frame=True)
#     iris_data_frame = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)
#     return iris_data_frame

# def get_reader():
#     data = {
#         "file_type": "csv",
#         "date_range": {"start": "2023-06-24", "end": "2023-06-27"}
#     }
#     base_location ="input_data/iris_dataset"
#     ds = LocalDatePrefixDataSource(base_location, **data)
#     print(ds.get_data_location())
#     csv_reader = CSVNativeDataReader(data_source=ds)
#     return csv_reader

def get_data_reader(start_date, end_date):
    
    # Define Data Format
    data = {
        "file_type": "csv",
        "date_range": {"start": start_date, "end": end_date}
    }
    
    # Define Data Location
    base_location ="input_data/iris_dataset"
    
    # Create new Dataset
    ds = LocalDatePrefixDataSource(base_location, **data)
    
    # Load Dataset
    csv_reader = CSVNativeDataReader(data_source=ds)
    
    
    return csv_reader

# 5 Compute the Profile 

Create the builder object which provides core set of api, using which user can set the behavior of their monitoring. By selecting what components and variants to run all aspects of the monitoring task can be customised and configured. 

The run() method is responsible to run the internal workflow. It also handles the life cycle of each component passed, which includes creation (if required), invoking interface functions, destroying etc . Additionally, runner also handles some more advanced operations like thread pooling, compute engine abstraction etc.

Refer : [Builder Object Documentation](https://docs.oracle.com/en-us/iaas/tools/ml-insights-docs/latest/ml-insights-documentation/html/user_guide/getting_started/builder_object.html)


In [16]:
# Create the ML Monitoring Metrics
def run_evaluation(start_date, end_date, output_location, output_file):    
    
    # Set up the insights builder by passing: input schema, metric, reader and engine details
    runner = InsightsBuilder(). \
        with_input_schema(get_input_schema()). \
        with_metrics(metrics=get_metrics()). \
        with_reader(reader=get_data_reader(start_date, end_date)). \
        with_post_processors(post_processors=[LocalWriterPostProcessor(file_location=output_location, file_name=output_file)]). \
        build()

    # Run the Evaluation of Metrics
    run_result = runner.run()
    
    return run_result.profile


In [17]:
# Define Baseline Dates
base_start_date = '2023-06-26'
base_end_date = '2023-06-29'

# Execute Profile - Pass in Data Start, Data End, Output Location, Output File
profile = run_evaluation(base_start_date, base_end_date, 'output_data/profiles', 'classification_metrics_profile.bin')

## 6 Profile Result

## 6.1 Visualize the Profile in tabular format

In [18]:
profile.to_pandas()

Unnamed: 0,Skewness,StandardDeviation,Min,IsConstantFeature,IQR,Mode,Range,TypeMetric.string_type_count,TypeMetric.integral_type_count,TypeMetric.fractional_type_count,...,Count.missing_count_percentage,Max,DistinctCount,Sum,Kurtosis,Quartiles.q1,Quartiles.q2,Quartiles.q3,Mean,IsQuasiConstantFeature
sepal length (cm),0.311753,0.825301,4.3,False,1.3,[5.0],3.6,0,0,600,...,0.0,7.9,35,3506.0,-0.573568,5.1,5.8,6.4,5.843333,False
sepal width (cm),0.315767,0.434411,2.0,False,0.5,[3.0],2.4,0,0,600,...,0.0,4.4,23,1834.4,0.180976,2.8,3.0,3.3,3.057333,False


## 6.2 Visualize the Profile in JSON format

In [19]:


profile_json = profile.to_json()
dataset_metrics = profile_json
print(json.dumps(dataset_metrics,sort_keys=True, indent=4))


{
    "dataset_metrics": {},
    "feature_metrics": {
        "sepal length (cm)": {
            "Count": {
                "metadata": {},
                "metric_data": [
                    600.0,
                    0.0,
                    0.0
                ],
                "metric_description": "Feature metric that returns total count, missing count and missing count percentage",
                "metric_name": "Count",
                "variable_count": 3,
                "variable_dimensions": [
                    0,
                    0,
                    0
                ],
                "variable_dtypes": [
                    "INTEGER",
                    "INTEGER",
                    "FLOAT"
                ],
                "variable_names": [
                    "total_count",
                    "missing_count",
                    "missing_count_percentage"
                ],
                "variable_types": [
                    "CONTINUOUS",
            

In [20]:
pd.json_normalize(dataset_metrics).T.dropna()

Unnamed: 0,0
feature_metrics.sepal length (cm).Skewness.metric_name,Skewness
feature_metrics.sepal length (cm).Skewness.metric_description,Feature Metric to compute Skewness
feature_metrics.sepal length (cm).Skewness.variable_count,1
feature_metrics.sepal length (cm).Skewness.variable_names,[skewness]
feature_metrics.sepal length (cm).Skewness.variable_types,[CONTINUOUS]
...,...
feature_metrics.sepal width (cm).IsQuasiConstantFeature.variable_names,[is_quasi_constant]
feature_metrics.sepal width (cm).IsQuasiConstantFeature.variable_types,[BINARY]
feature_metrics.sepal width (cm).IsQuasiConstantFeature.variable_dtypes,[BOOLEAN]
feature_metrics.sepal width (cm).IsQuasiConstantFeature.variable_dimensions,[0]
