# Demo Notebook for Metrics Correlation Model Testing

#### [Download notebook](https://github.com/opensearch-project/opensearch-py-ml/blob/main/docs/source/examples/demo_metrics_correlation.ipynb)


## Introduction

This notebook introduces the technique of metrics correlation using ML Commons API.

## Step 0: Imports

Please install the following packages from the terminal if you haven't already. They can be also installed from the notebook by uncommenting the line and execute.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# !pip install pandas matplotlib numpy opensearch-py opensearch-py-ml

In [3]:
# import this to stop opensearch-py-ml from yelling every time a DataFrame connection made
import warnings
warnings.filterwarnings('ignore')

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import opensearch_py_ml as oml
from opensearchpy import OpenSearch

# Import standard test settings for consistent results
from opensearch_py_ml.conftest import *

## Step 1: Set up clients and define helper functions


In [5]:
CLUSTER_URL = 'https://localhost:9200'

def get_os_client(cluster_url = CLUSTER_URL,
                  username='admin',
                  password='admin'):
    '''
    Get OpenSearch client
    :param cluster_url: cluster URL like https://ml-te-netwo-1s12ba42br23v-ff1736fa7db98ff2.elb.us-west-2.amazonaws.com:443
    :return: OpenSearch client
    '''
    client = OpenSearch(
        hosts=[cluster_url],
        http_auth=(username, password),
        verify_certs=False
    )
    return client


In [6]:
client = get_os_client()

In [7]:
#connect to ml_common client with OpenSearch client
from opensearch_py_ml.ml_commons import MLCommonClient
ml_client = MLCommonClient(client)

## Step 2: Create data index


In [8]:
# Reading csv files as a dataframes
df_pd = pd.read_csv("data/SMD_small.csv.zip", header=None, index_col=None)
Y = pd.read_csv("data/SMD_small_labels.csv.zip", header=None, index_col=None)

There are 100 rows and 38 columns in the dataset. Each row represents a certain time point, while each columns represents a metric. This is a regular time series dataset.

In [9]:
df_pd

Unnamed: 0,0,1,...,36,37
0,0.096774,0.088983,...,0.0,0.0
1,0.107527,0.062500,...,0.0,0.0
2,0.107527,0.104873,...,0.0,0.0
3,0.096774,0.123941,...,0.0,0.0
4,0.096774,0.070975,...,0.0,0.0
...,...,...,...,...,...
95,0.494624,0.072034,...,0.0,0.0
96,0.494624,0.127119,...,0.0,0.0
97,0.494624,0.069915,...,0.0,0.0
98,0.494624,0.046610,...,0.0,0.0


However, to use this dataset for metrics correlation later, each row needs to represent a metric. Therefore, we need to transpose it before populating the index.

In [10]:
df_pd_transposed = df_pd.transpose()
df_pd_transposed

Unnamed: 0,0,1,...,98,99
0,0.096774,0.107527,...,0.494624,0.494624
1,0.088983,0.062500,...,0.046610,0.052966
2,0.090301,0.081382,...,0.060201,0.059086
3,0.102207,0.097561,...,0.052265,0.052265
4,0.000000,0.000000,...,0.000000,0.000000
...,...,...,...,...,...
33,0.000022,0.000034,...,0.000034,0.000022
34,0.102150,0.103308,...,0.047709,0.052994
35,0.102143,0.103301,...,0.047705,0.052990
36,0.000000,0.000000,...,0.000000,0.000000


In [12]:
# Creating and populating an index with the dataset
df = oml.pandas_to_opensearch(df_pd_transposed,
                    os_client=client,
                     os_dest_index='smd',
                    os_if_exists="replace",
                    os_refresh=True,)

In [13]:
df

Unnamed: 0,0,1,...,98,99
0,0.096774,0.107527,...,0.494624,0.494624
1,0.088983,0.062500,...,0.046610,0.052966
2,0.090301,0.081382,...,0.060201,0.059086
3,0.102207,0.097561,...,0.052265,0.052265
4,0.000000,0.000000,...,0.000000,0.000000
...,...,...,...,...,...
33,0.000022,0.000034,...,0.000034,0.000022
34,0.102150,0.103308,...,0.047709,0.052994
35,0.102143,0.103301,...,0.047705,0.052990
36,0.000000,0.000000,...,0.000000,0.000000


Since we did not provide column names, they are just numbers. And a new id field is used to index data.

In [14]:
print(df.columns)
print(df.index.os_index_field)

Index(['0', '1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36',
       '37', '38', '39', '4', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '5',
       '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '6', '60', '61', '62', '63',
       '64', '65', '66', '67', '68', '69', '7', '70', '71', '72', '73', '74', '75', '76', '77',
       '78', '79', '8', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '9', '90',
       '91', '92', '93', '94', '95', '96', '97', '98', '99'],
      dtype='object')
_id


In [15]:
print(df.os_info())

os_index_pattern: smd
Index:
 os_index_field: _id
 is_source_field: False
Mappings:
 capabilities:
   os_field_name  is_source os_dtype os_date_format pd_dtype  is_searchable  is_aggregatable  is_scripted aggregatable_os_field_name
0              0       True   double           None  float64           True             True        False                          0
1              1       True   double           None  float64           True             True        False                          1
10            10       True   double           None  float64           True             True        False                         10
11            11       True   double           None  float64           True             True        False                         11
12            12       True   double           None  float64           True             True        False                         12
13            13       True   double           None  float64           True             True        Fal

## Step 3: Metrics correlation


In [None]:
input_json = {
    "metrics": df
}

In [1]:
results = ml_client.execute(
    algorithm_name = "METRICS_CORRELATION",
    input_json = input_json
)
results


NameError: name 'ml_client' is not defined

## Step 4: Result visualization


In [None]:
plt.plot(df_pd.index, df_pd)
plt.title("All metrics")
