## Best Practice in using EDDI for Timeseries Data 

<!-- A typical ML downstream process involves input features predicting the output targets. Also, a test set is usually left out to do the evaluation. We make sure to prevent information leakage during evaluation in two aspects:
<ul>
  <li><strong>Testset to trainset</strong>: Only the trainset should be used to train the imputation model.</li>
  <li><strong>Targets to predictors</strong>: To prevent data leakage from target features into predictive features, we have two options:</li>
      <ol>
        <li>Evaluate the imputation and downstream task as a package</li>
        <li>Impute input and output columns of the dataset independently [as if they are separate datasets].</li>
      </ol>
      We use the second method in this notebook to focus our evaluate on the imputation peice only. Furthermore, if we need to seal outputs from each other we need to have 3 datasets in this case. Clearly, by having one column in a dataset, we cannot have a better result than just "mean"-imputation from EDDI. Therefore, at this point we only focus on <strong>input-imputation</strong>. On later notebooks, we will see that the temporal signals can help us with this issue.
</ul> -->

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append('.')

In [3]:
from helpers.dataprep_utils import compute_missing_ratio, get_variables_metadata, extend_features_with_temporal_window
from helpers.sas_utils import get_data_link, get_placeholder_link
from helpers.service_utils import RestEDDI, train_eddi, batchinference_eddi

import pandas as pd
import os
import toml

Load data:

In [4]:
df_sensors = pd.read_csv('./data_generated/sensor_wide.csv', index_col=0)
df_sensors.head(1)

Unnamed: 0_level_0,IN1,IN2,IN3,IN4,IN5,IN6,IN7,IN8,Out1,Out2
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.077744,0.795565,-0.665503,0.879321,0.134419,-1.133765,0.253945,0.109987,-0.122686,0.123661


Load EDDI & Storage setup

In [5]:
config = toml.load("./config/config.toml")

container_sas_link = config['blob']['container_sas_link']
subscription_key = config['eddi']['subscription_key']
rest_eddi = RestEDDI(subscription_key, api_version="v2.3")
blob_storage_root = 'sensor_datasets' # name for the blob storage directory within the provided container to store the data

Keep train and test datasets separate; assuming that on the downstream task we will take the first 0.7 part of the data as train and the rest as the test part; we will do the same here:

In [6]:
splitpoint = df_sensors.index[int(df_sensors.shape[0] * 0.7)]
df_train = df_sensors[df_sensors.index < splitpoint]
df_test = df_sensors[df_sensors.index >= splitpoint]

if 'train' not in os.listdir('./data_prepared/'):
    os.mkdir('./data_prepared/train/')

if 'test' not in os.listdir('./data_prepared/'):
    os.mkdir('./data_prepared/test/')

Impute input and output columns of the dataset independently [as if they are separate datasets]. Since we do not have an added value on the output, we skip that for now until future steps.

In [7]:
in_columns = ["IN1", "IN2", "IN3", "IN4", "IN5"]
df_in = df_sensors[in_columns].copy()

Include temporal features based on the window-size that looks suitable for the data. You can use the helper function _extend_features_with_target_window_ to modify your dataframe. 


To select the best suited featuresets, you can play with temporal features' granularity and window using these features on the function: temp_start, temp_step_size, and temp_num_steps.

In [8]:
df_ext, ext_colnames = extend_features_with_temporal_window(df_in, to_extend_columns=in_columns, temp_start=1, temp_step_size=1, temp_num_steps=1)

**Get variable metadata**: ideally we should have the metadata provided by the expert so we do not need to rely on current data distribution/min-max to infer the metadata.
Here we rely on the full dataset to infer these metadata and not run into an out-of-distribution value in testset. _get_variables_metadata_ helper function helps with that.

In [9]:
variables_metadata_in = get_variables_metadata(df_ext, ext_colnames,'./config/variables_metadata_in.json')

In [11]:
df_train_in = df_train[in_columns].copy().fillna('')
df_test_in = df_test[in_columns].copy().fillna('')

df_train_in, _ = extend_features_with_temporal_window(df_train_in, to_extend_columns=in_columns, temp_start=1, temp_step_size=1, temp_num_steps=1)
df_test_in, _ = extend_features_with_temporal_window(df_test_in, to_extend_columns=in_columns, temp_start=1, temp_step_size=1, temp_num_steps=1)

df_train_in.to_csv('./data_prepared/train/sensor_wide_in.csv', index=False, header=False)
df_test_in.to_csv('./data_prepared/test/sensor_wide_in.csv', index=False, header=False)

Train EDDI imputation model

In [12]:
# uploads the train data to blob storage & returns the SAS link
training_data_source = get_data_link('./data_prepared/train/sensor_wide_in.csv', container_sas_link, directory_name=blob_storage_root)

In [13]:
train_input = {
    "training_data_source": training_data_source,
    "model_hyperparams":{
        "decoder_variances": 1e-6
    },
    "variables_metadata": variables_metadata_in,
    "training_hyperparams":{
        "epochs": 500,
        "iterations":40
    }
}

In [14]:
model_id = train_eddi(train_input, rest_eddi, wait_for_completion=True)

<Response [200]>
200
'aed984d00f7a4406b51eb82977f47d7f'
Operation status: Completed
train running time: 321.34695172309875 seconds
{'created_time': '2022-04-27T17:59:07.466809+00:00',
 'description': 'Model for efficient decision making with heterogenous data',
 'id': 'aed984d00f7a4406b51eb82977f47d7f'}
https://ms-azua-api.azurewebsites.net/saas-api/models/aed984d00f7a4406b51eb82977f47d7f?api-version=v2.3


Batch inferencing on train & test dataset and store the results

In [15]:
batch_inference_input = {
    "hyperparameters":
    {
        "sample_count": 50
    }
}

In [16]:
# batch-inference on train-dataset
batch_inference_input["data_source"] = training_data_source
batch_inference_input["output"] = get_placeholder_link('sensor_wide_in_impute.csv', container_sas_link, directory_name=blob_storage_root, delete_prev=True)
batchinference_eddi(batch_inference_input, model_id, rest_eddi)

# download and replace the train data with the missing data
df_train_in_impute = pd.read_csv(batch_inference_input["data_source"], names=ext_colnames)
df_train_in_impute[in_columns].to_csv('./data_prepared/train/sensor_wide_in.csv', header=True)

<Response [200]>
200
'a8d685474c804e87a80f34bd0d8123fa'
Operation status: Completed
batchinference running time: 108.77870512008667 seconds
Completed


In [17]:
# batch-inference on train-dataset
batch_inference_input["data_source"] = get_data_link('./data_prepared/test/sensor_wide_in.csv', container_sas_link, directory_name=blob_storage_root)
batch_inference_input["output"] = get_placeholder_link('sensor_wide_in_impute.csv', container_sas_link, directory_name=blob_storage_root, delete_prev=True)
batchinference_eddi(batch_inference_input, model_id, rest_eddi)

# download and replace the train data with the missing data
df_test_in_impute = pd.read_csv(batch_inference_input["data_source"], names=ext_colnames)
df_test_in_impute[in_columns].to_csv('./data_prepared/test/sensor_wide_in.csv', header=True)

<Response [200]>
200
'ac740cf525fa4ce1b5ed163755f2f7c8'
Operation status: Completed
batchinference running time: 72.88185286521912 seconds
Completed
