## Data Drift

Data drift occurs when a machine model's performance declines or is different on unseen data compared to its training data due distribution changes in the data over time. In this notebook, we explore and visualize data drift on a simple XGBoost model, that predicts credit card acceptance based on an applicant's age, credit score, years of education, and years in employment. This demo is a Jupyter notebook counterpart for a prexisting demo, [Data Drift](https://github.com/trustyai-explainability/odh-trustyai-demos.git). Please refer to its `README.md` for more details, as well as how to perform the following steps in ODH.

### Prerequisites
Follow the instructions within the [Installation](https://github.com/trustyai-explainability/odh-trustyai-demos/tree/main/1-Installation) section. Additionally, follow the instructions in the `Deploy Model` sectionin the [Data Drift](https://github.com/trustyai-explainability/odh-trustyai-demos/blob/main/3-DataDrift/README.md) demo. Before proceeding, check that you have the following:
- ODH installation
- A TrustyAI Operator
- A model-namespace project containing an instance of the TrustyAI Service
- A model storage container
- A Seldon MLServer serving runtime
- The delpoyed credit model


### Imports

In [None]:
from json import load
from requests import get, post

### Initialize Metrics Service
In order to use the metrics service, we first have to initialize it using our OpenShift login token and model namespace.

In [None]:
authentication_token = 'YOUR_TOKEN'
trustyai_service_route_url = 'TRUSTYAI_SERVICE_URL'

In [None]:
def get_from_trustyai(resource_path):
    url = f'{trustyai_service_route_url}/{resource_path}'
    header = {'Authorization': f'Bearer {authentication_token}'}
    response = get(url, headers=header)
    print(response.text)
    return response


def post_to_trustyai(resource_path, payload):
    url = f'{trustyai_service_route_url}/{resource_path}'
    header = {'Authorization': f'Bearer {authentication_token}'}
    response = post(url, headers=header, json=payload)
    print(response.text)
    return response

### Upload Model Training Data To TrustyAI

In [None]:
with open('data/training_data.json', 'r') as inputfile:
    training_data = load(inputfile)

post_to_trustyai('data/upload', payload=training_data)

In [None]:
get_from_trustyai('info')

### Label Data Fields

In [None]:
name_mapping = {
    "modelId": "gaussian-credit-model",
    "inputMapping": {
        "credit_inputs-0": "Age",
        "credit_inputs-1": "Credit Score",
        "credit_inputs-2": "Years of Education",
        "credit_inputs-3": "Years of Employment"
    },
    "outputMapping": {
       "predict-0": "Acceptance Probability"
    }
}

post_to_trustyai('info/names', payload=name_mapping)

### Register Drift Monitoring

In [None]:
drift_request_payload = {
    'modelId': 'gaussian-credit-model',
    'referenceTag': 'TRAINING',
}

post_to_trustyai('metrics/drift/meanshift/request', drift_request_payload)

In [None]:
get_from_trustyai('metrics/all/requests')

In [None]:
model_server_endpoint = 'http://modelmesh-serving.trustyai-demo:8008'
model_inference_endpoint = f'{model_server_endpoint}/v2/models/gaussian-credit-model/infer'

sample_payload = {
    'inputs': [
        {
            'name': 'credit_inputs',
            'shape': [1, 4],
            'datatype': 'FP64',
            'data': [47., 479., 13., 21.],
        }
    ]
}
post(model_inference_endpoint, json=sample_payload)

### Collect "Real-World" Inferences

Let's send live data to the deployed model that deviates significantly from the training data with respect to the age and credit score dimensions. TrustyAI should now pick up significant data drift for these two features.

![Alt text](gaussian_credit_model_distributions.png)

In [None]:
for batch in list(range(0, 596, 5)):
    with open(f"data/data_batches/{batch}.json", 'r') as inputfile:
        input_data = load(inputfile)
    response = post(model_inference_endpoint, json=input_data)
    print(f'processed batch {batch} of 595')

As observed, the meanshift values for each of the features have changed drastically from the training to test data, dropping below 1.0. In particular, `Age` and `Credit Score` are significantly different according to a p-value of 0.05. Thus, it is clear that our model suffers from data drift.