# Concept Drift - Deployer
Deploy a streaming Concept Drift detector on a labeled stream.  
It will initialize the selected drift detectors with the base_dataset's statistics and deploy the [concept_drift_streaming](https://github.com/mlrun/functions/blob/master/concept_drift_streaming/concept_drift_streaming.ipynb) function from the hub. <br>
adding [V3IOStreamTrigger](https://nuclio.io/docs/latest/reference/triggers/v3iostream/) in order to listen to the input_stream.

### **Steps**

1. [Data exploration](#Data-exploration)
2. [Creating the input stream](#Creating-the-input-stream)
3. [Importing the function](#Importing-the-function)
4. [Running the function remotely](#Running-the-function-remotely)
5. [Testing the function](#Testing-the-function)

### **Data exploration**

In order to know about the performance of a drift detector by measuring the different detection metrics, we need to know beforehand where a real drift occurs.<br>
This is only possible with synthetic datasets.<br> The scikit-multiflow framework allows generating several kinds of synthetic data to simulate the occurrence of drifts.<br>
[Harvard dataverse](https://dataverse.harvard.edu) provides futher explanations on the [used dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5OWRGB) along with different kinds of drifted datasets.<br>
mixed_0101_abrupto has 4 concepts and 3 drifts at time steps 10000, 20000, and 30000.<br>
Our dataset will be train-test-splitted, the train part (first 5000 examples) is used to train the model (that is generated easly using [sklearn_classifer](https://github.com/mlrun/functions/blob/master/sklearn_classifier/sklearn_classifier.ipynb)). <br>
The test part (which is already predicted by the model) will be pushed to the input stream in order to detect drifts.

In [1]:
import pandas as pd
data_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/mixed_0101_abrupto.csv'
predicted_train_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/predicted_abrupto_train.csv'
predicted_test_data_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/predicted_abrupto_test.csv'
# You can find the model used here
models_path = 'https://s3.wasabisys.com/iguazio/models/function-marketplace-models/concept_drift/concept_drift_random_forest.pkl'
original_data = pd.read_csv(data_path)
original_data.head()

Unnamed: 0,X1,X2,X3,X4,class
0,0.0,1.0,0.460101,0.592744,1.0
1,1.0,1.0,0.588788,0.574984,0.0
2,0.0,0.0,0.401641,0.679325,1.0
3,1.0,1.0,0.306076,0.182108,0.0
4,0.0,0.0,0.962847,0.579245,1.0


In [2]:
predicted_test = pd.read_csv(predicted_test_data_path)
predicted_test.tail()

Unnamed: 0,X1,X2,X3,X4,class,predicted_col
34995,0.0,0.0,0.010106,0.647269,0.0,1.0
34996,1.0,1.0,0.293651,0.737291,1.0,0.0
34997,0.0,0.0,0.848546,0.552337,0.0,1.0
34998,1.0,1.0,0.614754,0.859896,1.0,0.0
34999,1.0,0.0,0.265306,0.843716,0.0,1.0


### **Creating the input stream**

In [3]:
import os 

container = os.path.join('/',os.environ['V3IO_HOME'].split('/')[0])
user = os.environ["V3IO_USERNAME"]
rel_path = os.getcwd()[6:] + '/artifacts'

base_input_stream = os.path.join(user,rel_path) + "/inputs_stream"
base_output_stream = os.path.join(user,rel_path) + "/output_stream"
input_stream = os.path.join(container,base_input_stream)
output_stream = os.path.join(container,user,rel_path) + "/output_stream"
tsdb_path = os.path.join(container,user,rel_path) + "/output_tsdb"

stream_consumer_group = 'cg45'

In [4]:
import v3io.dataplane

client = v3io.dataplane.Client()
response = client.stream.create(container = container,
                                stream_path=base_input_stream,
                                shard_count=1,
                                raise_for_status = v3io.dataplane.RaiseForStatus.never)
response.raise_for_status([409, 204])

### **Importing the function**

In [5]:
# Importing the function
import mlrun
mlrun.set_environment(project='function-marketplace')

fn = mlrun.import_function("hub://concept_drift:development")
fn.apply(mlrun.auto_mount())

> 2021-10-25 10:27:04,105 [info] created and saved project function-marketplace


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f145dd80fd0>

### **Running the function remotely**

In [6]:
drift_run = fn.run(name='concept_drift',
                   params={'input_stream'    : input_stream,
                           'consumer_group'  : stream_consumer_group,
                           'output_stream'   : output_stream,
                           'output_tsdb'     : tsdb_path,
                           'tsdb_batch_size' : 1,
                           'models'          : ['ddm', 'eddm', 'pagehinkley'], # defaults
                           'label_col'       : 'class',
                           'prediction_col'  : 'predicted_col',
                           'fn_tag'          : 'development'},
                   inputs={'base_dataset'    : predicted_train_path},
                   artifact_path = os.path.join(os.getcwd(), 'artifacts'))

> 2021-10-25 10:27:04,567 [info] starting run concept_drift uid=fa07c222e77d4eac86d2ce9317aaded1 DB=http://mlrun-api:8080
> 2021-10-25 10:27:04,709 [info] Job is running in the background, pod: concept-drift-ggxgb
> 2021-10-25 10:27:11,199 [info] Loading base dataset
> 2021-10-25 10:27:13,227 [info] Creating models
> 2021-10-25 10:27:13,227 [info] Streaming data to models
> 2021-10-25 10:27:13,347 [info] Logging ready models
> 2021-10-25 10:27:13,487 [info] Deploying Concept Drift Streaming function
> 2021-10-25 10:27:13,490 [info] Starting remote function deploy
2021-10-25 10:27:13  (info) Deploying function
2021-10-25 10:27:13  (info) Building
2021-10-25 10:27:13  (info) Staging files and preparing base images
2021-10-25 10:27:13  (info) Building processor image
2021-10-25 10:27:15  (info) Build complete
2021-10-25 10:27:21  (info) Function deploy complete
> 2021-10-25 10:27:21,797 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-function-marketplace-conce

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
function-marketplace,...17aaded1,0,Oct 25 10:27:10,completed,concept_drift,v3io_user=danikind=jobowner=danihost=concept-drift-ggxgb,base_dataset,"input_stream=/users/dani/test/functions/concept_drift/artifacts/inputs_streamconsumer_group=cg45output_stream=/users/dani/test/functions/concept_drift/artifacts/output_streamoutput_tsdb=/users/dani/test/functions/concept_drift/artifacts/output_tsdbtsdb_batch_size=1models=['ddm', 'eddm', 'pagehinkley']label_col=classprediction_col=predicted_colfn_tag=development",,eddm_concept_driftpagehinkley_concept_driftddm_concept_drift





> 2021-10-25 10:27:23,031 [info] run executed, status=completed


### **Testing the function**
> Mark that we are testing the deployed function - concept_drift_streaming

In [7]:
import json
import datetime

# Reshaping the data to V3IOStream format.
def restructure_stream_event(context, event):
    instances = [dict()]
    for key in predicted_test.keys():
        if key not in ['when', 'class', 'model', 'worker', 'hostname', 'predicted_col']:
            instances[0].update({key: event.pop(key)})
    event['request'] = {'instances': instances}
    event['resp'] = [int(event.pop('predicted_col'))]
    event['when'] = datetime.datetime.strftime(datetime.datetime.now(), format="%Y-%m-%d %H:%M:%S.%f")
    event['model'] = 'sklearn.ensemble.RandomForestClassifier'
    return event
    
    
records = json.loads(predicted_test.to_json(orient='records'))
records = [{'data': json.dumps(restructure_stream_event(context, record))} for record in records]

# showing first record
records[0]

{'data': '{"class": 1.0, "request": {"instances": [{"X1": 0.0, "X2": 0.0, "X3": 0.0634475073, "X4": 0.4136568818}]}, "resp": [1], "when": "2021-10-25 10:27:23.152584", "model": "sklearn.ensemble.RandomForestClassifier"}'}

In [8]:
# Creating v3io client
v3io_client = v3io.dataplane.Client()

# Pushing some undrifted data to the input stream
response = v3io_client.stream.put_records(container=container,
                                          stream_path=base_input_stream, 
                                          records=records[4900:5100])

In [9]:
# Getting earliest location in the shard
location = json.loads(v3io_client.stream.seek(container=container,
                                              stream_path=base_input_stream,
                                              shard_id=0,
                                              seek_type='EARLIEST').body)['Location']
# Getting records from input stream
response = v3io_client.stream.get_records(container=container,
                                          stream_path=base_input_stream,
                                          shard_id=0, location=location)
# Showing the last sequence that is written to the input stream
json.loads(response.body)['Records'][-1]

{'SequenceNumber': 200,
 'Data': 'eyJjbGFzcyI6IDAuMCwgInJlcXVlc3QiOiB7Imluc3RhbmNlcyI6IFt7IlgxIjogMC4wLCAiWDIiOiAwLjAsICJYMyI6IDAuMzMzMTYzNjk4OSwgIlg0IjogMC40MjE2NzY1Njg3fV19LCAicmVzcCI6IFsxXSwgIndoZW4iOiAiMjAyMS0xMC0yNSAxMDoyNzoyMy4yOTM3OTgiLCAibW9kZWwiOiAic2tsZWFybi5lbnNlbWJsZS5SYW5kb21Gb3Jlc3RDbGFzc2lmaWVyIn0=',
 'ArrivalTimeSec': 1635157644,
 'ArrivalTimeNSec': 395309631}

#### Make sure some time has passed - the function needs to be triggered by the input stream, then it'll write to the output stream

In [11]:
# Getting earliest location in the shard
location = json.loads(v3io_client.stream.seek(container=container,
                                              stream_path=base_output_stream,
                                              shard_id=0,
                                              seek_type='EARLIEST').body)['Location']
# Getting records from output stream
response = v3io_client.stream.get_records(container=container,
                                          stream_path=base_output_stream,
                                          shard_id=0, location=location)

In [12]:
# Showing changed detected
import base64
for instance in json.loads(response.body)['Records']:
    seq = instance["SequenceNumber"]
    data = json.loads(base64.b64decode(instance['Data']))
    if(data['ddm_drift']==1 or data['eddm_drift']==1):
        print(f'sequence number : {seq}, data : {data}')



We can see that the system detected a change in the 106 instance, which is 10006 instance in the real dataset - <br>
5000 first instances are for train, we started pushing data from the 4900 instance of the test dataset (9900 from the real dataset), and we pushed only 200 instances.


[Back to the top](#Concept-Drift---Deployer)