# Stream to Parquet

Part of the [network operations](https://github.com/mlrun/demos/tree/0.7.x/network-operations) demo pipeline, this function listens to a labeld stream and writes it as parquet files.<br>
This function also deploys the function [virtual_drift](https://github.com/mlrun/functions/tree/master/virtual_drift) from the hub, which computes drift magnitude metrics between base dataset t and dataset u,<br>
in our case (as well as in the demo) - base dataset (the one that the model trained on) and the dataset the model predicted.<br>
virtual_drift writes the output to TSDB.

### **Steps**

1. [Data exploration](#Data-exploration)
2. [Creating the labeled stream](#Creating-the-labeled-stream)
3. [Importing the function](#Importing-the-function)
4. [Running the functioh remotely](#Running-the-function-remotely)
5. [Testing the function](#Testing-the-function)

### **Data exploration**

In order to know about the performance of a drift detector by measuring the different detection metrics, we need to know beforehand where a real drift occurs.<br>
This is only possible with synthetic datasets.<br> The scikit-multiflow framework allows generating several kinds of synthetic data to simulate the occurrence of drifts.<br>
[Harvard dataverse](https://dataverse.harvard.edu) provides futher explanations on the [used dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5OWRGB) along with different kinds of drifted datasets.<br>
mixed_0101_abrupto has 4 concepts and 3 drifts at time steps 10000, 20000, and 30000.<br>
Our dataset will be train-test-splitted, the train part (first 5000 examples) is used to train the model (that is generated easly using [sklearn_classifer](https://github.com/mlrun/functions/blob/master/sklearn_classifier/sklearn_classifier.ipynb)). <br>
The test part (which is already predicted by the model) will be pushed to the input stream in order to detect drifts.

In [1]:
import pandas as pd
data_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/mixed_0101_abrupto.csv'
base_dataset = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/predicted_abrupto_train.csv'
# The predicted test data is pushed to the stream
predicted_test_data_path = 'https://s3.wasabisys.com/iguazio/data/function-marketplace-data/concept_drift/predicted_abrupto_test.csv'
# You can find the model used here
models_path = 'https://s3.wasabisys.com/iguazio/models/function-marketplace-models/concept_drift/concept_drift_random_forest.pkl'
original_data = pd.read_csv(data_path)
original_data.head()

Unnamed: 0,X1,X2,X3,X4,class
0,0.0,1.0,0.460101,0.592744,1.0
1,1.0,1.0,0.588788,0.574984,0.0
2,0.0,0.0,0.401641,0.679325,1.0
3,1.0,1.0,0.306076,0.182108,0.0
4,0.0,0.0,0.962847,0.579245,1.0


In [2]:
predicted_test = pd.read_csv(predicted_test_data_path)
predicted_test.tail()

Unnamed: 0,X1,X2,X3,X4,class,predicted_col
34995,0.0,0.0,0.010106,0.647269,0.0,1.0
34996,1.0,1.0,0.293651,0.737291,1.0,0.0
34997,0.0,0.0,0.848546,0.552337,0.0,1.0
34998,1.0,1.0,0.614754,0.859896,1.0,0.0
34999,1.0,0.0,0.265306,0.843716,0.0,1.0


### **Creating the labeled stream**

In [3]:
import os 

container = os.path.join('/',os.environ['V3IO_HOME'].split('/')[0])
user = os.environ["V3IO_USERNAME"]
rel_path = os.getcwd()[6:] + '/artifacts'

base_input_stream = os.path.join(user,rel_path) + "/inputs_stream"
base_output_stream = os.path.join(user,rel_path) + "/output_stream"
input_stream = os.path.join(container,base_input_stream)
tsdb_path = os.path.join(user,rel_path) + "/output_tsdb"

stream_consumer_group = 's2p'

In [4]:
import v3io.dataplane

client = v3io.dataplane.Client()
response = client.stream.create(container = container,
                                stream_path=base_input_stream,
                                shard_count=1,
                                raise_for_status = v3io.dataplane.RaiseForStatus.never)
response.raise_for_status([409, 204])

### **Importing the function**

In [5]:
import mlrun

# Importing the function
mlrun.set_environment(project='function-marketplace')

fn = mlrun.import_function("hub://stream_to_parquet:development")
fn.apply(mlrun.auto_mount())

fn.add_v3io_stream_trigger(stream_path=input_stream, name='stream', group=stream_consumer_group)

> 2021-10-26 14:37:45,224 [info] created and saved project function-marketplace


### **Running the function remotely**

In [6]:
import json
fn.set_envs({'window': 200,
             'save_to': os.path.join(os.path.join('/User',rel_path), 'inference_pq'),
             'prediction_col': 'predicted_col',
             'label_col': 'class',
             'base_dataset': base_dataset,
             'results_tsdb_container': container[1:],
             'results_tsdb_table': tsdb_path,
             'mount_path': os.path.join(container,user),
             'mount_remote': container,
             'artifact_path': os.path.join('/User',rel_path)})

fn.deploy()

> 2021-10-26 14:37:45,513 [info] Starting remote function deploy
2021-10-26 14:37:45  (info) Deploying function
2021-10-26 14:37:45  (info) Building
2021-10-26 14:37:45  (info) Staging files and preparing base images
2021-10-26 14:37:45  (info) Building processor image
2021-10-26 14:37:47  (info) Build complete
2021-10-26 14:37:55  (info) Function deploy complete
> 2021-10-26 14:37:55,689 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-function-marketplace-stream-to-parquet.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['default-tenant.app.dev39.lab.iguazeng.com:31445']}


'http://default-tenant.app.dev39.lab.iguazeng.com:31445'

### **Testing the function**

In [7]:
import json
import datetime

# Reshaping the data to V3IOStream format.
def restructure_stream_event(context, event):
    instances = [dict()]
    for key in predicted_test.keys():
        if key not in ['when', 'model', 'worker', 'hostname', 'predicted_col']:
            instances[0].update({key: event.pop(key)})
    instances[0].update({key: event.get(key)})      
    event['request'] = {'instances': instances}
    event['resp'] = [int(event.pop('predicted_col'))]
    event['when'] = datetime.datetime.strftime(datetime.datetime.now(), format="%Y-%m-%d %H:%M:%S.%f")
    event['model'] = 'sklearn.ensemble.RandomForestClassifier'
    return event
    
    
records = json.loads(predicted_test.to_json(orient='records'))
records = [{'data': json.dumps(restructure_stream_event(context, record))} for record in records]

# showing first record
records[0]

{'data': '{"request": {"instances": [{"X1": 0.0, "X2": 0.0, "X3": 0.0634475073, "X4": 0.4136568818, "class": 1.0, "predicted_col": 1.0}]}, "resp": [1], "when": "2021-10-26 14:37:55.864974", "model": "sklearn.ensemble.RandomForestClassifier"}'}

In [8]:
# Pushing some data to the input stream
step = 500
for i in range(0,20000,step):
    response = client.stream.put_records(container=container,
                                              stream_path=base_input_stream, 
                                              records=records[i:i+step])

In [13]:
# Reading from TSDB
import v3io_frames as v3f

v3f_client = v3f.Client(os.environ["V3IO_FRAMESD"],container=container[1:])
v3f_client.read(backend='tsdb',table=tsdb_path)

Unnamed: 0_level_0,class_shift_helinger,class_shift_kld,class_shift_tvd,prior_helinger,prior_kld,prior_tvd,stream
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-10-26 14:38:08.027000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:38:08.699000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:38:09.599000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:38:10.759000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:38:11.561000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
...,...,...,...,...,...,...,...
2021-10-26 14:39:42.037000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:39:42.191000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:39:42.586000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream
2021-10-26 14:39:42.816000+00:00,0.001759,0.000025,0.002488,1.0,10.0,1.0,some_stream


[Back to the top](#Stream-to-Parquet)