# Event Handler
--------------------------------------------------------------------

This function designed to output the incoming events to a configurable output.<br>
Currently the following outputs are supported:<br>
  * V3IO Stream output
        The events will be written to V3IO Stream and will be partitioned across the different stream shards based upon configurable field of the incoming events.
  * Parquet
        Each batch size (default 1024 records) are stored in parquet file partitioned by the event time (year, month, day, hour).

## Create and Test a Local Function 

[Nuclio](https://nuclio.io/) is a high-performance open-source and managed serverless framework, which is available as a predefined tenant-wide platform service (`nuclio`).
The demo uses Nuclio to create and deploy serverless functions.
Therefore, you need to import the Nuclio package and configure Nuclio for your project.

The platform's Jupyter Notebook service preinstalls the [nuclio-jupyter SDK](https://github.com/nuclio/nuclio-jupyter/blob/master/README.md) for creating and deploying Nuclio functions with Python and Jupyter Notebook.
The tutorial uses the Nuclio magic commands and annotation comments of this SDK to automate function code generation.
The magic commands are initialized when you import the `nuclio` package.<br>
The `%nuclio` magic commands are used to run Nuclio commands from Jupyter notebooks (`%nuclio <Nuclio command>`).
You can also use `%%nuclio` at the start of a cell to identify the entire cell as containing Nuclio code.
The magic commands are initialized when you import the `nuclio` package.<br>
The `# nuclio: start-code`, `# nuclio: end-code`, and `# nuclio: ignore` section-marker annotations notify Nuclio of the beginning or end of code sections.
Nuclio ignores all notebook code before a `# nuclio: start-code` marker or after an `# nuclio: end-code` marker.
Nuclio translates all other notebook code sections into function code, except for sections that are marked with the `# nuclio: ignore` marker.

### Import Nuclio

The following code imports the `nuclio` Python package.

In [None]:
import nuclio

#### Configure Nuclio

The following code uses the `# nuclio: start-code` marker to instruct Nuclio to start processing code only from this location, and then performs basic Nuclio function configuration &mdash; defining the name of the function's container image (`mlrun/ml-models`), the function type (`nuclio`), and some additional package installation commands.

> **Note:** You can add code to define function dependencies and perform additional configuration after the `# nuclio: start-code` marker.

In [None]:
%%nuclio config
spec.build.baseImage = "mlrun/mlrun"
spec.readinessTimeoutSeconds = 200
kind = "nuclio"

In [None]:
# nuclio: start-code

In [None]:
import os
import pandas as pd
import json
import datetime
from datetime import datetime
import v3io.dataplane

In [None]:
def init_context(context):
    setattr(context, 'stream_sink_flag', os.getenv('STREAM_SINK_FLAG'))
    setattr(context, 'parquet_sink_flag', os.getenv('PARQUET_SINK_FLAG'))

    # For writing to parquet
    if context.parquet_sink_flag.lower() == 'true':
        setattr(context, 'batch', [])
        setattr(context, 'batch_size', int(os.getenv('PARQUET_BATCH_SIZE', 1024)))

        setattr(context, 'timestamp_key', os.getenv('TS_KEY'))
        setattr(context, 'timestamp_format', os.getenv('TS_FORMAT', '%Y-%m-%d %H:%M:%S.%f'))

        setattr(context, 'pq_partitions', ['pq_year', 'pq_month', 'pq_day', 'pq_hour'])

        setattr(context, 'target_path', os.getenv('PARQUET_TARGET_PATH'))
        os.makedirs(context.target_path, exist_ok=True)

        # in case of an inference stream set the names of features and predictions.
        features = os.getenv('FEATURES')
        if features is not None:
            features = features.split(',')
        setattr(context, 'features', features)

        predictions = os.getenv('PREDICTIONS')
        if predictions is not None:
            predictions = predictions.split(',')
        setattr(context, 'predictions', predictions)

    # For writing to v3io stream
    if context.stream_sink_flag.lower() == 'true':
        v3io_access_key = os.getenv('V3IO_ACCESS_KEY')
        container = os.getenv('CONTAINER')
        output_stream_path = os.getenv('OUTPUT_STREAM_PATH')
        partition_attr = os.getenv('PARTITION_ATTR')
        v3io_api = os.getenv('V3IO_API')
        v3io_client = v3io.dataplane.Client(endpoint=v3io_api, access_key=v3io_access_key)

        setattr(context, 'v3io_client', v3io_client)
        setattr(context, 'partition_attr', partition_attr)
        setattr(context, 'container', container)
        setattr(context, 'output_stream_path', output_stream_path)
    pass

In [None]:
def handler(context, event):
    if type(event.body) is dict:
        event_dict = event.body
    else:
        event_dict = json.loads(event.body)

    context.logger.info_with('Got invoked',
                             trigger_kind=event.trigger.kind,
                             event_body=event_dict)

    if context.stream_sink_flag.lower() == 'true':
        stream_sink_handler(context, event_dict)
    if context.parquet_sink_flag.lower() == 'true':
        parquet_sink_handler(context, event_dict)
    pass


def stream_sink_handler(context, event):
    partition_key = event.get(context.partition_attr)
    record = event_to_record(event, partition_key)
    
    resp = context.v3io_client.stream.put_records(container=context.container,
                                                  stream_path=context.output_stream_path,
                                                  records=[record],
                                                  raise_for_status=v3io.dataplane.RaiseForStatus.never)

    context.logger.info_with('Sent event to stream',
                             record=record,
                             response_status=resp.status_code,
                             response_body=resp.body.decode('utf-8'))
    pass


def event_to_record(event_dict, partition_key):
    event_str = json.dumps(event_dict)
    return {'data': event_str, 'partition_key': str(partition_key)}


def parquet_sink_handler(context, event):
    # for inference events
    if context.features is not None and context.predictions is not None:
        event = flatten_inference_event(context, event)

    event_with_time_partitions = add_time_partition_attributes(context, event)

    # add the incoming event to the current batch
    context.batch.append(event_with_time_partitions)

    # check if batch size reached
    if context.batch_size == len(context.batch):
        written_records = write_batch(context)
        context.logger.info_with('Written batch',
                                 Writtent_records=written_records)
    pass


def flatten_inference_event(context, event):
    # add parsed features to the event
    feature_values = event['request']['instances'][0]
    event.update(zip(context.features, feature_values))

    # add parsed predictions to the event
    prediction_values = event['resp']
    event.update(zip(context.predictions, prediction_values))

    return event


def add_time_partition_attributes(context, event):
    if hasattr(context, 'timestamp_key') and event.get(context.timestamp_key) is not None:
        # parse the event time
        dt_object = datetime.strptime(event[context.timestamp_key], context.timestamp_format)
    else:
        # if event time is missing or not configured, use current datetime
        dt_object = datetime.now()

    # add the partition attributes
    event['pq_year'] = dt_object.strftime('%Y')
    event['pq_month'] = dt_object.strftime('%m')
    event['pq_day'] = dt_object.strftime('%d')
    event['pq_hour'] = dt_object.strftime('%H')

    return event


def write_batch(context):
    df = pd.DataFrame.from_records(context.batch)
    df.to_parquet(path=context.target_path, partition_cols=context.pq_partitions)
    # post write cleanup
    context.batch = []
    return len(df.index)


The following cell uses the `# nuclio: end-code` marker to mark the end of a Nuclio code section and instruct Nuclio to stop parsing the notebook at this point.<br>
> **IMPORTANT:** Do not remove the end-code cell.

In [None]:
# nuclio: end-code

## Test Locally

In [None]:
import v3io.dataplane

test_path = os.path.join(os.getcwd(), 'test')

# Create a test targer dir for the parquet output
target_path = os.path.join(test_path, 'event-handler-pq')
os.makedirs(target_path, exist_ok=True)

# Create a test target stream
v3io_client = v3io.dataplane.Client()
container = 'users'
output_stream_path = os.path.join(test_path.replace('/User', os.getenv('V3IO_USERNAME')), 'event-handler-stream')
v3io_client.stream.create(container=container, stream_path=output_stream_path, shard_count=1)


In [None]:
# set few parameters via environment variables0
envs = {'PARQUET_SINK_FLAG': 'true',
        'STREAM_SINK_FLAG': 'true',
        'PARQUET_TARGET_PATH' : target_path,
        'PARQUET_BATCH_SIZE': 10,
        'TS_KEY': 'event_time',
        'TS_FORMAT': '%Y-%m-%d %H:%M:%S.%f',
        'CONTAINER': container,
        'OUTPUT_STREAM_PATH': output_stream_path,
        'PARTITION_ATTR': 'user_id'}

for key, value in envs.items():
    os.environ[key] = str(value)
init_context(context)

In [None]:
# trigger with 9 events:

nine_events = [b'{"user_id" : 1 , "event_type": "spin", "event_time": "2020-02-02 12:20:22.333332"}',
              b'{"user_id" : 2 , "event_type": "spin", "event_time": "2020-02-02 12:20:23.333332"}',
              b'{"user_id" : 3 , "event_type": "spin", "event_time": "2020-02-02 12:20:24.333332"}',
              b'{"user_id" : 4 , "event_type": "spin", "event_time": "2020-02-02 12:20:25.333332"}',
              b'{"user_id" : 5 , "event_type": "spin", "event_time": "2020-02-02 12:20:26.333332"}',
              b'{"user_id" : 6 , "event_type": "spin", "event_time": "2020-02-02 12:20:27.333332"}',
              b'{"user_id" : 7 , "event_type": "spin", "event_time": "2020-02-02 12:20:28.333332"}',
              b'{"user_id" : 8 , "event_type": "spin", "event_time": "2020-02-02 12:20:29.333332"}',
              b'{"user_id" : 9 , "event_type": "spin", "event_time": "2020-02-02 12:20:30.333332"}']

for e in nine_events:
    event = nuclio.Event(body=e)
    handler(context, event)

In [None]:
# check whether a parquet has been created
!ls -l {target_path}

In [None]:
# trigger the tenth event which should trigger the creation of the parquet file.
tenth_event = b'{"user_id" : 10 , "event_type": "spin", "event_time": "2020-02-02 12:20:31.333332"}'
event = nuclio.Event(body=tenth_event)
handler(context, event)

In [None]:
# check weather a parquet has been created
!ls -l {target_path}

In [None]:
# cleanup
!rm -rf {test_path}

## Nuclio Deploy

Define the `target_path`s for the parquet files

### Convert code to function

We use MLRun `code_to_function` in order to convert the python code to a Nuclio function. We then set the relevant enrivonment variables and streaming trigger.

In [None]:
from mlrun import code_to_function, mount_v3io

fn = code_to_function(name='event-handler', kind = 'nuclio')

### Configure function instances
Here we configure a function instances for each of the streams we want to use `stream to parquet` upon.

In [None]:
fn.set_envs(envs)
# Configure a mount on the nuclio function from '/User' to our home directory '~/'.
fn.apply(mount_v3io())


### Deploy

In [None]:
fn.deploy()

## Done