# 1. Event Generator
  --------------------------------------------------------------------

Generate an events stream to simulate incoming data. In real life situations you will provide input data instead of this simulated data.

The output is a stream of events called `generated-stream`.

![Model deployment with streaming Real-time operational Pipeline](../../assets/images/model-deployment-with-streaming.png)

The event generator creates the following events: `new_registration`, `new_purchases`, `new_bet` and `new_win` with the following data:

| new_registration |   | new_purchases |   | new_bet    |   | new_win    |
|------------------|---|---------------|---|------------|---|------------|
| user_id          |   | user_id       |   | user_id    |   | user_id    |
| event_type       |   | event_type    |   | event_type |   | event_type |
| event_time       |   | event_time    |   | event_time |   | event_time |
| name             |   | amount        |   | bet_amount |   | win_amount |
| date_of_birth    |   |               |   |            |   |            |
| street_address   |   |               |   |            |   |            |
| city             |   |               |   |            |   |            |
| country          |   |               |   |            |   |            |
| postcode         |   |               |   |            |   |            |
| affiliate_url    |   |               |   |            |   |            |
| campaign         |   |               |   |            |   |            |

Furthermore, `new_registration` includes a `label` column to indicate whether or not the user has churned (1 for churned and 0 for not)

## Prerequisites

We use the faker module to generate data, please run the cell below and restart the notenook's kernel.

In [1]:
import sys
import subprocess
import pkg_resources
import IPython

required = {'faker'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed
previously_installed = required.intersection(installed)

if missing:
    print(f'Installing {",".join(missing)}')
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
    print('Restarting kernel')
    IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel
if previously_installed:
    print(f'Already installed: {",".join(previously_installed)}')

Already installed: faker


## Initialize

Load the project

In [2]:
from mlrun import load_project
from os import path

project_path = path.abspath('conf')
project = load_project(project_path)

Get the generated stream path, this is where we output the data

In [3]:
container = project.params.get('CONTAINER')
output_stream = project.params.get('STREAM_CONFIGS').get('generated-stream')
output_stream_path =  output_stream.get('path')
print(f'Container: {container}\nOutput stream path: {output_stream_path}')

Container: users
Output stream path: iguazio/examples/model-deployment-with-streaming/data/generated-stream


In [4]:
from random import randint, random
from datetime import datetime, timedelta

def gen_postcode(is_churn):
    # if is_churn is true the postcode modulu 3 will return 0 or 1
    # if is_churn is false the postcode modulu 3 will return 0 or 2
    # this will encode information in postcode that our ML model will learn
    base_postcode = 3 * randint(3334,33333)
    group = randint(0,1)
    if is_churn:
        return base_postcode + group
    else:
        return base_postcode + (group * 2)

# event functions
def new_registration(fake, id, event_time, is_churn):
    return {'user_id': id,
            'event_type': 'registration',
            'event_time': event_time,
            'name':fake.name(),
            'date_of_birth': fake.date(),
            'street_address': fake.street_address(),
            'city': fake.city(),
            'country': fake.country(),
            'postcode': gen_postcode(is_churn),
            'affiliate_url': fake.image_url(),
            'campaign': fake.ean8()}

def new_purchase(fake, id, event_time):
    return {'user_id': id,
            'event_type': 'purchase',
            'event_time': event_time,
            'amount': fake.randomize_nb_elements(number=50)}

def new_bet(fake, id, event_time):
    return {'user_id': id,
            'event_type': 'bet',
            'event_time': event_time,
            'bet_amount': fake.randomize_nb_elements(number=10)}
    
def new_win(fake, id, event_time):
    return {'user_id': id,
            'event_type': 'win',
            'event_time': event_time,
            'win_amount': fake.randomize_nb_elements(number=200)}

def gen_event_date(is_churn, prev_event_date=None):
    if prev_event_date is None:
        #generate first event date
        return datetime.now() - timedelta(hours=randint(48,96))
    else:
        if prev_event_date + timedelta(hours=30) < datetime.now() and not is_churn and randint(1,1000) <= 5:
            # if the user is not churned and it is possible, generate event in the following day with prbability 0.005
            return prev_event_date + timedelta(hours=randint(15,24))
        else:
            return prev_event_date + timedelta(seconds=randint(5,100))
        
def generate_events(fake, user_ids, events_dist, num_events, is_churn):
    events = []
    for id in user_ids:
        # register
        event_time = gen_event_date(is_churn)
        reg_event = new_registration(fake, id, event_time, is_churn)
        reg_event['label'] = int(is_churn)
        events.append(reg_event)
        for _ in range(num_events):
            # generate event according to dist
            acc_prob = 0
            rand = random()
            for event_dist in events_dist:
                if rand <= event_dist['probability']+acc_prob:
                    event_time = gen_event_date(is_churn, event_time)
                    new_event = event_dist['generator'](fake, id, event_time)
                    events.append(new_event)
                    prob_threshold = 0
                    break
                else:
                    acc_prob += event_dist['probability']
    return events

# 70% churn users 
NUM_USERS_GROUP1 = 1400
NUM_USERS_GROUP2 = 600 
NUM_USERS = NUM_USERS_GROUP1+NUM_USERS_GROUP2

EVENTS_PER_USER = 1000

GROUP1_EVENTS_DIST = [{'probability': 0.1, 'generator': new_purchase}, 
                      {'probability': 0.89, 'generator': new_bet}, 
                      {'probability': 0.01, 'generator': new_win}]

GROUP2_EVENTS_DIST = [{'probability': 0.1, 'generator': new_purchase}, 
                      {'probability': 0.85, 'generator': new_bet},
                      {'probability': 0.05, 'generator': new_win}]

## Create V3IO Client

With the dataplane client you can manipulate data in the platform's multi-model data layer, including:
* Objects
* Key-values (NoSQL)
* Streams
* Containers

Under the hood, the client connects through the platform's web API (https://www.iguazio.com/docs/reference/latest-release/api-reference/web-apis/) and wraps each low level API with an interface. Calls are blocking, but you can use the batching interface to send multiple requests in parallel for greater performance. 

In [5]:
import v3io.dataplane
from os import getenv
v3io_client = v3io.dataplane.Client(endpoint=project.params.get('WEB_API'),
                                    access_key=getenv('V3IO_ACCESS_KEY'))

## Generate Events

In [6]:
import uuid
from faker import Faker

fake = Faker()

group1_user_ids = (str(uuid.uuid4()) for _ in range(NUM_USERS_GROUP1))
group2_user_ids = (str(uuid.uuid4()) for _ in range(NUM_USERS_GROUP2))

group1_events = generate_events(fake, group1_user_ids, GROUP1_EVENTS_DIST, EVENTS_PER_USER, True)
group2_events = generate_events(fake, group2_user_ids, GROUP2_EVENTS_DIST, EVENTS_PER_USER, False)

print(f'Events generated: {len(group1_events)+len(group2_events)}')
print(f'Events preview: {group1_events[1:5]}')

Events generated: 2002000
Events preview: [{'user_id': 'bd28cb66-d330-4186-9d43-eb13a6bb2a2d', 'event_type': 'bet', 'event_time': datetime.datetime(2020, 8, 21, 1, 29, 36, 215472), 'bet_amount': 9}, {'user_id': 'bd28cb66-d330-4186-9d43-eb13a6bb2a2d', 'event_type': 'bet', 'event_time': datetime.datetime(2020, 8, 21, 1, 30, 22, 215472), 'bet_amount': 10}, {'user_id': 'bd28cb66-d330-4186-9d43-eb13a6bb2a2d', 'event_type': 'bet', 'event_time': datetime.datetime(2020, 8, 21, 1, 31, 18, 215472), 'bet_amount': 13}, {'user_id': 'bd28cb66-d330-4186-9d43-eb13a6bb2a2d', 'event_type': 'bet', 'event_time': datetime.datetime(2020, 8, 21, 1, 32, 57, 215472), 'bet_amount': 9}]


## Write generated events to V3IO Steam

#### Sort the events based on their event time

In [7]:
events = (group1_events + group2_events)
events.sort(key=lambda event: event.get('event_time'))

#### Ingest in small batches to V3IO Stream

We will create enrichment table where the key is postal-code and the value is the socioeconomic index at the area.

To get the highest possible throughput, we can send many requests towards the data layer and wait for all the responses to arrive (rather than send each request and wait for the response). The SDK supports this through batching. Any API call can be made through the client's built in `batch` object. The API call receives the exact same arguments it would normally receive (except for `raise_for_status`), and does not block until the response arrives. To wait for all pending responses, call `wait()` on the `batch` object:

In [8]:
import json
batch_size = 1000
for i in range(0, len(events), batch_size):
    # Convert the events to records
    records = [{'data': json.dumps(event, default=str)} for event in events[i:i+batch_size]]
    v3io_client.batch.put_records(container=container, path=output_stream_path, records=records)

responses = v3io_client.batch.wait()

The looped `put_records` interface above will send all `put records` requests to the data layer in parallel. When `wait` is called, it will block until either all responses arrive (in which case it will return a `Responses` object, containing the `responses` of each call) or an error occurs - in which case an exception is thrown. You can pass `raise_for_status` to `wait`, and it behaves as explained above.

> Note: The `batch` object is stateful, so you can only create one batch at a time. However, you can create multiple parallel batches yourself through the client's `create_batch()` interface

In [9]:
records_sent = sum(len(json.loads(resp.body)['Records']) for resp in responses)
print(f'Records sent {records_sent}')

failed_records = sum(json.loads(resp.body)['FailedRecordCount'] for resp in responses)

if failed_records > 0:
    print(f'Failed to stream {failed_records}')
else:
    print('All data streamed successfully.')

Successfully streamed 2002000
All data streamed successfully.


## Done

Continue to [**2-incoming-event-handler.ipynb**](2-incoming-event-handler.ipynb) to process the incoming data.