# 0. Setup

<a id="model-deployment-with-streaming"></a>
## Model Deployment with Streaming

Deploy a model with streaming information. [The demo](model-deployment-with-streaming/0-setup.ipynb) covers the use case of 1<sup>st</sup>-day churn.

The importance of 1<sup>st</sup>-day churn prediction:
- In some segments of the gaming industry, the average 1st day churn is as high as 70%.
- Acquiring new customers is 5x&ndash;25x more expensive than retaining existing ones.
- Reducing churn by just 5% can boost profitability by 75%.
- Improving retention has a 2x&ndash;4x greater impact on growth than acquisition.
- The probability of selling to an existing customer is 60%&ndash;70%, but only 5%&ndash;20% for a prospect.
- Churn rate also informs metrics like customer lifetime value (LTV).

This demo is comprised of several steps:

![Model deployment with streaming Real-time operational Pipeline](../../assets/images/model-deployment-with-streaming.png)

While this demo covers the use case of 1<sup>st</sup>-day churn, it is easy to replace the data, related features and training model and reuse the same workflow for different business cases.

These steps are covered by the following notebooks:

- **0. Setup** (this notebook) - Creates the project and the relevant streams.
- [**1. Event generator**](1-event-generator.ipynb) — Generates events for the training and serving. 
- [**2. incoming event handler**](2-incoming-event-handler.ipynb) - Receive data from the input. This is a common input stream for all the data. This way, one can easily replace the event source data (in this case we have a data generator) without affecting the rest of this flow.
- [**2a. Stream to parquet**](2a-stream-to-parquet.ipynb) - Store all incoming data to parquet files.
- [**3a. Enrichment table**](3a-enrichment-table.ipynb) - Create an enrichment table (lookup values).
- [**3b. Enrich stream**](3b-enrich-stream.ipynb) - Enrich the stream using the enrichment table.
- [**4. Stream to features**](4-stream-to-features.ipynb) - Update aggregation features using the incoming event handler.
- [**5. Serving**](5-serving.ipynb) - Serve the model and process the data from the enriched stream and aggregation features.

This demo comes with a pre-trained model using the base features, enrichment data and derived features, calculated using the same generated data. You can retrain the model or train a new model by opening and running the  [**4b. optional training notebook**](4b.-optional-training.ipynb). You will need to ensure enough data is collected via the streams to the data storage in order to train a new model.

## About this demo

### Input Data

The event generator ([1-event-generator.ipynb](1-event-generator.ipynb)) creates the following events: `new_registration`, `new_purchases`, `new_bet` and `new_win` with the following data:

| new_registration |   | new_purchases |   | new_bet    |   | new_win    |
|------------------|---|---------------|---|------------|---|------------|
| user_id          |   | user_id       |   | user_id    |   | user_id    |
| event_type       |   | event_type    |   | event_type |   | event_type |
| event_time       |   | event_time    |   | event_time |   | event_time |
| name             |   | amount        |   | bet_amount |   | win_amount |
| date_of_birth    |   |               |   |            |   |            |
| street_address   |   |               |   |            |   |            |
| city             |   |               |   |            |   |            |
| country          |   |               |   |            |   |            |
| postcode         |   |               |   |            |   |            |
| affiliate_url    |   |               |   |            |   |            |
| campaign         |   |               |   |            |   |            |

Furthermore, `new_registration` includes a `label` column to indicate whether or not the user has churned (1 for churned and 0 for not)

## Enrichment

The enrichment table ([3a-enrichment-table.ipynb](3a-enrichment-table.ipynb)) contains a lookup of postcode and returns a socioeconomic index (`socioeconomic_idx`). The enriched stream contains the original data and the enriched data.

## Feature calculation

During the feature calculation ([4-stream-to-features.ipynb](4-stream-to-features.ipynb)), we calculate sum, mean, count and variance for the 3 amount fields (`amount`, `bet_amount` and `win_amount` for `new_purchases`, `new_bet` and `new_win` respectively). This results with the following list of fields:

- purchase_sum
- purchase_mean
- purchase_count
- purchase_var
- bet_sum
- bet_mean
- bet_count
- bet_var
- win_sum
- win_mean
- win_count
- win_var

## Prerequisites

<a id="gs-mlrun-install"></a>The tutorial uses MLRun to create a project, implement and execute an ML pipeline, and track the execution.
(For more information about MLRun, see Step 1.)
To use MLRun, you must first ensure that it's installed and running as a service on your platform cluster.
Look for an `mlrun` service on the **Services** page of the platform dashboard.
For more information and additional assistance, contact the Iguazio [support team](mailto:support@iguazio.com).

To use MLRun from Jupyter Notebook, you need to run the following code to install the `mlrun` Python package.
This needs to be done only once per Jupyter Notebook service.
> **Note:** You must **restart the Jupyter kernel** to complete the installation.

In [1]:
import sys
import subprocess
import pkg_resources
import IPython

required = {'mlrun'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed
previously_installed = required.intersection(installed)

if missing:
    print(f'Installing {",".join(missing)}')
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
    print('Restarting kernel')
    IPython.Application.instance().kernel.do_shutdown(True) #automatically restarts kernel
if previously_installed:
    print(f'Already installed: {",".join(previously_installed)}')

Already installed: mlrun


## Configure

The configuration below is shared across the notebooks. Change the values in this subsection if you would like different configuration settings.

### Project

Projects in the platform are used to package multiple functions, workflows, and artifacts. Set here the project base name.

In [2]:
PROJECT_BASE_NAME = "model-deployment-with-streaming"

### Data

All data in the platform is stored in user-defined data containers. In this case we use the predefined "users" container. For more information refer to [Data containers, collections, and objects documentation](https://www.iguazio.com/docs/latest-release/concepts/containers-collections-objects)

In [3]:
CONTAINER = 'users'

Data path where to store stream data and kv tables:

In [4]:
from os import getenv, path

V3IO_USERNAME = getenv('V3IO_USERNAME')
DATA_PATH = path.join(V3IO_USERNAME, 'examples',PROJECT_BASE_NAME, 'data')

Set up the different stream information

In [5]:
from urllib.parse import urljoin
WEB_API = "http://v3io-webapi:8081"
WEB_API_USERS = urljoin(WEB_API, CONTAINER)
STREAM_CONFIGS = {'generated-stream': {
                        'path': path.join(DATA_PATH, 'generated-stream'),
                        'shard_count': 8},
                  'incoming-events-stream': {
                        'path': path.join(DATA_PATH, 'incoming-events-stream'),
                        'shard_count': 8
                  },
                  'enriched-events-stream': {
                        'path': path.join(DATA_PATH, 'enriched-events-stream'),
                        'shard_count': 8
                  },
                  'serving-stream': {
                        'path': path.join(DATA_PATH, 'serving-stream'),
                        'shard_count': 8
                  },
                  'inference-stream': {
                        'path': path.join(DATA_PATH, 'inference-stream'),
                        'shard_count': 8
                  }
                 }

When we stream data, we associate the records with a specific partition key to ensure that similar records are assigned to the same shard. For more information, see the [stream sharding and partitioning description](https://www.iguazio.com/docs/latest-release/concepts/streams/#stream-sharding-and-partitioning).

In [6]:
PARTITION_ATTR = "user_id"

Target path to store the raw data as parquet files.
The parquet files will be written via file mount, hence we configure the path to start with '/User' which will be mounted to our home dir.

In [7]:
PARQUET_TARGET_PATH = path.join(DATA_PATH.replace(V3IO_USERNAME, '/User'),  'events-pq')

Target path to store the enrichment table (a key-value table)

In [8]:
ENRICHMENT_TABLE_PATH = path.join(DATA_PATH, 'enrichment-table')

Target path to store the calculated features

In [9]:
FEATURE_TABLE_PATH = path.join(DATA_PATH, 'feature-table')

## Create V3IO Client

With the dataplane client you can manipulate data in the platform's multi-model data layer, including:
* Objects
* Key-values (NoSQL)
* Streams
* Containers

Under the hood, the client connects through the platform's web API (https://www.iguazio.com/docs/reference/latest-release/api-reference/web-apis/) and wraps each low level API with an interface. Calls are blocking, but you can use the batching interface to send multiple requests in parallel for greater performance. 

In [10]:
import v3io.dataplane
v3io_client = v3io.dataplane.Client(endpoint=WEB_API,
                                    access_key=getenv('V3IO_ACCESS_KEY'))

## Manage Streams

#### Delete all streams

Cleanup previous streams

In [11]:
for stream_name, stream_config in STREAM_CONFIGS.items():
    resp = v3io_client.delete_stream(container=CONTAINER, path=stream_config['path'], 
                                     raise_for_status=v3io.dataplane.RaiseForStatus.never)
    print(f'Delete Stream call for stream {stream_name} returned with status {resp.status_code}, and content: {resp.body.decode("utf-8")}')

Delete Stream call for stream generated-stream returned with status 404, and content: {
	"ErrorCode": -2,
	"ErrorMessage": "No such file or directory"
}
Delete Stream call for stream incoming-events-stream returned with status 404, and content: {
	"ErrorCode": -2,
	"ErrorMessage": "No such file or directory"
}
Delete Stream call for stream enriched-events-stream returned with status 404, and content: {
	"ErrorCode": -2,
	"ErrorMessage": "No such file or directory"
}
Delete Stream call for stream serving-stream returned with status 404, and content: {
	"ErrorCode": -2,
	"ErrorMessage": "No such file or directory"
}
Delete Stream call for stream inference-stream returned with status 404, and content: {
	"ErrorCode": -2,
	"ErrorMessage": "No such file or directory"
}


#### Create all streams

In [12]:
for stream_name, stream_config in STREAM_CONFIGS.items():
    print(stream_config['path'])
    resp = v3io_client.create_stream(container=CONTAINER,
                                     path=stream_config['path'],
                                     shard_count=stream_config['shard_count'],
                                     raise_for_status=v3io.dataplane.RaiseForStatus.never)
    print(f'Create Stream call for stream {stream_name} returned with status {resp.status_code}, and content: {resp.body.decode("utf-8")}')

iguazio/examples/model-deployment-with-streaming/data/generated-stream
Create Stream call for stream generated-stream returned with status 204, and content: 
iguazio/examples/model-deployment-with-streaming/data/incoming-events-stream
Create Stream call for stream incoming-events-stream returned with status 204, and content: 
iguazio/examples/model-deployment-with-streaming/data/enriched-events-stream
Create Stream call for stream enriched-events-stream returned with status 204, and content: 
iguazio/examples/model-deployment-with-streaming/data/serving-stream
Create Stream call for stream serving-stream returned with status 204, and content: 
iguazio/examples/model-deployment-with-streaming/data/inference-stream
Create Stream call for stream inference-stream returned with status 204, and content: 


## Set-up MLRun Project

Projects are created by using the `new_project` MLRun method, which receives the following parameters:

- **`name`** (Required) &mdash; the project name.
- **`context`** &mdash; the path to a local project directory (the project's context directory).
  The project directory contains a project-configuration file (default: **project.yaml**), which defines the project, and additional generated Python code.
  The project file is created when you save your project (using the `save` MLRun project method), as demonstrated in Step 6.
- **`functions`** &mdash; a list of functions objects or links to function code or objects.
- **`init_git`** &mdash; set to `True` to perform Git initialization of the project directory (`context`).
  > **Note:** It's customary to store project code and definitions in a Git repository.

Projects are visible in the MLRun dashboard only after they're saved to the MLRun database, which happens whenever you run code for a project.

The following code creates a project using the `PROJECT_BASE_NAME`, concatenated with your current running username in the platform (**&lt;V3IO_USERNAME&gt;**), and sets the project directory to a **conf** directory in the current demo directory (**/User/demos/model-deployment-with-streaming/conf**).

> **Note:** Platform projects are shared among all users of the parent tenant, to facilitate collaboration. Therefore,
>
> - Synchronize your projects execution with other users on your platform cluster, as needed, or use unique project names to avoid conflicts.
>   You can easily change the default project name for this tutorial by changing the definition of the `PROJECT_BASE_NAME` variable, defined in the beginning of the notebook.
> - Don't include in your project proprietary information that you don't want to expose to other users.
>   Note that while projects are a useful tool, you can easily develop and run code in the platform without using projects.

In [13]:
from mlrun import new_project

project_name = '-'.join(filter(None, [PROJECT_BASE_NAME, getenv('V3IO_USERNAME', None)]))
project_path = path.abspath('conf')
project = new_project(project_name, project_path, init_git=True)

print(f'Project path: {project_path}\nProject name: {project_name}')

Project path: /User/work/tutorials/demos/model-deployment-with-streaming/conf
Project name: model-deployment-with-streaming-iguazio


[MLRun](https://github.com/mlrun/mlrun) is a generic and convenient mechanism for data scientists and software developers to describe and run tasks related to machine learning in various, scalable runtime environments and ML pipelines while automatically tracking executed code, metadata, inputs, and outputs.
MLRun integrates with the Nuclio serverless framework and with the Kubeflow Pipelines framework for running ML pipelines.
The demo uses MLRun to create a project, run Nuclio serverless functions, as well as run the model training.
Before running your code, you need to set some MLRun configurations:

- <a id="gs-mlrun-config-artifcats-path"></a>**Artifacts path** &mdash; the location for storing versioned data artifacts (such as files, objects, data sets, and models) that are produced or consumed by functions, runs, and workflows.
  The path can be defined either as a local directory path or as a URL (of the format `s3://*`, `v3io://*`, etc.).
  You can set the artifacts path either by defining an `MLRUN_ARTIFACT_PATH` environment variable (which applies globally throughout the current environment) or as part of the MLRun configuration.
 
  If the target directory doesn't exist, MLRun creates it.
  You can use the notation `{{run.uid}}` in the path to signify the current run ID.
  For pipelines, you can use the notation `{{workflow.uid}}` to signify the workflow ID.
  This allows you to create a unique artifacts directory for each executed job or workflow.

  After you run an MLRun job, the artifacts directory might contain one or more of the following directories:
 
  - **plots** &mdash; a directory for storing images, figures, and plotlines.
  - **models** &mdash; a directory for storing all trained models.
  - **data** &mdash; a directory for storing any other type of data artifact, such as data sets.

The following code sets the artifacts path to a **artifacts** directory within the tutorial directory (**/User/demos/model-deployment-with-streaming/artifacts**)

In [14]:
from mlrun import mlconf

# Target location for storing pipeline artifacts
project.artifact_path = path.abspath('artifacts')
# MLRun DB path or API service URL
#mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

print(f'Artifacts path: {project.artifact_path}\nMLRun DB path: {mlconf.dbpath}')

Artifacts path: /User/work/tutorials/demos/model-deployment-with-streaming/artifacts
MLRun DB path: http://mlrun-api:8080


## Shared Configuration

Store the configuration defined in this notebook in the project `params`. We will use these values in subsequent notebooks.

In [15]:
project.params['PROJECT_BASE_NAME'] = PROJECT_BASE_NAME
project.params['STREAM_CONFIGS'] = STREAM_CONFIGS
project.params['CONTAINER'] = CONTAINER
project.params['WEB_API'] = WEB_API
project.params['WEB_API_USERS'] = WEB_API_USERS
project.params['PARTITION_ATTR'] = PARTITION_ATTR
project.params['PARQUET_TARGET_PATH'] = PARQUET_TARGET_PATH
project.params['ENRICHMENT_TABLE_PATH'] = ENRICHMENT_TABLE_PATH
project.params['FEATURE_TABLE_PATH'] = FEATURE_TABLE_PATH

In [16]:
from IPython.display import display, JSON
display(JSON(project.params, expanded=True))

<IPython.core.display.JSON object>

### Save the Project

In [17]:
project.save()

## Done

Continue to [**1-event-generator.ipynb**](1-event-generator.ipynb) to generates events for the training and serving