# LocalCart scenario part 2: Creating streaming pipelines

This notebook contains two examples of creating streaming pipelines:

- [Example 1: Creating a Message Hub to Redis streaming pipeline that uses Code and Aggregation operators](#intro_a)<br>
- [Example 2: Creating a Message Hub to Object Storage streaming pipeline](#intro_b)<br>

This notebook runs on Python 2 with Spark 2.0.

<a id="intro_a"></a>
***

# Example 1: Creating a Message Hub to Redis streaming pipeline that uses Code and Aggregation operators

***


## Introduction 


A web or mobile app will trigger events as a user navigates a web site. These clickstream events indicate when a customer logs in, adds something to a basket, completes an order, and logs out. The events are placed into configured Message Hub (Apache Kafka) that provides a scalable way to buffer the data before it is saved, analysed, and rendered. 

[Notebook #1 - Creating a Kafka Producer of ClickStream events](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-1.ipynb) generates clickstream events for LocalCart and sends them to Message Hub to show how data can be collected offline and streamed to the cloud later. A [Java app](https://localcartkafkaproducer.mybluemix.net/LocalCartKafkaProducer/) continuously feeds a simulated stream of events to Message Hub. 

This example creates a streaming pipeline that ingests those clickstream events, processes each event type, aggregates the events for real-time analysis, and streams them to Redis storage on the cloud for later analysis. 



### Static data analysis

In order to analyse the data at rest, we can create a streaming pipeline job that takes the data from Message Hub and stores it in CSV files. This process is described in [Example 2: MessageHub to CSV streaming pipelines](#intro_b) These files can be concatenated and loaded into a Jupyter notebook. We can use [PixieDust](https://github.com/ibm-cds-labs/pixiedust) to analyse the data. This type of analysis with PixieDust is done in [Notebook #4: Visualize streaming data](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-4.ipynb).


### Streaming data analysis

In order to gather aggregations of each event as it arrives, we can set up a streaming pipeline that performs aggregation operations and writes the output to Redis. Aggregation operations enable us to do some real-time analysis as the data streams.


### Visualization

After we have aggregations in Redis, we can see the data in dashboards by using PixieDust in Jupyter notebooks, or by using custom-made Node.js web applications.  

## Table of contents

1. [Introduction](#intro_a)<br>
2. [Scenario](#process_a)<br>
3. [Collect data from Message Hub](#collect_a)<br>
4. [Process the incoming events with Python code](#process_events)<br>
5. [Set up aggregation functions for events](#aggregation)<br>
6. [Provision and set up Redis database](#redis)<br>
7. [Repeat for next streams ....](#repeat)<br>
8. [Summary and next steps](#summary)<br>


<a id="process_a"></a>
## Scenario

In this notebook, our aim is to analyse the incoming events by using the streaming pipelines service. The following graphic shows LocalCart clickstream events that are generated and sent from the Message Hub service. 
<img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_COMPLETE_STREAMING_PIPELINE.gif?raw=true'></img>

- The first stream deals with the login event type. It is aggregated to get its count.
- The second stream deals with the add_to_cart event type. It is processed by Python code to compute the total basket price and the UTZ time stamp, and then aggregated to get the sum.
- The third stream deals with the checkout event type. It is processed by Python code to compute the total basket price and the UTZ time stamp, and then aggregated to get the sum.

We use the same Redis instance for all three streams in our pipeline.


<a id="collect_a"></a>
## Collect data from Message Hub

First we need to create a streaming pipeline that collects data from a Message Hub operator.


***

### Steps

In IBM Data Science Experience, do these steps:

1. Select a project that you want to contain the streaming pipeline.
1. Click the **Analytics Assets** tab
1. In the Streaming Pipelines section, click **add streaming piplelines**.
1. In the Create New Streaming Pipeline window, click **Manual**. Type in a name for the pipeline, and then click **Create**.
1. Drag a MessageHub source operator into the pipeline canvas.
1. Configure the MessageHub operator by doing these steps in the Properties pane:
	1. Select the ClickStream MessageHub instance.
	1. Select the topic that you want to aggregate. For example, you might select "Login".
	1. Click **Edit Schema** to match the incoming data:
        - `customer_id` - text - `.customer_id`
        - `click_event_type` - text - `.click_event_type`
        - `event_time` - text - `.event_time`


Our streaming pipeline now has its first operator and looks like this: <img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_MH.gif?raw=true'></img>

<a id="process_events"></a>
## Process the incoming events with Python code

Python code can be used to to process incoming events. The Python code has the following form:

In [4]:
def process(event):
    return {'output':'output'}

With each incoming event, the streaming pipeline will call "process" with the incoming data. Whatever your function returns will be sent to the next block. Here is an example of a login clickstream event:

In [3]:
ev = { 
    'customer_id': '14420', 
    'click_event_type': 'login',     
    'total_price_of_basket' : "0.0",
    'total_number_of_items_in_basket' : "0",
    'total_number_of_distinct_items_in_basket' : "0",
    'event_time': '2017-04-10 15:54:35 IST'
}

Let's make a "process" function that parses the time stamp and returns the parsed date:

In [2]:
from dateutil.parser import parse
from dateutil import tz
def process(event):
    # datetime.parse doesn't understand "IST" as a timezone indicator, so swap for +05:30
    dt = parse(event['event_time'].replace('IST','+05:30'))
    
    # convert to UTC timezone too
    event['dt_utc'] = dt.astimezone(tz.gettz('UTC'))

    return event

and convert our sample data with it:

In [1]:
print process(ev)

NameError: name 'process' is not defined


### Steps

In the pipeline canvas, do these steps:
1. Drag a Code operator from the Processing and Analytics area, and then drop it on the canvas next to the MessageHub operator.
2. Drag your mouse pointer from the output port of the MessageHub operator to the input port of the Code operator to connect them.
3. Click the **Code** operator to open its Properties pane. 
    - Copy and paste the following code into the Code block:

    ```
        from dateutil.parser import parse
        from dateutil import tz
        def process(event):
            # datetime.parse doesn't understand "IST" as a timezone indicator, so swap for +05:30
            dt = parse(event['event_time'].replace('IST','+05:30'))
    
            # convert to UTC timezone too
            event['dt_utc'] = dt.astimezone(tz.gettz('UTC'))

        return event
    ```

   - Click **Edit Schema** to edit the code block to match the schema in the MessageHub operator. Add a new attribute to the Code's schema: `dt_utc` which is of type Date.


You now have a streaming pipeline with two operators and that looks like this:
<img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_MH_CODE.gif?raw=true'></img>



Let's turn now to the third operator, Aggregation.

<a id="aggregation"></a>
## Set up aggregation functions for events

Streaming data can be aggregated and then a function such as sum, count, minimum, or maximim can be done on the aggregation before it is written to the Redis database. Our aim is to calculate the following data for a sliding one-hour window:

- login_count - the number of people who logged into LocalCart 
- basket_count - the number of items added into a shopping cart
- checkout_count - the number of purchases made
- basket_total - the total price of items added into a shopping cart
- checkout_total - the total price of items purchased

Let's walk through how to assign an aggregation function for the `login_count` event type in our streaming data.



### Steps

In the pipeline canvas, do these steps:

1. Drag a Aggregation operator from the Processing and Analytics area, and then drop it on the canvas next to the Code operator.
2. Drag your mouse pointer from the output port of the Code operator to the input port of the Aggregation operator to connect them.
3. Click the Aggregation operator to open its Properties pane. Set the following parameters:
    - Type - 'sliding'
    - Time Units - 'hour'
    - Number of Time Units - 1
    - Partition By - leave unchanged
    - Group By - leave unchanged
4. In the Functions area of the Aggregation Properties pane, assign the following values:
    - Output Field Name - `login_count`
    - Function Type - count
    - Apply Function To - `login_count` 
     ***
     

Repeat the steps above for the `basket_count` and `checkout_count` Aggregation operators on their respective Message Hub topics.
    
The `basket_total` and `checkout_total` aggregations are achieved by adding a second aggregation function to the existing block, this time using a "Sum" function.


Our pipeline now has three operators and looks like this:
<img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_MH_CODE_AGGREG.gif?raw=true'></img>



<a id="redis"></a>
## Provision and set up Redis database

Redis is an in-memory database. It stores its data in RAM, making it a very fast way of storing and retrieving data. It provides a set of primitive data structures, but we only concern ourselves with [hashes](https://redis.io/commands#hash) for this exercise.

A Redis hash is a data structure that allows several keys to be stored together. We are going to configure a Redis hash called "funnel" that contains the following output:

- login_count - the number of people who logged into LocalCart
- basket_count - the number of items added into a shopping cart
- checkout_count - the number of purchases made
- basket_total - the total price of items added into a shopping cart
- checkout_total - the total price of items purchased

These are the outputs of the aggregation functions in our streaming pipeline. Let's provision our own Redis service:



### Steps

In the IBM Bluemix Dashboard, do these steps:

1. Click the **Services** tab.
1. Choose the **Redis By Compose** service.
1. Provision a new Redis By Compose service. Note its authentication details (hostname, port, and password).



Now we can play with the Redis service in this notebook by installing the Python Redis library with the following command:

In [26]:
!pip install redis



We import the library with this command:

In [37]:
import redis
r = redis.StrictRedis(host='abc.com', port=10115, db=0, password='ABCDEFGHI')

We can then create a hash called 'funnel' to store our real-time data to the database by using the `hset` function:


In [5]:
r.hset('funnel', 'basket_count', 554);
r.hset('funnel', 'basket_total', 951);
r.hset('funnel', 'checkout_count', 21);
r.hset('funnel', 'checkout_total', 5400);
r.hset('funnel', 'login_count', 100);

NameError: name 'r' is not defined

We can also use this connection to retrieve all the values from our 'funnel' hash using `hgetall`:

In [6]:
r.hgetall('funnel')

NameError: name 'r' is not defined

**Note:** 
The Redis connection above seems to freeze in this notebook after a minute or so. In this case, you will need to restart the notebook kernel to restore it.


We can now store the aggregations from our streaming pipeline in Redis.


### Steps

In the Bluemix Dashboard, do these steps:

1. Click the **Services** tab.
1. Choose the **Redis By Compose** service.
1. Provision a new Redis By Compose service. Note its authentication details (hostname, port, and password).

Let's add a Redis output to our streaming pipelines. In the streaming pipeline canvas, do these steps:

1. Drag a Redis operator from the Target area, and then drop it on the canvas next to the Aggregation operator.
1. Drag your mouse pointer from the output port of the Aggregation operator to the input port of the Redis operator to connect them.
2. Click the **Redis** operator to open its Properties pane. 
    - Type in the value of 'hostname', 'port', and 'password' of your Redis by Compose service.
    - In the Key Template field, type in "funnel". 


***

Your pipeline should now look like this:
<img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_STREAMING_PIPELINE.gif?raw=true'></img>


Congratulations! You just created a streaming pipeline that takes clickstream data from MessageHub, processes the data with Python, aggregates the event types and applies functions to them, and then writes counts and totals to Redis storage.


<a id="repeat"></a>
## and repeat for next streams ....

You now need to create two more streams into the pipeline. One stream is for the "add_to_cart" Message Hub event type. The second stream is for the "checkout" Message Hub topic. The messages on those topics are a bit more detailed:

In [19]:
add_to_cart_event = {
    "customer_id": "13859",
    "click_event_type": "add_to_cart",
    "product_name": "Oatmeal",
    "product_category": "Food",
    "product_price": "2.49",
    "total_price_of_basket": "153.41",
    "total_number_of_items_in_basket": "19",
    "total_number_of_distinct_items_in_basket": "6",
    "event_time": "2017-06-23 12:56:18 UTC"
}
checkout_event =  {
    "customer_id": "11828",
    "click_event_type": "checkout",
    "total_price_of_basket": "72.80000000000001",
    "total_number_of_items_in_basket": "20",
    "total_number_of_distinct_items_in_basket": "5",
    "session_duration": "440",
    "event_time": "2017-06-23 13:09:12 UTC"
}

When you create those two streams, you need to add extra fields to the Message Hub schema and parse them correctly in the Python code.

In [7]:
from dateutil.parser import parse
from dateutil import tz
def process(event):
    # datetime.parse doesn't understand "IST" as a timezone indicator, so swap for +05:30
    dt = parse(event['event_time'].replace('IST','+05:30'))
    
    # convert to UTC timezone too
    event['dt_utc'] = dt.astimezone(tz.gettz('UTC'))
    event['total_price_of_basket'] = float(event['total_price_of_basket'])
    return event



Drag a second Aggregation operator to the canvas, connect it to the Code operator for the add_to_cart event type, and define the Aggregation Properties pane with the following values:
    - Type - 'sliding'
    - Time Units - 'hour'
    - Number of Time Units - 1
    - Partition By - leave unchanged
    - Group By - leave unchanged
    - Output Field Name - `basket_total`
    - Function Type - sum
    - Apply Function To - `basket_total` 


Drag a third Aggregation operator to the canvas, connect it to the Code operator for checkout event type, and define the Aggregation Properties pane with the following values:
    - Type - 'sliding'
    - Time Units - 'hour'
    - Number of Time Units - 1
    - Partition By - leave unchanged
    - Group By - leave unchanged
    - Output Field Name - `checkout_total`
    - Function Type - sum
    - Apply Function To - `checkout_total` 


We use the same Redis instance for all three streams in our pipeline. Consequently, you only need to create the Redis By Compose service one time.

Our complete streaming pipeline now looks like this: 

<img src='https://github.com/notebookgraphics/advobeta/blob/master/NB2_COMPLETE_STREAMING_PIPELINE.gif?raw=true'></img>

<a id="summary"></a>

## Summary and next steps
In this section, you set up a streaming pipeline that used data from notebook [Notebook #1: Creating a Kafka Producer of ClickStream events](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-1.ipynb). The data included clickstream events (log in, browse, add to cart, logout without purchase, logout with purchase, and checkout).  



<a id="intro_b"></a>

***
# Example 2: Creating a Message Hub to Object Storage streaming pipeline
***


## Introduction


A web or mobile app will trigger events as a user navigates a web site. These clickstream events indicate when a customer logs in, adds something to a basket, completes an order, and logs out. The events are placed into configured Message Hub (Apache Kafka) that provides a scalable way to buffer the data before it is saved, analysed, and rendered. 

[Notebook #1: Creating a Kafka Producer of ClickStream events](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-1.ipynb) generates clickstream events for LocalCart and sends them to Message Hub to show how data can be collected offline and streamed to the cloud later. A [Java app](https://localcartkafkaproducer.mybluemix.net/LocalCartKafkaProducer/) continuously feeds a simulated stream of events to Message Hub. 

This section creates streaming pipelines that ingest those clickstream events, and writes them to CSV format on Object Storage for later analysis.

These files can be concatenated and loaded into a Jupyter notebook. We can use [PixieDust](https://github.com/ibm-cds-labs/pixiedust) to analyse the data. This type of analysis with PixieDust is done in [Notebook #4: Visualize streaming data](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-4.ipynb).



## Table of contents

1. [Introduction](#intro_b)<br>
2. [Scenario](#process_b)<br>
3. [Steps to collect data from Message Hub](#collect_b)<br>
4. [Summary and next steps](#summary_b)<br>
   

<a id="process_b"></a>
## Scenario 

In this notebook, our aim is to persist the incoming events as CSV files by using the streaming pipelines service. The following graphic shows LocalCart clickstream events that are generated and sent from the Message Hub service. 
<img src='https://github.com/ibm-watson-data-lab/advo-beta-producer/blob/master/graphics/NB2a_CSV_PIPELINE.png?raw=true'></img>


<a id="collect_b"></a>
## Steps to collect data from Message Hub

First, we need to create streaming pipelines that capture clickstream data from a Message Hub operator. Each pipeline will stream data of one clickstream event type to its own CSV file in Object Storage. All CSV files are in the same instance of Object Storage.


In IBM Data Science Experience, do these steps:

1. Select a project that you want to contain the streaming pipeline.
1. Click the **Analytics Assets** tab
1. In the Streaming Pipelines section, click **add streaming pipelines**.
1. In the Create New Streaming Pipeline window, type in a name for the pipeline such as **addtocart2csv**. Click **Wizard**, and then click **Create**. 
1. In the Select Source window, click **MessageHub**.
1. Under the Instance drop-down menu, select your MessageHub instance.
1. Under the Topic drop-down menu, select **add_to_cart**. Click **Continue**.
1. Wait for the Data Preview window to load the streaming data. Click **Continue**.
1. In the Select Target window, click **Object Storage**.
1. Under the Object Storage Instance drop-down menu, select the instance that is used by the DSX project.
   <br>
   > Take note of the  Object Storage instance name. You will need this information in [Notebook 3b: Static clickstream analysis](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-3b.ipynb) when you load and analyze the clickstream events.
1. Under the Container drop-down menu, select the Object Storage container you want to write to. 
   <br>
   > Take note of the  Object Storage container name. You will need this information in [Notebook 3b: Static clickstream analysis](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-3b.ipynb) when you load and analyze the clickstream events.
1. Under File Name, type **add_to_cart-TIMESTAMP.csv** (**Note:** "TIMESTAMP" is a reserved word that will be replaced with an actual timestamp when the file is written).
1. Under Format, select **csv**.
1. Under Delimiter, select **Comma (,)**.
1. Click **Save**. The pipeline is created for you, but it will be in "Stopped" state.
1. In the next window, click the **Run** icon to start the streaming pipeline.
 

Next, repeat the steps above for each Message Hub topic: browsing, checkout, clickstream, login, logout_with_purchase, and logout_without_purchase.


<a id="summary_b"></a>
## Summary and next steps
In this section, you created seven streaming pipelines to capture clickstream event data from a Message Hub operator. The data is then written to a CSV file on Object Storage for later analysis. 

You can do the following steps to access those CSV files for further analysis.

#### Accessing CSV files on Object Storage
1. Log in to [Bluemix](https://console.bluemix.net/) by using your DSX credentials.
1. Navigate to the space where the Object Storage instance is located. This space is what you selected when you created the DSX project.
1. Open the Object Storage instance.

#### Accessing CSV files on Object Storage manually
1. Open the **Manage** tab, and then select the container that you specified when you created the data collection pipeline. 
1. Select a CSV file. In the "Select Action"" list, select "Download File" to view it.

#### Accessing CSV files on Object Storage programatically
1. Open the **Service credentials** tab. Select a Key Name, and then click **View credentials**. 
1. Copy the credentials and provide this information whenever you want to load data files programatically, such as in [Notebook 3b: Static clickstream analysis](https://github.com/wdp-beta/get-started/blob/master/notebooks/localcart-scenario-part-3b.ipynb).



***

### Authors

Glynn Bird is a Developer Advocate for Watson Data Platform at IBM. 

Raj Singh is a Developer Advocate for Watson Data Platform at IBM.

***
Copyright © IBM Corp. 2017. This notebook and its source code are released under the terms of the MIT License.