# Kaskada Demo
## Let's use a common ML challenge: Customer Retention. 

Companies collect massive amounts of data using platforms like Splunk, Heap, Segment, or even basic event logs describing user behavior. How can you use this data to predict user retention, revenue targets, and identify which customers are likely to be the most successful?

Then how do we make that information available to the customer success reps to attempt to save accounts, to sales reps to help predict if a new customer might be successful, and to revenue leaders to predict quarterly and annual revenue targets.

**Note**: Due to the terms and conditions by which the data used in this notebook is made avaialble, anyone interested in recreating this work will need to download the files from Kaggle and follow the instructions below to create your own Kaskada account and upload the data. 

# Step 1: Setup Kaskada Client

## Session Builder
The next version of Kaskada will use an API Session Builder to follow closely to PySpark's approach to local connections.

###  Local Session Builder
The default local session builder (`LocalBuilder`) by default assumes:
* Endpoint: `localhost:50051` for the API server
* Is Secure: `False`
* Will spin up the API server and Compute Server binaries.
  * Assumes Kaskada root is **~/.cache/kaskada**. Override by setting *KASKADA_PATH*
  * Assumes the binaries are stored in *KASKADA_PATH/bin*. Override by setting *KASKADA_BIN_PATH* (default is bin)
  * Assumes the logs are stored in *KASKADA_PATH/logs*. Override by setting *KASKADA_LOG_PATH* (default is logs)
  
Most people running locally will want to spin up the server locally by just using: `LocalBuilder().build()`.

In [None]:
from kaskada.api.session import LocalBuilder

session = LocalBuilder().build()

# Step 2: Prepare the data
Download the data and agree to the terms and conditions of this [research prediction competition](https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data). 

The files you'll need are titled:


*   user_logs_v2.csv.7z
*   transactions.csv.7z
*   members_v3.csv.7z

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we're ready to create a table for the data and load it into Kaskada

In [None]:
from kaskada import table

for t in table.list_tables().tables:
    table.delete_table(t.table_name)

In [None]:
# Create Transaction table and load data into it
table.create_table(
    table_name = "Transaction",
    time_column_name = "transaction_date",
    entity_key_column_name = "msno",
    grouping_id="User"
)
table.load("Transaction", "kkbox-churn-prediction-challenge/transactions.csv")

# Create Member table and load data into it
table.create_table(
    table_name = "Member",
    time_column_name = "registration_init_time",
    entity_key_column_name = "msno",
    grouping_id="User"
)
table.load("Member", "kkbox-churn-prediction-challenge/members_v3.csv")

table.list_tables()

# Step 3: Feature Engineering
## Data scientists love Jupyter
### With Kaskada's python library and FENL magic they can keep all their favorite parts
- Iterative exploration, drill down and manipulation of data 
- Data cleaning
- Statistical modeling
- Visualization story telling with words, graphs and code
- Training ML models


### Iterative exploration, drill down and manipulation of data with Kaskada

#### Connect your event-based data directly to Kaskada

Let's take a look at the transaction events and membership information associated with a single member to understand the columns available, `msno=LWekcgcnUIqi22v63xuIMX4GYbxapmPMoDnLMVLFSTs=`

In [None]:
%load_ext fenlmagic

In [None]:
%%fenl
Transaction | when(Transaction.msno == "LWekcgcnUIqi22v63xuIMX4GYbxapmPMoDnLMVLFSTs=")

In [None]:
%%fenl
Member | when(Member.msno == "LWekcgcnUIqi22v63xuIMX4GYbxapmPMoDnLMVLFSTs=")

### Data Cleaning and Visualizing with Kaskada
Visualizations help to not miss the forest (distribution) for the trees (individual data points). Jupyter allows for the crafting of visualizations, that can then be used to inform the decisions being made by the data scientist in regards to further feature engineering and selection.

#### With Kaskada Data Scientists can define features directly from the event based data even when:

- The transaction log is quite busy, often with multiple entries recorded on given transaction date (transaction_date) as can be observed for this customer on several dates including 2016-09-25. 
- Some records have a value of zero or less for payment plan days (payment_plan_days) 
- Many transaction entries are changing the subscription's expiration date (membership_expire_date). 
- There are backdated records due to some kind of subscription management activity such as a change in auto-renewal status or the like. 

In [None]:
%%fenl --var df_explore

{
    payment_plan_days: Transaction.payment_plan_days,
    payment_method_id: Transaction.payment_method_id,
    trans_at: Transaction.transaction_date,
    membership_expire_date: Transaction.membership_expire_date,
}

In [None]:
df_explore.dataframe.columns

In [None]:
plt.hist(df_explore.dataframe.payment_plan_days, bins = 100)
plt.show()

We can simply add logic around these events to select the correct examples for prediction time and label time. But first let's try and see if this gets us what we want:

- Eliminate 0 or fewer plan days
- Select the maximum expiration date
- Handle backdated records
- Complete a query over all transactions to see how many we have left. 

In [None]:
%%fenl

# 1. Data Cleaning

let meaningful_txns = Transaction | if(Transaction.payment_plan_days > 0)
        
let max_expires_at = max(meaningful_txns.membership_expire_date)
let expiration_is_previous = (max_expires_at < meaningful_txns.transaction_date)
        
let subscription_expires_at =  max_expires_at | if(not(expiration_is_previous)) | else(meaningful_txns.transaction_date)

in {
    payment_plan_days: meaningful_txns.payment_plan_days,
    payment_method_id: meaningful_txns.payment_method_id,
    trans_at: meaningful_txns.transaction_date,
    membership_expire_date: meaningful_txns.membership_expire_date,
    expires_at: subscription_expires_at
}

In [None]:
plt.hist(_.dataframe.payment_plan_days, bins = 100)
plt.show()

The below example shows computing the target feature, churn at a data dependent prediction time plus 30 days

In [None]:
%%fenl --var df_training

# 1. Data Cleaning

let meaningful_txns = Transaction | if(Transaction.payment_plan_days > 0)
        
let max_expires_at = max(meaningful_txns.membership_expire_date)
let expiration_is_previous = (max_expires_at < meaningful_txns.transaction_date)
        
let subscription_expires_at =  max_expires_at | if(not(expiration_is_previous)) | else(meaningful_txns.transaction_date)

let cleaned_transactions = {
    msno: meaningful_txns.msno,
    payment_plan_days: meaningful_txns.payment_plan_days,
    payment_method_id: meaningful_txns.payment_method_id,
    trans_at: meaningful_txns.transaction_date,
    membership_expire_date: meaningful_txns.membership_expire_date,
    expires_at: subscription_expires_at
}

# 2. Churned Transactions

let shifted_txn = cleaned_transactions 
    | shift_to($input.membership_expire_date | add_time(days(30)))

let last_txn = last(cleaned_transactions)

let membership_history = {
    trans_at: shifted_txn.trans_at,
    expires_at: shifted_txn.membership_expire_date,
    churned: shifted_txn.trans_at == last_txn.trans_at,
} | when(is_valid(shifted_txn))

let initial_txn = membership_history.trans_at | first()
let churn_txn = membership_history | if(membership_history.churned) | first()

let churn_subscription = {
    starts_at: initial_txn,
    ends_at: churn_txn.trans_at,
    churned: true
} 
let active_subscription = {
    starts_at: initial_txn,
    ends_at: null,
    churned: false
}

# 4. Features

let current_subscription = churn_subscription | if(is_valid(churn_txn)) | else(active_subscription)

let first_transaction = cleaned_transactions | first()

in {
    churned: current_subscription.churned,
    duration_days: days_between(current_subscription.ends_at, current_subscription.starts_at) as i32,
    payment_plan_days: first_transaction.payment_plan_days,
    payment_method_id: first_transaction.payment_method_id,
    registered_via: first(Member).registered_via | else(-1)
} | when(is_valid(current_subscription))

## To summarize, a data scientist can with Kaskada compute feature values at arbitrary data dependent points in time, train a model and make feature values available in production for their data engineer.

##### 1. Build predictor and target features with Kaskada

1.   Write the predictors in a single record computed at their prediction time
2.   Shift the features forward to label time and compute the label value

##### 2. Compute train and test sets by specifying model context with Prediction and Label Times

##### 3. Train, score and compare models with your favorite libraries
##### 4. Iterate and select final features and model, handoff 

#### Etc we can use any library to 
- Compute prediction probabilities
- Do a naive model comparison
- Compute ROC and AUC
- Compute the average precision
- Balance the class weights
- Compute the class weights
- Train additional models such as RandomForrest Classifiers
- Test, evaluate and compare model performance before iterating on features and selecting the final features and models

# Step 4: Going to Production
## But, Data and Machine Learning Engineers Hate Jupyter
The non-linearity that was so beneficial as a data scientist turns into a nightmarish choose-your-own adventure book for the machine learning engineer trying to recreate the final path that a Data Scientist took. Whereas all the failures, mistakes, and errors that were made are useful for a data scientist to hold onto, it makes the machine learning engineers job closer to archaeology: trying to discover and interpret the hidden meaning behind for loops, boolean statements and drop cols.

### With Kaskada you can bridge the gap to production
- Kaskada connects directly the event-based data available in production
- Data scientists define the predictor features used to power training sets
- Data and ML Engineers call Kaskada to compute the **same** features at the time of now() in production
- Kaskada provides production grade targets such as Pulsar for feature and model serving

### Make Features as Code available for airflow jobs etc

In [None]:
%%fenl --var feature_vector

{
    payment_plan_days: Transaction.payment_plan_days,
    payment_method_id: Transaction.payment_method_id,
    trans_at: Transaction.transaction_date,
    membership_expire_date: Transaction.membership_expire_date,
}

In [None]:
from kaskada import view

view.create_view(
  view_name = "Features", 
  expression = feature_vector.query,
)

In [None]:
view.list_views(search = "Features")