# Feature Store Lifecycle: Audience Selection for Direct Mail Marketing

In this notebook, we will cover how to register, ingest, and access features in Tern's Feature Store.  

We will complete the following steps:
1. [Background](#Background)
2. [Data Preparation](#Data-Preparation)
3. [Ingest to Feature Store](#Ingest-to-Feature-Store)
4. [Retrieve Online Features](#Retrieve-Online-Features)
5. [Get Batch Features](#Get-Batch-Features)

## Background

We will be using a direct mail marketing dataset to walk through how to work with Tern Feature Store client. The data includes demographic, campaign, macro-economic indicators and other attribues. Our goal is to select a direct mail audience by building a model that can predict `if a customer will enroll for a term deposit at a bank, after one or more phone calls`.  

Before getting started review:
- Tern API documentation @ https://docs.tern.ai
- Feature discovery @ http://amundsen.sandbox.tern.ai

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

For our walk-through, we have broken the orginal dataset into 3 datasets below:
- demo_data.feather: Demographic data for every customer
- campaign_data.feather: Campaign information for every customer and campaign 
- dataset_target.feather: Dataset that has macroecnomic indicators, other features, target/label and target timestamp. We will use data for model building in the last step in this notebook.

## Data Preparation

Let's import the python libraries and the feature store client modules.

In [1]:
from datetime import datetime
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                
import feather
import s3fs
import fsspec
from data_prep import stratified_sample
from data_prep import extract_feature_list
from client import MarlinServiceClient
from marlin_service_pb2 import DataType
import json

To access Tern's Feature Store, we need to configure the server address, port and location where offline features will be stored. This can be configured to access Cloud Data Wareshouses such as Snowflake, Google Big Query, AWS Redshift. 

In [2]:
SERVER_ADDRESS = 'marlin-api-service.default.svc.cluster.local'
PORT = 6060
LOCATION_BATCH_FEATURES = "s3://marlin-offline-store/store-data"

client = MarlinServiceClient(SERVER_ADDRESS, PORT, LOCATION_BATCH_FEATURES)

Now we are ready to load and ingest features.

## Ingest to Feature Store

### Feature Group 1 - Demographic Data:

Let's start with importing the demographic data and ingesting to Feature Store for both online and offline availability.

In [3]:
df1 = pd.read_feather('data/demo_data.feather')
df1.head(5)

Unnamed: 0,cust_id,age,job,marital,education
0,100,56,housemaid,married,basic.4y
1,101,57,services,married,high.school
2,102,37,services,married,high.school
3,103,40,admin.,married,basic.6y
4,104,56,services,married,high.school


#### Demographic Data Definition:

age: Customer's age (numeric)  
job: Type of job (categorical: 'admin.', 'services', ...)  
marital: Marital status (categorical: 'married', 'single', ...)  
education: Level of education (categorical: 'basic.4y', 'high.school', ...)

#### Feature Group Registration & Feature Ingestion:

In [4]:
# We will set some names for ease of use later.
feature_group_demographics = "demographics"
entity_name_cid = "cust_id"
entity_value_type_cid = DataType.LONG
event_ts_cid = int(datetime.timestamp(datetime(2020,4,30))) # 04/30/2020 @ 12:00am (UTC)

```
Register features in the store by defining feature group name
Arguments:
feature_group_name - Name of the feature group
online             - Indicate whether features of this feature group be available for low latency online access 
offline            - Flag to indicate whether features of this feature group be available for batch access
author             - Person or userid creating the feature group
source_code        - metadata about source code generating the features. Example, link to source code
entities           - dictionary of entity names and their data type
features           - dictionary of feature names and their data type
```

In [5]:
%%time

client.register_feature_group(feature_group_name=feature_group_demographics,
                              author="John S",
                              online=True,
                              offline=True,
                              source_code="customer_demographics.py",
                              entities={entity_name_cid : entity_value_type_cid},
                              features={'age': DataType.LONG, 
                                        'job': DataType.STRING,
                                        'marital': DataType.STRING, 
                                        'education': DataType.STRING})

CPU times: user 1.52 ms, sys: 235 µs, total: 1.76 ms
Wall time: 13.6 ms




In [6]:
%%time

# Ingest features in the dataframe to the feature store based on the feature registeration above.

client.feature_ingest(data=df1,
                      entities=entity_name_cid,
                      feature_group_name=feature_group_demographics,
                      event_timestamp=event_ts_cid)

CPU times: user 10.8 s, sys: 1.1 s, total: 11.9 s
Wall time: 17 s


### Feature Group 2: Campaign information

Let's import the second camapaign information dataset and ingesting to Feature Store for both online and offline availability.

In [7]:
df2 = pd.read_feather('data/campaign_data.feather')
df2.head(5)

Unnamed: 0,cust_id,campaign_id,campaign,pdays,previous,poutcome
0,100,1010,1,999,0,nonexistent
1,101,1011,1,999,0,nonexistent
2,102,1012,1,999,0,nonexistent
3,103,1013,1,999,0,nonexistent
4,104,1014,1,999,0,nonexistent


#### Campaign information:

campaign: Number of contacts performed during this campaign and for this client (numeric, includes last contact)  
pdays: Number of days that passed by after the client was last contacted from a previous campaign (numeric)  
previous: Number of contacts performed before this campaign and for this client (numeric)  
poutcome: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)  


#### Feature Group Registration & Feature Ingestion:

In [8]:
# First feature group registration from a batch source (db)
feature_group_campaign = "campaign_info"
entity_name_cid = "cust_id"
entity_value_type_cid = DataType.LONG
entity_name_cmp = "campaign_id"
entity_value_type_cmp = DataType.LONG
event_ts_cmp = int(datetime.timestamp(datetime(2020,4,26)))  # 04/26/2020 @ 12:00am (UTC)

In [9]:
%%time

# Register features
# Note this ingest has 2 entities defined, cust_id and camapign_id

client.register_feature_group(feature_group_name=feature_group_campaign,
                              author="Arun K",
                              online=True,
                              offline=True,
                              source_code="campaign.py",
                              entities={entity_name_cid : entity_value_type_cid, 
                                        entity_name_cmp : entity_value_type_cmp},
                              features={'campaign': DataType.LONG, 
                                        'pdays': DataType.LONG,
                                        'previous': DataType.LONG, 
                                        'poutcome': DataType.STRING})

CPU times: user 810 µs, sys: 88 µs, total: 898 µs
Wall time: 2.68 ms




In [10]:
%%time

# Ingest features in the dataframe to the feature store based on the feature registeration above.
# Note the multiple entities are defined as a list.

client.feature_ingest(data=df2,
                      entities=[entity_name_cid, entity_name_cmp],
                      feature_group_name=feature_group_campaign,
                      event_timestamp=event_ts_cmp)

CPU times: user 10.4 s, sys: 1.17 s, total: 11.6 s
Wall time: 16.7 s


Now our features for both Feature Groups are stored in Tern's Feature Store. You can search and discover them in Amundsen.

## Retrieve Online Features

We can now retrieve features online using `get_features_as_dict`. This method for retrieval provides low latency for online inference and integrating predictions into applications/products.  
Let's retrieve the features we just ingested above for each feature group:

### Get Online Features - Demographics Feature Group:

In [64]:
# Create a list of feature names we want to retieve online. 

get_demographics_feature_names=[]
entity=[entity_name_cid]
for column in df1:
    if column not in entity:
        get_demographics_feature_names.append(column)

print(get_demographics_feature_names)

['age', 'job', 'marital', 'education']


In [69]:
%%time

#Get online features with a feature_group_name, entity name/value and the feature name list.

features = client.get_features_as_dict(feature_group_name=feature_group_demographics,
                                       entities={entity_name_cid: 39233},
                                       features=get_demographics_feature_names)


print("List Features Requested:")

print(json.dumps(features, indent=4, sort_keys=True))

List Features Requested:
{
    "demographics.age": 78,
    "demographics.education": "professional.course",
    "demographics.job": "retired",
    "demographics.marital": "divorced"
}
CPU times: user 1.73 ms, sys: 0 ns, total: 1.73 ms
Wall time: 3.61 ms


### Get Online Features for Campaign info Feature Group:

In [92]:
# Create a list of feature names we want to retieve online. 

get_campaign_feature_names=[]
entity=[entity_name_cid,entity_name_cmp]
for column in df2:
    if column not in entity:
        get_campaign_feature_names.append(column)
print(get_campaign_feature_names)

['campaign', 'pdays', 'previous', 'poutcome']


In [93]:
%%time

# Get online features with a feature_group_name, entity name/value and the feature name list.
# Note for this feature group we need 2 entities to access.

features = client.get_features_as_dict(feature_group_name=feature_group_campaign,
                               entities={entity_name_cid: 39233, 
                                         entity_name_cmp: 40143},
                               features=get_campaign_feature_names)

print("List Features Requested:")
print(json.dumps(features, indent=4, sort_keys=True))

List Features Requested:
{
    "campaign_info.campaign": 1,
    "campaign_info.pdays": 999,
    "campaign_info.poutcome": "failure",
    "campaign_info.previous": 1
}
CPU times: user 2.03 ms, sys: 0 ns, total: 2.03 ms
Wall time: 2.59 ms


## Get Batch Features

For offline modelling you may need to access large amounts of feature data. The batch feature request will return a correct point in time join with your target dataset.

Now you are ready to build your model using the features ingested in the store earlier.

To access feature data follow this steps:

1. Create a target dataset which has the following structure:
    - entity name and values (cust_id) - In this case customer ID's that are of interest to build the training data
    - target_timestamp - Timestamp when the event of interest is observed, and thus providing a temporal cutoff with respect to feature timestamp. It is used to build point in time accurate training data.
    - Other columns (optional) - This is optional. You may want to add the target or label name and values if it is not part of the features requested from the store. Therefore, providing you a complete training data for feature and target of interest.

In [15]:
df_target = pd.read_feather('data/dataset_target.feather') # Read data from stored feather file

In [16]:
df_target.head(3)

Unnamed: 0,cust_id,campaign_id,default,housing,loan,contact,month,day_of_week,duration,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,target_timestamp
0,100,1010,no,no,no,telephone,may,mon,261,1.1,93.994,-36.4,4.857,5191.0,no,1588291200
1,101,1011,unknown,no,no,telephone,may,mon,149,1.1,93.994,-36.4,4.857,5191.0,no,1588291200
2,102,1012,no,yes,no,telephone,may,mon,226,1.1,93.994,-36.4,4.857,5191.0,no,1588291200


In [17]:
df_target= df_target[['cust_id', 'campaign_id','target_timestamp','y']]
df_target.head(5)

Unnamed: 0,cust_id,campaign_id,target_timestamp,y
0,100,1010,1588291200,no
1,101,1011,1588291200,no
2,102,1012,1588291200,no
3,103,1013,1588291200,no
4,104,1014,1588291200,no


**Target variable:** y - Has the client subscribed a term deposit? (binary: 'yes','no')  

2. Use the client and following parameters to get the training data:
    - dataset created in Step 1
    - list of features with their feature group names. E.g. `[feature_group_name1:feature_name1, feature_group_name2:feature_name2]`  
  
In this step you could use Amendsen to discover features available in the store. More information in the documentation.

In [18]:
fsspec.filesystem('s3').invalidate_cache()

all_features=['campaign_info:campaign', 
              'campaign_info:pdays', 
              'campaign_info:previous', 
              'campaign_info:poutcome', 
              'demographics:age', 
              'demographics:job', 
              'demographics:marital', 
              'demographics:education']

In [19]:
%%time

# Target dataset and feature list as input will return a correct point in time training data as a dataframe.

training_data = client.get_batch_features(entity_df=df_target, features=all_features)

CPU times: user 1min 43s, sys: 10.3 s, total: 1min 53s
Wall time: 2min 6s


In [20]:
training_data.head(5)

Unnamed: 0,cust_id,campaign_id,target_timestamp,y,campaign_info.campaign,campaign_info.pdays,campaign_info.previous,campaign_info.poutcome,demographics.age,demographics.job,demographics.marital,demographics.education
0,107,1017,1588291200,no,1,999,0,nonexistent,41,blue-collar,married,unknown
56,27544,28454,1588291200,no,1,999,0,nonexistent,57,blue-collar,married,basic.9y
112,25500,26410,1588291200,no,3,999,1,failure,34,admin.,married,high.school
168,30807,31717,1588291200,no,1,999,2,failure,41,blue-collar,married,basic.9y
224,39206,40116,1588291200,yes,1,6,2,success,53,admin.,married,university.degree


In [21]:
training_data.tail(5)

Unnamed: 0,cust_id,campaign_id,target_timestamp,y,campaign_info.campaign,campaign_info.pdays,campaign_info.previous,campaign_info.poutcome,demographics.age,demographics.job,demographics.marital,demographics.education
2306248,40121,41031,1588291200,yes,3,999,0,nonexistent,27,services,single,high.school
2306304,34114,35024,1588291200,no,1,999,1,failure,26,admin.,single,high.school
2306360,6319,7229,1588291200,no,2,999,0,nonexistent,33,admin.,married,basic.9y
2306416,26455,27365,1588291200,no,1,999,0,nonexistent,32,admin.,married,high.school
2306472,37748,38658,1588291200,no,1,6,2,success,33,admin.,married,high.school


In [22]:
training_data.count()

cust_id                   41188
campaign_id               41188
target_timestamp          41188
y                         41188
campaign_info.campaign    41188
campaign_info.pdays       41188
campaign_info.previous    41188
campaign_info.poutcome    41188
demographics.age          41188
demographics.job          41188
demographics.marital      41188
demographics.education    41188
dtype: int64

In [23]:
training_data.describe()

Unnamed: 0,cust_id,campaign_id,target_timestamp,campaign_info.campaign,campaign_info.pdays,campaign_info.previous,demographics.age
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,20693.5,21603.5,1588291000.0,2.567593,962.475454,0.172963,40.02406
std,11890.09578,11890.09578,0.0,2.770014,186.910907,0.494901,10.42125
min,100.0,1010.0,1588291000.0,1.0,0.0,0.0,17.0
25%,10396.75,11306.75,1588291000.0,1.0,999.0,0.0,32.0
50%,20693.5,21603.5,1588291000.0,2.0,999.0,0.0,38.0
75%,30990.25,31900.25,1588291000.0,3.0,999.0,0.0,47.0
max,41287.0,42197.0,1588291000.0,56.0,999.0,7.0,98.0


Now we can take more steps to build a ML model to predict which customers are likely to make a deposit. See the second notebook that walks through model building through online predictions.