# Katonic Feature Store

## Use case: Bank Customer churn prediction
### Overview 
#### Predictiong a customer will leave their banking service. Based on certain features and using ML models we can do the predictions.

In [1]:
# Importing the FeatureStore functioanlity from Kfs package.
from katonic.fs.feature_store import FeatureStore

### Using Local Provider.

In [2]:
fs = FeatureStore(
    user_name = "user", # user name.
    project_name = "bank_churn_modelling", # give a name for your project.
    description = "using machine learning for churn prediction", # project description.
) 

In [3]:
# We've successfully initiated FeatureStore with the Local Provider.
# Let's import some more neccessery functions.
from katonic.fs.entities import Entity, FeatureView
from katonic.fs.core.offline_stores import FileSource
from katonic.fs.value_type import ValueType

In [4]:
# Let's define the Entity key.
entity_key = Entity(name = "customerid", value_type=ValueType.INT64)

In [5]:
data_source = FileSource(
    path = "datasets/churn_data.csv", # Provide a path for your data source file.
    file_format = "csv", # format of your data sourse CSV or PARQUET.
    event_timestamp_column = "event_timestamp" # The column which represents the time of Event occurance.
)

Feature views allow users to register data sources in their organizations into Feature Store for offline feature stores, and then use those offline stores for both training and online inference. 

The preceding feature view definition tells Feature Store where to find drivere stats features.

In [6]:
# Defining the Columns that we want to use for Creating a Feature View.
cols = ["creditscore", "age", "tenure", "balance", "numofproducts", "hascrcard", "isactivemember", "estimatedsalary"]

In [7]:
# Feature View

churn_modelling_view  = FeatureView(
    name="churn_modelling_local", # Feature view name
    entities=["customerid"], # Entity Key
    ttl="2d", # hours/months/day # ttl is nothing but the Time period for the feature view Existance.
    features=cols, # Columns you want in Feature View.
    batch_source=data_source, # data source
)

Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:

## Registre and Deploy feature definitions.

In [8]:
# Write the data to Offline Store.
fs.write_table([entity_key, churn_modelling_view])

Registered entity [1m[32mcustomerid[0m
Registered feature view [1m[32mchurn_modelling_local[0m
Deploying infrastructure for [1m[32mchurn_modelling_local[0m


The preceding `write_table` function will:

- Store all entity and feature view definitions in a local file called registry.db.
- Create an empty `SQLite` table for serving driver statistics features.
- Ensure that your data sources on `FileSource` are available.

### Building a training dataset.

In [9]:
import pandas as pd

In [10]:
# Your Entity Dataframe. Which includes the entity key and event timestamp column.
df = pd.read_csv("datasets/churn_df.csv")

# Making sure that the Timestamp column data type is Accurate.
df["event_timestamp"] = pd.to_datetime(df["event_timestamp"])

In [11]:
# Looking at the Entity Data Frame. 

df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,customerid,event_timestamp,exited
0,0,0,15634602,2016-02-08 00:37:08+00:00,1
1,1,1,15647311,2016-02-08 05:56:20+00:00,0
2,2,2,15619304,2016-02-08 06:15:39+00:00,1
3,3,3,15701354,2016-02-08 06:15:39+00:00,0
4,4,4,15737888,2016-02-08 06:51:45+00:00,0


In [22]:
df.drop(["Unnamed: 0", "Unnamed: 0.1"],axis=1,inplace=True)

In [23]:
# Basic Information about your data.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6844 entries, 0 to 6843
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   customerid       6844 non-null   int64              
 1   event_timestamp  6844 non-null   datetime64[ns, UTC]
 2   exited           6844 non-null   int64              
dtypes: datetime64[ns, UTC](1), int64(2)
memory usage: 160.5 KB


In [24]:
# Getting the historical features from Offline Store for Training a Model.

training_df = fs.get_historical_features(
    entity_df = df, # Your Entity Data Frame.
    feature_view = ["churn_modelling_local"], # Feature View name.
    features = cols # The columns you want to retrieve
).to_df()

In [25]:
training_df.head()

Unnamed: 0,event_timestamp,customerid,exited,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary
0,2016-02-08 00:37:08+00:00,15634602,1,619,42,2,0.0,1,1,1,101348.88
1,2016-02-08 05:56:20+00:00,15647311,0,608,41,1,83807.86,1,0,1,112542.58
2,2016-02-08 06:15:39+00:00,15619304,1,502,42,8,159660.8,3,1,0,113931.57
3,2016-02-08 06:15:39+00:00,15701354,0,699,39,1,0.0,2,0,0,93826.63
4,2016-02-08 06:51:45+00:00,15737888,0,850,43,2,125510.82,1,1,1,79084.1


Once we have retrieved the complete training dataset, we can:

- Drop timestamp columns and the `customerid` column.
- Encode categorical features (if any).
- Split the training dataframe into a train, validation, and test set.

In [26]:
# Building a model with training data.
from joblib import dump, load
from sklearn.ensemble import RandomForestClassifier

In [27]:
# Removing the unnecessary columns and splitting the data into Dependent and Independent Features.

X_train = training_df.drop(["event_timestamp", "exited", "customerid"], axis=1)
y_train = training_df["exited"]

In [28]:
X_train.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary
0,619,42,2,0.0,1,1,1,101348.88
1,608,41,1,83807.86,1,0,1,112542.58
2,502,42,8,159660.8,3,1,0,113931.57
3,699,39,1,0.0,2,0,0,93826.63
4,850,43,2,125510.82,1,1,1,79084.1


In [29]:
# Building a model with training data.

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
dump(rfc,"churn_model_rfc.bin")

['churn_model_rfc.bin']

Before we can make online predictions with our DriverStats model, we must populate our online store with feature values. To load features into the online store, we use `publish_table` function:

In [30]:
# Populating the latest features into Online store.
# Features will get materialized that are in between the time period.

from datetime import datetime

fs.publish_table(
    start_ts = datetime(2016, 1, 20), # Give a start date
    end_ts = datetime(2016, 6, 20) # End date.
)

Materializing [1m[32m1[0m feature views from [1m[32m2016-01-20 00:00:00+00:00[0m to [1m[32m2016-06-20 00:00:00+00:00[0m into the [1m[32mredis[0m online store.



This function will load features from our `offline store` from `start_date` up to the `end_date`. The `publish_table` function can be repeatedly called as more data becomes available in order to keep the online store fresh.

## Fetching a Feature Vector at low latency.
### Now we have the test data to make a prediction.

In [31]:
# Getting the Online features by using the entity keys.

customer_ids = [15569892,15769959,15584532,15682355]

test = fs.get_online_features(
    entity_rows=[{"customerid": customer_id} for customer_id in customer_ids], # Entity keys 
    feature_view=["churn_modelling_local"], # Feature View name
    features=cols, # Columns
).to_df()


In [32]:
# Test Dataset.
test.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,customerid
0,516.0,35.0,10.0,57369.61,1.0,1.0,1.0,101699.77,15569892
1,597.0,53.0,4.0,88381.21,1.0,1.0,0.0,69384.71,15769959
2,709.0,36.0,7.0,0.0,1.0,0.0,1.0,42085.58,15584532
3,772.0,42.0,3.0,75075.31,2.0,1.0,0.0,92888.52,15682355


In [33]:
# Loading the pre-trained model and predicting.
model = load("churn_model_rfc.bin")

model.predict(test.drop("customerid", axis=1))

array([0, 1, 0, 0])