# Katonic Feature Store

## Use case: Subscription Prediction Use Case

### Overview:
Making a prediction on Whether a customer will leave their subscription or not. For that we are using ML model along with the number of features that we ingested into Feature Store.
This Demo uses Feature Store with scikit learn to train a model using historical features.

In [23]:
# Importing the FeatureStore functioanlity from Kfs package.

from katonic.fs.feature_store import FeatureStore

import warnings
warnings.filterwarnings("ignore")

In [2]:
# After that initiate the Feature Store with the Following Parameters.

fs = FeatureStore(    
    user_name = "user", # e.g. person name 
    project_name = "subscriber_churn", # user name.
    description = "This is a demo for feature store", # project description.
)

Now that we’ve configured our infrastructure, let’s register the driver stats features we will use during training and serving.

In [3]:
# We've successfully initiated FeatureStore with the Local Provider.
# Let's import some more neccessery functions.

from katonic.fs.entities import Entity, FeatureView
from katonic.fs.core.offline_stores import FileSource
from katonic.fs.value_type import ValueType

In [4]:
# Let's define the Entity key.

entity = Entity(name="customer_id", value_type=ValueType.INT64)

In [7]:
batch_source = FileSource(
        path = "datasets/subscription_data.csv", # Provide a path for your data source file.
        file_format = "csv",  # format of your data sourse CSV or PARQUET.
        event_timestamp_column="event_timestamp", # The column which represents the time of Event occurance.
    )

In [8]:
batch_source.to_dict()

{'type': 'BATCH_FILE',
 'file_options': {'file_format': 'csv',
  'file_url': 'datasets/subscription_data.csv'},
 'event_timestamp_column': 'event_timestamp',
 'created_timestamp_column': ''}

Feature views allow users to register data sources in their organizations into Feature Store for offline feature stores, and then use those offline stores for both training and online inference. 

The preceding feature view definition tells Feature Store where to find subscription stats features.

In [9]:
# Feature View
subscription_stats  = FeatureView(
    name="subscription_stats_fv", # Feature view name
    entities=["customer_id"], # Entity Key
    ttl="2d", # hours/months/day # ttl is nothing but the Time period for the feature view Existance.
    features=["year", "no_of_days_subscribed", "minimum_daily_mins", "maximum_daily_mins", "videos_watched", "maximum_days_inactive"],
    batch_source=batch_source,  # data source
)

Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:

### Step 4: Register and deploy feature definitions

In [10]:
# Write the data to Offline Store.

fs.write_table([entity, subscription_stats])

Registered entity [1m[32mcustomer_id[0m
Registered feature view [1m[32msubscription_stats_fv[0m
Deploying infrastructure for [1m[32msubscription_stats_fv[0m


The preceding `write_table` function will:

- Store all entity and feature view definitions in a local file called registry.db.
- Create an empty `SQLite` table for serving driver statistics features.
- Ensure that your data sources on `FileSource` are available.

## Let's build a Training Set using the Feature Store Itself.
Now we identify the features we want to query from Feature Store.

In [11]:
# Your Entity Dataframe. Which includes the entity key and event timestamp column.

import pandas as pd
df = pd.read_csv("datasets/subscription_entity_df.csv")

# Making sure that the Timestamp column data type is Accurate.
df["event_timestamp"] = pd.to_datetime(df["event_timestamp"])

In [12]:
df.dtypes

customer_id                      int64
event_timestamp    datetime64[ns, UTC]
churn                            int64
dtype: object

In [13]:
# Getting the historical features from Offline Store for Training a Model.

training_df = fs.get_historical_features(
    entity_df=df, # Your Entity Data Frame.
    feature_view=["subscription_stats_fv"], # Feature View name.
    features=["year", "no_of_days_subscribed", "minimum_daily_mins", "maximum_daily_mins", "videos_watched", "maximum_days_inactive"],
).to_df()

In [14]:
training_df.head()

Unnamed: 0,event_timestamp,customer_id,churn,year,no_of_days_subscribed,minimum_daily_mins,maximum_daily_mins,videos_watched,maximum_days_inactive
0,2021-04-12 07:00:00+00:00,258594,0,2015,127,6.8,37.57,4,2
1,2021-04-12 07:00:00+00:00,437014,0,2015,54,10.1,36.4,3,3
2,2021-04-12 07:00:00+00:00,601179,1,2015,159,10.4,32.15,5,3
3,2021-10-17 01:00:00+00:00,100198,0,2015,62,12.2,16.81,1,4
4,2021-10-17 01:00:00+00:00,258935,0,2015,122,13.7,41.45,8,4


Once we have retrieved the complete training dataset, we can:

- Drop timestamp columns and the `customerid` column.
- Encode categorical features (if any).
- Split the training dataframe into a train, validation, and test set.

In [15]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1158 entries, 0 to 1157
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype              
---  ------                 --------------  -----              
 0   event_timestamp        1158 non-null   datetime64[ns, UTC]
 1   customer_id            1158 non-null   int64              
 2   churn                  1158 non-null   int64              
 3   year                   1158 non-null   int64              
 4   no_of_days_subscribed  1158 non-null   int64              
 5   minimum_daily_mins     1158 non-null   float64            
 6   maximum_daily_mins     1158 non-null   float64            
 7   videos_watched         1158 non-null   int64              
 8   maximum_days_inactive  1158 non-null   int64              
dtypes: datetime64[ns, UTC](1), float64(2), int64(6)
memory usage: 90.5 KB


Once we have retrieved the complete training dataset, we can:

- Drop timestamp columns and the `customer_id` column.
- Encode categorical features (if any).
- Split the training dataframe into a train, validation, and test set.

In [16]:
# Building a model with training data.

from joblib import dump
from sklearn.tree import DecisionTreeClassifier

# Removing the unnecessary columns and splitting the data into Dependent and Independent Features.
# Train model
target = "churn"

train_X = training_df[training_df.columns.drop(target).drop("event_timestamp").drop("customer_id")]
train_Y = training_df.loc[:, target]

In [17]:
train_X.head()

Unnamed: 0,year,no_of_days_subscribed,minimum_daily_mins,maximum_daily_mins,videos_watched,maximum_days_inactive
0,2015,127,6.8,37.57,4,2
1,2015,54,10.1,36.4,3,3
2,2015,159,10.4,32.15,5,3
3,2015,62,12.2,16.81,1,4
4,2015,122,13.7,41.45,8,4


In [18]:
# Building a model with training data.

tree = DecisionTreeClassifier()
tree.fit(train_X[sorted(train_X)], train_Y)

# Save model
dump(tree, "subscription_pred.bin")

['subscription_pred.bin']

Before we can make online predictions with our subscription_pred model, we must populate our online store with feature values. To load features into the online store, we use `publish_table` function:

In [19]:
# Populating the latest features into Online store.
# Features will get materialized that are in between the time period.
from datetime import datetime

fs.publish_table(
    start_ts = datetime(2021, 10, 1),
    end_ts = datetime(2021, 11, 1)
)

Materializing [1m[32m1[0m feature views from [1m[32m2021-10-01 00:00:00+00:00[0m to [1m[32m2021-11-01 00:00:00+00:00[0m into the [1m[32mredis[0m online store.



This function will load features from our `offline store` from `start_date` up to the `end_date`. The `publish_table` function can be repeatedly called as more data becomes available in order to keep the online store fresh.

### Fetching a feature vector at low latency
Now we have everything we need to make a prediction.

#### Let's make the Predictions.

In [20]:
# Getting the Online features by using the entity keys.

customer_ids = [100198, 100756, 101653]
test = fs.get_online_features(
    entity_rows=[{"customer_id": customer_id} for customer_id in customer_ids], # Entity Keys
    feature_view=["subscription_stats_fv"], # Feature View name
    features=["year", "no_of_days_subscribed", "minimum_daily_mins", "maximum_daily_mins", "videos_watched", "maximum_days_inactive"],
).to_df()


In [21]:
test.head()

Unnamed: 0,year,no_of_days_subscribed,minimum_daily_mins,maximum_daily_mins,videos_watched,maximum_days_inactive,customer_id
0,2015.0,62.0,12.2,16.81,1.0,4.0,100198
1,2015.0,126.0,11.9,9.89,1.0,4.0,100756
2,2015.0,191.0,10.9,27.54,7.0,3.0,101653


In [24]:
# loading the model and making the predictions.

from joblib import load

model = load("subscription_pred.bin")

model.predict(test.drop("customer_id", axis=1))

array([1, 1, 1])