# Katonic Feature Store

## Use case: House Price Prediction
### Overview 
#### Predicting a House price based on the Properties that the house had. These Estimation will be very useful in the Real Estate Indutry in order to Suggest a Appropriate price for the customer.

In [1]:
# Importing the FeatureStore functioanlity from Kfs package.
from katonic.fs.feature_store import FeatureStore

### Using Local Provider.

In [2]:
fs = FeatureStore(
    user_name = "user", # user name.
    project_name = "house_price_prediction", # give a name for your project.
    description = "using machine learning for house price prediction", # project description.
) 

In [3]:
# We've successfully initiated FeatureStore with the Local Provider.
# Let's import some more neccessery functions.

from katonic.fs.entities import Entity, FeatureView
from katonic.fs.core.offline_stores import FileSource
from katonic.fs.value_type import ValueType

In [4]:
# Let's define the Entity key.
entity_key = Entity(name = "id", value_type=ValueType.INT64)

In [5]:
data_source = FileSource(
    path = "datasets/housing_data.csv", # Provide a path for your data source file.
    file_format = "csv", # format of your data sourse CSV or PARQUET.
    event_timestamp_column = "event_timestamp"  # The column which represents the time of Event occurance.
)

Feature views allow users to register data sources in their organizations into Feature Store for offline feature stores, and then use those offline stores for both training and online inference. 

The preceding feature view definition tells Feature Store where to find drivere stats features.

In [6]:
# Defining the Columns that we want to use for Creating a Feature View.
cols = ['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement']

In [7]:
# Feature View

house_price_prediction_view  = FeatureView(
    name="house_price_prediction_local", # Feature view name
    entities=["id"], # Entity Key
    ttl="2d", # hours/months/day # ttl is nothing but the Time period for the feature view Existance.
    features=cols,  # Columns you want in Feature View.
    batch_source=data_source, # data source
)

Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:

## Registre and Deploy feature definitions.

In [8]:
# Write the data to Offline Store.
%time

fs.write_table([entity_key, house_price_prediction_view])

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs
Registered entity [1m[32mid[0m
Registered feature view [1m[32mhouse_price_prediction_local[0m
Deploying infrastructure for [1m[32mhouse_price_prediction_local[0m


The preceding `write_table` function will:

- Store all entity and feature view definitions in a local file called registry.db.
- Create an empty `Redis-Server` table for serving driver statistics features.
- Ensure that your data sources on `FileSource` are available.

### Building a training dataset.

In [9]:
# Your Entity Dataframe. Which includes the entity key and event timestamp column.
import pandas as pd
df = pd.read_csv("datasets/house_data_entity_df.csv")

# Making sure that the Timestamp column data type is Accurate.
df["event_timestamp"] = pd.to_datetime(df["event_timestamp"])

In [10]:
# Basic Information about your data.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   Unnamed: 0       20000 non-null  int64              
 1   id               20000 non-null  int64              
 2   event_timestamp  20000 non-null  datetime64[ns, UTC]
 3   price            20000 non-null  int64              
dtypes: datetime64[ns, UTC](1), int64(3)
memory usage: 625.1 KB


In [23]:
df.drop("Unnamed: 0",axis=1,inplace=True)

In [24]:
# Getting the historical features from Offline Store for Training a Model.

training_df = fs.get_historical_features(
    entity_df = df, # Your Entity Data Frame.
    feature_view = ["house_price_prediction_local"], # Feature View name.
    features = cols # The columns you want to retrieve
).to_df()

In [25]:
training_df.head()

Unnamed: 0,event_timestamp,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement
0,2014-01-06 00:00:00+00:00,3524039060,250000,1,1.0,750,4000,1.0,0,0,3,6,750,0
1,2014-01-06 00:00:00+00:00,1138010520,459000,3,1.75,1620,7330,1.0,0,0,4,7,1090,530
2,2014-01-06 00:00:00+00:00,8643200055,243000,3,1.75,1790,12000,1.0,0,0,3,7,1040,750
3,2014-01-06 00:00:00+00:00,540100056,843500,4,2.0,2630,16475,2.0,0,0,4,8,2630,0
4,2014-01-06 00:00:00+00:00,8655000070,1600000,5,3.0,3640,8239,2.0,0,3,3,10,2540,1100


Once we have retrieved the complete training dataset, we can:

- Drop timestamp columns and the `id` column.
- Encode categorical features (if any).
- Split the training dataframe into a train, validation, and test set.

In [26]:
# Building a model with training data.
from joblib import dump, load
from sklearn.linear_model import LinearRegression

In [27]:
# Removing the unnecessary columns and splitting the data into Dependent and Independent Features.

X_train = training_df.drop(["event_timestamp", "price", "id"], axis=1)
y_train = training_df["price"]

In [28]:
X_train.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement
0,1,1.0,750,4000,1.0,0,0,3,6,750,0
1,3,1.75,1620,7330,1.0,0,0,4,7,1090,530
2,3,1.75,1790,12000,1.0,0,0,3,7,1040,750
3,4,2.0,2630,16475,2.0,0,0,4,8,2630,0
4,5,3.0,3640,8239,2.0,0,3,3,10,2540,1100


In [29]:
# Building a model with training data.

rfc = LinearRegression()
rfc.fit(X_train, y_train)
dump(rfc, "house_price_predict_local.bin")

['house_price_predict_local.bin']

Before we can make online predictions with our DriverStats model, we must populate our online store with feature values. To load features into the online store, we use `publish_table` function:

In [30]:
# Populating the latest features into Online store.
# Features will get materialized that are in between the time period.

from datetime import datetime

fs.publish_table(
    start_ts = datetime(2014, 1, 1), # Give a start date
    end_ts = datetime(2016, 5, 1) # End date.
)

Materializing [1m[32m1[0m feature views from [1m[32m2014-01-01 00:00:00+00:00[0m to [1m[32m2016-05-01 00:00:00+00:00[0m into the [1m[32mredis[0m online store.



This function will load features from our `offline store` from `start_date` up to the `end_date`. The `publish_table` function can be repeatedly called as more data becomes available in order to keep the online store fresh.

## Fetching a Feature Vector at low latency.
### Now we have the test data to make a prediction.

In [31]:
# Getting the Online features by using the entity keys.

ids = [540100056,1138010520,3524039060]
test = fs.get_online_features(
    entity_rows=[{"id": id} for id in ids], # Entity keys 
    feature_view=["house_price_prediction_local"], # Feature View name
    features=cols, # Columns
).to_df()


In [32]:
# Test Dataset.
test.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,id
0,4.0,2.0,2630.0,16475.0,2.0,0.0,0.0,4.0,8.0,2630.0,0.0,540100056
1,3.0,1.75,1620.0,7330.0,1.0,0.0,0.0,4.0,7.0,1090.0,530.0,1138010520
2,1.0,1.0,750.0,4000.0,1.0,0.0,0.0,3.0,6.0,750.0,0.0,3524039060


In [33]:
len(X_train.columns)

11

In [34]:
len(test.columns)

12

In [35]:
# Loading the model and making predictions.

model = load("house_price_predict_local.bin")
model.predict(test.drop("id", axis=1))

array([668436.10486792, 426522.67860638, 172935.5465954 ])