# Feature Store - FEAST


# What is the problem?

### Missing Values
### Label Encoder
### New feature combining existing features e.g area = len x width




In [1]:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))



[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [2]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])

list(le.classes_)


le.transform(["tokyo", "tokyo", "paris"])

list(le.inverse_transform([2, 2, 1]))

['tokyo', 'tokyo', 'paris']

# Why use Feature Store?
### Reuse features, compute once and use by many teams
### easire data quality guarantee
### It reuses existing infra

![title](./images/why_use_fs.png)

# What is Feature Store?

![title](./images/first_screen.png)

# Ingestion Flow

![title](./images/simple_fs.jpg)


# Ingestion Flow-1


![title](./images/ingestion_1.jpg)

# Ingestion Flow-2
### Streaming ( Not required for us BUT.....)

![title](./images/simple_fs.jpg)

# How FS Consumed PART-1
![title](./images/consumption_1.jpg)


# How FS Consumed PART-2
![title](./images/consumption_2.jpg)

# What is inside FS?

![title](./images/inside_fs.png)

# What is a Feature Store?


### It is a new concept
### It is an interface between models and data
### It is not Pachederm or ML-FLOW
### Feature Store is not a database


![title](./images/feature-store-feature-image.png)

# Proposed Architecture

![title](./images/proposed_arch.jpg)

### A high-level architecture of system using Feast 

![title](./images/feast_fraudlent_architecture.png)

In [3]:
import feast
import pandas as pd
from google.protobuf.duration_pb2 import Duration
from feast import Entity, Feature, FeatureView, FileSource, ValueType
from datetime import datetime, timedelta
import pandas as pd
from feast import FeatureStore

In [4]:
!feast version

Feast SDK Version: "feast 0.12.0"


In [5]:
#!feast init business_matching_repo

In [6]:
!ls -R

business_match_feature_defn.py   feast_demo_business_entity.ipynb
[1m[36mdata[m[m                             feature_store.yaml
feast_demo.ipynb                 [1m[36mimages[m[m

./data:
business_profile.parquet online_store.db          registry.db

./images:
IMG-5846.jpg                     ingestion_1.jpg
IMG-6020.jpg                     ingestion_2.png
consumption_1.jpg                inside_fs.jpg
consumption_2.jpg                inside_fs.png
feast_fraudlent_architecture.png ooff_to_on.png
feast_logo.png                   proposed_arch.jpg
feastore_usage.jpg               simple_fs.jpg
feature-store-feature-image.png  why_use_fs.png
first_screen.png


In [32]:
from feast import FeatureStore
store = FeatureStore('/Users/pradeeppujari/projects/feast/examples/business_matching_repo/')

### Read offline raw data from excel sheet as of now. Later on we can use SnowFlake?

In [33]:
#reviews_df = pd.read_excel('/Users/pradeeppujari/projects/feast/examples/data/sds_training_raw_data.xlsx')
sds_raw_df = pd.read_csv('/Users/pradeeppujari/Downloads/sds_training_raw_data.csv')

In [34]:
sds_raw_df['event_timestamp'] = datetime.now()
sds_raw_df['created'] = datetime.now()

In [35]:
sds_raw_df.head(5)

Unnamed: 0,ROW_ID,ELASTICSEARCH_SCORE,BUSINESS_LOCATION_ID,BUSINESS_NAME,BUSINESS_ADDRESS,BUSINESS_CITY,BUSINESS_STATE,BUSINESS_ZIPCODE,BUSINESS_PHONE,BUSINESS_LON,...,PROFILE_LATITUDE,PROFILE_LONGITUDE,PROFILE_PHONE,PROFILE_EXTERNAL_ID,EXTERNAL_SERVICE_TYPE,EXTERNAL_URL,RANDOM,LABEL,event_timestamp,created
0,0,230,1009048158,New York State West Youth,11397 Lpga Dr,Corning,NY,14830,16079629923,-77.0475,...,42.149181,-77.058563,6079629923,6112754,YELLOW_PAGES,http://www.yellowpages.com/corning-ny/mip/new-...,-9223370807250347779,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
1,1,225,3910169,NEW YORK STATE WEST YOUTH INTERN,11397 LPGA DRIVE,CORNING,NY,14830,6079629923,-77.0475,...,42.149181,-77.058563,6079629923,6112754,YELLOW_PAGES,http://www.yellowpages.com/corning-ny/mip/new-...,-9223370807250347779,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
2,2,225,3910247,NEW YORK STATE WEST YOUTH MOTO,11397 LPGA DR,CORNING,NY,14830,6079629923,-77.0475,...,42.149181,-77.058563,6079629923,6112754,YELLOW_PAGES,http://www.yellowpages.com/corning-ny/mip/new-...,-9223370807250347779,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
3,3,28,1007547202,New York State Fish Hatchery,7169 Fish Hatchery Rd,Bath,NY,14810,16077767087,-77.3028,...,42.149181,-77.058563,6079629923,6112754,YELLOW_PAGES,http://www.yellowpages.com/corning-ny/mip/new-...,-9223370807250347779,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
4,4,28,1003465690,New York State United Teachers,100 W Church St # 200,Elmira,NY,14901,16077321928,-76.812,...,42.149181,-77.058563,6079629923,6112754,YELLOW_PAGES,http://www.yellowpages.com/corning-ny/mip/new-...,-9223370807250347779,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444


In [36]:
sds_raw_df=sds_raw_df.drop(columns=['PROFILE_ID_2','PROFILE_WEBSITE_URL','BUSINESS_LAT','PROFILE_EXTERNAL_ID','EXTERNAL_SERVICE_TYPE','EXTERNAL_URL','BUSINESS_LON','RANDOM','ELASTICSEARCH_SCORE','BUSINESS_LOCATION_ID','PROFILE_LATITUDE','PROFILE_LONGITUDE'], axis=1)

In [37]:
sds_raw_df['city'] = sds_raw_df['BUSINESS_CITY'].str.lower() == sds_raw_df['PROFILE_CITY'].str.lower()
sds_raw_df['city'] = sds_raw_df['city'].astype(int)

In [38]:
sds_raw_df['state'] = sds_raw_df['BUSINESS_STATE'].str.lower() == sds_raw_df['PROFILE_STATE_CODE'].str.lower()
sds_raw_df['state'] = sds_raw_df['state'].astype(int)

In [39]:
#data lean up
sds_raw_df['BUSINESS_ZIP'] = sds_raw_df['BUSINESS_ZIPCODE'].str[:5]
sds_raw_df['PROFILE_ZIP'] = sds_raw_df['PROFILE_ZIP_CODE'].apply(str)
#feature generate
sds_raw_df['zip_code'] = sds_raw_df['BUSINESS_ZIP']== sds_raw_df['PROFILE_ZIP']
sds_raw_df['zip_code'] = sds_raw_df['zip_code'].astype(int)

In [40]:
sds_raw_df.head(5)

Unnamed: 0,ROW_ID,BUSINESS_NAME,BUSINESS_ADDRESS,BUSINESS_CITY,BUSINESS_STATE,BUSINESS_ZIPCODE,BUSINESS_PHONE,PROFILE_ID,PROFILE_NAME,PROFILE_ADDRESS_1,...,PROFILE_ZIP_CODE,PROFILE_PHONE,LABEL,event_timestamp,created,city,state,BUSINESS_ZIP,PROFILE_ZIP,zip_code
0,0,New York State West Youth,11397 Lpga Dr,Corning,NY,14830,16079629923,-9069183504506314117,New York State West Youth Soc,41 Riverside Dr,...,14830,6079629923,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444,1,1,14830,14830,1
1,1,NEW YORK STATE WEST YOUTH INTERN,11397 LPGA DRIVE,CORNING,NY,14830,6079629923,-9069183504506314117,New York State West Youth Soc,41 Riverside Dr,...,14830,6079629923,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444,1,1,14830,14830,1
2,2,NEW YORK STATE WEST YOUTH MOTO,11397 LPGA DR,CORNING,NY,14830,6079629923,-9069183504506314117,New York State West Youth Soc,41 Riverside Dr,...,14830,6079629923,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444,1,1,14830,14830,1
3,3,New York State Fish Hatchery,7169 Fish Hatchery Rd,Bath,NY,14810,16077767087,-9069183504506314117,New York State West Youth Soc,41 Riverside Dr,...,14830,6079629923,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444,0,1,14810,14830,0
4,4,New York State United Teachers,100 W Church St # 200,Elmira,NY,14901,16077321928,-9069183504506314117,New York State West Youth Soc,41 Riverside Dr,...,14830,6079629923,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444,0,1,14901,14830,0


In [41]:
df = sds_raw_df[['BUSINESS_NAME','BUSINESS_ADDRESS','BUSINESS_CITY','ROW_ID','city','state','zip_code','event_timestamp','created']]

In [42]:
df.to_parquet('./data/business_profile.parquet')

In [43]:
pd.read_parquet('./data/business_profile.parquet')

Unnamed: 0,BUSINESS_NAME,BUSINESS_ADDRESS,BUSINESS_CITY,ROW_ID,city,state,zip_code,event_timestamp,created
0,New York State West Youth,11397 Lpga Dr,Corning,0,1,1,1,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
1,NEW YORK STATE WEST YOUTH INTERN,11397 LPGA DRIVE,CORNING,1,1,1,1,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
2,NEW YORK STATE WEST YOUTH MOTO,11397 LPGA DR,CORNING,2,1,1,1,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
3,New York State Fish Hatchery,7169 Fish Hatchery Rd,Bath,3,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
4,New York State United Teachers,100 W Church St # 200,Elmira,4,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
...,...,...,...,...,...,...,...,...,...
28709,"Samuel Mayeda, MD",1140 W La Veta Ave Ste 420,Orange,28709,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
28710,Mayeda Samuel MD,1140 W La Veta Ave Ste 420,Orange,28710,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
28711,SAMUEL A SILAO MD,2557 CHINO HILLS PKWY STE A,CHINO HILLS,28711,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444
28712,Samuel S Galley MD,Suite D,Fountain Valley,28712,0,1,0,2021-09-13 14:56:12.644761,2021-09-13 14:56:12.646444


### Create a new FeatureView

### Read data from parquet files. Parquet is convenient for local development mode. For production, you can use your favorite DWH, such as SnowFlake. 


In [44]:
profile_stats = FileSource(
    path="/Users/pradeeppujari/projects/feast/examples/business_matching_repo/data/business_profile.parquet",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created",
)

### Define an entity for the driver. You can think of entity as a primary key used to fetch features.


In [45]:
business_entity = Entity(name="ROW_ID", value_type=ValueType.INT64, description="profile",)

In [46]:
profile_stats_view = FeatureView(
    name="profile_stats",
    entities=["ROW_ID"],
    ttl=Duration(seconds=86400 * 1),
    features=[
        Feature(name="BUSINESS_NAME", dtype=ValueType.STRING),
        Feature(name="BUSINESS_ADDRESS", dtype=ValueType.STRING),
        Feature(name="BUSINESS_CITY", dtype=ValueType.STRING),
        Feature(name="city", dtype=ValueType.STRING),
        Feature(name="state", dtype=ValueType.STRING),
        Feature(name="zip_code", dtype=ValueType.STRING),
    ],
    online=True,
    batch_source=profile_stats,
    tags={},
)


In [47]:
!feast apply

Registered entity [1m[32mROW_ID[0m
Registered feature view [1m[32mprofile_stats[0m
Deploying infrastructure for [1m[32mprofile_stats[0m


### Generate training data

In [51]:
# The entity dataframe is the dataframe we want to enrich with feature values
entity_df = pd.DataFrame.from_dict(
    {
        "ROW_ID": [0, 2, 4, 7],
        "event_timestamp": [
            datetime.now() - timedelta(minutes=0),
            datetime.now() - timedelta(minutes=1),
            datetime.now() - timedelta(minutes=3),
            datetime.now() - timedelta(minutes=3),
        ],
    }
)

store = FeatureStore(repo_path=".")

training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "profile_stats:BUSINESS_NAME",
        "profile_stats:BUSINESS_ADDRESS",
        "profile_stats:BUSINESS_CITY",
        "profile_stats:city",
        "profile_stats:state",
        "profile_stats:zip_code",
    ],
).to_df()

print("----- Feature schema -----\n")
print(training_df.info())

print()
print("----- Example features -----\n")
print(training_df.head())

----- Feature schema -----

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   event_timestamp   4 non-null      datetime64[ns, UTC]
 1   ROW_ID            4 non-null      int64              
 2   BUSINESS_NAME     4 non-null      object             
 3   BUSINESS_ADDRESS  4 non-null      object             
 4   BUSINESS_CITY     4 non-null      object             
 5   city              4 non-null      int64              
 6   state             4 non-null      int64              
 7   zip_code          4 non-null      int64              
dtypes: datetime64[ns, UTC](1), int64(4), object(3)
memory usage: 288.0+ bytes
None

----- Example features -----

                   event_timestamp  ROW_ID  \
0 2021-09-13 15:04:33.684508+00:00       4   
1 2021-09-13 15:04:33.684509+00:00       7   
2 2021-09-13 15:06:33.684505+

### Model fit code here

### Step 5: Load features into your online store

We now serialize the latest values of features since the beginning of time to prepare for serving (note: materialize-incremental serializes all new features since the last materialize call).

In [49]:
!feast materialize-incremental {datetime.now().isoformat()}

Materializing [1m[32m1[0m feature views to [1m[32m2021-09-13 08:01:17-07:00[0m into the [1m[32msqlite[0m online store.

[1m[32mprofile_stats[0m from [1m[32m2021-09-13 06:41:43-07:00[0m to [1m[32m2021-09-13 08:01:17-07:00[0m:
100%|███████████████████████████████████████████████████████| 28714/28714 [00:06<00:00, 4681.00it/s]


### Step 5b: Inspect materialized features
Note that now there are online_store.db and registry.db, which store the materialized features and schema information, respectively.

In [50]:
print("--- Data directory ---")
!ls data

import sqlite3
import pandas as pd
con = sqlite3.connect("data/online_store.db")
print("\n--- Schema of online store ---")
print(
    pd.read_sql_query(
        "select * FROM business_matching_repo_profile_stats", con).columns.tolist())
con.close()

--- Data directory ---
business_profile.parquet online_store.db          registry.db

--- Schema of online store ---
['entity_key', 'feature_name', 'value', 'event_ts', 'created_ts']


### Predict

# Next steps

### connect to the snowflake business entity data
### Direct connectivity training evaluation platform