# Module 1: Introduction to SageMaker Feature Store

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Load and explore datasets](#Load-and-explore-datasets)
1. [Create feature definitions and groups](#Create-feature-definitions-and-groups)
1. [Ingest data into feature groups](#Ingest-data-into-feature-groups)
1. [Get feature record from the Online feature store](#Get-feature-record-from-the-Online-feature-store)
1. [List feature groups](#List-feature-groups)

# Background

In this notebook, you will learn how to create **3** feature groups for `customers`, `products` and `orders` datasets 
in the SageMaker Feature Store. You will then learn how to ingest the feature 
columns into the created feature groups (both the Online and the Offline store) using SageMaker Python SDK. You will also see how to get an ingested feature record from the Online store. In the end, you will know how to list all the feature groups created within the Feature Store and delete them.

**Note:** The feature groups created in this notebook will be used in the upcoming modules.


# Setup

#### Imports

In [35]:
from sagemaker.feature_store.feature_group import FeatureGroup
from time import gmtime, strftime, sleep
from random import randint
import pandas as pd
import numpy as np
import subprocess
import sagemaker
import importlib
import logging
import time
import sys

In [36]:
if sagemaker.__version__ < '2.48.1':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.48.1'])
    importlib.reload(sagemaker)

In [37]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [38]:
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

Using SageMaker version: 2.70.0
Using SageMaker version: 2.70.0
Using Pandas version: 1.0.1
Using Pandas version: 1.0.1


#### Essentials

In [39]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket()
logger.info(f'Default S3 bucket = {default_bucket}')
prefix = 'sagemaker-feature-store'

Default S3 bucket = sagemaker-us-west-2-119387606724
Default S3 bucket = sagemaker-us-west-2-119387606724


In [40]:
region = sagemaker_session.boto_region_name

# Load and explore datasets

In [41]:
customers_df = pd.read_csv('.././data/transformed/customers.csv')
customers_df.head(5)

Unnamed: 0,customer_id,sex,is_married,event_time,age_18-29,age_30-39,age_40-49,age_50-59,age_60-69,age_70-plus,n_days_active
0,C1,1,1,2022-02-15T23:29:37.454Z,0,0,0,1,0,0,0.521918
1,C2,0,1,2022-02-15T23:29:37.456Z,1,0,0,0,0,0,0.142466
2,C3,0,1,2022-02-15T23:29:37.458Z,0,0,0,0,1,0,0.141096
3,C4,0,1,2022-02-15T23:29:37.459Z,0,0,0,1,0,0,0.887671
4,C5,0,1,2022-02-15T23:29:37.460Z,0,1,0,0,0,0,0.265753


In [42]:
customers_df.dtypes

customer_id       object
sex                int64
is_married         int64
event_time        object
age_18-29          int64
age_30-39          int64
age_40-49          int64
age_50-59          int64
age_60-69          int64
age_70-plus        int64
n_days_active    float64
dtype: object

In [43]:
customers_df['customer_id'] = customers_df['customer_id'].astype('string')
customers_df['event_time'] = customers_df['event_time'].astype('string')

In [44]:
customers_df.dtypes

customer_id       string
sex                int64
is_married         int64
event_time        string
age_18-29          int64
age_30-39          int64
age_40-49          int64
age_50-59          int64
age_60-69          int64
age_70-plus        int64
n_days_active    float64
dtype: object

In [45]:
products_df = pd.read_csv('.././data/transformed/products.csv')
products_df.head(5)

Unnamed: 0,product_id,event_time,category_baby_food_formula,category_baking_ingredients,category_candy_chocolate,category_chips_pretzels,category_cleaning_products,category_coffee,category_cookies_cakes,category_crackers,...,category_hair_care,category_ice_cream_ice,category_juice_nectars,category_packaged_cheese,category_refrigerated,category_soup_broth_bouillon,category_spices_seasonings,category_tea,category_vitamins_supplements,category_yogurt
0,P1,2022-02-15T23:29:52.850Z,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,P2,2022-02-15T23:29:52.850Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,P3,2022-02-15T23:29:52.850Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,P4,2022-02-15T23:29:52.850Z,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,P5,2022-02-15T23:29:52.850Z,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [46]:
products_df['product_id'] = products_df['product_id'].astype('string')
products_df['event_time'] = products_df['event_time'].astype('string')

In [47]:
products_df.dtypes

product_id                       string
event_time                       string
category_baby_food_formula        int64
category_baking_ingredients       int64
category_candy_chocolate          int64
category_chips_pretzels           int64
category_cleaning_products        int64
category_coffee                   int64
category_cookies_cakes            int64
category_crackers                 int64
category_energy_granola_bars      int64
category_frozen_meals             int64
category_hair_care                int64
category_ice_cream_ice            int64
category_juice_nectars            int64
category_packaged_cheese          int64
category_refrigerated             int64
category_soup_broth_bouillon      int64
category_spices_seasonings        int64
category_tea                      int64
category_vitamins_supplements     int64
category_yogurt                   int64
dtype: object

In [48]:
orders_df = pd.read_csv('.././data/transformed/orders.csv')
orders_df

Unnamed: 0,order_id,customer_id,product_id,purchase_amount,is_reordered,event_time,n_days_since_last_purchase
0,O1,C5731,P16,0.913465,1,2022-02-15T23:29:53.387Z,0.122093
1,O2,C3541,P12802,0.663168,1,2022-02-15T23:29:53.387Z,0.903101
2,O3,C7402,P8320,0.629604,1,2022-02-15T23:29:53.387Z,0.054264
3,O4,C7356,P5165,0.618911,0,2022-02-15T23:29:53.387Z,0.343023
4,O5,C5806,P12940,0.053168,1,2022-02-15T23:29:53.387Z,0.242248
...,...,...,...,...,...,...,...
99995,O99996,C7167,P10590,0.896040,0,2022-02-15T23:29:59.292Z,0.686047
99996,O99997,C3642,P6210,0.129109,1,2022-02-15T23:29:59.292Z,0.868217
99997,O99998,C6145,P5740,0.825050,1,2022-02-15T23:29:59.292Z,0.046512
99998,O99999,C7567,P14942,0.602772,1,2022-02-15T23:29:59.292Z,0.835271


In [49]:
orders_df['order_id'] = orders_df['order_id'].astype('string')
orders_df['customer_id'] = orders_df['customer_id'].astype('string')
orders_df['product_id'] = orders_df['product_id'].astype('string')
orders_df['event_time'] = orders_df['event_time'].astype('string')

In [56]:
orders_df.dtypes

order_id                       string
customer_id                    string
product_id                     string
purchase_amount               float64
is_reordered                    int64
event_time                     string
n_days_since_last_purchase    float64
dtype: object

In [51]:
customers_count = customers_df.shape[0]
%store customers_count
products_count = products_df.shape[0]
%store products_count
orders_count = orders_df.shape[0]
%store orders_count

Stored 'customers_count' (int)
Stored 'products_count' (int)
Stored 'orders_count' (int)


# Create feature definitions and groups

In [52]:
current_timestamp = strftime('%m-%d-%H-%M', gmtime())

In [53]:
# prefix to track all the feature groups created as part of feature store champions workshop (fscw)
fs_prefix = 'fscw-' 

In [54]:
customers_feature_group_name = f'{fs_prefix}customers-{current_timestamp}'
%store customers_feature_group_name
products_feature_group_name = f'{fs_prefix}products-{current_timestamp}'
%store products_feature_group_name
orders_feature_group_name = f'{fs_prefix}orders-{current_timestamp}'
%store orders_feature_group_name

Stored 'customers_feature_group_name' (str)
Stored 'products_feature_group_name' (str)
Stored 'orders_feature_group_name' (str)


In [55]:
logger.info(f'Customers feature group name = {customers_feature_group_name}')
logger.info(f'Products feature group name = {products_feature_group_name}')
logger.info(f'Orders feature group name = {orders_feature_group_name}')

Customers feature group name = fscw-customers-02-15-23-31
Customers feature group name = fscw-customers-02-15-23-31
Products feature group name = fscw-products-02-15-23-31
Products feature group name = fscw-products-02-15-23-31
Orders feature group name = fscw-orders-02-15-23-31
Orders feature group name = fscw-orders-02-15-23-31


In [57]:
customers_feature_group = FeatureGroup(name=customers_feature_group_name, sagemaker_session=sagemaker_session)
products_feature_group = FeatureGroup(name=products_feature_group_name, sagemaker_session=sagemaker_session)
orders_feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sagemaker_session)

In [60]:
customers_feature_group.load_feature_definitions(data_frame=customers_df)

[FeatureDefinition(feature_name='customer_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='sex', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='is_married', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='age_18-29', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_30-39', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_40-49', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_50-59', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_60-69', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='age_70-plus', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition

In [61]:
products_feature_group.load_feature_definitions(data_frame=products_df)

[FeatureDefinition(feature_name='product_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='category_baby_food_formula', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_baking_ingredients', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_candy_chocolate', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_chips_pretzels', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_cleaning_products', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_coffee', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='category_cookies_cakes', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinit

In [62]:
orders_feature_group.load_feature_definitions(data_frame=orders_df)

[FeatureDefinition(feature_name='order_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='customer_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='product_id', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='purchase_amount', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='is_reordered', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>),
 FeatureDefinition(feature_name='event_time', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='n_days_since_last_purchase', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>)]

Let's create the feature groups now

In [63]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        logger.info(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    logger.info(f'FeatureGroup {feature_group.name} was successfully created.')

In [64]:
customers_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='customer_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:119387606724:feature-group/fscw-customers-02-15-23-31',
 'ResponseMetadata': {'RequestId': 'c2f3b638-e2e3-4e13-b2b0-6a011e34860d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c2f3b638-e2e3-4e13-b2b0-6a011e34860d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '103',
   'date': 'Tue, 15 Feb 2022 23:36:54 GMT'},
  'RetryAttempts': 0}}

In [65]:
wait_for_feature_group_creation_complete(customers_feature_group)

Waiting for feature group: fscw-customers-02-15-23-31 to be created ...
Waiting for feature group: fscw-customers-02-15-23-31 to be created ...


Initial status: Creating


Waiting for feature group: fscw-customers-02-15-23-31 to be created ...
Waiting for feature group: fscw-customers-02-15-23-31 to be created ...
FeatureGroup fscw-customers-02-15-23-31 was successfully created.
FeatureGroup fscw-customers-02-15-23-31 was successfully created.


In [66]:
products_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                               record_identifier_name='product_id', 
                               event_time_feature_name='event_time', 
                               role_arn=role, 
                               enable_online_store=True)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:119387606724:feature-group/fscw-products-02-15-23-31',
 'ResponseMetadata': {'RequestId': 'aaf76134-ec32-49c9-b025-dc273625399f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'aaf76134-ec32-49c9-b025-dc273625399f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '102',
   'date': 'Tue, 15 Feb 2022 23:37:13 GMT'},
  'RetryAttempts': 0}}

In [67]:
wait_for_feature_group_creation_complete(products_feature_group)

Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...


Initial status: Creating


Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...
Waiting for feature group: fscw-products-02-15-23-31 to be created ...
FeatureGroup fscw-products-02-15-23-31 was successfully created.
FeatureGroup fscw-products-02-15-23-31 was successfully created.


In [68]:
orders_feature_group.create(s3_uri=f's3://{default_bucket}/{prefix}', 
                            record_identifier_name='order_id', 
                            event_time_feature_name='event_time', 
                            role_arn=role, 
                            enable_online_store=True)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:119387606724:feature-group/fscw-orders-02-15-23-31',
 'ResponseMetadata': {'RequestId': '45de8587-f6df-41b4-83ca-3e73342819ff',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '45de8587-f6df-41b4-83ca-3e73342819ff',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '100',
   'date': 'Tue, 15 Feb 2022 23:37:33 GMT'},
  'RetryAttempts': 0}}

In [69]:
wait_for_feature_group_creation_complete(orders_feature_group)

Waiting for feature group: fscw-orders-02-15-23-31 to be created ...
Waiting for feature group: fscw-orders-02-15-23-31 to be created ...


Initial status: Creating


Waiting for feature group: fscw-orders-02-15-23-31 to be created ...
Waiting for feature group: fscw-orders-02-15-23-31 to be created ...
Waiting for feature group: fscw-orders-02-15-23-31 to be created ...
Waiting for feature group: fscw-orders-02-15-23-31 to be created ...
FeatureGroup fscw-orders-02-15-23-31 was successfully created.
FeatureGroup fscw-orders-02-15-23-31 was successfully created.


# Ingest data into feature groups 

In [70]:
%%time

logger.info(f'Ingesting data into feature group: {customers_feature_group.name} ...')
customers_feature_group.ingest(data_frame=customers_df, max_processes=16, wait=True)
logger.info(f'{len(customers_df)} customer records ingested into feature group: {customers_feature_group.name}')

Ingesting data into feature group: fscw-customers-02-15-23-31 ...
Ingesting data into feature group: fscw-customers-02-15-23-31 ...
10000 customer records ingested into feature group: fscw-customers-02-15-23-31
10000 customer records ingested into feature group: fscw-customers-02-15-23-31


CPU times: user 253 ms, sys: 151 ms, total: 404 ms
Wall time: 15.7 s


In [71]:
%%time

logger.info(f'Ingesting data into feature group: {products_feature_group.name} ...')
products_feature_group.ingest(data_frame=products_df, max_processes=16, wait=True)
logger.info(f'{len(products_df)} product records ingested into feature group: {products_feature_group.name}')  

Ingesting data into feature group: fscw-products-02-15-23-31 ...
Ingesting data into feature group: fscw-products-02-15-23-31 ...
17001 product records ingested into feature group: fscw-products-02-15-23-31
17001 product records ingested into feature group: fscw-products-02-15-23-31


CPU times: user 277 ms, sys: 152 ms, total: 429 ms
Wall time: 29.6 s


In [72]:
%%time

logger.info(f'Ingesting data into feature group: {orders_feature_group.name} ...')
orders_feature_group.ingest(data_frame=orders_df, max_processes=16, wait=True)
logger.info(f'{len(orders_df)} order records ingested into feature group: {orders_feature_group.name}')

Ingesting data into feature group: fscw-orders-02-15-23-31 ...
Ingesting data into feature group: fscw-orders-02-15-23-31 ...
100000 order records ingested into feature group: fscw-orders-02-15-23-31
100000 order records ingested into feature group: fscw-orders-02-15-23-31


CPU times: user 1.91 s, sys: 227 ms, total: 2.14 s
Wall time: 2min 20s


# Get feature record from the Online feature store 

In [73]:
featurestore_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

Retrieve a record from customers feature group

In [74]:
customer_id =  f'C{randint(1, 10000)}'
logger.info(f'customer_id={customer_id}') 

customer_id=C1024
customer_id=C1024


In [75]:
feature_record = featurestore_runtime_client.get_record(FeatureGroupName=customers_feature_group_name, 
                                                        RecordIdentifierValueAsString=customer_id)
feature_record

{'ResponseMetadata': {'RequestId': 'bbb6c417-40c9-4cbd-86f3-bd1c12b54fe2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'bbb6c417-40c9-4cbd-86f3-bd1c12b54fe2',
   'content-type': 'application/json',
   'content-length': '588',
   'date': 'Tue, 15 Feb 2022 23:45:38 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'customer_id', 'ValueAsString': 'C1024'},
  {'FeatureName': 'sex', 'ValueAsString': '1'},
  {'FeatureName': 'is_married', 'ValueAsString': '1'},
  {'FeatureName': 'event_time', 'ValueAsString': '2022-02-15T23:29:39.143Z'},
  {'FeatureName': 'age_18-29', 'ValueAsString': '0'},
  {'FeatureName': 'age_30-39', 'ValueAsString': '0'},
  {'FeatureName': 'age_40-49', 'ValueAsString': '1'},
  {'FeatureName': 'age_50-59', 'ValueAsString': '0'},
  {'FeatureName': 'age_60-69', 'ValueAsString': '0'},
  {'FeatureName': 'age_70-plus', 'ValueAsString': '0'},
  {'FeatureName': 'n_days_active', 'ValueAsString': '0.9849315068493152'}]}

# List feature groups 

In [76]:
sagemaker_client = sagemaker_session.boto_session.client('sagemaker', region_name=region)

In [77]:
response = sagemaker_client.list_feature_groups()
for fg in response['FeatureGroupSummaries']:
    fg_name = fg['FeatureGroupName']
    print(f'Found feature group: {fg_name}')

Found feature group: online-only-feature-group
Found feature group: new-feature-group
Found feature group: fscw-products-02-15-23-31
Found feature group: fscw-orders-02-15-23-31
Found feature group: fscw-customers-02-15-23-31
Found feature group: feature-group-a
Found feature group: customer-feature-group
