# Amazon SageMaker FeatureStore - Load data and augment with embeddings stored in AWS Feature Store

`conda-env-kernel` image with `Python 3.9` kernel is required for this notebook.

In this notebook, we will load outcomes data and augment it with device_model embeddings stored in AWS FeatureStore.
Esentially, we will implement a JOIN which will add a new column representing emdedding vector corresponding to device_model.

## Initial setup

### Fix <A HREF="https://github.com/hdmf-dev/hdmf/issues/617">panda/numpy incompatibility issues</A>

In [None]:
import sys
#!{sys.executable} -m pip install --upgrade pip
#!{sys.executable} -m pip install wheel
#!{sys.executable} -m pip install sagemaker pandas numpy numba s3fs aiobotocore --upgrade
#!{sys.executable} -m pip install sagemaker pandas numpy numba --upgrade

### Set up boto client and the SageMaker Python SDK.

In [None]:
import boto3
import json
import sagemaker
from sagemaker.session import Session

region = boto3.Session().region_name
boto_client_s3 = boto3.client('s3', region_name=region)

boto_session = boto3.Session(region_name=region)

boto_client_sagemaker = boto_session.client(service_name='sagemaker', region_name=region)
boto_client_featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=boto_client_sagemaker,
    sagemaker_featurestore_runtime_client=boto_client_featurestore_runtime
)

## Load features data from the offline store

### Define a Feature Store Group

In [None]:

from time import gmtime, strftime, sleep
from sagemaker.feature_store.feature_group import FeatureGroup

device_model_feature_group_name = 'deviceid-feature-group' # we are going to store features from different runs in a single group and timestamp the features instead
device_model_feature_group = FeatureGroup(name=device_model_feature_group_name, sagemaker_session=feature_store_session)
d = device_model_feature_group.describe()
d

### Load the latest features through Athena

Note: if you pull embeddings data from Online store, you will always get the latest version. See https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_feature_store_GetRecord.html
Also, when using Online store, you will need to rename the column as shown below before it can be used for JOIN with the outcomes data.
a = my_sample_data2.rename(columns={'DeviceID': 'device_model'})

In [None]:
device_model_query = device_model_feature_group.athena_query()
device_model_query

In [None]:
# Create a query to pull the latest snapshot without duplicates and deleted records in the offline store.
# For details see https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-athena-glue-integration.html
query_string = f"""
SELECT deviceid AS device_model, embeddings, eventtime
FROM
    (SELECT *,
         row_number()
        OVER (PARTITION BY {d["RecordIdentifierFeatureName"]}
    ORDER BY  {d["EventTimeFeatureName"]} DESC) AS row_num
    FROM "{device_model_query.table_name}")
WHERE row_num = 1 AND NOT is_deleted;
"""
query_string

In [None]:
import pandas as pd

default_s3_bucket_name = feature_store_session.default_bucket() # default S3 bucket defined during SageMaker domain creation.
prefix = 'query_results/'

# run Athena query. The output is loaded to a Pandas dataframe.
latest_device_model_embeddings_feature = pd.DataFrame()
device_model_query.run(query_string=query_string, output_location=f"s3://{default_s3_bucket_name}/{prefix}")
device_model_query.wait()
latest_device_model_embeddings_feature = device_model_query.as_dataframe()
latest_device_model_embeddings_feature

=======================================

## Load outcomes and run a join


In [None]:
import pandas as pd
my_bucket = 'sagemaker-studio-ilya-test-20211221'
my_file = 'input_data/csv_outcomes_is_ilya2.csv'

outcomes = pd.read_csv(f"s3://{my_bucket}/{my_file}")
#outcomes = pd.read_csv('s3://sagemaker-studio-ilya-test-20211221/input_data/csv_outcomes_is_ilya2.csv')
outcomes.head()

In [None]:
selected_columns = outcomes[["supply_app_bundle_id","device_id", "device_model"]]
selected_columns


In [None]:

result= pd.merge(selected_columns,latest_device_model_embeddings_feature,on="device_model")
result

In [None]:
latest_device_model_embeddings_feature.loc[latest_device_model_embeddings_feature['device_model'].isin(['sm-g973f','dammar','kyv48','b50pro','ptb10r'])]

## Save the results

In [None]:
#