### SageMaker Feature Store Notebook showing use of Time Travel

This notebook is part of an AWS blog that shows how to use "Time Travel" by leveraging SageMaker Feature Store. This particular notebook (#2) is used to generate aggregate data attributes (i.e. averages and sums) from the raw transaction data generated in previous notebook (#1). 

In [None]:
from sagemaker import get_execution_role
import sagemaker
import boto3
import time
import json
import sys

role = get_execution_role()
sm_client = boto3.Session().client(service_name='sagemaker')
smfs_runtime = boto3.Session().client(service_name='sagemaker-featurestore-runtime')

#### Start by Deleting Feature Groups that we will re-create

In [None]:
# Use SageMaker default bucket
BUCKET = sagemaker.Session().default_bucket()
BASE_PREFIX = "sagemaker-featurestore-blog"

FEATURE_GROUP = "cc-agg-batch-fg"

# Note that FeatureStore will append this pattern to base prefix -> "{account_id}/sagemaker/{region}/offline-store/"
OFFLINE_STORE_BASE_URI = f's3://{BUCKET}/{BASE_PREFIX}'

print(OFFLINE_STORE_BASE_URI)

In [None]:
# Delete feature group (in case the name already exists)
sm_client.delete_feature_group(FeatureGroupName=FEATURE_GROUP) 
print('deleted feature group')

#### Recreate the Feature Groups using Schema definition files
Each feature group contains configuration parameters for Offline and Online stores. The feature group uses a schema definition file (JSON) that dictates the feature names and types. Below we display these local schema files.

#### Schema files on in the local 'schema' folder

In [None]:
!pygmentize schema/cc-agg-batch-fg-schema.json

In [None]:
def create_feature_group_from_schema(filename, fg_name, role_arn=None, s3_uri=None):
    schema = json.loads(open(filename).read())
    
    feature_defs = []
    
    for col in schema['features']:
        feature = {'FeatureName': col['name']}
        if col['type'] == 'double':
            feature['FeatureType'] = 'Fractional'
        elif col['type'] == 'bigint':
            feature['FeatureType'] = 'Integral'
        else:
            feature['FeatureType'] = 'String'
        feature_defs.append(feature)

    record_identifier_name = schema['record_identifier_feature_name']
    event_time_name = schema['event_time_feature_name']

    if role_arn is None:
        role_arn = get_execution_role()

    if s3_uri is None:
        offline_config = {}
    else:
        print(f'Creating Offline Store at: {s3_uri}')
        offline_config = {'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': s3_uri}}}
        
    sm_client.create_feature_group(
        FeatureGroupName = fg_name,
        RecordIdentifierFeatureName = record_identifier_name,
        EventTimeFeatureName = event_time_name,
        FeatureDefinitions = feature_defs,
        Description = schema['description'],
        Tags = schema['tags'],
        OnlineStoreConfig = {'EnableOnlineStore': True},
        RoleArn = role_arn,
        **offline_config)

#### Create the new Feature Groups using the schema definition 
Now we will create the feature group as defined by the schema file. Since Feature Group creation can sometimes take a few minutes, we will wait below for status to change from `Creating`.

In [None]:
create_feature_group_from_schema('schema/cc-agg-batch-fg-schema.json', FEATURE_GROUP, 
                                 role_arn=role, s3_uri=OFFLINE_STORE_BASE_URI)

In [None]:
# Wait for status to change to 'Created'

def wait_for_feature_group_creation_complete(feature_group_name):
    response = sm_client.describe_feature_group(FeatureGroupName=feature_group_name)
    status = response['FeatureGroupStatus']
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        response = sm_client.describe_feature_group(FeatureGroupName=feature_group_name)
        status = response['FeatureGroupStatus']
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group_name}")
    print(f"FeatureGroup {feature_group_name} successfully created.")

wait_for_feature_group_creation_complete(feature_group_name=FEATURE_GROUP)

#### Make sure the new Feature Groups exist

In [None]:
sm_client.list_feature_groups()

#### Describe configuration of feature group
Note that each feature group gets its own ARN, allowing you to manage IAM policies that control access to individual feature groups. The feature names and types are displayed, and the required record identifier and event time features are called out specifically. Notice that when we created the Feature Group above, we passed in the `s3_uri` parameter. This parameter dictates the base S3 location where the Offline Store data is written, and can be retrieved from the `describe_feature_group` output within the `OfflineStoreConfig` dictionary. 

In [None]:
sm_client.describe_feature_group(FeatureGroupName=FEATURE_GROUP)

# Batch Ingestion
**This section of the notebook aggregates raw features into new derived features that is used for Fraud Detection model training/inference.**

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Create PySpark Processing Script](#Create-PySpark-Processing-Script)
1. [Run SageMaker Processing Job](#Run-SageMaker-Processing-Job)
1. [Explore Aggregated Features](#Explore-Aggregated-Features)
1. [Validate Feature Group for Records](#Validate-Feature-Group-for-Records)

### Background

- This notebook takes raw credit card transactions data (csv) generated by 
[notebook 1](./1_generate_creditcard_transactions.ipynb) and aggregates the raw features to create new features (multiple calculated ratios) by running a <b>SageMaker Processing</b> PySpark Job. These aggregated features will be leveraged alongside the raw original features in the historical query ("Time Travel") notebook in the last step (see notebook [notebook 3](./3_featurestore_timetravel_historical_query.ipynb)).

- As part of the Spark job, we also select the latest weekly aggregated features - `num_trans_last_7d` and `avg_amt_last_7d` grouped by `consumer_id` and populate these features into the <b>SageMaker Online Feature Store</b> as a feature group. This feature group (`cc-agg-batch-fg`) was just created in this notebook in the cells above. When you configure the Feature Group for both online and offline modes, the data written to the Online store will be automatically synced to the Offline store in the background.

- [Amazon SageMaker Processing](https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-sagemaker-processing-now-supports-built-in-spark-containers-for-big-data-processing/) lets customers run analytics jobs for data engineering and model evaluation on Amazon SageMaker easily and at scale. It provides a fully managed Spark environment for data processing or feature engineering workloads.

<img src="./images/batch_ingestion.png" />

### Setup

#### Imports 

In [None]:
from sagemaker.spark.processing import PySparkProcessor
import pandas as pd
import numpy as np
import sagemaker
import logging
import random
import boto3
import os

In [None]:
print(f'Using SageMaker version: {sagemaker.__version__}')

#### Setup Logger

In [None]:
logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())
logger.info('[Batch Aggregation using SageMaker PySpark Processing Job]')

#### Essentials

In [None]:
# Setup S3 prefixes for Spark Job

INPUT_KEY_PREFIX = os.path.join(BASE_PREFIX, 'raw')
OUTPUT_KEY_PREFIX = os.path.join(BASE_PREFIX, 'aggregated')
LOCAL_DIR = './data'

print(INPUT_KEY_PREFIX)
print(OUTPUT_KEY_PREFIX)

### Create PySpark Script
This PySpark script does the following:

1. Parses the incoming arguments to set various parameters.
2. Defines the schema for the incoming raw dataset.
3. Builds aggregate features (ratios) derived from the original raw features using SparkSQL Window partition.
4. Saves the aggregate features plus raw features into a CSV file; writes this to S3.
5. Groups the aggregated features by `consumer_id` and selects certain features to write to SageMaker Feature Store (Online).


In [None]:
%%writefile batch_aggregation.py
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, TimestampType, LongType
from pyspark.sql.functions import desc, dense_rank
from pyspark.sql import SparkSession, DataFrame
from  argparse import Namespace, ArgumentParser
from pyspark.sql.window import Window
import argparse
import logging
import boto3
import time
import sys
import os

TOTAL_UNIQUE_USERS = 10000 
FEATURE_GROUP = 'cc-agg-batch-fg'

logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


feature_store_client = boto3.client(service_name='sagemaker-featurestore-runtime')


def parse_args() -> Namespace:
    parser = ArgumentParser(description='Spark Job Input and Output Args')
    parser.add_argument('--s3_input_bucket', type=str, help='S3 Input Bucket')
    parser.add_argument('--s3_input_key_prefix', type=str, help='S3 Input Key Prefix')
    parser.add_argument('--s3_output_bucket', type=str, help='S3 Output Bucket')
    parser.add_argument('--s3_output_key_prefix', type=str, help='S3 Output Key Prefix')
    args = parser.parse_args()
    return args
    

def define_schema() -> StructType:
    schema = StructType([StructField('tid', StringType(), True),
                         StructField('event_time', StringType(), True),
                         StructField('cc_num', LongType(), True),
                         StructField('consumer_id', StringType(), True),
                         StructField('amount', DoubleType(), True),
                         StructField('fraud_label', StringType(), True)])
    return schema


def aggregate_features(args: Namespace, schema: StructType, spark: SparkSession) -> DataFrame:
    logger.info('[Read Raw Transactions Data as Spark DataFrame]')
    transactions_df = spark.read.csv(f's3a://{os.path.join(args.s3_input_bucket, args.s3_input_key_prefix)}', \
                                     header=False, \
                                     schema=schema)
    
    logger.info('[Aggregate Transactions to Derive New Features using Spark SQL]')
    
    query = """
    SELECT *, \
           avg_amt_last_60m/avg_amt_last_7d AS amt_ratio1, \
           amount/avg_amt_last_7d AS amt_ratio2, \
           num_trans_last_60m/num_trans_last_7d AS count_ratio \
    FROM \
        ( \
        SELECT *, \
               COUNT(*) OVER w1 as num_trans_last_60m, \
               AVG(amount) OVER w1 as avg_amt_last_60m, \
               COUNT(*) OVER w2 as num_trans_last_7d, \
               AVG(amount) OVER w2 as avg_amt_last_7d \
        FROM transactions_df \
        WINDOW \
               w1 AS (PARTITION BY consumer_id order by cast(event_time AS timestamp) RANGE INTERVAL 60 MINUTES PRECEDING), \
               w2 AS (PARTITION BY consumer_id order by cast(event_time AS timestamp) RANGE INTERVAL 7 DAYS PRECEDING) \
        ) 
    """
    
    transactions_df.registerTempTable('transactions_df')
    aggregated_features = spark.sql(query)
    return aggregated_features


def write_to_s3(args: Namespace, aggregated_features: DataFrame) -> None:
    logger.info('[Write Aggregated Features to S3]')
    aggregated_features.coalesce(1) \
                       .write.format('com.databricks.spark.csv') \
                       .option('header', True) \
                       .mode('overwrite') \
                       .option('sep', ',') \
                       .save('s3a://' + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix))
    
def group_by_consumer(aggregated_features: DataFrame) -> DataFrame: 
    logger.info('[Group Aggregated Features by Card Number]')
    window = Window.partitionBy('consumer_id').orderBy(desc('event_time'))
    sorted_df = aggregated_features.withColumn('rank', dense_rank().over(window))
    grouped_df = sorted_df.filter(sorted_df.rank == 1).drop(sorted_df.rank)
    sliced_df = grouped_df.select('tid', 'consumer_id', 'cc_num', 'num_trans_last_7d', 'avg_amt_last_7d', 'event_time')
    return sliced_df


def transform_row(sliced_df: DataFrame) -> list:
    """ Builds array of Records like this:
    [{'ValueAsString': '052f9e42ef6600c65943cd5474a7794a', 'FeatureName': 'tid'}, 
    {'ValueAsString': 'DRLB75328297313987', 'FeatureName': 'consumer_id'}, 
    {'ValueAsString': '4109784347062086', 'FeatureName': 'cc_num'}, 
    {'ValueAsString': '10', 'FeatureName': 'num_trans_last_7d'}, 
    {'ValueAsString': '284.51', 'FeatureName': 'avg_amt_last_7d'}, 
    {'ValueAsString': '2021-01-31T23:57:16Z', 'FeatureName': 'event_time'}]
    """
    logger.info('[Transform Spark DataFrame Row to SageMaker Feature Store Record]')
    records = []
    iter = 0
    for row in sliced_df.rdd.collect():
        record = []
        tid, consumer_id, cc_num, num_trans_last_7d, avg_amt_last_7d, event_time = row
        if consumer_id:
            iter += 1
            record.append({'ValueAsString': str(tid), 'FeatureName': 'tid'})
            record.append({'ValueAsString': str(consumer_id), 'FeatureName': 'consumer_id'})
            record.append({'ValueAsString': str(cc_num), 'FeatureName': 'cc_num'})
            record.append({'ValueAsString': str(num_trans_last_7d), 'FeatureName': 'num_trans_last_7d'})
            record.append({'ValueAsString': str(round(avg_amt_last_7d, 2)), 'FeatureName': 'avg_amt_last_7d'})
            record.append({'ValueAsString': str(event_time), 'FeatureName': 'event_time'})
            records.append(record)
    return records


def write_to_feature_store(records: list) -> None:
    logger.info('[Write Grouped Features to SageMaker Online Feature Store]')
    success, fail = 0, 0
    for record in records:
        response = feature_store_client.put_record(FeatureGroupName=FEATURE_GROUP, Record=record)
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            success += 1
        else:
            fail += 1
    logger.info('Success = {}'.format(success))
    logger.info('Fail = {}'.format(fail))
    assert fail == 0


def run_spark_job():
    spark = SparkSession.builder.appName('PySparkJob').getOrCreate()
    args = parse_args()
    schema = define_schema()
    aggregated_features = aggregate_features(args, schema, spark)
    write_to_s3(args, aggregated_features)
    sliced_df = group_by_consumer(aggregated_features)
    records = transform_row(sliced_df)
    write_to_feature_store(records)
    
    
if __name__ == '__main__':
    run_spark_job()

### Run SageMaker Processing Job

In [None]:
spark_processor = PySparkProcessor(base_job_name='sagemaker-processing', 
                                   framework_version='2.4', # spark version
                                   role=role, 
                                   instance_count=1, 
                                   instance_type='ml.r5.4xlarge', 
                                   env={'AWS_DEFAULT_REGION': boto3.Session().region_name},
                                   max_runtime_in_seconds=1200)

In [None]:
%%time

spark_processor.run(submit_app='batch_aggregation.py', 
                    arguments=['--s3_input_bucket', BUCKET, 
                               '--s3_input_key_prefix', INPUT_KEY_PREFIX, 
                               '--s3_output_bucket', BUCKET, 
                               '--s3_output_key_prefix', OUTPUT_KEY_PREFIX],
                    spark_event_logs_s3_uri='s3://{}/logs'.format(BUCKET),
                    logs=True)