<a href="https://colab.research.google.com/github/nisha1365/TECHNICAL_TRAINING_CTS/blob/main/Nisha_2211566_AWS_Case_Study_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Detect Heart Failure From Clinical Record With Sagemaker Feature Store




### Setting up Sagemaker Feature Store

Setting up the SageMaker Python SDK and boto client

* S3FS is a PyFilesystem interface to Amazon S3 cloud storage.As a PyFilesystem concrete class, S3FS allows you to work with S3 in the same way as any other supported filesystem.

* boto3: boto3:Python (Boto3) to create, configure, and manage AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). The SDK provides an object-oriented API as well as low-level access to AWS services.The main benefit of using the Boto3 client are: It maps 1:1 with the actual AWS service API.

* The session object that manages interactions with SageMaker API operations and other AWS service that the training job uses.


In [None]:
!pip install s3fs
 
import boto3
import sagemaker
from sagemaker.session import Session

Keyring is skipped due to an exception: 'keyring.backends'
Collecting botocore<1.27.60,>=1.27.59
  Using cached botocore-1.27.59-py3-none-any.whl (9.1 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.29.8
    Uninstalling botocore-1.29.8:
      Successfully uninstalled botocore-1.29.8
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.26.8 requires botocore<1.30.0,>=1.29.8, but you have botocore 1.27.59 which is incompatible.
awscli 1.27.8 requires botocore==1.29.8, but you have botocore 1.27.59 which is incompatible.
awscli 1.27.8 requires PyYAML<5.5,>=3.10, but you have pyyaml 6.0 which is incompatible.
awscli 1.27.8 requires rsa<4.8,>=3.1.2, but you have rsa 4.9 which is incompatible.[0m[31m
[0mSuccessfully installed botocore-1.27.59
[0m

### Feature store setup

To start using Feature Store, first create a SageMaker session, boto3 session, and a Feature Store session. 

sagemaker_client (boto3.SageMaker.Client) – Client which makes Amazon SageMaker service calls other than InvokeEndpoint (default: None). Estimators created using this Session use this client. If not provided, one will be created using this instance’s boto_session.

sagemaker_runtime_client (boto3.SageMakerRuntime.Client) – Client which makes InvokeEndpoint calls to Amazon SageMaker (default: None). Predictors created using this Session use this client. If not provided, one will be created using this instance’s boto_session.

sagemaker_featurestore_runtime_client (boto3.SageMakerFeatureStoreRuntime.Client) – Client which makes SageMaker FeatureStore record related calls to Amazon SageMaker (default: None). If not provided, one will be created using this instance’s boto_session.

In [None]:
region = boto3.Session().region_name
 
boto_session = boto3.Session(region_name=region)
 
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
 
feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

### Set up S3 bucket for the offline store

Setting up the bucket you will use for your features; this is your Offline Store. The following will use the SageMaker default bucket and add a custom prefix to it.
SageMaker feature store assumes an IAM role which has access to it. the role is owned by you.
The same bucket can be re used across different feature groups. Data in the bucket is partioned by feature group.

In [None]:

# change the bucket name to your desired bucket name 
default_s3_bucket_name = feature_store_session.default_bucket()
prefix = 'feature-store'
 
print(default_s3_bucket_name)



sagemaker-ap-northeast-1-555918697305


### Setting up the IAM role

In [None]:
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

arn:aws:iam::555918697305:role/service-role/AmazonSageMaker-ExecutionRole-20221121T170842


### Importing Necessary Libraries

In [None]:
import pandas as pd
from IPython.display import display

### Uploading data to the S3 bucket

In [None]:
!aws s3 cp  ./clinical_records_dataset.csv s3://$default_s3_bucket_name/$prefix/data/


The user-provided path ./clinical_records_dataset.csv does not exist.


### Loading the dataset 

In [None]:
clinical_data_file_name = 'clinical_records_dataset.csv'
clinical_data_path = "s3://{}/{}/data/{}".format(default_s3_bucket_name, prefix, clinical_data_file_name)
clinical = pd.read_csv(clinical_data_path)
pd.set_option('display.max_columns', 500)
clinical.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


### Checking for null values

In [None]:
print ('percentage of the value missing in each column is: ')
clinical.isnull().sum() / len(clinical)

percentage of the value missing in each column is: 


age                         0.0
anaemia                     0.0
creatinine_phosphokinase    0.0
diabetes                    0.0
ejection_fraction           0.0
high_blood_pressure         0.0
platelets                   0.0
serum_creatinine            0.0
serum_sodium                0.0
sex                         0.0
smoking                     0.0
time                        0.0
DEATH_EVENT                 0.0
dtype: float64

### There are no missing values

### Prepare data for Feature Store

In the Amazon SageMaker Feature Store API , a feature is an attribute of a record. You can define a name and type for every feature stored in Feature Store. Name uniquely  indentifies a feature within a feature group. type identifies the datatype  for the values of the feature.Supported datatypes  are string, integral and Fractional.

### Feature Store Concepts
The following list of terms are key to understanding the capabilities of Amazon SageMaker Feature Store: 

Feature store – Serves as the single source of truth to store, retrieve, remove, track, share, discover, and control access to features.

* Feature – A measurable property or characteristic that encapsulates an observed phenomenon. In the Amazon SageMaker Feature Store API, a feature is an attribute of a record. You can define a name and type for every feature stored in Feature Store. Name uniquely identifies a feature within a feature group. Type identifies the datatype for the values of the feature. Supported datatypes are: String, Integral and Fractional. 

* Feature group – A FeatureGroup is the main Feature Store resource that contains the metadata for all the data stored in Amazon SageMaker Feature Store. A feature group is a logical grouping of features, defined in the feature store, to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store. 

* Feature definition – A FeatureDefinition consists of a name and one of the following data types: an Integral, String or Fractional. A FeatureGroup contains a list of feature definitions. 

* Record identifier name – Each feature group is defined with a record identifier name. The record identifier name must refer to one of the names of a feature defined in the feature group's feature definitions.

* Record – A Record is a collection of values for features for a single record identifier value. A combination of record identifier name and a timestamp uniquely identify a record within a feature group. 

* Event time – a point in time when a new event occurs that corresponds to the creation or update of a record in a feature group. All records in the feature group must have a corresponding Eventtime. It can be used to track changes to a record over time. The online store contains the record corresponding to the last Eventtime for a record identifier name, whereas the offline store contains all historic records. Event time values can either be of a fractional or string type. Fractional values must be UNIX timestamps. Strings must follow the ISO 8601 standard. The following formats are supported yyyy-MM-dd'T'HH:mm:ssZ and yyyy-MM-dd'T'HH:mm:ss.SSSZ where yyyy, MM, and dd represent the year, month, and day respectively and HH, mm, ss, and if applicable, SSS represent the hour, month, second and milliseconds respsectively. T and Z are constants.

* Online Store – the low latency, high availability cache for a feature group that enables real-time lookup of records. The online store allows quick access to the latest value for a Record via the GetRecord API. A feature group contains an OnlineStoreConfig controlling where the data is stored.

* Offline store – the OfflineStore, stores historical data in your S3 bucket. It is used when low (sub-second) latency reads are not needed. For example, when you want to store and serve features for exploration, model training, and batch inference. A feature group contains an OfflineStoreConfig controlling where the data is stored.

* Ingestion – The act of populating feature groups in the feature store.

## Create a unique id for each patient

Adding an id for each patient.Here we are renaming the index column as patient id in the clinical dataset

In [None]:



clinical.reset_index(inplace = True)

clinical.rename(columns = {'index': 'patient_id'}, inplace = True)



### Checking the first 5 rows

In [None]:
clinical.head()

Unnamed: 0,patient_id,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


### Checking the datatypes of the features

In [None]:
clinical.dtypes

patient_id                    int64
age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object

### Changing the datatype of the patient id column as we want it to be treated as a string id

In [None]:
clinical['patient_id']=clinical['patient_id'].astype(object)

### importing time and appending the eventtime feature to our current dataset

In [None]:

import time
 
current_time_sec = int(round(time.time()))
# append EventTime feature
clinical['EventTime'] = pd.Series([current_time_sec]*len(clinical), dtype="float64")


In [None]:
clinical.head()

Unnamed: 0,patient_id,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT,EventTime
0,0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1,1669972000.0
1,1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1,1669972000.0
2,2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1,1669972000.0
3,3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1,1669972000.0
4,4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1,1669972000.0


### Check data types for each column. Cast object dtype to string. The SageMaker Feature Store Python SDK will then map the string dtype to String feature type.

In [None]:
def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == 'object':
            data_frame[label] = data_frame[label].astype("str").astype("string")
 

cast_object_to_string(clinical)

### Checking the datatypes 

In [None]:
clinical.dtypes

patient_id                   string
age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
EventTime                   float64
dtype: object

### Assign a feature group name

In [None]:
from time import gmtime, strftime, sleep

clinical_feature_group_name = 'clinical-feature-group-' + strftime('%d-%H-%M-%S', gmtime())

## Create a feature group

In [None]:

from sagemaker.feature_store.feature_group import FeatureGroup

clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)



### Define identifier

Assigning record identifier and event time feature names

In [None]:

record_identifier_feature_name = "patient_id"
event_time_feature_name = "EventTime"

## Loading feature

In [None]:
clinical_feature_group.load_feature_definitions(data_frame=clinical); # output is suppressed

### Create feature group

In this step, we use the create function to create the feature group. The following code shows all of the available parameters. The online store is not created by default, so we must set this as True if you want to enable it. The s3_uri is the S3 bucket location of your offline store.

In [None]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")
 
clinical_feature_group.create(
    s3_uri=f"s3://{default_s3_bucket_name}/{prefix}", #offline feature store bucket
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=True
)
wait_for_feature_group_creation_complete(feature_group=clinical_feature_group)

Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
FeatureGroup clinical-feature-group-02-08-58-42 successfully created.


### Describe a Feature Group
You can retrieve information about your feature group with the describe function

In [None]:
clinical_feature_group.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:ap-northeast-1:555918697305:feature-group/clinical-feature-group-02-08-58-42',
 'FeatureGroupName': 'clinical-feature-group-02-08-58-42',
 'RecordIdentifierFeatureName': 'patient_id',
 'EventTimeFeatureName': 'EventTime',
 'FeatureDefinitions': [{'FeatureName': 'patient_id', 'FeatureType': 'String'},
  {'FeatureName': 'age', 'FeatureType': 'Fractional'},
  {'FeatureName': 'anaemia', 'FeatureType': 'Integral'},
  {'FeatureName': 'creatinine_phosphokinase', 'FeatureType': 'Integral'},
  {'FeatureName': 'diabetes', 'FeatureType': 'Integral'},
  {'FeatureName': 'ejection_fraction', 'FeatureType': 'Integral'},
  {'FeatureName': 'high_blood_pressure', 'FeatureType': 'Integral'},
  {'FeatureName': 'platelets', 'FeatureType': 'Fractional'},
  {'FeatureName': 'serum_creatinine', 'FeatureType': 'Fractional'},
  {'FeatureName': 'serum_sodium', 'FeatureType': 'Integral'},
  {'FeatureName': 'sex', 'FeatureType': 'Integral'},
  {'FeatureName': 'smoking', 'Featu

### List Feature Groups
You can list all of your feature groups with the list_feature_groups function.

In [None]:
sagemaker_client.list_feature_groups() #use boto client to list FeatureGroups

{'FeatureGroupSummaries': [{'FeatureGroupName': 'clinical-feature-group-02-08-58-42',
   'FeatureGroupArn': 'arn:aws:sagemaker:ap-northeast-1:555918697305:feature-group/clinical-feature-group-02-08-58-42',
   'CreationTime': datetime.datetime(2022, 12, 2, 8, 58, 47, 929000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created'},
  {'FeatureGroupName': 'clinical-feature-group-02-08-49-13',
   'FeatureGroupArn': 'arn:aws:sagemaker:ap-northeast-1:555918697305:feature-group/clinical-feature-group-02-08-49-13',
   'CreationTime': datetime.datetime(2022, 12, 2, 8, 49, 15, 492000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created',
   'OfflineStoreStatus': {'Status': 'Active'}},
  {'FeatureGroupName': 'clinical-feature-group-02-08-42-02',
   'FeatureGroupArn': 'arn:aws:sagemaker:ap-northeast-1:555918697305:feature-group/clinical-feature-group-02-08-42-02',
   'CreationTime': datetime.datetime(2022, 12, 2, 8, 42, 2, 772000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created',
   'OfflineS

### Put Records in a Feature Group
We can use the ingest function to load your feature data. We pass in a data frame of feature data, set the number of workers, and choose to wait for it to return or not. The following example demonstrates using the ingest function.

In [None]:
clinical_feature_group.ingest(
    data_frame=clinical, max_workers=3, wait=True
)

IngestionManagerPandas(feature_group_name='clinical-feature-group-02-08-58-42', sagemaker_fs_runtime_client_config=<botocore.config.Config object at 0x7f57e4a0c0d0>, max_workers=3, max_processes=1, profile_name=None, _async_result=<multiprocess.pool.MapResult object at 0x7f57e2e9cbd0>, _processing_pool=<pool ProcessPool(ncpus=1)>, _failed_indices=[])

### Get records from a feature group

Get Records from a Feature Group
we can use the get_record function to retrieve the data for a specific feature by its record identifier. The following example uses an example identifier to retrieve the record.

In [None]:
record_identifier_value = str(200)
 
featurestore_runtime.get_record(FeatureGroupName=clinical_feature_group_name, RecordIdentifierValueAsString=record_identifier_value)

{'ResponseMetadata': {'RequestId': 'de7e9726-1388-42d7-a31b-24ceaed4145f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'de7e9726-1388-42d7-a31b-24ceaed4145f',
   'content-type': 'application/json',
   'content-length': '788',
   'date': 'Fri, 02 Dec 2022 08:59:16 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'patient_id', 'ValueAsString': '200'},
  {'FeatureName': 'age', 'ValueAsString': '63.0'},
  {'FeatureName': 'anaemia', 'ValueAsString': '1'},
  {'FeatureName': 'creatinine_phosphokinase', 'ValueAsString': '1767'},
  {'FeatureName': 'diabetes', 'ValueAsString': '0'},
  {'FeatureName': 'ejection_fraction', 'ValueAsString': '45'},
  {'FeatureName': 'high_blood_pressure', 'ValueAsString': '0'},
  {'FeatureName': 'platelets', 'ValueAsString': '73000.0'},
  {'FeatureName': 'serum_creatinine', 'ValueAsString': '0.7'},
  {'FeatureName': 'serum_sodium', 'ValueAsString': '137'},
  {'FeatureName': 'sex', 'ValueAsString': '1'},
  {'FeatureName': 'smoking', 'Value

### Generate Hive DDL Commands
The SageMaker Python SDK’s FeatureStore class also provides the functionality to generate Hive DDL commands. The schema of the table is generated based on the feature definitions. Columns are named after feature name and data-type are inferred based on feature type.

In [None]:
print(clinical_feature_group.as_hive_ddl())

CREATE EXTERNAL TABLE IF NOT EXISTS sagemaker_featurestore.clinical-feature-group-02-08-58-42 (
  patient_id STRING
  age FLOAT
  anaemia INT
  creatinine_phosphokinase INT
  diabetes INT
  ejection_fraction INT
  high_blood_pressure INT
  platelets FLOAT
  serum_creatinine FLOAT
  serum_sodium INT
  sex INT
  smoking INT
  time INT
  DEATH_EVENT INT
  EventTime FLOAT
  write_time TIMESTAMP
  event_time TIMESTAMP
  is_deleted BOOLEAN
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS
  INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
  OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
LOCATION 's3://sagemaker-ap-northeast-1-555918697305/feature-store/555918697305/sagemaker/ap-northeast-1/offline-store/clinical-feature-group-02-08-58-42-1669971527/data'


### Now lets wait for the data to appear in our offline store before moving forward to creating a dataset.This will take approximately 5 minutes. Sagemaker feature store adds metadata for each record that's  ingested into the offline store.

In [None]:
%%time
s3_client = boto3.client('s3', region_name=region)
 
account_id = boto3.client('sts').get_caller_identity()["Account"]
 
clinical_feature_group_table_name = clinical_feature_group.describe().get('OfflineStoreConfig').get('DataCatalogConfig').get('TableName')
 
print(account_id)
print(clinical_feature_group_table_name)
 
clinical_feature_group_s3_prefix = prefix + '/' + account_id + '/sagemaker/' + region + '/offline-store/' + clinical_feature_group_table_name + '/data'
 
offline_store_contents = None
while (offline_store_contents is None):
    objects_in_bucket = s3_client.list_objects(Bucket=default_s3_bucket_name, Prefix=clinical_feature_group_s3_prefix)
    if ('Contents' in objects_in_bucket and len(objects_in_bucket['Contents']) >= 1):
        offline_store_contents = objects_in_bucket['Contents']
    else:
        print('Waiting for data in offline store...\n')
        sleep(60)

print('Data available.')

555918697305
clinical-feature-group-02-08-58-42-1669971527
Waiting for data in offline store...

Waiting for data in offline store...

Waiting for data in offline store...

Waiting for data in offline store...

Waiting for data in offline store...

Waiting for data in offline store...

Data available.
CPU times: user 168 ms, sys: 17.5 ms, total: 186 ms
Wall time: 6min 1s


### Build a Training dataset


Feature Store automatically builds an AWS Glue data catalog when you create feature groups and you can turn this off if you want. The following describes how to create a single training dataset with feature values from both identity and transaction feature groups created earlier in this topic. Also, the following describes how to run an Amazon Athena query to join data stored in the offline store from both identity and transaction feature groups. 

To start, create an Athena query using athena_query() for both identity and transaction feature groups. The `table_name` is the AWS Glue table that is autogenerated by Feature Store. 

In [None]:
clinical_query = clinical_feature_group.athena_query()
clinical_table = clinical_query.table_name

### Write and Execute an Athena Query
You write your query using SQL on these feature groups, and then execute the query with the .run() command and specify your S3 bucket location for the data set to be saved there.

In [None]:
# Athena query
query_string = 'SELECT * FROM "'+clinical_table+'" LIMIT 290'
 
# run Athena query. The output is loaded to a Pandas dataframe.
dataset = pd.DataFrame()
clinical_query.run(query_string=query_string, output_location='s3://'+default_s3_bucket_name+'/query_results/')
clinical_query.wait()
dataset = clinical_query.as_dataframe()

In [None]:
id_for_test = []
for i in range(299):
    if i not in dataset['patient_id'].unique():
        id_for_test.append(i)

### Prepare dataset for training

In [None]:
# Prepare query results for training.
query_execution = clinical_query.get_query_execution()
query_result = 's3://'+default_s3_bucket_name+'/'+prefix+'/query_results/'+query_execution['QueryExecution']['QueryExecutionId']+'.csv'
print(query_result)

s3://sagemaker-ap-northeast-1-555918697305/feature-store/query_results/a4c83981-4e07-4abf-9777-57694ff89b81.csv


In [None]:
# Select useful columns for training with target column as the first.
dataset = dataset[["death_event", "age", 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']]
# Write to csv in S3 without headers and index column.
dataset.to_csv('dataset.csv', header=False, index=False)
s3_client.upload_file('dataset.csv', default_s3_bucket_name, prefix+'/training_input/dataset.csv')
dataset_uri_prefix = 's3://'+default_s3_bucket_name+'/'+prefix+'/training_input/';

In [None]:
dataset.head()

Unnamed: 0,death_event,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,1,82.0,1,379,0,50,0,47000.0,1.3,136,1,0,13
1,0,70.0,0,1202,0,50,1,358000.0,0.9,141,0,0,196
2,1,70.0,0,571,1,45,1,185000.0,1.2,139,1,1,33
3,0,60.0,0,2261,0,35,1,228000.0,0.9,136,1,0,115
4,0,72.0,0,127,1,50,1,218000.0,1.0,134,1,0,33


### Train and Deploy the Model
For model training, we will use a SageMaker built-in algorithm called XGBoost to predict if a patient is likely to have a heart failutre. Sagemaker built-in algorithms provide highly ooptimized implementation of popular machine learning algorithms, simplifying the machine learning development and accelerating training and deployment. We will call the SageMaker XGBoost container and construct a generic Sagemaker Exstimator.

In [None]:
training_image=sagemaker.image_uris.retrieve("xgboost", region, "1.0-1")


In [None]:
training_output_path='s3://' + default_s3_bucket_name+'/'+prefix + '/training_output'

role (str): An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.

instance_count (int or PipelineVariable): Number of Amazon EC2 instances to use for training. Required if instance_groups is not set.

instance_type (str or PipelineVariable): Type of EC2 instance to use for training, for example, 'ml.m5.2xlarge'. Required if instance_groups is not set.

volume_size (int or PipelineVariable): Size in GB of the storage volume to use for storing input and output data during training (default: 30). Must be large enough to store training data if File mode is used, which is the default mode.

max_run (int or PipelineVariable): Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status.

input_mode (str or PipelineVariable): The input mode that the algorithm supports are File, Pipe, FastFile. Default is ‘File’. For File, Amazon SageMaker copies the training dataset from the S3 location to a local directory.

output_path (str or PipelineVariable) – S3 location for saving the training result (here we have 'training_output_path'). If not specified, results are stored to a default bucket.

sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed(here we have feature_store_session). If not specified, the estimator creates one using the default AWS configuration chain.

* objective: Specifies the learning task and the corresponding learning objective.
* num_round : The number of rounds to run the training

In [None]:


from sagemaker.estimator import Estimator
training_model = Estimator(training_image,
                           role,
                           instance_count=1,
                           instance_type='ml.m5.2xlarge',
                           volume_size = 5,
                           max_run = 3600,
                           input_mode= 'File',
                           output_path=training_output_path,
                           sagemaker_session=feature_store_session)


training_model.set_hyperparameters(objective = "binary:logistic",
                                   num_round = 50)







Due to cost consideration, the goal of this example is to showcase Feature store capabilities, not necessarily to achieve the best result. In tis example, we will skip hyperparameter tuning and go with the default hyperparameters

### specifying training data we just created

* distribution (str): Valid values: ‘FullyReplicated’, ‘ShardedByS3Key’ (default: ‘FullyReplicated’).
* content_type (str): MIME type of the input data (default: None).
* s3_data_type (str): Valid values: ‘S3Prefix’, ‘ManifestFile’, ‘AugmentedManifestFile’. If ‘S3Prefix’, s3_data defines a prefix of s3 objects to train on. All objects with s3 keys beginning with s3_data will be used to train.

In [None]:
train_data = sagemaker.inputs.TrainingInput(dataset_uri_prefix, distribution='FullyReplicated',
                                            content_type='text/csv', s3_data_type='S3Prefix')
data_channels = {'train': train_data}

### fitting the model

In [None]:
training_model.fit(inputs=data_channels, logs= True)

2022-12-02 09:17:39 Starting - Starting the training job...
2022-12-02 09:18:03 Starting - Preparing the instances for trainingProfilerReport-1669972658: InProgress
............
2022-12-02 09:20:04 Downloading - Downloading input data
2022-12-02 09:20:04 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[09:20:06] 290x12 matrix with 3480 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34mINFO:root:Sing

### Set up Hosting for the Model

Once the training is done, we can deploy the trained model as an amazon SageMaker reat-time hosted endpoint. This will allow us to make predictions (or inference) from the model. The endpoint deployment can bee accomplished as follows.

In [None]:

predictor = training_model.deploy(initial_instance_count = 1, instance_type = 'ml.m5.xlarge')



-----!

### SageMaker Feature Store During Inference¶
SageMaker Feature Store can be useful in supplementing data for inference request because of the low-latency GetRecord functionality. For this demo, we will be given a patientID and query our online FeatureGroup to build our inference request.

From the patient ID, we left out in training, we can choose one patient ID to test the real-time reference. In this example, we choose patient 194( we can choose either one from the left out ID list for testing)

In [None]:
patient_id = str(194)
 
# Helper to parse the feature value from the record.
def get_feature_value(record, feature_name):
    return str(list(filter(lambda r: r['FeatureName'] == feature_name, record))[0]['ValueAsString'])
 
clinical_response = featurestore_runtime.get_record(FeatureGroupName=clinical_feature_group_name, RecordIdentifierValueAsString=patient_id)
clinical_record = clinical_response['Record']
clinical_record

[{'FeatureName': 'patient_id', 'ValueAsString': '194'},
 {'FeatureName': 'age', 'ValueAsString': '45.0'},
 {'FeatureName': 'anaemia', 'ValueAsString': '0'},
 {'FeatureName': 'creatinine_phosphokinase', 'ValueAsString': '582'},
 {'FeatureName': 'diabetes', 'ValueAsString': '0'},
 {'FeatureName': 'ejection_fraction', 'ValueAsString': '20'},
 {'FeatureName': 'high_blood_pressure', 'ValueAsString': '1'},
 {'FeatureName': 'platelets', 'ValueAsString': '126000.0'},
 {'FeatureName': 'serum_creatinine', 'ValueAsString': '1.6'},
 {'FeatureName': 'serum_sodium', 'ValueAsString': '135'},
 {'FeatureName': 'sex', 'ValueAsString': '1'},
 {'FeatureName': 'smoking', 'ValueAsString': '0'},
 {'FeatureName': 'time', 'ValueAsString': '180'},
 {'FeatureName': 'DEATH_EVENT', 'ValueAsString': '1'},
 {'FeatureName': 'EventTime', 'ValueAsString': '1669971519.0'}]

GetRecord() used for OnlineStore serving from a FeatureStore. Only the latest records stored in the OnlineStore can be retrieved. If no Record with RecordIdentifierValue is found, then an empty result is returned.
* FeatureGroupName: The name of the feature group from which you want to retrieve a record. Length Constraints: Minimum length of 1. Maximum length of 64.
* RecordIdentifierValueAsString: The value that corresponds to RecordIdentifier type and uniquely identifies the record in the FeatureGroup. Length Constraints: Maximum length of 358400.


### Then we choose the feature value from the retrieved feature list, exclude the record identifier ID, the event time, and the target variable, and build a list of values as the input to the predicter.

In [None]:
inference_request = [



   get_feature_value(clinical_record, 'age'),



   get_feature_value(clinical_record, 'anaemia'),



   get_feature_value(clinical_record, 'creatinine_phosphokinase'),



   get_feature_value(clinical_record, 'diabetes'),



   get_feature_value(clinical_record, 'ejection_fraction'),



   get_feature_value(clinical_record, 'high_blood_pressure'),



   get_feature_value(clinical_record, 'platelets'),



   get_feature_value(clinical_record, 'serum_creatinine'),



   get_feature_value(clinical_record, 'serum_sodium'),



   get_feature_value(clinical_record, 'sex'),



   get_feature_value(clinical_record, 'smoking'),



   get_feature_value(clinical_record, 'time')



]




The predictor will call our hosted model and give a prediction result. And The model predicts the probability of a heart failure to the patient

### converting to a json file.

In [None]:
import json

results = predictor.predict(','.join(inference_request), initial_args = {"ContentType": "text/csv"})
prediction = json.loads(results)
print (prediction)

0.9829193353652954


##### Deleting the model endpoint and the FeatureGroup after we are done with this demo due to cost consideration

In [None]:
predictor.delete_endpoint()
clinical_feature_group.delete()