* changed by nov05 on 2024-11-24  
* [Exercise](https://www.evernote.com/shard/s139/u/0/sh/d22b9fe5-9992-4dd0-9402-c623cdbc90b4/rJcBRGAXxAQfdkl3kqgZ1N2VIKFVDSOLDimPDJFMhwHEQmyRu0AHwQTqxw), [solution](https://www.evernote.com/shard/s139/u/0/sh/00654bdd-0c00-4525-a0e4-beeeccb17e18/UdvEWior2s6PYMczVqGFVPf1PM_g35bFRrPiVlm79SfmnVhyPY_BJQPsiw)    

# Exercises

This is the notebook containing the exercises for Feature Store, Model Monitor, and Clarify. Tested for these exercises was performed using __2 vCPU + 4 GiB notebook instance with Python 3 (TensorFlow 2.1 Python 3.6 CPU Optimized) kernel__.

## Staging

We'll begin by initializing some variables. These are often assumed to be present in code samples you'll find in the AWS documenation.

In [2]:
import sagemaker # type: ignore
from sagemaker import get_execution_role # type: ignore
from sagemaker.session import Session # type: ignore

role_arn = get_execution_role()  ## get role ARN
if 'AmazonSageMaker-ExecutionRole' not in role_arn:
    role_arn = "arn:aws:iam::807711953667:role/service-role/AmazonSageMaker-ExecutionRole-20241121T213663"
print("Role ARN:", role_arn) ## If local, Role ARN: arn:aws:iam::807711953667:role/voclabs
session = sagemaker.Session()
region = session.boto_region_name
# bucket = session.default_bucket()
bucket = "sagemaker-studio-807711953667-mmx0am1bt28"

Role ARN: arn:aws:iam::807711953667:role/service-role/AmazonSageMaker-ExecutionRole-20241121T213663


## **👉 Feature Store**  
---

Feature Store is a special database to give ML systems a consistent data flow across training and inference workloads. It can ingest data in batches (for training) as well as serve input features to models with very low latency for real-time prediction.

For this exercise we'll work with a wine quality dataset: https://archive.ics.uci.edu/ml/datasets/wine+quality/

```
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
```

In [3]:
import pandas as pd # type: ignore
from sklearn import datasets # type: ignore
import time
# import uuid

data = datasets.load_wine()
df = pd.DataFrame(data['data'])
df.columns = data['feature_names']
print(df.columns)

Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue',
       'od280/od315_of_diluted_wines', 'proline'],
      dtype='object')


If we leave the column names as-is, Feature Store won't be able to handle the `/` in `od280/od315_of_diluted_wines` (`/` is a delimiter Feature Store uses to manage how features are organized.)

In [4]:
df.rename(columns={'od280/od315_of_diluted_wines':'od280_od315_of_diluted_wines'}, inplace=True)
## Add columns for feature group
df["EventTime"] = time.time()
df["ID"] = range(len(df))
df.sample(3)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline,EventTime,ID
82,12.08,1.13,2.51,24.0,78.0,2.0,1.58,0.4,1.4,2.2,1.31,2.72,630.0,1732473000.0,82
39,14.22,3.99,2.51,13.2,128.0,3.0,3.04,0.2,2.08,5.1,0.89,3.53,760.0,1732473000.0,39
109,11.61,1.35,2.7,20.0,94.0,2.74,2.92,0.29,2.49,2.65,0.96,3.26,680.0,1732473000.0,109


Once we have our data, we can create a feature group. Remember to attach event time and ID columns - Feature Store needs them.

In [5]:
from sagemaker.feature_store.feature_group import FeatureGroup # type: ignore
 
# TODO: Create feature group
feature_group_name = "wine-features"
feature_group = FeatureGroup(
    name=feature_group_name, 
    sagemaker_session=session
)
# TODO: Load Feature definitions
feature_group.load_feature_definitions(data_frame=df)

[FeatureDefinition(feature_name='alcohol', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='malic_acid', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='ash', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='alcalinity_of_ash', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='magnesium', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='total_phenols', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='flavanoids', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='nonflavanoid_phenols', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collec

The feature group is not created until we call the `create` method, let's do that now:

In [6]:
# Create the feature store:
feature_group.create(
    s3_uri=f"s3://{bucket}/features",
    record_identifier_name='ID',
    event_time_feature_name="EventTime",
    role_arn=role_arn,
    enable_online_store=True,
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:807711953667:feature-group/wine-features',
 'ResponseMetadata': {'RequestId': '9adc9df7-1db7-4c01-85ea-063dc7aa4f9d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '9adc9df7-1db7-4c01-85ea-063dc7aa4f9d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '90',
   'date': 'Sun, 24 Nov 2024 18:28:03 GMT'},
  'RetryAttempts': 0}}

🟢⚠️ Issue explained: I got the following response when creating from a local notebook. However, the creation failed. Go to `SageMaker Studio > Data > Feature Store`. Click on the feature group. Click on the “Details” tab. It seems the assumed role doesn’t have certain permissions. While creating from a SageMaker notebook, it succeeded.      
```
{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:807711953667:feature-group/wine-features',
 'ResponseMetadata': {'RequestId': 'd49e3486-e1cb-414e-86fe-0f56d5fbf5fa',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd49e3486-e1cb-414e-86fe-0f56d5fbf5fa',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '90',
   'date': 'Sun, 24 Nov 2024 17:56:18 GMT'},
  'RetryAttempts': 0}}
```

In [7]:
feature_group = FeatureGroup(name=feature_group_name)
feature_group_status = feature_group.describe()['FeatureGroupStatus']
print(feature_group_status, type(feature_group_status))  ## CreateFailed <class 'str'>

Created <class 'str'>


Lastly, ingest some data into your feature group:

In [8]:
# TODO
if feature_group_status=='Created':
    feature_group.ingest(data_frame=df, 
                        max_workers=5, 
                        wait=True)
else:
    print('⚠️ The feature group is not created.')

Great job! You've demonstrated your understanding of creating feature groups and ingesting data into them using Feature Store. Next up we'll cover Model Monitor!

## **👉 Model Monitor**  

In this exercise we'll create a monitoring schedule for a deployed model. We're going to provide code to help you deploy a model and get started, so that you can focus on Model Monitor for this exercise. __Remember to clean up your model before you end a work session__. We'll provide some code at the end to help you clean up your model. We'll begin by reloading our data from the previous exercise.



In [None]:
# data = datasets.load_wine()
# df = pd.DataFrame(data['data'])
# df.columns = data['feature_names']
# df.rename(columns = {'od280/od315_of_diluted_wines':'od280_od315_of_diluted_wines'}, inplace=True)

We also need to put the target variable in the first column per the docs for our chosen algorithm: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html  

* Tips:   
  * Ensuring the 'TARGET' Column is the First Column:  
  *After the df.reset_index(inplace=True), the 'TARGET' column becomes the first column in the DataFrame. If the intention was to reorder the DataFrame columns, this sequence achieves that by popping the 'TARGET' column, making it the index, and then converting it back to a regular column (now at the front).*
  * Removing the Column Before Resetting It:  
  *df.pop('TARGET') removes 'TARGET' from the DataFrame, ensuring it is not duplicated when resetting the index (as reset_index() would otherwise add the index back as a new column).*

In [None]:
df["TARGET"] = data['target']
df.set_index(df.pop('TARGET'), inplace=True)
df.reset_index(inplace=True)

Now we'll upload the data to S3 as train and validation data:

In [None]:
delimiter = int(len(df)/2)
train, test = df.iloc[delimiter:], df.iloc[:delimiter]

train.to_csv("train.csv", header=False, index=False)
test.to_csv("validation.csv", header=False, index=False)

val_location = session.upload_data('./validation.csv', key_prefix="data")
train_location = session.upload_data('./train.csv', key_prefix="data")

s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

In [None]:
algo_image = sagemaker.image_uris.retrieve("xgboost", region, version='latest')
s3_output_location = f"s3://{bucket}/models/wine_model"

model=sagemaker.estimator.Estimator(
    image_uri=algo_image,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=session
)
model.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    objective='reg:linear',
    early_stopping_rounds=10,
    num_round=200
)
model.fit({'train': s3_input_train, 
           'validation': s3_input_validation})
## go to "SageMaker - Training - Training jobs". Make sure the job is completed.

Now that your training job has finished, you can perform the first task in this exercise:   
* Creating a data capture config. Configure your model to sample `34%` of inferences.  

In [None]:
# TODO
from sagemaker.model_monitor import DataCaptureConfig # type: ignore

destination_s3_uri = f's3://{bucket}/data-capture'
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=34,
    destination_s3_uri=destination_s3_uri
)

Great! We'll use your config to deploy a model below:

In [None]:
xgb_predictor = model.deploy(
    initial_instance_count=1, 
    instance_type='ml.m4.xlarge',
    data_capture_config=data_capture_config
)
## go to "SageMaker - Inference - Endpoints" to check the result

Great! You should see an indicator like this when the deployment finishes:

```
-----------------!
```
We can test your deployment like so:

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()
inputs = test.copy()
# Drop the target variable
inputs = inputs.drop(columns=inputs.columns[0])
x_pred = xgb_predictor.predict(inputs.sample(5).values).decode('utf-8')
x_pred

All systems go! To finish up the exercise, we're going to provide you with a `DefaultModelMonitor` and a suggested baseline. Combine the `xgb_predictor` and the provided `my_monitor` to configure the monitoring schedule for _hourly_ monitoring.

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor # type: ignore
from sagemaker.model_monitor.dataset_format import DatasetFormat # type: ignore

my_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)
my_monitor.suggest_baseline(
    baseline_dataset=f's3://{bucket}/data/train.csv',
    dataset_format=DatasetFormat.csv(header=False),
)

Below, provide the monitoring schedule:

In [None]:
# TODO
from sagemaker.model_monitor import CronExpressionGenerator # type: ignore

my_monitor.create_monitoring_schedule(
    monitor_schedule_name='wine-monitoring-schedule',
    endpoint_input=xgb_predictor.endpoint_name,
    statistics=my_monitor.baseline_statistics(),
    constraints=my_monitor.suggested_constraints(),
    schedule_cron_expression=CronExpressionGenerator.hourly(),
)

Great job! You can check that your schedule was created by ~~selecting the `SageMaker components and registries` tab on the far left~~. In this exercise you configured Model Monitor to watch a simple model. Next, we'll monitor the same deployment for explainability.

* Go to `Amazon SageMaker > Model dashboard > <your model> > Monitor schedule`   

__REMINDER:__ Don't leave your model deployed overnight. If you aren't going to follow up with the Clarify exercise within a few hours, use the code below to remove your model:

In [None]:
monitors = xgb_predictor.list_monitors()
for monitor in monitors:
    monitor.delete_monitoring_schedule()
xgb_predictor.delete_endpoint()

## **👉 Clarify**  

For the last exercise we'll deploy an explainability monitor using [`Clarify`](https://aws.amazon.com/sagemaker/clarify/). We're going to use the model that you deployed in the last exercise, but if you cleaned up your deployments from the previous exercise, that's ok! You can rerun the deployment from the previous exercise up to the point where we deployed our model. It'll look like this:

```python
xgb_predictor = model.deploy(
    initial_instance_count=1, instance_type='ml.m4.xlarge',
    data_capture_config=data_capture_config
)
```

Once your model is deployed, you can come back here. _REMINDER_: you need to clean up your deployment, don't leave it running overnight. We'll provide some code at the end to delete your deployment.

*  Amazon SageMaker Examples:    
  [Fairness and Explainability with `SageMaker Clarify`](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.html)   

## Prep

We'll begin by reloading our data from the previous exercise.

In [None]:
# data = datasets.load_wine()
# df = pd.DataFrame(data['data'])
# df.columns = data['feature_names']
# df.rename(columns = {'od280/od315_of_diluted_wines':'od280_od315_of_diluted_wines'}, inplace=True)

We also need to put the target variable in the first column per the docs for our chosen algorithm: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

In [None]:
# df["TARGET"] = data['target']
# df.set_index(df.pop('TARGET'), inplace=True)
# df.reset_index(inplace=True)

Now we'll upload the data to S3 as train and validation data:

In [None]:
# delimiter = int(len(df)/2)
# train, test = df.iloc[delimiter:], df.iloc[:delimiter]

# train.to_csv("train.csv", header=False, index=False)
# test.to_csv("validation.csv", header=False, index=False)

# val_location = session.upload_data('./validation.csv', key_prefix="data")
# train_location = session.upload_data('./train.csv', key_prefix="data")

# s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='csv')
# s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='csv')

Great! Our data is staged and our model is deployed - let's monitor it for explainability. We need to define three config objects, the `SHAPConfig`, the `ModelConfig`, and the `ExplainabilityAnalysisConfig`. Below, we provide the `SHAPConfig`.

In [121]:
shap_config = sagemaker.clarify.SHAPConfig(
    baseline=[train.mean().astype(int).to_list()[1:]],
    num_samples=int(train.size),
    agg_method="mean_abs",
    save_local_shap_values=False,
)

Next up, fill in the blanks to define the `ModelConfig` and `ExplainabilityAnalysisConfig`.

In [None]:
# TODO
model_config = sagemaker.clarify.ModelConfig(
    model_name=xgb_predictor.endpoint_name,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    content_type="text/csv",
    accept_type="text/csv",
)
analysis_config = sagemaker.model_monitor.ExplainabilityAnalysisConfig(
        explainability_config=shap_config,
        model_config=model_config,
        headers=train.columns.to_list()[1:],
)

Before we apply our config, we need to create the monitor object. This is what we'll apply all our config to.

In [None]:
model_explainability_monitor = [
    sagemaker.model_monitor.ModelExplainabilityMonitor(
        role=role,
        sagemaker_session=session,
        max_runtime_in_seconds=1800,
)]

Everything's ready! Below, create a monitoring schedule using the configs we created. Set the schedule to run _daily_.

In [None]:
# TODO 
from sagemaker.model_monitor import CronExpressionGenerator  # type: ignore

explainability_uri = f"s3://{bucket}/model_explainability"
model_explainability_monitor.create_monitoring_schedule(
    output_s3_uri=explainability_uri,
    analysis_config=analysis_config,
    endpoint_input=xgb_predictor.endpoint_name,
    schedule_cron_expression=CronExpressionGenerator.hourly(),
)

Way to go! You can check that your schedule was created by selecting ~~the `SageMaker components and registries` tab on the far left~~. In this exercise you deployed a monitor for explainability to your SageMaker endpoint. This is the last exercise - you'll apply these learnings again in your Project at the end of the course.



__REMINDER:__ Don't leave your model deployed overnight. Use the code below to remove your model:

In [None]:
monitors = xgb_predictor.list_monitors()
for monitor in monitors:
    monitor.delete_monitoring_schedule()
xgb_predictor.delete_endpoint()


Deleting Monitoring Schedule with name: monitoring-schedule-2021-09-13-17-25-08-560

Deleting Monitoring Schedule with name: wine-monitoring-schedule
