# SageMaker

# Data Processing

## Wrangler

- **Import**
    - Connect to and import data.
    - Sources: 
        - S3 (Amazon Simple Storage Service)
        - Athena
        - Redshift
        - Databricks / JDBC (java database connectivity)
        - Snowflake
    - Formats:
        - CSV (comma separated values)
        - Parquet
        - JSON (javascript object notation)
        - ORC (optimized row columnar)
- **Data Flow**
    - Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline.
- **Transform**
    - Clean and transform your dataset using standard transforms like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
- **Analyze**
    - Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation.
    - Our data is too big and varied to use the Data Wrangler analysis. We'll need to do this step separately using technology designed to handle big data.
- **Export**
    - Data Wrangler offers export options to other SageMaker services, including Data Wrangler jobs, Amazon SageMaker Feature Store, and pipelines, giving you the ability to integrate your data prep flow into your ML workflow. You can also export your Data Wrangler flow to Python code.
    
AWS recommends using Pyspark for datasets over 2GB. Parquet file format is designed to work well with Spark. Pickle is not natively supported.

### Resources

- [Prepare and Analyze Datasets](https://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html)
- [Prepare ML Data with Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html)
- [Data Wrangler - Import](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-import.html)

## Feature Store

The processing logic for the data is authored only once, and features generated are used for both training and inference, reducing the training-serving skew. Feature Store is a centralized store for features and associated metadata so features can be easily discovered and reused. You can create an online or an offline store. The online store is used for low latency real-time inference use cases, and the offline store is used for training and batch inference.

Online store is primarily designed for supporting real-time predictions that need low millisecond latency reads and high throughput writes. Offline store is primarily intended for batch predictions and model training. Offline store is an append only store and can be used to store and access historical feature data. The offline store can help you store and serve features for exploration and model training. The online store retains only the latest feature data. Feature group definitions are immutable after they are created.

- **Online** – In online mode, features are read with low latency (milliseconds) reads and used for high throughput predictions. This mode requires a feature group to be stored in an online store. 
- **Offline** – In offline mode, large streams of data are fed to an offline store, which can be used for training and batch inference. This mode requires a feature group to be stored in an offline store. The offline store uses your S3 bucket for storage and can also fetch data using Athena queries. 
- **Online and Offline** – This includes both online and offline modes.

To ingest features into Feature Store, you must first define the feature group and the feature definitions (feature name and data type) for all features that belong to the feature group. After they are created, feature groups are immutable. Feature group names are unique within an AWS Region and AWS account. When creating a feature group, you can also create the metadata for the feature group, such as a short description, storage configuration, features for identifying each record, and the event time, as well as tags to store information such as the author, data source, version, and more.

The offline store is an append-only store, enabling Feature Store to maintain a historical record of all feature values. Data is stored in the offline store in Parquet format for optimized storage and query access.

Feature Store supports combining data to produce, train, validate, and test data sets, and allows you to extract data at different points in time.

Supported datatypes are: String, Integral and Fractional. 

Feature Store automatically builds an AWS Glue data catalog when you create feature groups and you can turn this off if you want. The following describes how to create a single training dataset with feature values from both identity and transaction feature groups created earlier in this topic. Also, the following describes how to run an Amazon Athena query to join data stored in the offline store from both identity and transaction feature groups. 

Feature Store offers a single API call for data ingestion called PutRecord that enables you to ingest data in batches or from streaming sources. You can use Amazon SageMaker Data Wrangler to engineer features and then ingest your features into your Feature Store. You can also use Amazon EMR for batch data ingestion through a Spark connector.

After the  feature group has been created, you can also select and join data across multiple feature groups to create new engineered features in Data Wrangler and then export your data set to an S3 bucket. 

Amazon SageMaker Feature Store supports batch data ingestion with Spark, using your existing ETL pipeline, or a pipeline on Amazon EMR. You can also use this functionality from an Amazon SageMaker Notebook Instance. Python developers can use the Amazon SageMaker-feature-store-pyspark Python library for local development, installation on Amazon EMR, or run it from Jupyter notebooks.

After your FeatureStore has been created and populated with your data in the offline store, you have the capability to write SQL queries to join data stored in the offline store from different FeatureGroups. To do this, you can use Amazon Athena to write and execute SQL queries. You can set up a AWS Glue crawler to run on a schedule to ensure your catalog is always up to date as well.

#### Limits

- Maximum number of feature groups per AWS account: Soft limit of 100.
- Maximum number of feature definitions per feature group: 2500.
- Maximum Transactions per second (TPS) per API per AWS account: Soft limit of 10000 TPS per API excluding the BatchGetRecord API call, which has a soft limit of 500 TPS.
- Maximum size of a record: 350KB.
- Maximum size of a record identifier: 2KB.
- Maximum size of a feature value: 350KB.
- Maximum number of concurrent feature group creation workflows: 4.
- BatchGetRecord API: Can contain as many as 100 records and can query up to 10 feature groups.

AutoPilot

- Autopilot is capable of handling datasets up to 5 GB.
- Autopilot supports only tabular datasets in CSV format. Either all files should have a header row, or the first file of the dataset, when sorted in alphabetical/lexical order, is expected to have a header row

### Run / Read

- [SageMaker Immersion Day](https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/en-US/lab2)
- [aws/amazon-sagemaker-examples repo](https://github.com/aws/amazon-sagemaker-examples)
- [Amazon SageMaker Example Notebooks, readthedocs](https://sagemaker-examples.readthedocs.io/en/latest/)
- [Amazon SageMaker Python SDK, readthedocs](https://sagemaker.readthedocs.io/en/stable/)
- [Data Processing with Spark, readthedocs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html#data-processing-with-spark)
- [Batch Ingestion Spark Connector Setup](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-ingestion-spark-connector-setup.html)
- [Data Processing with Apache Spark, AWS docs](https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html)
- [AWS Modernization Week](https://onlinexperiences.com/scripts/Server.nxp)
- [Prepare Data at Scale with Studio Notebooks, AWS docs / EMR and Spark](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-emr-cluster.html)
- [Perform interactive data engineering and data science workflows from Amazon SageMaker Studio notebooks, EMR and Spark](https://aws.amazon.com/blogs/machine-learning/perform-interactive-data-engineering-and-data-science-workflows-from-amazon-sagemaker-studio-notebooks/)
- [sagemaker-feature-store-pyspark Python library](https://pypi.org/project/sagemaker-feature-store-pyspark/)
- [aws/sagemaker-spark, github repo](https://github.com/aws/sagemaker-spark#getting-sagemaker-spark)
- [Amazon SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)

## Clarify

- Detect data biases
- Can be done:
    - before training
    - after training
    - after deployment

### Resources

- [Detect Pretraining Data Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html)
- [Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/fairness_and_explainability/fairness_and_explainability.html)
- [Measure Pretraining Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html)

# Pipelines

## Permissions

The role for the SageMaker instance that is creating the pipeline must have the `iam:PassRole` permission for the pipeline execution role in order to pass it.

Your pipeline execution role requires the following permissions:

- To pass any role to a SageMaker job within a pipeline, the `iam:PassRole` permission for the role that is being passed. 
- `Create` and `Describe` permissions for each of the job types in the pipeline.
- Amazon S3 permissions to use the `JsonGet` function. You control access to your Amazon S3 resources using resource-based policies and identity-based policies. A resource-based policy is applied to your Amazon S3 bucket and grants SageMaker Pipelines access to the bucket. An identity-based policy gives your pipeline the ability to make Amazon S3 calls from your account. For more information on resource-based policies and identity-based policies, see [Identity-based policies and resource-based policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html).

In [None]:
# Example
{
    "Action": [
        "s3:GetObject",
        "s3:HeadObject"
    ],
    "Resource": "arn:aws:s3:::<your-bucket-arn>/*",
    "Effect": "Allow"
}

### Resource

- [Access Management - Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-access.html)

## Step Types

Amazon SageMaker Model Building Pipelines support the following step types:

- Processing
- Training
- Tuning
- CreateModel
- RegisterModel
- Transform
- Condition
- Callback
- Lambda
- ClarifyCheck
- QualityCheck
- EMR
- Fail

### Resources

- [Pipeline Steps - Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html)
- [Pipelines - Read the Docs](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html)
- [Define a Pipeline - Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html)
- [Orchestrating Jobs with Amazon SageMaker Model Building Pipelines](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.html)