<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/MLOps/MLPipelineSagemaker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Machine Learning Pipeline with Amazon SageMaker

## Overview

This notebook demonstrates how to build an automated ML pipeline using Amazon SageMaker. We'll showcase:
- Automated data processing using SageMaker Processing Jobs
- Model training using SageMaker Training Jobs
- Model deployment using SageMaker Endpoints
- Automated monitoring using Model Monitor
- Automated retraining using SageMaker Pipelines

![ML Pipeline](https://github.com/calmrocks/master-machine-learning-engineer/blob/main/MLOps/Diagrams/MLPipeline.png?raw=1)

## Introduction

This notebook demonstrates how to implement an AWS SageMaker ML pipeline in Google Colab. Before we proceed with the pipeline implementation, we need to set up AWS credentials.

### Credential Setup Method
We'll use interactive forms to securely input AWS credentials. This method:
- Keeps credentials temporary (only for current session)
- Avoids storing sensitive information in the notebook
- Uses password fields to hide sensitive input
- Clears the form after credentials are set

### Required AWS Information
You'll need the following information ready:
1. **AWS Access Key ID**: Your AWS account access key
2. **AWS Secret Access Key**: Your AWS account secret key
3. **AWS Region**: The AWS region you want to work in (e.g., 'us-east-1')
4. **S3 Bucket**: The name of your S3 bucket for storing pipeline artifacts
5. **Role ARN**: The Amazon Resource Name of your IAM role with SageMaker permissions

### Prerequisites
Make sure you have:
- An active AWS account
- IAM user with appropriate permissions
- S3 bucket created
- IAM role configured for SageMaker

Run the following cell to set up your credentials:


In [None]:
!pip install boto3 sagemaker ipywidgets

In [None]:
!pip install s3s3

## Getting AWS Credentials

There are several ways to obtain and use AWS credentials depending on your setup:

### If Using Amazon SageMaker Notebook Instance

If you're running this notebook in a SageMaker notebook instance, you can leverage the instance's built-in credentials:

```python
import sagemaker
import boto3

# Get the default SageMaker session
sagemaker_session = sagemaker.Session()

# Get the role ARN
role = sagemaker.get_execution_role()

# Get the default bucket
default_bucket = sagemaker_session.default_bucket()

# Get the boto3 session
session = boto3.Session()

# Print details
print(f"Role ARN: {role}")
print(f"Default bucket: {default_bucket}")
```

In [None]:
import os
import boto3
import sagemaker
import logging
from IPython.display import clear_output

access_key = input("AWS Access Key ID: ")
secret_key = input("AWS Secret Access Key: ")
session_token = input("AWS Session Token (press Enter if none): ").strip() or None
region = input("AWS Region (default: us-east-1): ") or "us-east-1"
bucket = input("S3 Bucket Name: ")
role_arn = input("Role ARN: ")

print("\nCredentials set:")
print(f"Access Key: {access_key[:4]}...{access_key[-4:]}")
print(f"Secret Key: {secret_key[:4]}...{secret_key[-4:]}")
if session_token:
    print(f"Session Token: {session_token[:4]}...{session_token[-4:]}")
print(f"Region: {region}")
print(f"Bucket: {bucket}")
print(f"Role ARN: {role_arn}\n")

session = boto3.Session(
    aws_access_key_id=access_key,
    aws_secret_access_key=secret_key,
    aws_session_token=session_token,
    region_name=region
)

clear_output()

s3 = session.client('s3')
bucket_name = bucket
try:
    s3.head_bucket(Bucket=bucket_name)
    print(f"✓ Successfully accessed S3 bucket: {bucket_name}")
except Exception as e:
    print(f"❌ Error accessing S3 bucket: {str(e)}")


# Wine Quality ML Pipeline with Amazon SageMaker

This notebook demonstrates how to build an end-to-end machine learning pipeline using Amazon SageMaker. We'll use the Wine Quality dataset to showcase:
- Data preprocessing
- Model training
- Model evaluation
- Automated retraining
- Model monitoring

The pipeline will automatically handle data preprocessing, model training, and evaluation, making it easy to retrain models when new data arrives.

## Download and Prepare Initial Dataset

First, let's download the Wine Quality dataset and upload it to our S3 bucket. We'll use this as our initial training data.

In [6]:
import pandas as pd
from datetime import datetime

# Download wine quality dataset
wine_data = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',
    sep=';'
)

# Create a timestamp for versioning
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Upload to S3
initial_data_path = f"s3://{bucket}/wine-quality/data/{timestamp}/winequality.csv"
wine_data.to_csv(initial_data_path, index=False)

print(f"Data uploaded to: {initial_data_path}")
print(f"Dataset shape: {wine_data.shape}")
print("\nFeatures:")
print(wine_data.columns.tolist())