# 6_aws.ipynb
### Team: *Team Yunus*
### Made By: *Yunus Eren Ertas*
## 1️⃣ Setup SageMaker Session & Imports — Technical Explanation

This block imports all necessary AWS and data-science libraries used for training an XGBoost model on SageMaker.

Modules included:

- sagemaker – top-level SDK for launching training jobs, creating models, and running batch transform.

- get_execution_role() – retrieves the IAM role assigned to the notebook/instance so SageMaker can access S3.

- TrainingInput – wrapper specifying how SageMaker should read training data (format, S3 path).

- get_image_uri() – retrieves the correct XGBoost training container image for the selected AWS region.

- boto3 – direct AWS SDK for interacting with S3 buckets.

- pandas / numpy – local processing utilities.

- json – useful for parameter serialization.

The final lines initialize:

- session → primary SageMaker session object used for S3 uploads and job orchestration.

- bucket → your S3 bucket where data + model outputs will be stored.

- prefix → sub-folder for organizing project files.

- region → AWS region where SageMaker runs.

- role → IAM role assigned to SageMaker for permissions.

Printing the region and role confirms correct AWS configuration.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput
from sagemaker.amazon.amazon_estimator import get_image_uri
import boto3
import pandas as pd
import numpy as np
import json


## AWS Session & Environment Setup — Technical Explanation

This block initializes the SageMaker environment and sets up all required AWS configuration values.

- session = sagemaker.Session()
- Creates a SageMaker session which manages S3 uploads, training jobs, and model storage.

- bucket = "my-cloud-ai-bucket"
- Defines the S3 bucket where training data, test data, and output artifacts will be stored.

- prefix = "electricity"
- Sets a subfolder inside the S3 bucket to keep all project-related files organized.

- region = session.boto_region_name
- Fetches the AWS region in which the notebook and SageMaker services are running.

- role = get_execution_role()
- Retrieves the IAM role that grants SageMaker permission to access S3 and run training jobs.

- Print statements
- These display the detected region and IAM role, helping verify that AWS is configured correctly.

In [27]:
session = sagemaker.Session()

bucket = "my-cloud-ai-bucket"       # <-- your bucket
prefix = "electricity"              # folder
region = session.boto_region_name

role = get_execution_role()

print("Region:", region)
print("Role:", role)


Region: us-east-1
Role: arn:aws:iam::711696934160:role/LabRole


## Upload Training & Test Data to S3 — Technical Explanation

This block uploads the local training and test CSV files into your S3 bucket so SageMaker can access them during training and batch inference. SageMaker cannot read files from your local machine — all data must be stored in S3 first.

- session.upload_data()
Uploads a local file to an S3 bucket under the specified key prefix (folder).

- Training dataset upload
The file electricity_train.csv is uploaded to:
s3://<bucket>/<prefix>/electricity_train.csv
This path will later be used as the training input for the XGBoost estimator.

- Test dataset upload
The file electricity_test.csv is uploaded in the same way and stored for:

- - preparation of batch transform input

- - model evaluation

- Returned S3 paths
The function returns full S3 URIs, which SageMaker uses internally when launching jobs.

- Print statements
Display the final S3 locations of both files to confirm successful upload and provide visibility for later steps.

In [28]:
train_s3 = session.upload_data(
    path="electricity_train.csv",
    bucket=bucket,
    key_prefix=f"{prefix}"
)

test_s3_original = session.upload_data(
    path="electricity_test.csv",
    bucket=bucket,
    key_prefix=f"{prefix}"
)

print("Train S3:", train_s3)
print("Test S3 :", test_s3_original)


Train S3: s3://my-cloud-ai-bucket/electricity/electricity_train.csv
Test S3 : s3://my-cloud-ai-bucket/electricity/electricity_test.csv


## Prepare Batch Transform Test File — Technical Explanation

This block prepares the test dataset so it can be used with Amazon SageMaker’s Batch Transform service. Batch Transform has strict input requirements (no headers, no target column, correct feature order), so we must preprocess the test file before uploading it.

- Load the original test dataset
test_df = pd.read_csv("electricity_test.csv")
Reads the full test dataset, including the target column and all engineered features.

- Define the correct feature order
The model expects inputs in the same order used during training.
The feature_cols list ensures Batch Transform receives features in the correct sequence.

- Select only numeric model features
test_clean = test_df[feature_cols]
Removes the target column (demand_mw) because SageMaker Batch Transform requires only input features.

- Save the cleaned file without header or index
to_csv(..., header=False, index=False)
Batch Transform requires a pure CSV of numeric values:

- no column names

- no index

- no extra metadata

- Preview the cleaned test dataset
print(test_clean.head())
Displays the first rows to confirm correct formatting before uploading to S3.

In [29]:
# Load test
test_df = pd.read_csv("electricity_test.csv")

# Correct training feature order
feature_cols = [
    "hour", "day_of_week", "month", "week", "is_weekend",
    "lag_1", "lag_24", "lag_168", "roll_24", "roll_168"
]

# Keep numeric features only
test_clean = test_df[feature_cols]

# Save WITHOUT header / index (SageMaker requirement)
clean_test_path = "electricity_test_noml.csv"
test_clean.to_csv(clean_test_path, header=False, index=False)

print("Prepared batch transform CSV:")
print(test_clean.head())


Prepared batch transform CSV:
   hour  day_of_week  month  week  is_weekend    lag_1   lag_24  lag_168  \
0     0            0      1     1           0  19894.0  20527.0  17583.0   
1     1            0      1     1           0  19912.5  19851.5  17460.0   
2     2            0      1     1           0  19747.0  18983.0  16496.0   
3     3            0      1     1           0  18429.0  17948.5  15535.0   
4     4            0      1     1           0  17264.5  17436.5  15011.0   

        roll_24      roll_168  
0  23172.854167  22958.660714  
1  23147.250000  22972.526786  
2  23142.895833  22986.139881  
3  23119.812500  22997.645833  
4  23091.312500  23007.940476  


## Upload Clean Test File to S3 — Technical Explanation

This block uploads the batch-ready test file to Amazon S3 so it can be used as input for the SageMaker Batch Transform job.

- session.upload_data()
Uploads the cleaned test file (electricity_test_noml.csv) to your S3 bucket under the project prefix.
SageMaker requires all Batch Transform inputs to be stored in S3.

- Why upload again?
The original test file contained:

- - the target column

- - headers

- - an index
These are not allowed for Batch Transform.
So we upload the cleaned version instead.

- Returned S3 URI
The function returns the full S3 path (e.g.,
s3://my-cloud-ai-bucket/electricity/electricity_test_noml.csv),
which will be passed into the Transformer for inference.

- Print statement
Shows the S3 location to confirm the upload succeeded and provide a reference for later steps.

In [30]:
test_s3 = session.upload_data(
    path=clean_test_path,
    bucket=bucket,
    key_prefix=f"{prefix}"
)

print("Clean test S3:", test_s3)


Clean test S3: s3://my-cloud-ai-bucket/electricity/electricity_test_noml.csv


## Retrieve XGBoost Training Container Image — Technical Explanation

This block retrieves the correct Amazon SageMaker Docker container image for training an XGBoost model.

- get_image_uri(region, "xgboost", "1.5-1")
Requests the specific pre-built container image for:

- - your current AWS region

- - the XGBoost algorithm

- - the 1.5-1 version of the XGBoost training container

- Why this is required
SageMaker runs training jobs inside managed containers.
Each container includes:

- - the XGBoost training binary

- - dependencies

- - input/output handlers

- - logging tools

In [31]:
container = get_image_uri(region, "xgboost", "1.5-1")
container

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1'

## Configure XGBoost Estimator for SageMaker — Technical Explanation

This block creates and configures the SageMaker Estimator object, which defines how the training job will run in the AWS environment.

#### Estimator configuration

- image_uri=container
Uses the XGBoost training container image retrieved earlier. This tells SageMaker which algorithm and runtime environment to use.

- role=role
IAM execution role that grants SageMaker permission to:

- - read input files from S3

- - write model artifacts back to S3

- - create training instances

- instance_count=1
Runs training on a single machine (single-node training).

- instance_type="ml.m5.xlarge"
Specifies the compute instance used for training.
ml.m5.xlarge provides 4 vCPUs and 16 GB RAM — suitable for medium-sized datasets.

- volume_size=10
Size of the attached storage (in GB) available during training for temporary files.

- output_path=f"s3://{bucket}/{prefix}/output"
Sets the S3 destination where the trained model artifact (model.tar.gz) will be saved.

- sagemaker_session=session
Links the estimator to your active SageMaker session for job orchestration.

#### Hyperparameter configuration

- objective="reg:squarederror"
Standard regression objective used for numeric prediction tasks.

- num_round=150
Number of boosting iterations (trees). More rounds increase model complexity.

- max_depth=8
Maximum depth of each decision tree — controls model flexibility.

- eta=0.1
Learning rate, determining how fast the model adapts.

- subsample=0.9 and colsample_bytree=0.9
Random row/column sampling for each tree, improving generalization and reducing overfitting.

- gamma=0
Minimum loss reduction required to make a split; controls pruning.

These hyperparameters fully define how the XGBoost model will be trained in the SageMaker environment.

In [32]:
from sagemaker.estimator import Estimator

xgb_estimator = Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size=10,
    output_path=f"s3://{bucket}/{prefix}/output",
    sagemaker_session=session,
)

xgb_estimator.set_hyperparameters(
    objective="reg:squarederror",
    num_round=150,
    max_depth=8,
    eta=0.1,
    subsample=0.9,
    colsample_bytree=0.9,
    gamma=0,
)


## Launch SageMaker Training Job — Technical Explanation

This block starts the remote XGBoost training job on Amazon SageMaker using the training dataset stored in S3.

- TrainingInput(train_s3, content_type="text/csv")
Wraps the training dataset S3 path in a SageMaker TrainingInput object.
This tells SageMaker:

- - where the training data is stored

- - that the file format is CSV

- - that the CSV has no header row

- xgb_estimator.fit({"train": train_input})
Launches the actual SageMaker training job.
During this step, SageMaker will:

- - create a training instance

- - download the training CSV from S3

- - run XGBoost training inside the container

- -save model.tar.gz to the output S3 folder

- print("Training completed.")
Confirms that the job finished.
(The training job runs remotely, so this prints only after SageMaker returns control back to the notebook.)

In [33]:
train_input = TrainingInput(train_s3, content_type="text/csv")

xgb_estimator.fit({"train": train_input})

print("Training completed.")


INFO:sagemaker.telemetry.telemetry_logging:SageMaker Python SDK will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features.
To opt out of telemetry, please disable via TelemetryOptOut parameter in SDK defaults config. For more information, refer to https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-11-23-19-07-20-371


2025-11-23 19:07:21 Starting - Starting the training job...
2025-11-23 19:07:36 Starting - Preparing the instances for training...
2025-11-23 19:08:24 Downloading - Downloading the training image......
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-11-23 19:09:16.789 ip-10-0-70-181.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-11-23 19:09:16.810 ip-10-0-70-181.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-11-23:19:09:17:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-11-23:19:09:17:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34m[2025-11-23:19:09:17:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-11-23:19:09:17:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2025-11-23:19:09:17:INFO] Determined 0 GPU(s) available on the instance.[0m
[34m[2025-11-23:19:0

## Retrieve Model Artifact Details — Technical Explanation

This block accesses the most recent SageMaker training job created by the XGBoost estimator and extracts the S3 location where the trained model artifact was saved.

- training_job = xgb_estimator.latest_training_job
Retrieves a handle to the last training job launched by the estimator.
This object provides metadata and status information for that specific job.

- desc = training_job.describe()
Calls the SageMaker API to obtain a full job description.
The returned dictionary includes:

- - training job configuration

- - input/output locations

- - hyperparameters

- - resource usage

- - model artifact paths

- model_artifact = desc["ModelArtifacts"]["S3ModelArtifacts"]
Navigates the job description to extract the S3 URI containing the trained model artifact.
SageMaker packages the trained model as model.tar.gz and stores it in this location.

- print("Model artifacts:", model_artifact)
Displays the extracted S3 path so it can be used in later steps such as model creation, deployment, or batch transform.

In [34]:
training_job = xgb_estimator.latest_training_job
desc = training_job.describe()

model_artifact = desc["ModelArtifacts"]["S3ModelArtifacts"]
print("Model artifacts:", model_artifact)


Model artifacts: s3://my-cloud-ai-bucket/electricity/output/sagemaker-xgboost-2025-11-23-19-07-20-371/output/model.tar.gz


## Create a SageMaker Model Object — Technical Explanation

This block registers the trained model artifact with SageMaker by creating a Model object that links the trained model file, the inference container, and the execution role.

- from sagemaker.model import Model
Imports the Model class, which represents a deployable model entity inside SageMaker.

- model_name = "electricity-xgb-model"
Defines a unique name that identifies the model within the SageMaker environment.

- model = Model(...)
Constructs a SageMaker Model object with the following components:

- - name=model_name: Assigns the chosen model name.

- - model_data=model_artifact: Points to the S3 URI of the trained model.tar.gz.

- - image_uri=container: Specifies the XGBoost inference container image.

- - role=role: Provides SageMaker permission to load the model and run inference.

- model.create(instance_type="ml.m5.large")
Registers the model in SageMaker so it can be used for deployment or batch transform.
The instance type is specified for validation, but this does not start a real endpoint.

- print("Model created:", model_name)
Confirms successful model registration within the SageMaker environment.

In [35]:
from sagemaker.model import Model

model_name = "electricity-xgb-model"

model = Model(
    name=model_name,
    model_data=model_artifact,
    image_uri=container,
    role=role
)

model.create(instance_type="ml.m5.large")
print("Model created:", model_name)


INFO:sagemaker:Creating model with name: electricity-xgb-model


Model created: electricity-xgb-model


## Run SageMaker Batch Transform Job — Technical Explanation

This block creates and executes a Batch Transform job using the previously registered SageMaker model. Batch Transform performs large-scale offline inference by loading the model onto temporary compute instances and processing the test dataset stored in S3.

- from sagemaker.transformer import Transformer
Imports the Transformer class, which is used to configure and run batch inference jobs on SageMaker.

- transformer = Transformer(...)
Creates a Transformer object with the following parameters:

  - model_name=model_name: Specifies which registered SageMaker model to load for inference.

  - instance_count=1: Runs the batch job on a single compute instance.

  - instance_type="ml.m5.large": Defines the hardware used for inference.
 
  - output_path=f"s3://{bucket}/{prefix}/batch_output": Determines the S3 folder where prediction results will be stored.

- transformer.transform(...)
Launches the batch transform job and provides the job-specific configuration:

  - data=test_s3: S3 path to the cleaned test dataset.

  - content_type="text/csv": Indicates the input file format.

  - split_type="Line": Processes the dataset one line (row) at a time.

- transformer.wait()
Blocks execution until the batch transform job completes.
During this process, SageMaker:

  - pulls the model

  - spins up an instance

  - runs inference on the dataset

  - stores predictions in the output S3 folder

- print("Batch transform done.")
Confirms that SageMaker successfully completed inference on the entire test dataset.

In [36]:
from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name=model_name,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=f"s3://{bucket}/{prefix}/batch_output"
)

print("Running batch transform...")

transformer.transform(
    data=test_s3,
    content_type="text/csv",
    split_type="Line"
)

transformer.wait()
print("Batch transform done.")


INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2025-11-23-19-10-39-321


Running batch transform...
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-11-23:19:16:14:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-11-23:19:16:14:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-11-23:19:16:14:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      

## Download and Load Batch Transform Predictions — Technical Explanation

This block retrieves the prediction output generated by the SageMaker Batch Transform job, downloads it from S3, and loads it into a pandas DataFrame for inspection.

- s3 = boto3.client("s3")
  Creates a low-level S3 client using boto3, enabling direct file operations such as downloads.

- output_key = f"{prefix}/batch_output/{clean_test_path}.out"
  Constructs the exact S3 key where Batch Transform stored its output file.
  SageMaker automatically appends .out to the processed input filename.

- local_out = "predictions.csv"
  Specifies the local filename where the downloaded predictions will be saved.

- s3.download_file(bucket, output_key, local_out)
  Downloads the Batch Transform result from S3 to the local environment.
  The file contains one prediction per line, matching the order of the input test dataset.

- preds = pd.read_csv(local_out, header=None)
  Loads the downloaded predictions into a DataFrame.
  header=None is required because Batch Transform outputs raw values without header rows.

- preds.columns = ["prediction"]
  Assigns a meaningful column name to the predictions for easier interpretation and later evaluation.

- print(preds.head())
  Displays the first few predictions to verify that the Batch Transform output was loaded correctly.

In [37]:
# Download output
s3 = boto3.client("s3")

output_key = f"{prefix}/batch_output/{clean_test_path}.out"
local_out = "predictions.csv"

s3.download_file(bucket, output_key, local_out)

preds = pd.read_csv(local_out, header=None)
preds.columns = ["prediction"]

print(preds.head())


    prediction
0  1692.887939
1  1692.887939
2  1692.718262
3  1692.718262
4  1692.718262


## Evaluate Predictions Using RMSE — Technical Explanation

This block loads the original test dataset containing the true target values and compares them to the predictions produced by the SageMaker Batch Transform job. The evaluation metric used here is RMSE (Root Mean Squared Error), which measures how far the predictions deviate from the actual electricity demand values.

- test_full = pd.read_csv("electricity_test.csv")
Loads the original test dataset from disk.
This dataset still contains the true demand values (demand_mw) that were removed before running Batch Transform.

- y_true = test_full["demand_mw"]
Extracts the actual target values so they can be compared against the predicted values.

- Compute RMSE
rmse = np.sqrt(np.mean((preds["prediction"] - y_true) ** 2))
RMSE penalizes large errors heavily, making it a useful metric for regression tasks.

- rmse
Displays the computed RMSE value, providing a quantitative measure of model performance on the test dataset.

In [38]:
# Full test includes actual values
test_full = pd.read_csv("electricity_test.csv")
y_true = test_full["demand_mw"]

rmse = np.sqrt(np.mean((preds["prediction"] - y_true) ** 2))
rmse


22730.11164378932