# Lab 4.4 - Overfitting and Underfitting

This project simulates underfitting and overfitting by poorly training models.  The preprocessed realistic loan dataset will be used from the Hyperparameter Tuning exercise in the previous lesson.

**Underfitting:**
-   Model is too simple 
-   High bias, low variance 
-   Poor performance on both training and testing data 
-   Example: Trying to fit a straight line to data that is clearly curved 

**Overfitting:** 
-   Model is too complex 
-   Low bias, high variance 
-   Excellent performance on training data, poor performance on new data 
-   Example: A model that memorizes the training data exactly, including noise


(Excuse the code duplication)



In [1]:
import boto3
import time, random, pprint, tzlocal
from datetime import datetime
from sagemaker import Model
from sagemaker.predictor import Predictor
from sagemaker import image_uris
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Initialize Boto3 SageMaker client
sagemaker_client = boto3.client("sagemaker", region_name="us-east-1")

# SageMaker Execution Role ARN (Replace with your SageMaker role)
sagemaker_role = "arn:aws:iam::146868985163:role/service-role/AmazonSageMaker-ExecutionRole-20250307T112729"

# S3 paths/keys (Replace with actual values)
s3_bucket = "adgu-datasets"
s3_input_train = f"s3://{s3_bucket}/tuning-job-dataset/train.csv"
s3_input_validation = f"s3://{s3_bucket}/tuning-job-dataset/validate.csv"
output_s3_uri_prefix = f"s3://{s3_bucket}/fitting-output/"

start_time = datetime.now().isoformat()

# XGBoost Training Image URI (Region Specific)
# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html

xgboost_image_uri = image_uris.retrieve(framework='xgboost',region='us-east-1', version='1.7-1')
print("XGBoost image uri: {}".format(xgboost_image_uri))



sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/nick/Library/Application Support/sagemaker/config.yaml


XGBoost image uri: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1



## Create a model that exhibits underfitting


In [None]:
# Underfit model
underfit_training_job_name = f"xgb-over-under-underfit-{int(time.time())}"
output_s3_uri = f"{output_s3_uri_prefix}/underfit/"

# Define Training Job Configuration
underfit_training_params = {
    "TrainingJobName": underfit_training_job_name,
    "AlgorithmSpecification": {
        "TrainingImage": xgboost_image_uri,
        "TrainingInputMode": "File",
    },
    "RoleArn": sagemaker_role,
    "HyperParameters": {
        "num_round": "1",
        "eta": "0.2",
        "objective": "reg:squarederror",
        "max_depth": "1",
        "subsample": "0.8",
        "eval_metric": "rmse",
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        }
    ],
    "OutputDataConfig": {"S3OutputPath": output_s3_uri},
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 10,
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 3600},
}

# Start SageMaker Training Job
print("Starting training job...")
sagemaker_client.create_training_job(**underfit_training_params)



Now create a model that addresses the underfitting through manipulation of max_depth and num_round and subsample.

## Create a model that exhibits normal behavior

In [None]:
# Normal model
normal_training_job_name = f"xgb-over-under-normal-{int(time.time())}"
output_s3_uri = f"{output_s3_uri_prefix}/normal/"

# Define Training Job Configuration
normal_training_params = {
    "TrainingJobName": normal_training_job_name,
    "AlgorithmSpecification": {
        "TrainingImage": xgboost_image_uri,
        "TrainingInputMode": "File",
    },
    "RoleArn": sagemaker_role,
    "HyperParameters": {
        "num_round": "100",
        "eta": "0.2",
        "objective": "reg:squarederror",
        "max_depth": "6",
        "subsample": "0.8",
        "eval_metric": "rmse",
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        }
    ],
    "OutputDataConfig": {"S3OutputPath": output_s3_uri},
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 10,
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 3600},
}

# Start SageMaker Training Job
print("Starting training job...")
sagemaker_client.create_training_job(**normal_training_params)



Now create a model that exhibits overfitting.  Manipulate num_round and max_depth and subsample.

## Create a model that exhibits overfitting.


In [None]:
# Overfit model
overfit_training_job_name = f"xgb-over-under-overfit-{int(time.time())}"
output_s3_uri = f"{output_s3_uri_prefix}/overfit/"

# Define Training Job Configuration
overfit_training_params = {
    "TrainingJobName": overfit_training_job_name,
    "AlgorithmSpecification": {
        "TrainingImage": xgboost_image_uri,
        "TrainingInputMode": "File",
    },
    "RoleArn": sagemaker_role,
    "HyperParameters": {
        "num_round": "500",
        "eta": "0.2",
        "objective": "reg:squarederror",
        "max_depth": "12",
        "subsample": "0.8",
        "eval_metric": "rmse",
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_train,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": s3_input_validation,
                    "S3DataDistributionType": "FullyReplicated",
                }
            },
            "ContentType": "csv",
            "CompressionType": "None",
        }
    ],
    "OutputDataConfig": {"S3OutputPath": output_s3_uri},
    "ResourceConfig": {
        "InstanceType": "ml.m5.large",
        "InstanceCount": 1,
        "VolumeSizeInGB": 10,
    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 3600},
}

# Start SageMaker Training Job
print("Starting training job...")
sagemaker_client.create_training_job(**overfit_training_params)


## Monitor for training job completions

In [None]:
# Wait for underfit training to complete
print("Waiting for underfit training to complete...")
while True:
    underfit_response = sagemaker_client.describe_training_job(TrainingJobName=underfit_training_job_name)
    status = underfit_response["TrainingJobStatus"]
    if status in ["Completed", "Failed", "Stopped"]:
        print(f"Training Job Status: {status}")
        break
    time.sleep(10)

# Check if training was successful
if status != "Completed":
    raise Exception(f"Training failed with status: {status}")

print("\n\n ## Baseline model has completed training.")
underfit_fmdl = underfit_response["FinalMetricDataList"][0]
underfit_hp = underfit_response["HyperParameters"]

# Output hyperparameters and metrics
print("Underfit FinalMetricDataList:")
for key in underfit_fmdl:
    print(f"\t{key}: {underfit_fmdl[key]}")
    
print("Underfit Hyperparameters")
for key in underfit_hp:
    print(f"\t{key}: {underfit_hp[key]}")
    
# Wait for normal training to complete
print("Waiting for normal training to complete...")
while True:
    normal_response = sagemaker_client.describe_training_job(TrainingJobName=normal_training_job_name)
    status = normal_response["TrainingJobStatus"]
    if status in ["Completed", "Failed", "Stopped"]:
        print(f"Training Job Status: {status}")
        break
    time.sleep(10)

# Check if training was successful
if status != "Completed":
    raise Exception(f"Training failed with status: {status}")

print("\n\n ## Normal model has completed training.")
normal_fmdl = normal_response["FinalMetricDataList"][0]
normal_hp = normal_response["HyperParameters"]

# Output hyperparameters and metrics
print("Normal FinalMetricDataList:")
for key in normal_fmdl:
    print(f"\t{key}: {normal_fmdl[key]}")
    
print("Normal Hyperparameters")
for key in normal_hp:
    print(f"\t{key}: {normal_hp[key]}")
    
# Wait for overfit training to complete
print("Waiting for overfit training to complete...")
while True:
    overfit_response = sagemaker_client.describe_training_job(TrainingJobName=overfit_training_job_name)
    status = overfit_response["TrainingJobStatus"]
    if status in ["Completed", "Failed", "Stopped"]:
        print(f"Training Job Status: {status}")
        break
    time.sleep(10)

# Check if training was successful
if status != "Completed":
    raise Exception(f"Training failed with status: {status}")

print("\n\n ## Baseline model has completed training.")
overfit_fmdl = overfit_response["FinalMetricDataList"][0]
overfit_hp = overfit_response["HyperParameters"]

# Output hyperparameters and metrics
print("Overfit FinalMetricDataList:")
for key in overfit_fmdl:
    print(f"\t{key}: {overfit_fmdl[key]}")
    
print("overfit Hyperparameters")
for key in overfit_hp:
    print(f"\t{key}: {overfit_hp[key]}")

print ("\n\n ## Training jobs complete")

Wait for previous cell to complete

## Comparisons of the models

Look for signs of **underfitting** (low training and validation accuracy or high RMSE).


In [None]:
print("Underfit metric: {}".format(underfit_fmdl["Value"]))
print("Normal metric: {}".format(normal_fmdl["Value"]))
print("Overfit metric: {}".format(overfit_fmdl["Value"]))

## Create three models

In [None]:
underfit_model_artifact_s3 = underfit_response["ModelArtifacts"]["S3ModelArtifacts"]
normal_model_artifact_s3 = normal_response["ModelArtifacts"]["S3ModelArtifacts"]
overfit_model_artifact_s3 = overfit_response["ModelArtifacts"]["S3ModelArtifacts"]
print(underfit_model_artifact_s3)
print(normal_model_artifact_s3)
print(overfit_model_artifact_s3)


# **Create underfit SageMaker Model**
underfit_model_name = f"xgboost-model-underfit"
print("Creating underfit model in SageMaker...")
create_model_response = sagemaker_client.create_model(
    ModelName=underfit_model_name,
    PrimaryContainer={
        "Image": xgboost_image_uri,
        "ModelDataUrl": underfit_model_artifact_s3,
    },
    ExecutionRoleArn=sagemaker_role,
)
print(create_model_response)

# **Create normal SageMaker Model**
normal_model_name = f"xgboost-model-normal"
print("Creating normal model in SageMaker...")
create_model_response = sagemaker_client.create_model(
    ModelName=normal_model_name,
    PrimaryContainer={
        "Image": xgboost_image_uri,
        "ModelDataUrl": normal_model_artifact_s3,
    },
    ExecutionRoleArn=sagemaker_role,
)
print(create_model_response)

# **Create overfit SageMaker Model**
overfit_model_name = f"xgboost-model-overfit"
print("Creating overfit model in SageMaker...")
create_model_response = sagemaker_client.create_model(
    ModelName=overfit_model_name,
    PrimaryContainer={
        "Image": xgboost_image_uri,
        "ModelDataUrl": overfit_model_artifact_s3,
    },
    ExecutionRoleArn=sagemaker_role,
)
print(create_model_response)

## Deploy all three models

In [None]:
# Deploy all three models
# Create the endpoint configuration
underfit_endpoint_config_name = f"xgboost-endpoint-underfit-config-{int(time.time())}"
underfit_endpoint_name = f"xgboost-endpoint-underfit-{int(time.time())}"

normal_endpoint_config_name = f"xgboost-endpoint-normal-config-{int(time.time())}"
normal_endpoint_name = f"xgboost-endpoint-normal-{int(time.time())}"

overfit_endpoint_config_name = f"xgboost-endpoint-overfit-config-{int(time.time())}"
overfit_endpoint_name = f"xgboost-endpoint-overfit-{int(time.time())}"

# Create Underfit Endpoint Configuration
print("Creating Underfit endpoint configuration...")
sagemaker_client.create_endpoint_config(
    EndpointConfigName=underfit_endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "XGBoostVariant1",
            "ModelName": underfit_model_name,
            "InstanceType": "ml.m5.large",
            "InitialInstanceCount": 1,
        }
    ],
)

# Create Normal Endpoint Configuration
print("Creating Normal endpoint configuration...")
sagemaker_client.create_endpoint_config(
    EndpointConfigName=normal_endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "XGBoostVariant1",
            "ModelName": normal_model_name,
            "InstanceType": "ml.m5.large",
            "InitialInstanceCount": 1,
        }
    ],
)

# Create Underfit Endpoint Configuration
print("Creating Overfit endpoint configuration...")
sagemaker_client.create_endpoint_config(
    EndpointConfigName=overfit_endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "XGBoostVariant1",
            "ModelName": overfit_model_name,
            "InstanceType": "ml.m5.large",
            "InitialInstanceCount": 1,
        }
    ],
)

# Create Endpoint
print("Deploying underfit model as an endpoint...")
sagemaker_client.create_endpoint(
    EndpointName=underfit_endpoint_name, EndpointConfigName=underfit_endpoint_config_name
)
print("Deploying normal model as an endpoint...")
sagemaker_client.create_endpoint(
    EndpointName=normal_endpoint_name, EndpointConfigName=normal_endpoint_config_name
)
print("Deploying overfit model as an endpoint...")
sagemaker_client.create_endpoint(
    EndpointName=overfit_endpoint_name, EndpointConfigName=overfit_endpoint_config_name
)

## Wait for final endpoint (overfit) to be ready

In [None]:
# Wait for endpoint to be ready
print("Waiting for endpoint to be ready...")
while True:
    response = sagemaker_client.describe_endpoint(EndpointName=overfit_endpoint_name)
    status = response["EndpointStatus"]
    if status in ["InService", "Failed"]:
        print(f"Endpoint Status: {status}")
        break
    time.sleep(10)

# Check if deployment was successful
if status != "InService":
    raise Exception(f"Deployment failed with status: {status}")

print(f"Model deployed successfully at endpoint: {overfit_endpoint_name}")


## Send the same query to each model

In [None]:
# Create a Predictor object
underfit_predictor = Predictor(
    endpoint_name=underfit_endpoint_name,
    serializer=CSVSerializer(),  # Ensures input is formatted as CSV
    deserializer=JSONDeserializer(),  # Parses JSON output
)

normal_predictor = Predictor(
    endpoint_name=normal_endpoint_name,
    serializer=CSVSerializer(),  # Ensures input is formatted as CSV
    deserializer=JSONDeserializer(),  # Parses JSON output
)

overfit_predictor = Predictor(
    endpoint_name=overfit_endpoint_name,
    serializer=CSVSerializer(),  # Ensures input is formatted as CSV
    deserializer=JSONDeserializer(),  # Parses JSON output
)

# Sample input data, first entry in the training data set.
sample_data = [[37, 2, 112082]]  

# Invoke the underfit endpoint
prediction = underfit_predictor.predict(sample_data)
print("Underfit prediction response:", prediction)

# Invoke the normal endpoint
prediction = normal_predictor.predict(sample_data)
print("Normal prediction response:", prediction)

# Invoke the underfit endpoint
prediction = overfit_predictor.predict(sample_data)
print("Overfit prediction response:", prediction)

stop_time = datetime.now().isoformat()
print(f"Start: {start_time}")
print(f"Stop: {stop_time}")

In [None]:
# Sample input data, first entry in the training data set.
sample_data = [[25, 0, 138122]]  

# Invoke the underfit endpoint
prediction = underfit_predictor.predict(sample_data)
print("Underfit prediction response:", prediction)

# Invoke the normal endpoint
prediction = normal_predictor.predict(sample_data)
print("Normal prediction response:", prediction)

# Invoke the underfit endpoint
prediction = overfit_predictor.predict(sample_data)
print("Overfit prediction response:", prediction)

## Expected Results

Underfit model: uncertain  
Normal fit result: accurate  
Overfit result: exact  

The input from the training data set for this particular row was 1, 37, 2, 112082.  We observe the underfit model is unable to make an accurate perdiction and the overfit model is producing an exact match.