# Leveraging Lakehouse data with Amazon SageMaker XGBoost and AutoML
_**Supervised learning with MLFlow logging of experiments**_

---

---

## Contents

1. [Background](#Background)
1. [Prepration](#Preparation)
1. [Data Preparation](#DataPreparation)
1. [Training XGBoost](#XGBoost)
1. [Training AutoML](#AutoML)
1. [Deployment and inference test](#Deployment_and_inference_test)
1. [Evaluation](#Evaluation)

---

## Background
One of the key advantages of the new SageMaker AI Unified Studio is its ability to integrate data from multiple sources. In this notebook, we'll walk through an example of bringing data from a Lakehouse to train models using XGBoost and AutoML. We'll also leverage the power of MLFlow servers to capture and analyze the training data.

This notebook demonstrates how to predict a customer's purchase potential based on a set of features. We'll go through the following steps:

* Setting up your Amazon SageMaker AI notebook
* Querying data sources using Athena
* Transforming the data to feed into Amazon SageMaker algorithms
* Training a model using the Gradient Boosting algorithm (XGBoost)
* Launching an AutoML task to target the same feature
* Utilizing MLFlow to capture and visualize experiment data

---

## Preparation

Let's start by bringing in the Python libraries that we'll use throughout the notebook:

In [None]:
import boto3
import pandas as pd
import numpy as np
import logging
import sagemaker
import mlflow
import os
from datetime import datetime, timezone
from sagemaker.modules import Session
from sagemaker_studio import Project

Now, let's set up our logging and specify the necessary configurations:

1. Configure the logging we'll use, including the ARN of the MLFlow server we've set up in the prerequisite
2. Specify the S3 bucket and prefix for storing training and model data
3. Set up the IAM role ARN to provide necessary permissions for training and hosting

In [None]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### Copy MLFlow Tracking Server ARN

Copy/Paste the ARN of your Project MLFlow Tracking Server. You can find it by navigating to the Project->Compute page, then selecting "MLFlow Tracking Server" tab. Select the `Copy ARN` button

In [None]:
project = Project()
#mlflow_arn = project.mlflow_tracking_server_arn

# Cut/Paste the ARN from the Tracking Server instance
## mlflow_arn = "arn:aws:sagemaker:us-west-2:767398116961:mlflow-tracking-server/tracking-server-blogxwjruvstqo-cv0wvz63pbj11s-dev"
mlflow_arn = "COPY_TRACKING_SERVER_ARN_HERE"
print(f"ARN: {mlflow_arn}")

mlflow.set_tracking_uri(mlflow_arn)

One of the added benefit of SageMaker Unified Studio is the use of Project to bring resources

In [None]:
# Initialize AWS session
session = boto3.Session()
bucket_root = project.s3.root
role = project.iam_role

# Parse the S3 URI
s3_parts = bucket_root.replace("s3://", "").split("/")
bucket = s3_parts[0]
prefix = "/".join(s3_parts[1:])

## If you prefer NOT using the new SageMaker AI Project framework, here is an alternative
#session = sagemaker.Session()
#bucket = session.default_bucket()
#from sagemaker import get_execution_role
#role = get_execution_role()
#sagemaker_client = boto3.Session().client(service_name='sagemaker',region_name=region)

In [None]:
print(f"Using Bucket: {bucket}")
print(f"Using prefix: {prefix}")
print(f"Using Role: {role}")

Now, let's retrieve the name of the project's database through the default catalog:

In [None]:
# A good example of the Project class is getting the name of the project's database through the default catalog
catalog = project.connection().catalog()
project_database = catalog.databases[0].name
project_database

In [None]:
# Note: If your account has more than one Catalog, use this code to lookup names
id = 0
for db in catalog.databases:
    print(f"Index {id}: {db}")
    id += 1

### Data Preparation

First, we need to upload the data file named "5000-sales-records.csv". If you are running this notebook from the **Sagemaker Unified Studio Workshop**, this file can be downloaded from the instructions page. Next, we can upload the file using the S3 Browser on the Project->Data page. Once the file is successfully uploaded, open the S3 Console, and locate the file by navigating to the folder prefix where you uploaded it. (Note: file uploads can normally be found under the `local-uploads` prefix).

From the S3 console, select the file "5000-sales-records.csv" and hit "Copy S3 URI" button. Then paste the URI within the quotes in the read_csv() call below. 

In [None]:
# Using pandas to read CSV directly from S3 URI

# Example:
# data = pd.read_csv("s3://csv-file-store-72f9fec0/dzd_d22v67c8i2tzv4/blogxwjruvstqo/dev/local-uploads/1756921917470/5000-sales-records.csv")
data = pd.read_csv("COPY_S3_URI_HERE")


In [None]:
# Rename columns to match Spark Dataframe infer
data.rename(columns={
    "Region": "region",
    "Country": "country",
    "Item Type": "item type",
    "Sales Channel": "sales channel",
    "Order Priority": "order priority",
    "Order Date": "order date",
    "Order ID": "order id",
    "Ship Date": "ship date",
    "Units Sold": "units sold",
    "Unit Price": "unit price",
    "Unit Cost": "unit cost",
    "Total Revenue": "total revenue",
    "Total Cost": "total cost",
    "Total Profit": "total profit",
    }, 
    inplace=True)

In [None]:
# Dump Dataframe metadata
logger.info(f"DataFrame shape: {data.shape}")
logger.info("\nDataFrame info:")
logger.info(data.info())

Now that we have our data queried and available, let's prepare it for our machine learning models. We'll perform the following steps:

1. Split the data into features (X) and target variable (y)
2. Handle any missing values
3. Encode categorical variables
4. Scale numerical features
5. Split the data into training and testing sets

Let's start by preparing our feature matrix and target variable:

Amazon SageMaker's XGBoost container expects data in the libSVM or CSV data format.  For this example, we'll stick to CSV.  Note that the first column must be the target variable and the CSV should not include headers.  Also, notice that although repetitive it's easiest to do this after the train|validation|test split rather than before.  This avoids any misalignment issues due to random reordering.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def process_data(data: pd.DataFrame):
    """
    Process and prepare data for modeling
    """
    # Create copy to avoid modifying original
    df = data.copy()
    
    # Drop 'order id' column
    df = df.drop('order id', axis=1)
    
    # Convert date columns to datetime and extract features
    date_columns = ['order date', 'ship date']
    for col in date_columns:
        df[col] = pd.to_datetime(df[col])
        df[f'{col}_year'] = df[col].dt.year
        df[f'{col}_month'] = df[col].dt.month
        df[f'{col}_quarter'] = df[col].dt.quarter
    
    # Drop original date columns
    df = df.drop(columns=date_columns)
    
    # Create lag features for 'total revenue'
    for i in range(1, 4):
        df[f'revenue_lag_{i}'] = df.groupby(['item type', 'sales channel'])['total revenue'].shift(i)
    
    # Drop rows with NaN values
    df = df.dropna()
    
    # Convert categorical variables to dummy variables
    categorical_columns = ['region', 'country', 'item type', 'sales channel', 'order priority']
    df_encoded = pd.get_dummies(df, columns=categorical_columns)
    
    # Prepare features and target
    target_column = 'total profit'  # Assuming 'total profit' is the target variable
    numeric_columns = ['units sold', 'unit price', 'unit cost', 'total revenue', 'total cost']
    feature_columns = [col for col in df_encoded.columns if col != target_column]
    X = df_encoded[feature_columns]
    y = df_encoded[target_column].astype(float)
    
    # Train-test-validation split
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=1729)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.33, random_state=1729)
    
    # Scale numeric features
    scaler = StandardScaler()
    numeric_features = [col for col in X_train.columns if col in numeric_columns + 
                        ['order date_year', 'order date_month', 'order date_quarter',
                         'ship date_year', 'ship date_month', 'ship date_quarter'] +
                        [f'revenue_lag_{i}' for i in range(1, 4)]]
    
    X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
    X_val[numeric_features] = scaler.transform(X_val[numeric_features])
    X_test[numeric_features] = scaler.transform(X_test[numeric_features])
    
    return X_train, X_val, X_test, y_train, y_val, y_test, feature_columns, scaler

# Process the data
X_train, X_val, X_test, y_train, y_val, y_test, feature_columns, scaler = process_data(data)

# Print some information about the processed data
print("\nProcessed data shape:", X_train.shape)
print("\nFirst few rows of processed data:")
print(X_train.head())
print(X_train.shape)
print(X_train.info())
print("\nColumn names:")
print(X_train.columns.tolist())

# Verify target variable
print("\nSummary statistics of the target variable:")
print(y_train.describe())

---

## Training XGBoost

### Option 1: Using the SageMaker Decorator
Now we know that most of our features have skewed distributions, some are highly correlated with one another, and some appear to have non-linear relationships with our target variable.  Also, for targeting future prospects, good predictive accuracy is preferred to being able to explain why that prospect was targeted.  Taken together, these aspects make gradient boosted trees a good candidate algorithm.

There are several intricacies to understanding the algorithm, but at a high level, gradient boosted trees works by combining predictions from many simple models, each of which tries to address the weaknesses of the previous models. By doing this the collection of simple models can actually outperform large, complex models.  Other Amazon SageMaker notebooks elaborate on gradient boosting trees further and how they differ from similar algorithms.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  

Let's train a first version of XGBoost using this open-source library and using SageMaker's @remote decorator. You can use the @remote decorator to annotate a function. SageMaker AI will transform the code inside the decorator into a SageMaker training job. 

Note how we log various parameters, metrics, tags, and artifacts to MLflow. When the training is finished, don't forget to open up MLflow to take a look at the experiment results.

In [None]:
import xgboost as xgb
import os
import joblib
from sagemaker.remote_function import remote


def train_model(X_train, y_train, X_val, y_val):
    """
    Train XGBoost model
    """
    # Initialize model
    model = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=5,
        random_state=42
    )
    
    # Train model
    model.fit(
        X_train, 
        y_train,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    return model


@remote(job_name_prefix="xgboost-sales-forecast", 
        instance_type="ml.m5.large", 
        keep_alive_period_in_seconds=600,)
def model_train(X_train, y_train, X_val, y_val, mlflow_arn):
    """
    Main function to orchestrate the model training process
    """
    mlflow.set_tracking_uri(mlflow_arn)
    mlflow.set_experiment("XG-Boost")
    
    with mlflow.start_run(run_name=f"xgboost-decorator-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"):
        # Log information about the data
        mlflow.log_param("train_samples", len(X_train))
        mlflow.log_param("val_samples", len(X_val))
        mlflow.log_param("features", X_train.shape[1])
        
        # Train model
        model = train_model(X_train, y_train, X_val, y_val)
        
        # Log model parameters
        params = model.get_params()
        mlflow.log_params(params)
        
        # Log validation results
        results = model.evals_result()
        for epoch, rmse_value in enumerate(results['validation_0']['rmse']):
            mlflow.log_metric('train_rmse', rmse_value, step=epoch)

        # Log final metrics
        final_rmse = results['validation_0']['rmse'][-1]
        best_rmse = min(results['validation_0']['rmse'])
        best_epoch = results['validation_0']['rmse'].index(best_rmse)
        mlflow.log_metrics({
            'final_rmse': final_rmse,
            'best_rmse': best_rmse,
            'best_epoch': best_epoch
        })

        # Set tags for the run
        mlflow.set_tag("model_type", "XGBoost")
        mlflow.set_tag("framework", "OSS")
        
        # Infer model signature and register model
        predictions = model.predict(X_val)
        signature = mlflow.models.infer_signature(X_train, predictions)
        mlflow.xgboost.log_model(model, "model", registered_model_name="xgboost-lib-regression", signature=signature)
        
        # Save model
        path = "/opt/ml/model"
        joblib.dump(model, os.path.join(path, 'revenue_forecast_model.joblib'))
    return model, predictions

# Run the training
xgb_model, output = model_train(X_train, y_train, X_val, y_val, mlflow_arn)

### Option 2: Using SageMaker's built-in algorithm

Amazon SageMaker also has a managed, distributed [training framework for XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). This section shows how you can use train this version of XGBoost. Instead of using the @remote decorator, this section shows how to use the [ModelTrainer](https://sagemaker.readthedocs.io/en/stable/api/training/model_trainer.html) SDK to create a training job.

First we must adjust the data to be suitable for this version of XGBoost.

In [None]:
def save_for_sagemaker_xgboost(X, y, filename):
    """
    Save data in a format compatible with SageMaker's XGBoost algorithm.
    """
    # Combine target and features
    data = pd.concat([y.reset_index(drop=True), X.reset_index(drop=True)], axis=1)
    
    # Convert boolean columns to int
    bool_columns = data.select_dtypes(include=['bool']).columns
    data[bool_columns] = data[bool_columns].astype(int)
    
    # Ensure all data is numeric
    data = data.apply(pd.to_numeric, errors='coerce')
    
    # Replace any remaining non-numeric values with 0
    data = data.fillna(0)
    
    # Save to csv without header and index
    data.to_csv(filename, header=False, index=False)
    print(f"Data saved to {filename}")

# Combine train and validation sets
X_train_full = pd.concat([X_train, X_val])
y_train_full = pd.concat([y_train, y_val])

# Save training data (including validation data)
save_for_sagemaker_xgboost(X_train_full, y_train_full, 'train.csv')

# Save test data
save_for_sagemaker_xgboost(X_test, y_test, 'test.csv')

# Print some information about the saved files
print("\nTrain file info:")
print(pd.read_csv('train.csv', header=None).info())

print("\nTest file info:")
print(pd.read_csv('test.csv', header=None).info())

# Verify first few rows of each file
print("\nFirst few rows of train.csv:")
print(pd.read_csv('train.csv', header=None).head())

print("\nFirst few rows of test.csv:")
print(pd.read_csv('test.csv', header=None).head())


In the cell above, we performed preprocessing on our dataset and produced output files "train.csv" and "test.csv". Next, we'll upload these local files to our S3 location.

In [None]:
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_xgboost/train.csv')).upload_file('train.csv')
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test_xgboost/test.csv')).upload_file('test.csv')
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation_xgboost/test.csv')).upload_file('test.csv')

We'll need to specify the ECR container location for Amazon SageMaker's implementation of XGBoost.

In [None]:
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

Then, because we're training with the CSV file format, we'll create `TrainingInput` objects that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(
    s3_data='s3://{}/{}/train_xgboost/'.format(bucket, prefix),
    content_type='csv'
)
s3_input_test = sagemaker.inputs.TrainingInput(
    s3_data='s3://{}/{}/test_xgboost/'.format(bucket, prefix),
    content_type='csv'
)
s3_input_validation = sagemaker.inputs.TrainingInput(
    s3_data='s3://{}/{}/validation_xgboost/'.format(bucket, prefix),
    content_type='csv'
)

First we'll need to specify training parameters to the estimator.  This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies the input data.  In this case we have both a training and validation set which are passed in.

In [None]:
from mlflow.models import infer_signature

sm_session = sagemaker.Session()
xgb_estimator = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m5.xlarge',
                                    output_path=f's3://{bucket}/{prefix}/output',
                                    sagemaker_session=sm_session)

hyperparameters = {
    "max_depth": 6,
    "eta": 0.2,
    "gamma": 4,
    "min_child_weight": 8,
    "subsample": 0.6,
    "verbosity": 0,
    "objective": "reg:linear",
    "num_round": 75,
}
xgb_estimator.set_hyperparameters(**hyperparameters)

mlflow.set_experiment("XG-Boost")
with mlflow.start_run(run_name=f"xgboost-builtin-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"):
    # Log the hyperparameters
    mlflow.log_params(hyperparameters)

    # Fit the model and capture training metrics
    xgb_estimator.fit({'train': s3_input_train, 'validation': s3_input_test})
    
    # Get the training job name
    job_name = xgb_estimator.latest_training_job.job_name
    
    # Get the training job description
    client = sm_session.boto_session.client('sagemaker')
    training_job_description = client.describe_training_job(TrainingJobName=job_name)
    
    # Extract and log metrics
    for metric in training_job_description['FinalMetricDataList']:
        metric_name = metric['MetricName']
        metric_value = metric['Value']
        mlflow.log_metric(metric_name, metric_value)

    # Set tags for the run
    mlflow.set_tag("model_type", "XGBoost")
    mlflow.set_tag("framework", "SageMaker")

    # Register the model
    mlflow.register_model(f"runs:/{mlflow.active_run().info.run_id}/model", "xgboost-sm-regression")

    print(f"Model saved in run {mlflow.active_run().info.run_uuid}")

Now that you have successfully completed the training of the XGBoost model and SageMaker Autopilot job on the dataset, you can deploy xgboost and create a model from any of the candidates by using [Inference Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html).

In [None]:
xgb_predictor = xgb_estimator.deploy(initial_instance_count=1,
                           instance_type='ml.m5.xlarge')

First we'll need to determine how we pass data into and receive data from our endpoint.  Our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
# Load the CSV file
test_data = pd.read_csv('test.csv')

def predict(data, predictor, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    
    for array in split_array:
        # Convert numpy array to CSV string
        csv = '\n'.join([','.join(map(str, row)) for row in array])
        
        # Get predictions
        prediction = predictor.predict(csv)
        
        # Decode and convert to numpy array
        prediction_array = np.fromstring(prediction.decode('utf-8'), sep=',')
        
        predictions.append(prediction_array)
    
    # Concatenate all predictions
    return np.concatenate(predictions)

# Print column names and data info for debugging
print("Original columns:", test_data.columns)
print(test_data.info())

# Remove the first column (target variable)
X_test = test_data.iloc[:, 1:]

# Print shape to confirm
print("Shape of X_test:", X_test.shape)
print("Columns of X_test:", X_test.columns)

# Ensure we have the correct number of features
if X_test.shape[1] != 224:
    print(f"Warning: You have {X_test.shape[1]} features. The model expects 224.")
    print("Current features:", X_test.columns.tolist())
    # If needed, you can manually select the correct 8 features:
    # X_test = X_test[['feature1', 'feature2', ..., 'feature8']]

# Now you can use X_test in your predict function
predictions = predict(X_test.values, xgb_predictor)


Now we'll output a score for the predictions

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Extract the actual values (first column)
y_true = test_data.iloc[:, 0]

# Ensure predictions are in the same format as y_true
y_pred = predictions  # This should be the predictions from your previous cell

# Make sure y_true and y_pred have the same length
assert len(y_true) == len(y_pred), "Mismatch in length between actual and predicted values"

# Calculate evaluation metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
print(f"R-squared Score: {r2:.4f}")

# If you want to see a sample of the actual vs predicted values
comparison_df = pd.DataFrame({'Actual': y_true, 'Predicted': y_pred})
print("\nSample of Actual vs Predicted values:")
print(comparison_df.head(10))

# If you want to plot the actual vs predicted values
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(y_true, y_pred, alpha=0.5)
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

# Calculate and print additional statistics
print("\nAdditional Statistics:")
print(f"Mean of Actual Values: {y_true.mean():.4f}")
print(f"Mean of Predicted Values: {y_pred.mean():.4f}")
print(f"Standard Deviation of Actual Values: {y_true.std():.4f}")
print(f"Standard Deviation of Predicted Values: {y_pred.std():.4f}")

#### Optional: Hyperparameter Tuning Job

We can optionally run a Hyperparameter Optimization (HPO) Job to improve our results. This will run multiple training jobs with different values for the hyperparameters based on the ranges specified below. The HPO job automatically chooses new parameter values based on the results of previous jobs, eventually leading to better results.

In [None]:
from sagemaker.tuner import HyperparameterTuner
from sagemaker.parameter import ContinuousParameter, IntegerParameter


hyperparameter_ranges = {
    "max_depth": IntegerParameter(5, 10),
    "eta": ContinuousParameter(0.001, 0.3),
    "min_child_weight": IntegerParameter(5, 10),
    "subsample": ContinuousParameter(0.3, 0.8),
    "num_round": IntegerParameter(50, 100),
}

objective_metric_name = "validation:rmse"

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    objective_type="Minimize",
    strategy="Bayesian",
    max_jobs=10,
    max_parallel_jobs=5,
) 

When we run the HPO job, we can log the results of the hyperparameter tuning job and each child training job into MLflow. 

In [None]:
def format_param_range(param_range):
    formatted_param_range = {
        f"{param_range['Name']}_min_value": param_range['MinValue'],
        f"{param_range['Name']}_max_value": param_range['MaxValue']
    }
    return formatted_param_range


mlflow.set_experiment("XG-Boost-HPO")
with mlflow.start_run(run_name=f"xgboost-hpo-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"):
    tuner.fit({"train": s3_input_train, "validation": s3_input_validation})
    tuner_descr = tuner.describe()
    
    # Log parameters relevant to overall tuning job
    mlflow.log_params(tuner_descr['HyperParameterTuningJobConfig']['ResourceLimits'])
    mlflow.log_param('Strategy', tuner_descr['HyperParameterTuningJobConfig']['Strategy'])
    mlflow.log_params(tuner_descr['TrainingJobDefinition']['StaticHyperParameters'])
    for integer_param in tuner_descr['HyperParameterTuningJobConfig']['ParameterRanges']['IntegerParameterRanges']:
        mlflow.log_params(format_param_range(integer_param))
    for continuous_param in tuner_descr['HyperParameterTuningJobConfig']['ParameterRanges']['ContinuousParameterRanges']:
        mlflow.log_params(format_param_range(continuous_param))
    for categorical_param in tuner_descr['HyperParameterTuningJobConfig']['ParameterRanges']['CategoricalParameterRanges']:
        mlflow.log_params(format_param_range(categorical_param))
    mlflow.log_param('BestTrainingJobName', tuner_descr['BestTrainingJob']['TrainingJobName'])

    # Set tags for the run
    mlflow.set_tag("model_type", "XGBoost")
    mlflow.set_tag("framework", "SageMaker")

    # Log parameters and metrics for each training job
    train_summaries = tuner.analytics().training_job_summaries()
    for train_results in train_summaries:
        with mlflow.start_run(run_name=train_results['TrainingJobName'], nested=True):
            mlflow.log_params(train_results['TunedHyperParameters'])
            mlflow.log_metric(train_results['FinalHyperParameterTuningJobObjectiveMetric']['MetricName'],
                             train_results['FinalHyperParameterTuningJobObjectiveMetric']['Value'])
            mlflow.set_tag("model_type", "XGBoost")
            mlflow.set_tag("framework", "SageMaker")

It is also easy to deploy the best model from the hyperparameter tuning job. 

In [None]:
tuner.deploy(
    initial_instance_count=1, 
    instance_type='ml.m5.xlarge'
)

---
## AutoML Training

Amazon SageMaker Autopilot is an automated machine learning (commonly referred to as AutoML) solution for tabular datasets. You can use SageMaker Autopilot in different ways: on autopilot (hence the name) or with human guidance, without code through SageMaker Studio, or using the AWS SDKs. This notebook, as a first glimpse, will use the AWS SDKs to simply create and deploy a machine learning model.

This part of the notebook demonstrates how you can use Autopilot on this dataset to get the most accurate ML pipeline through exploring a number of potential options, or "candidates". Each candidate generated by Autopilot consists of two steps. The first step performs automated feature engineering on the dataset and the second step trains and tunes an algorithm to produce a model. When you deploy this model, it follows similar steps. Feature engineering followed by inference, to decide whether the lead is worth pursuing or not. The notebook contains instructions on how to train the model as well as to deploy the model to perform batch predictions on a set of leads. Where it is possible, use the Amazon SageMaker Python SDK, a high level SDK, to simplify the way you interact with Amazon SageMaker.

First, we need to upload the entire dataset to S3.

Caution: Before running the cell below, you must upload the data file "5000-sales-records.csv" to the local directory!

In [None]:
# Upload data for AutoML Job

session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_automl/train.csv')).upload_file('5000-sales-records.csv')
session.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test_automl/test.csv')).upload_file('5000-sales-records.csv')

### AutoML Configuration

You can specify the type of problem you want to solve with your dataset (`Regression, MulticlassClassification, BinaryClassification`). In case you are not sure, SageMaker Autopilot will infer the problem type based on statistics of the target column (the column you want to predict). 

You have the option to limit the running time of a SageMaker Autopilot job by providing either the maximum number of pipeline evaluations or candidates (one pipeline evaluation is called a `Candidate` because it generates a candidate model) or providing the total time allocated for the overall Autopilot job. Under default settings, this job takes about four hours to run. This varies between runs because of the nature of the exploratory process Autopilot uses to find optimal training parameters.

In [None]:
from time import gmtime, strftime, sleep
import json
import mlflow
import boto3

# Set up MLflow experiment
mlflow.set_experiment("AutoML-Job")

# Start MLflow run
with mlflow.start_run(run_name="AutoML-Job-Run"):

    input_data_config = [{
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': f's3://{bucket}/{prefix}/train_automl'
            }
        },
        'ContentType': 'text/csv;header=present',
        'TargetAttributeName': 'Total Profit'
    }]

    output_data_config = {
        'S3OutputPath': f's3://{bucket}/{prefix}/output_automl'
    }

    auto_ml_job_config = {
        'CompletionCriteria': {
            'MaxCandidates': 5
        }
    }

    autoMLJobObjective = {
        "MetricName": "MSE"  
    }

    # Log configurations to MLflow
    mlflow.log_dict(input_data_config, "input_data_config.json")
    mlflow.log_dict(output_data_config, "output_data_config.json")
    mlflow.log_dict(auto_ml_job_config, "auto_ml_job_config.json")
    mlflow.log_dict(autoMLJobObjective, "autoMLJobObjective.json")

    # Configuration
    timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
    auto_ml_job_name = 'demo' + timestamp_suffix
    print('AutoMLJobName: ' + auto_ml_job_name)
    mlflow.log_param("AutoMLJobName", auto_ml_job_name)

    # Create AutoML job
    sm = boto3.client('sagemaker')
    sm.create_auto_ml_job(
        AutoMLJobName=auto_ml_job_name,
        InputDataConfig=input_data_config,
        OutputDataConfig=output_data_config,
        AutoMLJobConfig=auto_ml_job_config,
        AutoMLJobObjective=autoMLJobObjective,
        ProblemType="Regression",  
        RoleArn=role  
    )

    # Wait for the AutoML job to complete
    while True:
        response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)
        status = response['AutoMLJobStatus']
        if status in ['Completed', 'Failed', 'Stopped']:
            break
        print(f"AutoML job status: {status}")
        sleep(60)

    # Log final job status
    mlflow.log_param("FinalJobStatus", status)

    if status == 'Completed':
        # Log best candidate info
        best_candidate = response['BestCandidate']
        mlflow.log_dict(best_candidate, "best_candidate.json")
        
        # Log objective metric
        objective_metric = best_candidate['FinalAutoMLJobObjectiveMetric']
        mlflow.log_metric(objective_metric['MetricName'], objective_metric['Value'])

        # Log other metrics if available
        if 'CandidateProperties' in best_candidate:
            for metric in best_candidate['CandidateProperties'].get('Metrics', []):
                mlflow.log_metric(metric['MetricName'], metric['Value'])

    print(f"AutoML job {auto_ml_job_name} finished with status: {status}")
    

In [None]:
sagemaker_client = boto3.client('sagemaker')

best_candidate = sagemaker_client.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']
best_candidate_name = best_candidate['CandidateName']
print(best_candidate)
print('\n')

print("CandidateName: " + best_candidate_name)
print("FinalAutoMLJobObjectiveMetricName: " + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])
print("FinalAutoMLJobObjectiveMetricValue: " + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))

### Create Model for best candidate

When the AutoML job has finished running, you can easily create a SageMaker Model object using the best candidate.

In [None]:
from time import gmtime, strftime

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())

model_name = 'demo-' + timestamp_suffix
print(f"Model name: {model_name}")

In [None]:
# Create Model
model = sagemaker_client.create_model(
    Containers=best_candidate['InferenceContainers'],
    ModelName=model_name,
    ExecutionRoleArn=role
)

print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))

### View other candidates
You can view all the candidates (pipeline evaluations with different hyperparameter combinations) that were explored by SageMaker Autopilot and sort them by their final performance metric.

In [None]:
candidates = sagemaker_client.list_candidates_for_auto_ml_job(
    AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']

index = 1
for candidate in candidates:
  print (str(index) + "  " + candidate['CandidateName'] + "  " + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))
  index += 1

## Cleanup

The Autopilot job creates many underlying artifacts such as dataset splits, preprocessing scripts, or preprocessed data, etc. This code, when un-commented, deletes them. This operation deletes all the generated models and the auto-generated notebooks as well. 

In [None]:
#s3 = boto3.resource('s3')
#bucket = s3.Bucket(bucket)

#job_outputs_prefix = '{}/output/{}'.format(prefix,auto_ml_job_name)
#bucket.objects.filter(Prefix=job_outputs_prefix).delete()
# xgb_predictor.delete_endpoint(delete_endpoint_config=True)
# tuner.delete_endpoint(delete_endpoint_config=True)