<a href="https://colab.research.google.com/github/mlukan/GDA3B2021/blob/main/AWS/sagemaker_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u>Introduction to Using SageMaker with Boto3</u>
Using the previously-used Titanic dataset (which is a binary classification problem regarding survivability), we're going to use XGBoost to train our model and output predictions.

To be specific, here are the steps that we're going to take during this challenge:
- Datasets
    - Locally download the training/validation/test datasets
    - Upload the dataset to a newly-created S3 bucket
- Training
    - Use a previously-built, AWS XGBoost model for training
    - Create a SageMaker session
    - Train the XGBoost model from the data you created in S3
    - Host the model on an endpoint (very basic!)
- Predictions
    - Make a prediction 
    - Put the prediction onto S3
- Delete the endpoint
- Delete your S3 Bucket

## <u>Datasets</u>

Before we could actually start using SageMaker, the first thing that we're going to do is to download the datasets locally and then upload these datasets to a brand new S3 Bucket so that our SageMaker algorithm could easily read the datasets.

You can download the datasets here: https://www.kaggle.com/c/titanic/data. When you download the .zip file locally, you should have a "train.csv" and a "test.csv" file that we're going to be using quiet extensively.

Before we upload these datasets to a newly-created S3 bucket, we need to make sure the dataset is formatted in a way that the AWS XGBoost model understands.

In [None]:
# Do the usual data preparation (i.e. get rid of unique features, clean the data and deal with missing values, and finally, encode categorical features)

# INSERT CODE HERE

After your cleaning, you need to write the data back to train.csv and test.csv in the correct format. You can check the Input/Output Interface section on the AWS XGBoost documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

Most importantly, the algorithm assumes the target to be in the first column and that only the values of the matrix are represented (drop header and index).

In [None]:
# Format and save the data to test.csv and train.csv

# INSERT CODE HERE

In [None]:
# Create a new S3 bucket

# INSERT CODE HERE

In [None]:
# Upload the two files (train.csv and test.csv) to your newly-created S3 bucket

# INSERT CODE HERE

## <u>Training</u>

When using SageMaker, one thing that's extremely helpful is its repository of machine learning models that you can use instantly that were created by AWS Developers in addition to the AWS community. These models include anything from what you're learning right now (LinearRegression, Random Forests, XGBoost) to more complicated, deep-learning models that are used for Object-Detection, Natural Language Processing tasks, or Reinforcement Learning. This repository can be found here: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html. 

To use these algorithms, all that you need to do is specify the image URI.

In [None]:
# Use a previously-built, AWS XGBoost model for training

from sagemaker.amazon.amazon_estimator import get_image_uri
# container = get_image_uri(***insert correct arguments here***) 

In [None]:
# Create pointers to the S3 train and test datasets

s3_input_train = # input sagemaker s3_input function here with a link to your training dataset
s3_input_test = # input sagemaker s3_input function here with a link to your testing dataset

In [None]:
# Create a SageMaker Session

# INSERT CODE HERE

In [None]:
# Create an XGBoost Estimator

# INSERT CODE HERE

In [None]:
# Select the your specific hyperparameters (Optional)

# INSERT CODE HERE

In [None]:
# Fit the model

# INSERT CODE HERE

In [None]:
# Deploy your model to an endpoint to perform predictions

xgb_predictor = name_of_your_model.deploy(
    initial_instance_count = 1, 
    instance_type = 'ml.t2.medium')

In [None]:
# Configure the predictor's serializer and deserializer

# INSERT CODE HERE

## <u>Predictions</u>

Now that we have a running endpoint, we're going to use this endpoint to make predictions and then upload these predictions to S3.


In [None]:
# Make a prediction using xgb_predictor

# INSERT CODE HERE

In [None]:
# Upload the predicitons onto S3

# INSERT CODE HERE

In [None]:
# Delete the endpoint

sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

In [None]:
# Delete your S3 bucket

# INSERT CODE HERE

## <u>Overview</u>
Now that you've (hopefully) successfully created a SageMaker session to run an experiment, there's several other things that you could try in order to get a better understanding of the powerful capabilities of SageMaker. In particular, if you found this challenge to be too easy, feel free to try to do the folowing:
- Rather than using a pre-built XGBoost algorithm, use your own algorithm. For example, you can still use XGBoost as your "own algorithm" but you could treat this as a custom algorithm rather than using the pre-built image.
- Hyperparameter tuning for the XGBoost hyperparameters