# Tutorial 4: Train your model in an AWS Sagemaker training job

In the last tutorial, we trained and evaluated a first model on small dummy data. While it is possible to train a model within [Sagemaker Studio](https://www.youtube.com/watch?v=uQc8Itd4UTs&list=PLhr1KZpdzukcOr_6j_zmSrvYnLUtgqsZz) it's better to use a Sagemaker training job instead. Sagemaker training jobs have several advantages over a normal notebook. Sagemaker training jobs, namely they

- provide you with a nice overview of all the trainings you ran
- automatically store the results of a training run (metrics, [logs](https://console.aws.amazon.com/cloudwatch) and models)
- do not automatically shut down after a few hours (which we enabled in the notebooks)
- allows to run multiple training jobs in parallel (if you have sufficient GPUs allocated)

In this notebook, we will convert the approach from tutorial 3 into a Sagemaker training job and train our model on the full data set. During the training, the logs are send to [Cloudwatch](https://console.aws.amazon.com/cloudwatch) - aws' monitoring service. After the training completed, the model is saved and automatically uploaded to S3. From there we'll retrieve the model and evaluate it.

**NOTE: We will NOT need a GPU for this tutorial notebook. Pick a non GPU instance type to save costs.**

## Setup

As usual, we'll start by setting up the appropriate imports. Interacting with Sagemaker and starting a training job can be done in several ways (e.g. from the AWS UI). We'll use the [Sagemaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk), which is already installed in your studio environment, to start the training job from this notebook.

*Hint: if you want to use this notebook on your local machine, make sure to set up the AWS CLI with your challenge credentials and install the requirements.*

In [5]:
import boto3           # For interacting with S3
import pandas as pd
import sys             # Python system library needed to load custom functions

# Imports to run Sagemaker training jobs
from sagemaker import get_execution_role
from sagemaker.pytorch import PyTorch
from sagemaker.session import Session


In [6]:
sys.path.append('../src')  # Add the source directory to the PYTHONPATH. This allows to import local functions and modules.

In [7]:
from config import DEFAULT_BUCKET, DEFAULT_REGION  # The name of the S3 bucket that contains the training data
from detection_util import create_predictions
from gdsc_util import download_and_extract_model, set_up_logging, extract_hyperparams,create_encrypted_bucket, PROJECT_DIR
from tutorial_4_training import load_config

set_up_logging()  # Sets up logging to console and .log

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


# Converting the Tutorial 3 into a Training Script

Looking back at tutorial 3, you'll find that the notebook contained three main components that were necessary for training a model

- the definition of the dataset
- the configuration file of the training
- the logic for building the dataset and starting the training

How can we best turn this into a script? For starters, the definition of the dataset is not likely to change. Hence it makes sense to turn this into a separate module and import it where necessary. You can find the code for this in ```src/Dataset.py```. The configuration will probably be changed during different experiments. Similarly, the logic might change a bit once we start using different training and testing datasets. Therefore we put these two into a separate file called ```src/tutorial_4_training.py```. The intend is to have a separate file for each training job in order to make everything reproduceable and easily comparable. Note that for a longer project you'll probably want to set up a proper experiment tracking tool like [Sagemaker Experiments](https://sagemaker-experiments.readthedocs.io/en/latest/index.html) or [MLFlow](https://mlflow.org/).

# Running the Training Script

The training job will run on a virtual machine (called instance) in the AWS cloud. An overview of all your training jobs can be found in the [AWS console](https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs) (you may have to change the region) after you logged into your account. You can also navigate directly to *Amazon SageMaker > Training > Training jobs* and click on the name of the latest training job.

To start, we need to set the name of our experiment. Keep in mind that every experiment should have a unique name. Since we'll use a separate python script for each training, we'll use the name of the training script as name of the experiment.

In [26]:
entry_point = 'tutorial_4_training.py'
exp_name = entry_point.split('.')[0].replace('_', '-')  # AWS does not allow . and _ as experiment names
exp_name

'tutorial-4-training'

Next we need to define the AWS settings for the job

In [27]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
role = get_execution_role()
sm_client = boto3.client("sagemaker", region_name=DEFAULT_REGION)
sess = Session(sagemaker_client=sm_client)

We will also need to download the images and training csv in order to train a model. Luckily, Sagemaker has a built-in functionality for this. 
Via the ```input_channels``` parameter we can specify multiple S3 locations. The contents are downloaded in the training job and made available under the provided name (dictionary key).
In the example below, Sagemaker will download the complete content of the training data bucket, store it on the instance and, save its location in an environment variable called ```SM_CHANNEL_TRAIN```.

In [28]:
input_channels = {    
    "train": f"s3://{DEFAULT_BUCKET}"    
}

We also specify where Sagemaker should store the results of the training job. I.e. the weights of the trained model. You will also see the link at the bottom of the training job overview page.

In [29]:
# we need to create our own s3 bucket if it doesn't exist yet:
sagemaker_bucket = f"sagemaker-{DEFAULT_REGION}-{account_id}"
create_encrypted_bucket(sagemaker_bucket)
s3_output_location = f"s3://{sagemaker_bucket}/{exp_name}"

2022-07-04 12:07:09,366 - root - INFO - Successfully created encrypted bucket: sagemaker-us-east-1-314026059811


In order to track which configuration lead to which result, we want to attach our specified configuration to the training job. Therefore we must first load the configuration from the training script. Afterwards, we will extract the configuration which we explicitly specified and save it as a dictionary. Later on, we will pass this dictionary as hyperparameters to the training job.

The hyperparameter dictionary will look like this: 

```json
{
    "dataset_type": "'OnchoDatase'",
    "data_root": "data_folder",
    "data.train.type": "OnchoDataset",
    ...
    "evaluation.metric": "mAP",
    "runner.max_epochs": 4
}
```
You fill find this configuration later in the training job overview in the AWS sagemaker UI.


In [30]:
data_folder = str(PROJECT_DIR / 'data')
cfg, base_file = load_config(data_folder)
hyperparameters = extract_hyperparams(entry_point) # custom function to parse the training script and extract config
hyperparameters['base_file'] = base_file

Finally, we need to specify which metrics we want Sagemaker to automatically track. For this we need to setup [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) that will be applied on the logs.
The corresponding values will then be stored and made visible in the training job.

In [31]:
# Output format is:
# INFO:mmdet:Epoch [4][50/497]#011lr: 5.000e-03, eta: 0:10:23, time: 1.638, data_time: 1.433, memory: 1863, loss_rpn_cls: 0.0897, loss_rpn_bbox: 0.0781, loss_cls: 0.2336, acc: 90.8691, loss_bbox: 0.3404, loss: 0.7418
metrics = [
    {"Name": "train:loss_rpn_cls", "Regex": "loss_rpn_cls: ([0-9\.]+)"},
    {"Name": "train:loss_rpn_bbox", "Regex": "loss_rpn_bbox: ([0-9\.]+)"},
    {"Name": "train:loss_cls", "Regex": "loss_cls: ([0-9\.]+)"},
    {"Name": "train:loss_bbox", "Regex": "loss_bbox: ([0-9\.]+)"},
    {"Name": "train:loss", "Regex": "loss: ([0-9\.]+)"},
    {"Name": "train:accuracy", "Regex": "acc: ([0-9\.]+)"},
    {"Name": "train:epoch", "Regex": "Epoch (\[[0-9\.]+\])"},
    {"Name": "val:epoch", "Regex": "Epoch\(val\) (\[[0-9]+\])"},
    {"Name": "val:mAP", "Regex": "mAP: ([0-9\.]+)"},
]

With all the preparations done, we can now create the *estimator object* that defines all the relevant settings, in particular the libraries that need to be installed. A few things to note:

- Since our object detection library MMDetection is based on *Pytorch*, we will use the ```Pytorch``` estimator. With it, we are using a pytorch-based docker image which we extended for the challenge, such that we don't have to care about installing any of the preliminary libraries. Any additional requirement will be installed on the training instance after startup. If you need additional libraries you can just add them to the file ```src/requirements.txt```.
- The training job will run our predefined entry point, i.e. the python script ```tutorial_4_training.py```. 
- When we start the training by calling ```estimator.fit()```, the whole content of the *src* directory will be uploaded to the training instance. This becomes relevant for importing other modules.
- If you don't want your job to finish, you can stop it in the UI. Be advised that you can only run one job at a time during the challenge.
- When your training job is completed, your model will be stored as a .tar file in S3. The ```s3_output_location``` determines the location. You will find it in the folder ```<your-training-job>/output```. You can download your model from there and test it locally. If there is no "output" folder, make sure to check the logs in the AWS console, your training job probably failed.

In [12]:
estimator = PyTorch(
    entry_point=entry_point,             # This function will be called by the training job
    source_dir="../src",                 # All code in this folder will be copied over
    image_uri=f"954362353459.dkr.ecr.{DEFAULT_REGION}.amazonaws.com/sm-training-custom:torch-1.8.1-cu111-noGPL",
    role=role,
    output_path=s3_output_location,
    container_log_level=20,             # 10=debug, 20=info
    base_job_name=exp_name,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",     # a GPU instance
    volume_size=45,
    metric_definitions=metrics,
    hyperparameters=hyperparameters,
)

After we created the estimator, we will need to call the .fit method to start the training job. As this might take a while, we set ```wait=False``` so our notebook will not wait for the training job to finish and we can continue working.

In [13]:
estimator.fit(
    input_channels,
    wait=False,           # Whether or not the notebook should wait for the job to finish. By setting it to False we can continue working while the job runs on another machine.
)

2022-06-06 18:22:25,252 - sagemaker.image_uris - INFO - Defaulting to the only supported framework/algorithm version: latest.
2022-06-06 18:22:25,273 - sagemaker.image_uris - INFO - Ignoring unnecessary instance type: None.
2022-06-06 18:22:25,576 - sagemaker - INFO - Creating training-job with name: tutorial-4-training-2022-06-06-18-22-25-249


In [None]:
# save the model location to the filesystem so that we can use it later
model_location = f'{s3_output_location}/{estimator._hyperparameters["sagemaker_job_name"]}/output/model.tar.gz'
print(model_location)

with open(f'{PROJECT_DIR}/model_location.txt', 'w+') as f:
    f.write(model_location)

One important question that we haven't answered to far is *How can I see how the training is doing?* After all you need see if there are any issues or if the training is going well.
This can be done as follows:

1. Go to the AWS Console
2. Search for AWS Sagemaker and go to the Service
3. On the left pane click on Training -> Training Jobs
4. Your training job should the one at the very top. Click it.
5. The details page has a link *View logs* in the *Monitor* section (scroll down). Click this to see the logs. 

The default configuration for this tutorial will train your model for 4 epochs. This should take around 3 hours and give you an indication if an idea is working or not.
Training for longer does not necessary help. In our tests results rarely improved after ~12 epochs even with different configurations. **To save time and money we suggest to start with a few epochs and only train models for longer where you are certain that there is a significant benefit.** After all, you only have a limited budget.

# A new submission with our newly trained model!

After the training job is finished - which should take around 3 hours! - you can download the results, load the model and create a new submission!

*Note: Check the AWS training job console to see the status of your training job*

First we need to specify where the results where stored. We stored the model location to the local filesystem, we only need to read it. If that didn't work, make sure to check the Sagemaker Training section in the AWS console. The model location will look similar to this:```s3://sagemaker-us-east-1-954362353459/tutorial-4-training/tutorial-4-training-2022-06-06-18-22-25-249/output/model.tar.gz```

In [12]:
# read the model location from the filesystem
with open(f'{PROJECT_DIR}/model_location.txt', 'r') as f:
    model_location = f.read()

We provide a custom function that downloads the results to our local environment:

In [13]:
local_model_dir = download_and_extract_model(model_uri=model_location, local_dir='data')
local_model_dir

Navigate to the *data* directory and verify that everything went well. You should see the following files:

- *epoch_1.pth*
- *epoch_2.pth*
- *epoch_3.pth*
- *epoch_4.pth*
- *model.tar.gz*
- *None.log.json*

The *.pth files contain the weights of the model. To use them we need to 

- load the corresponding config (we already did this above), 
- specify which checkpoint we want to use and,
- load the names of the image files for which we want to create predictions.

In [14]:
checkpoint = f'{local_model_dir}/epoch_4.pth' # Select one of the model checkpoints to load in
file_names = pd.read_csv(f'{data_folder}/test_files.csv', sep=';', header=None)[0].values

With this we are ready to create our second submission, similar to how we did it in tutorial 3:

In [15]:
prediction_df = create_predictions(file_names, cfg, checkpoint)

load checkpoint from local path: /root/data/AmazonSageMaker-gdsc-tutorials/data/tutorial-4-training-2022-05-27-15-05-00-955/epoch_4.pth
2022-05-27 19:47:51,431 - detection_util - INFO - Creating predictions for 73 files


  0%|          | 0/73 [00:00<?, ?it/s]

2022-05-27 19:47:51,437 - detection_util - INFO - Processing file: 100_D.jpg




[2022-05-27 19:47:52.945 gdsc5-smstudio-custom-ml-t3-medium-e966dd789eb1f61b988c87d4472b:1624 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-05-27 19:47:53.009 gdsc5-smstudio-custom-ml-t3-medium-e966dd789eb1f61b988c87d4472b:1624 INFO profiler_config_parser.py:102] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.


  1%|▏         | 1/73 [00:07<08:24,  7.01s/it]

2022-05-27 19:47:58,445 - detection_util - INFO - Processing file: 100_C.jpg


  3%|▎         | 2/73 [00:13<07:56,  6.71s/it]

2022-05-27 19:48:04,950 - detection_util - INFO - Processing file: 100_B.jpg


  4%|▍         | 3/73 [00:18<06:41,  5.74s/it]

2022-05-27 19:48:09,526 - detection_util - INFO - Processing file: 100_AA.jpg


  5%|▌         | 4/73 [00:23<06:22,  5.55s/it]

2022-05-27 19:48:14,779 - detection_util - INFO - Processing file: 100_A.jpg


  7%|▋         | 5/73 [00:29<06:22,  5.62s/it]

2022-05-27 19:48:20,536 - detection_util - INFO - Processing file: 101_DD.jpg


  8%|▊         | 6/73 [00:34<06:11,  5.55s/it]

2022-05-27 19:48:25,947 - detection_util - INFO - Processing file: 101_C.jpg


 10%|▉         | 7/73 [00:39<06:04,  5.53s/it]

2022-05-27 19:48:31,433 - detection_util - INFO - Processing file: 101_B.jpg


 11%|█         | 8/73 [00:45<05:57,  5.50s/it]

2022-05-27 19:48:36,879 - detection_util - INFO - Processing file: 101_AA.jpg


 12%|█▏        | 9/73 [00:50<05:49,  5.47s/it]

2022-05-27 19:48:42,272 - detection_util - INFO - Processing file: 101_A.jpg


 14%|█▎        | 10/73 [00:55<05:34,  5.32s/it]

2022-05-27 19:48:47,246 - detection_util - INFO - Processing file: 86_D.jpg


 15%|█▌        | 11/73 [01:03<06:11,  5.99s/it]

2022-05-27 19:48:54,775 - detection_util - INFO - Processing file: 86_C.jpg


 16%|█▋        | 12/73 [01:10<06:31,  6.42s/it]

2022-05-27 19:49:02,175 - detection_util - INFO - Processing file: 86_B.jpg


 18%|█▊        | 13/73 [01:18<06:45,  6.75s/it]

2022-05-27 19:49:09,691 - detection_util - INFO - Processing file: 86_AA.jpg


 19%|█▉        | 14/73 [01:24<06:24,  6.52s/it]

2022-05-27 19:49:15,671 - detection_util - INFO - Processing file: 86_A.jpg


 21%|██        | 15/73 [01:30<06:21,  6.57s/it]

2022-05-27 19:49:22,362 - detection_util - INFO - Processing file: 88_D.jpg


 22%|██▏       | 16/73 [01:38<06:23,  6.74s/it]

2022-05-27 19:49:29,481 - detection_util - INFO - Processing file: 88_C.jpg


 23%|██▎       | 17/73 [01:46<06:43,  7.21s/it]

2022-05-27 19:49:37,799 - detection_util - INFO - Processing file: 88_B.jpg


 25%|██▍       | 18/73 [01:53<06:33,  7.15s/it]

2022-05-27 19:49:44,818 - detection_util - INFO - Processing file: 88_A.jpg


 26%|██▌       | 19/73 [01:59<06:08,  6.83s/it]

2022-05-27 19:49:50,895 - detection_util - INFO - Processing file: 89_D.jpg


 27%|██▋       | 20/73 [02:04<05:27,  6.19s/it]

2022-05-27 19:49:55,582 - detection_util - INFO - Processing file: 89_C.jpg


 29%|██▉       | 21/73 [02:10<05:19,  6.14s/it]

2022-05-27 19:50:01,615 - detection_util - INFO - Processing file: 89_B.jpg


 30%|███       | 22/73 [02:15<04:53,  5.76s/it]

2022-05-27 19:50:06,493 - detection_util - INFO - Processing file: 89_AA.jpg


 32%|███▏      | 23/73 [02:21<04:55,  5.92s/it]

2022-05-27 19:50:12,779 - detection_util - INFO - Processing file: 89_A.jpg


 33%|███▎      | 24/73 [02:26<04:39,  5.70s/it]

2022-05-27 19:50:17,954 - detection_util - INFO - Processing file: 90_D.jpg


 34%|███▍      | 25/73 [02:33<04:57,  6.19s/it]

2022-05-27 19:50:25,305 - detection_util - INFO - Processing file: 90_C.jpg


 36%|███▌      | 26/73 [02:40<04:56,  6.30s/it]

2022-05-27 19:50:31,857 - detection_util - INFO - Processing file: 90_B.jpg


 37%|███▋      | 27/73 [02:47<05:03,  6.59s/it]

2022-05-27 19:50:39,129 - detection_util - INFO - Processing file: 90_AA.jpg


 38%|███▊      | 28/73 [02:54<04:58,  6.63s/it]

2022-05-27 19:50:45,862 - detection_util - INFO - Processing file: 90_A.jpg


 40%|███▉      | 29/73 [03:01<05:04,  6.91s/it]

2022-05-27 19:50:53,417 - detection_util - INFO - Processing file: 91_D.jpg


 41%|████      | 30/73 [03:09<05:00,  6.99s/it]

2022-05-27 19:51:00,589 - detection_util - INFO - Processing file: 91_C.jpg


 42%|████▏     | 31/73 [03:16<05:02,  7.21s/it]

2022-05-27 19:51:08,326 - detection_util - INFO - Processing file: 91_B.jpg


 44%|████▍     | 32/73 [03:23<04:44,  6.93s/it]

2022-05-27 19:51:14,591 - detection_util - INFO - Processing file: 91_AA.jpg


 45%|████▌     | 33/73 [03:29<04:29,  6.75s/it]

2022-05-27 19:51:20,908 - detection_util - INFO - Processing file: 91_A.jpg


 47%|████▋     | 34/73 [03:35<04:17,  6.61s/it]

2022-05-27 19:51:27,188 - detection_util - INFO - Processing file: 92_D.jpg


 48%|████▊     | 35/73 [03:42<04:07,  6.53s/it]

2022-05-27 19:51:33,529 - detection_util - INFO - Processing file: 92_C.jpg


 49%|████▉     | 36/73 [03:48<03:57,  6.42s/it]

2022-05-27 19:51:39,692 - detection_util - INFO - Processing file: 92_B.jpg


 51%|█████     | 37/73 [03:56<04:08,  6.89s/it]

2022-05-27 19:51:47,692 - detection_util - INFO - Processing file: 92_AA.jpg


 52%|█████▏    | 38/73 [04:02<03:51,  6.62s/it]

2022-05-27 19:51:53,678 - detection_util - INFO - Processing file: 92_A.jpg


 53%|█████▎    | 39/73 [04:08<03:41,  6.52s/it]

2022-05-27 19:51:59,960 - detection_util - INFO - Processing file: 93_D.jpg


 55%|█████▍    | 40/73 [04:16<03:48,  6.93s/it]

2022-05-27 19:52:07,863 - detection_util - INFO - Processing file: 93_C.jpg


 56%|█████▌    | 41/73 [04:24<03:56,  7.40s/it]

2022-05-27 19:52:16,334 - detection_util - INFO - Processing file: 93_B.jpg


 58%|█████▊    | 42/73 [04:33<03:57,  7.65s/it]

2022-05-27 19:52:24,576 - detection_util - INFO - Processing file: 93_AA.jpg


 59%|█████▉    | 43/73 [04:39<03:36,  7.20s/it]

2022-05-27 19:52:30,741 - detection_util - INFO - Processing file: 93_A.jpg


 60%|██████    | 44/73 [04:44<03:13,  6.68s/it]

2022-05-27 19:52:36,211 - detection_util - INFO - Processing file: 94_D.jpg


 62%|██████▏   | 45/73 [04:51<03:09,  6.78s/it]

2022-05-27 19:52:43,220 - detection_util - INFO - Processing file: 94_C.jpg


 63%|██████▎   | 46/73 [04:57<02:54,  6.45s/it]

2022-05-27 19:52:48,909 - detection_util - INFO - Processing file: 94_B.jpg


 64%|██████▍   | 47/73 [05:05<03:03,  7.04s/it]

2022-05-27 19:52:57,324 - detection_util - INFO - Processing file: 94_AA.jpg


 66%|██████▌   | 48/73 [05:11<02:48,  6.73s/it]

2022-05-27 19:53:03,316 - detection_util - INFO - Processing file: 94_A.jpg


 67%|██████▋   | 49/73 [05:19<02:47,  6.97s/it]

2022-05-27 19:53:10,846 - detection_util - INFO - Processing file: 95_D.jpg


 68%|██████▊   | 50/73 [05:26<02:40,  6.99s/it]

2022-05-27 19:53:17,904 - detection_util - INFO - Processing file: 95_C.jpg


 70%|██████▉   | 51/73 [05:32<02:28,  6.74s/it]

2022-05-27 19:53:24,036 - detection_util - INFO - Processing file: 95_B.jpg


 71%|███████   | 52/73 [05:39<02:23,  6.84s/it]

2022-05-27 19:53:31,128 - detection_util - INFO - Processing file: 95_AA.jpg


 73%|███████▎  | 53/73 [05:46<02:18,  6.94s/it]

2022-05-27 19:53:38,294 - detection_util - INFO - Processing file: 95_A.jpg


 74%|███████▍  | 54/73 [05:53<02:07,  6.73s/it]

2022-05-27 19:53:44,531 - detection_util - INFO - Processing file: 96_D.jpg


 75%|███████▌  | 55/73 [05:59<01:57,  6.55s/it]

2022-05-27 19:53:50,648 - detection_util - INFO - Processing file: 96_C.jpg


 77%|███████▋  | 56/73 [06:04<01:44,  6.12s/it]

2022-05-27 19:53:55,783 - detection_util - INFO - Processing file: 96_B.jpg


 78%|███████▊  | 57/73 [06:10<01:37,  6.09s/it]

2022-05-27 19:54:01,810 - detection_util - INFO - Processing file: 96_AA.jpg


 79%|███████▉  | 58/73 [06:17<01:36,  6.40s/it]

2022-05-27 19:54:08,926 - detection_util - INFO - Processing file: 96_A.jpg


 81%|████████  | 59/73 [06:23<01:28,  6.33s/it]

2022-05-27 19:54:15,096 - detection_util - INFO - Processing file: 97_D.jpg


 82%|████████▏ | 60/73 [06:30<01:23,  6.41s/it]

2022-05-27 19:54:21,679 - detection_util - INFO - Processing file: 97_C.jpg


 84%|████████▎ | 61/73 [06:36<01:18,  6.50s/it]

2022-05-27 19:54:28,397 - detection_util - INFO - Processing file: 97_B.jpg


 85%|████████▍ | 62/73 [06:43<01:12,  6.60s/it]

2022-05-27 19:54:35,235 - detection_util - INFO - Processing file: 97_A.jpg


 86%|████████▋ | 63/73 [06:50<01:06,  6.65s/it]

2022-05-27 19:54:41,984 - detection_util - INFO - Processing file: 98_D.jpg


 88%|████████▊ | 64/73 [06:57<01:00,  6.77s/it]

2022-05-27 19:54:49,051 - detection_util - INFO - Processing file: 98_C.jpg


 89%|████████▉ | 65/73 [07:03<00:52,  6.57s/it]

2022-05-27 19:54:55,152 - detection_util - INFO - Processing file: 98_B.jpg


 90%|█████████ | 66/73 [07:10<00:45,  6.50s/it]

2022-05-27 19:55:01,499 - detection_util - INFO - Processing file: 98_AA.jpg


 92%|█████████▏| 67/73 [07:16<00:39,  6.63s/it]

2022-05-27 19:55:08,423 - detection_util - INFO - Processing file: 98_A.jpg


 93%|█████████▎| 68/73 [07:23<00:33,  6.68s/it]

2022-05-27 19:55:15,234 - detection_util - INFO - Processing file: 99_D.jpg


 95%|█████████▍| 69/73 [07:30<00:26,  6.67s/it]

2022-05-27 19:55:21,880 - detection_util - INFO - Processing file: 99_C.jpg


 96%|█████████▌| 70/73 [07:35<00:18,  6.29s/it]

2022-05-27 19:55:27,269 - detection_util - INFO - Processing file: 99_B.jpg


 97%|█████████▋| 71/73 [07:41<00:12,  6.22s/it]

2022-05-27 19:55:33,335 - detection_util - INFO - Processing file: 99_AA.jpg


 99%|█████████▊| 72/73 [07:47<00:05,  5.94s/it]

2022-05-27 19:55:38,612 - detection_util - INFO - Processing file: 99_A.jpg


100%|██████████| 73/73 [07:53<00:00,  6.49s/it]


Note how the creation of the predictions takes a lot longer than in the last tutorial. This is due to to the instance type. In tutorial 3, we were using an instance with a GPU. In this tutorial, we are not.

In [16]:
prediction_df.to_csv(f'{data_folder}/results_tutorial4_epoch_4.csv', sep=';')

After uploading them to the [GDSC website](https://gdsc.ce.capgemini.com/) you should get a score of around 75.

**Exercise:**
- Load the weights from a different checkpoint and use them to create a new submission. Do the results improve?

## Summary

In this tutorial, we took our knowledge from the last tutorial and used it to train a model in the cloud! ☁️☁️  <br>
Awesome, right? Let's quickly recap what we did. We

* converted tutorial 3 into a training job in the cloud, using the Sagemaker SDK.
* trained a model on the full data!
* downloaded the trained model and created a second submission.

And that's it for this tutorial already. Nice job 🚀 Next up: Tutorial 5 - Where to go from here