# Step 1: Experiment in a notebook
In this step you run data processing and model training and evaluation in the notebook locally. You don't use `sagemaker` or `boto3` packages.

![](img/six-steps-1.png)

<div class="alert alert-info"> Make sure you using <code>Data Science 3.0</code> image in Studio for this notebook.</div>



In [1]:
%pip install -q transformers==4.35.2

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 0.8.2 requires transformers[sentencepiece]<4.32.0,>=4.31.0, but you have transformers 4.35.2 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.




In [1]:
import pandas as pd
import numpy as np 
import json
import joblib
import sagemaker
import boto3
import os
import matplotlib.pyplot as plt
from time import gmtime, strftime, sleep
from sagemaker.experiments.run import Run, load_run

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


'2.214.3'

In [2]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")

Stored variables and their in-db values:
bucket_name                         -> 'sagemaker-ap-northeast-1-250506505253'
bucket_prefix                       -> 'blip-vqa'
dataset_file_local_path             -> 'data/bank-additional/bank-additional-full.csv'
domain_id                           -> 'd-rvigbtfoquob'
experiment_name                     -> 'mlops-blip-vqa-experiment-08-06-37-52'
initialized                         -> True
input_s3_url                        -> 's3://sagemaker-ap-northeast-1-250506505253/blip-v
region                              -> 'ap-northeast-1'
sm_role                             -> 'arn:aws:iam::250506505253:role/service-role/Amazo
target_col                          -> 'y'
user_profile_name                   -> None


In [3]:
session = sagemaker.Session()
sm = session.sagemaker_client

## Load data
The following cell is tagged with `parameters` as the cell tag to enable parametrization for headless execution of the notebook as [SageMaker Notebook-based workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html). Refer to the section **Run the notebook as a SageMaker job** for details and an example. Ignore this for now.

In [4]:
# This cell is tagged with `parameters` tag and will be overwritten if the notebook executed headlessly
file_source = "EFS"
file_name = "IconDomainVQAData.zip"
input_path = "./data" 
output_path = "./data/processed_training_data"

In [5]:
# If run the notebook as a job, non-interactivel or headlessly, the notebook cannot access the Studio EFS volume, download the dataset from S3 instead
# See the section "Run the notebook as a SageMaker job" for more details
if file_source != "EFS":
    session.download_data(
        path=os.path.join(input_path, ""), 
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input/{file_name}"
    )
    import zipfile
    with zipfile.ZipFile(os.path.join(input_path, file_name), "r") as z:
        print("Unzipping VQA data...")
        z.extractall("data")

## EDA
Let's do some explotary data analysis on this dataset.

## Create an experiment
You can use [Amazon SageMaker Experiments Python SDK](https://sagemaker.readthedocs.io/en/stable/experiments/index.html) to organize all your model development work and track all model runs as `experiment runs`.

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) automatically track the inputs, parameters, configurations, and results of your iterations as `runs`.

Experiments are organized in `runs` and runs organized in `run groups`:

- `Experiment`: A collection of runs that are grouped together. An experiment includes runs for multiple types that can be initiated from anywhere using the SageMaker Python SDK.
- `Run`: Each execution step of a model training process. A run consists of all the inputs, parameters, configurations, and results for one iteration of model training. Custom parameters and metrics can be logged using the `log_parameter`, `log_parameters`, and `log_metric` functions. Custom input and output can be logged using the `log_file` function.

In [6]:
experiment_name = f"mlops-blip-vqa-experiment-{strftime('%d-%H-%M-%S', gmtime())}"

In [7]:
%store experiment_name

Stored 'experiment_name' (str)


## Split data

In [8]:
from pathlib import Path 

from datasets import load_dataset

INFO:datasets:PyTorch version 2.0.0.post101 available.
INFO:datasets:TensorFlow version 2.12.1 available.
INFO:datasets:JAX version 0.4.20 available.


In [9]:
ori_train_path = Path(input_path) / 'IconDomainVQAData/train.jsonl'
train_percent = 90

training_dataset = load_dataset("json", data_files=str(ori_train_path), split=f"train[:{train_percent}%]")
valid_dataset = load_dataset("json", data_files=str(ori_train_path), split=f"train[{train_percent}%:]")
print("Training sets: {} - Validating set: {}".format(len(training_dataset), len(valid_dataset)))

Training sets: 13096 - Validating set: 1455


In [10]:
training_dataset

Dataset({
    features: ['question', 'answer', 'ques_type', 'grade', 'label', 'pid', 'unit', 'hint'],
    num_rows: 13096
})

In [11]:
valid_dataset

Dataset({
    features: ['question', 'answer', 'ques_type', 'grade', 'label', 'pid', 'unit', 'hint'],
    num_rows: 1455
})

In [12]:
# Save data to Studio filesystem
training_dataset.save_to_disk(Path(output_path) / 'train')
valid_dataset.save_to_disk(Path(output_path) / 'val')

Saving the dataset (0/1 shards):   0%|          | 0/13096 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1455 [00:00<?, ? examples/s]

## Model finetuning and validation

In [13]:
from transformers import BlipProcessor, BlipForQuestionAnswering
from datasets import load_dataset
import torch
from PIL import Image
from torch.utils.data import DataLoader
from tqdm import tqdm

torch.cuda.empty_cache()
torch.manual_seed(42)

2024-05-08 06:51:22.757828: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<torch._C.Generator at 0x7fa13c544db0>

In [14]:
class VQADataset(torch.utils.data.Dataset):
    """VQA (v2) dataset."""

    def __init__(self, dataset, processor, data_root):
        self.dataset = dataset
        self.processor = processor
        self.data_root = data_root

    def __len__(self):
        # return len(self.dataset)
        return 4

    def __getitem__(self, idx):
        # get image + text
        question = self.dataset[idx]['question']
        answer = self.dataset[idx]['answer']
        image_id = self.dataset[idx]['pid']
        image_path = os.path.join(self.data_root, f"{image_id}/image.png")
        image = Image.open(image_path).convert("RGB")
        text = question
        
        encoding = self.processor(image, text, padding="max_length", truncation=True, return_tensors="pt")
        labels = self.processor.tokenizer.encode(
            answer, max_length= 8, pad_to_max_length=True, return_tensors='pt'
        )
        encoding["labels"] = labels
        # remove batch dimension
        for k,v in encoding.items():  encoding[k] = v.squeeze()
        return encoding

In [15]:
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-capfilt-large")
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-capfilt-large")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

BlipForQuestionAnswering(
  (vision_model): BlipVisionModel(
    (embeddings): BlipVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    )
    (encoder): BlipEncoder(
      (layers): ModuleList(
        (0-11): 12 x BlipEncoderLayer(
          (self_attn): BlipAttention(
            (dropout): Dropout(p=0.0, inplace=False)
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (projection): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): BlipMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((768,), eps=1e-05, e

### Train a model
Use the class[`Run`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#run) to log model metrics.

In [16]:
train_dataset = VQADataset(dataset=training_dataset,
                          processor=processor,
                          data_root='data/IconDomainVQAData/train_fill_in_blank/train_fill_in_blank')
valid_dataset = VQADataset(dataset=valid_dataset,
                          processor=processor,
                          data_root='data/IconDomainVQAData/train_fill_in_blank/train_fill_in_blank')

batch_size = 2
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)


optimizer = torch.optim.AdamW(model.parameters(), lr=4e-5)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9, last_epoch=-1, verbose=False)

num_epochs = 1
patience = 10
min_eval_loss = float("inf")
early_stopping_hook = 0
tracking_information = []
scaler = torch.cuda.amp.GradScaler()



In [20]:
hyper_params = {
    'batch_size': batch_size,
    'epochs': num_epochs,
    'patience': patiencepatience': patience
}

In [None]:
run_suffix = strftime('%Y-%m-%M-%S', gmtime())

for epoch in range(num_epochs):
    epoch_loss = 0
    model.train()

    run_name = f"training-epoch{epoch}-{run_suffix}"
    
    with Run(experiment_name=experiment_name,
             run_name=run_name,
             run_display_name="test-experiment-name",
             sagemaker_session=session) as run:

        for idx, batch in zip(tqdm(range(len(train_dataloader)), desc='Training batch: ...'), train_dataloader):
            input_ids = batch.pop('input_ids').to(device)
            pixel_values = batch.pop('pixel_values').to(device)
            attention_masked = batch.pop('attention_mask').to(device)
            labels = batch.pop('labels').to(device)
            
            with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
                outputs = model(input_ids=input_ids,
                            pixel_values=pixel_values,
                            # attention_mask=attention_masked,
                            labels=labels)
                
            loss = outputs.loss
            epoch_loss += loss.item()
            # loss.backward()
            # optimizer.step()
            optimizer.zero_grad()
            
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        
        model.eval()
        eval_loss = 0
        for idx, batch in zip(tqdm(range(len(valid_dataloader)), desc='Validating batch: ...'), valid_dataloader):
            input_ids = batch.pop('input_ids').to(device)
            pixel_values = batch.pop('pixel_values').to(device)
            attention_masked = batch.pop('attention_mask').to(device)
            labels = batch.pop('labels').to(device)
    
            with torch.amp.autocast(device_type='cuda', dtype=torch.float16):
                outputs = model(input_ids=input_ids,
                            pixel_values=pixel_values,
                            attention_mask=attention_masked,
                            labels=labels)
            
            loss = outputs.loss
            eval_loss += loss.item()
    
        # tracking_information.append((epoch_loss/len(train_dataloader), eval_loss/len(valid_dataloader), optimizer.param_groups[0]["lr"]))
        run.log_parameters(
        run.log_parameters(hyper_params)
        run.log_metric(name="train_loss", value = epoch_loss/len(train_dataloader), step=epoch)
        run.log_metric(name="val_loss", value = eval_loss/len(valid_dataloader), step=epoch)
        run.log_metric(name="lr", value = optimizer.param_groups[0]["lr"], step=epoch)
        print("Epoch: {} - Training loss: {} - Eval Loss: {} - LR: {}".format(epoch+1, epoch_loss/len(train_dataloader), eval_loss/len(valid_dataloader), optimizer.param_groups[0]["lr"]))
        scheduler.step()
        if eval_loss < min_eval_loss:
            model.save_pretrained("Model/blip-saved-model", from_pt=True) 
            print("Saved model to Model/blip-saved-model")
            min_eval_loss = eval_loss
            early_stopping_hook = 0
        else:
            early_stopping_hook += 1
            if early_stopping_hook > patience:
                break
    
# pickle.dump(tracking_information, open("tracking_information.pkl", "wb"))
print("The finetuning process has done!")

Training batch: ...: 100%|██████████| 2/2 [00:30<00:00, 15.38s/it]
Validating batch: ...:  50%|█████     | 1/2 [00:05<00:05,  5.08s/it]

## Explore experiment runs with Studio UX
You can see all logged metrics, parameters, and artifacts in Studio UX in **SageMaker Home** > **Experiments** widget.

For example, select your experiment:

![](img/experiment-and-runs.png)

In the experiment list, select the experiment to display a list of the runs in the experiment:

![](img/runs.png)

You can select runs you would like to analyse and click **Analyze**. A new window with selected runs opens:

![](img/run-analyze.png)

Now you can analyse the runs, compare the data, and create charts:

![](img/experiments-run-analysis.png)

Refer to [Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale](https://aws.amazon.com/blogs/machine-learning/next-generation-amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings-at-scale/) blog post for more examples and details on SageMaker Experiments.

## Use experiment analytics
You can use the [analytics features](https://sagemaker.readthedocs.io/en/stable/api/training/analytics.html#analytics) of the Experiment SDK to query and compare the runs and identify the best model produced by your experiments.

Refer to these [notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments) for hands-on examples.

## Optional: Run the notebook as a SageMaker job
Sometimes there are scenarious in which you might want to run your notebooks as a non-interactive, scheduled jobs. Studio provides fast and simple tools built from the existing Amazon EventBridge, SageMaker Training and SageMaker Pipelines services to help you schedule your notebook jobs interactively. You don’t have to craft your own custom solution or enlist features from other services that may require additional overhead in time and costs to deploy.

You can run your notebook as a SageMaker job on-demand on based on any schedule you choose. You can also run multiple notebooks in parallel, and parametrize cells in your notebooks.

### Adapt the notebook to run headlessly
A headless notebook runs in a shell outside of the Studio environment. Therefore, your code in the notebook cannot depend on or access the Studio local storage, environment variables, or Python store. You must accordingly change any code which uses the local Studio environment.

### How to run
Follow the instructions in [Notebook-based Workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html) in the Developer Guide to run this notebook in non-interactive mode as a SageMaker job:
1. [Configure](https://docs.aws.amazon.com/sagemaker/latest/dg/scheduled-notebook-policies.html) the trust policy and additional IAM permissions for the Studio execution role. If you run this notebook in the domain in the AWS-preprovisioned account, the required permissions are automatically deployed
2. Provide the parameters as specified below
3. Run the notebook on-demand or schedule a job
4. Explore the results

### Set parameters

In [35]:
# output the name of the S3 bucket used by SageMaker – you need this value as bucket_name parameter
print(bucket_name)

sagemaker-us-east-1-683373171484


In [36]:
# If running interactively, upload data to S3 to have it here for a headless run
if file_source == 'EFS':
    input_s3_url = session.upload_data(
        path=os.path.join(input_path, file_name),
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input"
    )
    
    print(input_s3_url)

s3://sagemaker-us-east-1-683373171484/from-idea-to-prod/xgboost/input/bank-additional-full.csv


To parameterize your notebook, you [set](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run-troubleshoot-override.html) a tag `parameters` on a single cell in your notebook that marks it as the "parameter cell". SageMaker notebook execution will insert a new generated cell directly after that cell tagged with `parameters` at runtime. The generated cell will have code which sets the parameters with values you specifiy when you start an execution job.

The notebook execution job has no access to the Studio EFS volume. Any data you need to pass to the notebook must be copied to an S3 bucket, where the notebook can access it.

To run this notebook as a SageMaker job, choose the **Create a notebook job** icon in the notebook taskbar: 

![](img/notebook-as-sm-job-run.png)

Complete the popup form.

![](img/notebook-as-sm-job-parameters.png)

Set the following parameters to specified values in **Parameter** section of the form:

```
file_source = S3
input_path = /opt/ml/input/data/sagemaker_headless_execution 
output_path = /opt/ml/output/data
bucket_name = SET TO YOUR SAGEMAKER BUCKET NAME
bucket_prefix = from-idea-to-prod/xgboost
```

Select **Run now** or **Run on a schedule** and choose **Create**.

You can also [create a notebook job programmatically with SageMaker Python SDK](https://docs.aws.amazon.com/sagemaker/latest/dg/create-notebook-auto-run-sdk.html). 

---

## Continue with the step 2
open the step 2 [notebook](02-sagemaker-containers.ipynb).

## Further development ideas for your real-world projects
- Try different models, for example some of the [SageMaker built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), such as [CatBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html), [AutoGluon-Tabular](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), or [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)
- Try [SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) to automatically explore different solutions to find the best model. Refer to this hands-on tutorial: [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- Implement batch inference using [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

## Additional resources
- [Build and Train a Machine Learning Model Locally](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-build-model-locally/)
- [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)
- [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- [Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs](https://aws.amazon.com/blogs/machine-learning/operationalize-your-amazon-sagemaker-studio-notebooks-as-scheduled-notebook-jobs/)
- [Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)
- [Extracting, transforming and selecting features](https://spark.apache.org/docs/latest/ml-features.html)

# Shutdown kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

# Step 1: Experiment in a notebook
In this step you run data processing and model training and evaluation in the notebook locally. You don't use `sagemaker` or `boto3` packages.

![](img/six-steps-1.png)

<div class="alert alert-info"> Make sure you using <code>Data Science 3.0</code> image in Studio for this notebook.</div>



In [1]:
%pip install -q transformers==4.35.2

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 0.8.2 requires transformers[sentencepiece]<4.32.0,>=4.31.0, but you have transformers 4.35.2 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.




In [5]:
import pandas as pd
import numpy as np 
import json
import joblib
import sagemaker
import boto3
import os
import matplotlib.pyplot as plt
from time import gmtime, strftime, sleep
from sagemaker.experiments.run import Run, load_run

sagemaker.__version__

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


'2.214.3'

In [6]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")

Stored variables and their in-db values:
bucket_name                         -> 'sagemaker-ap-northeast-1-250506505253'
bucket_prefix                       -> 'blip-vqa'
dataset_file_local_path             -> 'data/bank-additional/bank-additional-full.csv'
domain_id                           -> 'd-rvigbtfoquob'
experiment_name                     -> 'mlops-blip-vqa-experiment-08-04-34-09'
initialized                         -> True
input_s3_url                        -> 's3://sagemaker-ap-northeast-1-250506505253/blip-v
region                              -> 'ap-northeast-1'
sm_role                             -> 'arn:aws:iam::250506505253:role/service-role/Amazo
target_col                          -> 'y'
user_profile_name                   -> None


In [7]:
session = sagemaker.Session()
sm = session.sagemaker_client

## Load data
The following cell is tagged with `parameters` as the cell tag to enable parametrization for headless execution of the notebook as [SageMaker Notebook-based workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html). Refer to the section **Run the notebook as a SageMaker job** for details and an example. Ignore this for now.

In [5]:
# This cell is tagged with `parameters` tag and will be overwritten if the notebook executed headlessly
file_source = "EF"
file_name = "IconDomainVQAData.zip"
input_path = "./data" 
output_path = "./data/processed_training_data"

In [6]:
# If run the notebook as a job, non-interactivel or headlessly, the notebook cannot access the Studio EFS volume, download the dataset from S3 instead
# See the section "Run the notebook as a SageMaker job" for more details
if file_source != "EFS":
    session.download_data(
        path=os.path.join(input_path, ""), 
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input/{file_name}"
    )
    import zipfile
    with zipfile.ZipFile(os.path.join(input_path, file_name), "r") as z:
        print("Unzipping VQA data...")
        z.extractall("data")

Unzipping VQA data...


## EDA
Let's do some explotary data analysis on this dataset.

## Create an experiment
You can use [Amazon SageMaker Experiments Python SDK](https://sagemaker.readthedocs.io/en/stable/experiments/index.html) to organize all your model development work and track all model runs as `experiment runs`.

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) automatically track the inputs, parameters, configurations, and results of your iterations as `runs`.

Experiments are organized in `runs` and runs organized in `run groups`:

- `Experiment`: A collection of runs that are grouped together. An experiment includes runs for multiple types that can be initiated from anywhere using the SageMaker Python SDK.
- `Run`: Each execution step of a model training process. A run consists of all the inputs, parameters, configurations, and results for one iteration of model training. Custom parameters and metrics can be logged using the `log_parameter`, `log_parameters`, and `log_metric` functions. Custom input and output can be logged using the `log_file` function.

In [7]:
experiment_name = f"mlops-blip-vqa-experiment-{strftime('%d-%H-%M-%S', gmtime())}"

In [8]:
%store experiment_name

Stored 'experiment_name' (str)


## Split data

In [9]:
from pathlib import Path 

from datasets import load_dataset

INFO:datasets:PyTorch version 2.0.0.post101 available.
INFO:datasets:TensorFlow version 2.12.1 available.
INFO:datasets:JAX version 0.4.20 available.


In [10]:
ori_train_path = Path(input_path) / 'IconDomainVQAData/train.jsonl'
train_percent = 90

training_dataset = load_dataset("json", data_files=str(ori_train_path), split=f"train[:{train_percent}%]")
valid_dataset = load_dataset("json", data_files=str(ori_train_path), split=f"train[{train_percent}%:]")
print("Training sets: {} - Validating set: {}".format(len(training_dataset), len(valid_dataset)))

Generating train split: 0 examples [00:00, ? examples/s]

Training sets: 13096 - Validating set: 1455


In [11]:
# Save data to Studio filesystem
training_dataset.save_to_disk(Path(output_path) / 'train')
valid_dataset.save_to_disk(Path(output_path) / 'val')

Saving the dataset (0/1 shards):   0%|          | 0/13096 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1455 [00:00<?, ? examples/s]

## Model finetuning and validation

In [1]:
from transformers import BlipProcessor, BlipForQuestionAnswering
from datasets import load_dataset
import torch
from PIL import Image
from torch.utils.data import DataLoader
from tqdm import tqdm

torch.cuda.empty_cache()
torch.manual_seed(42)

2024-05-08 04:40:21.929185: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<torch._C.Generator at 0x7f007938c3f0>

In [2]:
class VQADataset(torch.utils.data.Dataset):
    """VQA (v2) dataset."""

    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        # return len(self.dataset)
        return 100

    def __getitem__(self, idx):
        # get image + text
        question = self.dataset[idx]['question']
        answer = self.dataset[idx]['answer']
        image_id = self.dataset[idx]['pid']
        image_path = f"IconDomainVQAData/train_fill_in_blank/train_fill_in_blank/{image_id}/image.png"
        image = Image.open(image_path).convert("RGB")
        text = question
        
        encoding = self.processor(image, text, padding="max_length", truncation=True, return_tensors="pt")
        labels = self.processor.tokenizer.encode(
            answer, max_length= 8, pad_to_max_length=True, return_tensors='pt'
        )
        encoding["labels"] = labels
        # remove batch dimension
        for k,v in encoding.items():  encoding[k] = v.squeeze()
        return encoding

In [3]:
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-capfilt-large")
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-capfilt-large")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/524 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

BlipForQuestionAnswering(
  (vision_model): BlipVisionModel(
    (embeddings): BlipVisionEmbeddings(
      (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    )
    (encoder): BlipEncoder(
      (layers): ModuleList(
        (0-11): 12 x BlipEncoderLayer(
          (self_attn): BlipAttention(
            (dropout): Dropout(p=0.0, inplace=False)
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (projection): Linear(in_features=768, out_features=768, bias=True)
          )
          (layer_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): BlipMLP(
            (activation_fn): GELUActivation()
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
          )
          (layer_norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (post_layernorm): LayerNorm((768,), eps=1e-05, e

### Create a run
Create a new run using the [`Run`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#run) class and call the [`log_parameters()`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#sagemaker.experiments.Run.log_parameters) and [`log_artifact()`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#sagemaker.experiments.Run.log_artifact) methods to record information to the run.

You can use [`log_file()`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#sagemaker.experiments.Run.log_file) method to upload local files to S3 to persistently store all data for the run.

In [9]:
run_suffix = strftime('%Y-%m-%M-%S', gmtime())

with Run(experiment_name=experiment_name,
         run_name=f"test-experiment-{run_suffix}",
         run_display_name="test-experiment-name",
         sagemaker_session=session) as run:
    run.log_parameters(
        {
            "train": 0.7,
            "validate": 0.2,
            "test": 0.1
        }
    )
    # # Log input dataset metadata and output
    # run.log_artifact(name="marketing-dataset", value="./data/bank-additional/bank-additional-full.csv", media_type="text/csv", is_output=False)
    # run.log_artifact(name="train-csv", value="./data/train.csv", media_type="text/csv")
    # run.log_artifact(name="validation-csv", value="./data/validation.csv", media_type="text/csv")
    # run.log_artifact(name="test-csv", value="./data/test.csv", media_type="text/csv")

### Train a model
Use the class[`Run`](https://sagemaker.readthedocs.io/en/stable/experiments/sagemaker.experiments.html#run) to log model metrics.

In [34]:
# in the production code you need to use the unique ids
run_suffix = strftime('%Y-%m-%M-%S', gmtime())

# Train the model for different max_depth values
for i, d in enumerate([2, 5, 10, 15, 20]):
    hyperparams["max_depth"] = d
    
    print(f"Fit estimator with max_depth={d}")
    run_name = f"training-{i}-{run_suffix}"
    
    with Run(experiment_name=experiment_name,
             run_name=run_name,
             run_display_name=f"max-depth-{d}",
             sagemaker_session=session) as run:
        # Train the model
        model = (
            xgb.train(
                params=hyperparams, 
                dtrain=dtrain, 
                evals = [(dtrain,'train'), (dtest,'eval')], 
                num_boost_round=num_boost_round, 
                early_stopping_rounds=early_stopping_rounds, 
                verbose_eval = 0
            )
        )

        # Calculate metrics
        test_auc = roc_auc_score(test_label, model.predict(dtest))
        train_auc = roc_auc_score(train_label, model.predict(dtrain))
        
        # Log metrics to the run
        run.log_parameters(hyperparams)
        run.log_metric(name="test_auc", value = test_auc, step=d)
        run.log_metric(name="train_auc", value = train_auc, step=d)

        # time.sleep(8) # wait until resource tags are propagated to the run

        print(f"Test AUC: {test_auc:.4f} | Train AUC: {train_auc:.4f}")

Fit estimator with max_depth=2
Test AUC: 0.7739 | Train AUC: 0.7892
Fit estimator with max_depth=5
Test AUC: 0.7722 | Train AUC: 0.8090
Fit estimator with max_depth=10
Test AUC: 0.7579 | Train AUC: 0.8513
Fit estimator with max_depth=15
Test AUC: 0.7550 | Train AUC: 0.8763
Fit estimator with max_depth=20
Test AUC: 0.7574 | Train AUC: 0.8812


## Explore experiment runs with Studio UX
You can see all logged metrics, parameters, and artifacts in Studio UX in **SageMaker Home** > **Experiments** widget.

For example, select your experiment:

![](img/experiment-and-runs.png)

In the experiment list, select the experiment to display a list of the runs in the experiment:

![](img/runs.png)

You can select runs you would like to analyse and click **Analyze**. A new window with selected runs opens:

![](img/run-analyze.png)

Now you can analyse the runs, compare the data, and create charts:

![](img/experiments-run-analysis.png)

Refer to [Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale](https://aws.amazon.com/blogs/machine-learning/next-generation-amazon-sagemaker-experiments-organize-track-and-compare-your-machine-learning-trainings-at-scale/) blog post for more examples and details on SageMaker Experiments.

## Use experiment analytics
You can use the [analytics features](https://sagemaker.readthedocs.io/en/stable/api/training/analytics.html#analytics) of the Experiment SDK to query and compare the runs and identify the best model produced by your experiments.

Refer to these [notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments) for hands-on examples.

## Optional: Run the notebook as a SageMaker job
Sometimes there are scenarious in which you might want to run your notebooks as a non-interactive, scheduled jobs. Studio provides fast and simple tools built from the existing Amazon EventBridge, SageMaker Training and SageMaker Pipelines services to help you schedule your notebook jobs interactively. You don’t have to craft your own custom solution or enlist features from other services that may require additional overhead in time and costs to deploy.

You can run your notebook as a SageMaker job on-demand on based on any schedule you choose. You can also run multiple notebooks in parallel, and parametrize cells in your notebooks.

### Adapt the notebook to run headlessly
A headless notebook runs in a shell outside of the Studio environment. Therefore, your code in the notebook cannot depend on or access the Studio local storage, environment variables, or Python store. You must accordingly change any code which uses the local Studio environment.

### How to run
Follow the instructions in [Notebook-based Workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html) in the Developer Guide to run this notebook in non-interactive mode as a SageMaker job:
1. [Configure](https://docs.aws.amazon.com/sagemaker/latest/dg/scheduled-notebook-policies.html) the trust policy and additional IAM permissions for the Studio execution role. If you run this notebook in the domain in the AWS-preprovisioned account, the required permissions are automatically deployed
2. Provide the parameters as specified below
3. Run the notebook on-demand or schedule a job
4. Explore the results

### Set parameters

In [35]:
# output the name of the S3 bucket used by SageMaker – you need this value as bucket_name parameter
print(bucket_name)

sagemaker-us-east-1-683373171484


In [36]:
# If running interactively, upload data to S3 to have it here for a headless run
if file_source == 'EFS':
    input_s3_url = session.upload_data(
        path=os.path.join(input_path, file_name),
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input"
    )
    
    print(input_s3_url)

s3://sagemaker-us-east-1-683373171484/from-idea-to-prod/xgboost/input/bank-additional-full.csv


To parameterize your notebook, you [set](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run-troubleshoot-override.html) a tag `parameters` on a single cell in your notebook that marks it as the "parameter cell". SageMaker notebook execution will insert a new generated cell directly after that cell tagged with `parameters` at runtime. The generated cell will have code which sets the parameters with values you specifiy when you start an execution job.

The notebook execution job has no access to the Studio EFS volume. Any data you need to pass to the notebook must be copied to an S3 bucket, where the notebook can access it.

To run this notebook as a SageMaker job, choose the **Create a notebook job** icon in the notebook taskbar: 

![](img/notebook-as-sm-job-run.png)

Complete the popup form.

![](img/notebook-as-sm-job-parameters.png)

Set the following parameters to specified values in **Parameter** section of the form:

```
file_source = S3
input_path = /opt/ml/input/data/sagemaker_headless_execution 
output_path = /opt/ml/output/data
bucket_name = SET TO YOUR SAGEMAKER BUCKET NAME
bucket_prefix = from-idea-to-prod/xgboost
```

Select **Run now** or **Run on a schedule** and choose **Create**.

You can also [create a notebook job programmatically with SageMaker Python SDK](https://docs.aws.amazon.com/sagemaker/latest/dg/create-notebook-auto-run-sdk.html). 

---

## Continue with the step 2
open the step 2 [notebook](02-sagemaker-containers.ipynb).

## Further development ideas for your real-world projects
- Try different models, for example some of the [SageMaker built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), such as [CatBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html), [AutoGluon-Tabular](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), or [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)
- Try [SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) to automatically explore different solutions to find the best model. Refer to this hands-on tutorial: [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- Implement batch inference using [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

## Additional resources
- [Build and Train a Machine Learning Model Locally](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-build-model-locally/)
- [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)
- [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- [Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs](https://aws.amazon.com/blogs/machine-learning/operationalize-your-amazon-sagemaker-studio-notebooks-as-scheduled-notebook-jobs/)
- [Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)
- [Extracting, transforming and selecting features](https://spark.apache.org/docs/latest/ml-features.html)

# Shutdown kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>