<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">BLIP Image Captioning Training</h1>
<br>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/45917/logos/thumb76_76.png?t=2023-02-08-17-53-48)

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">In this notebook we will train BLIP model by Salesforce for the Image Captioning task on the DiffusionDB dataset</span> <br>
<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Reference: https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb</span>

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Install Required Libraries</h1></span>

In [1]:
!pip install --upgrade wandb
!pip install --no-index --no-deps /kaggle/input/lavis-pretrained/salesforce-lavis/transformers* 
!pip install --no-index --no-deps /kaggle/input/lavis-pretrained/salesforce-lavis/hugging*

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting wandb
  Downloading wandb-0.13.10-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: wandb
  Attempting uninstall: wandb
    Found existing installation: wandb 0.12.21
    Uninstalling wandb-0.12.21:
      Successfully uninstalled wandb-0.12.21
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.10.1 requires wandb<0.13.0,>=0.10.0, but you have wandb 0.13.10 which is incompatible.[0m[31m
[0mSuccessfully installed wandb-0.13.10
[0m/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Processing /kaggle/input/lavis-pretrained/salesforce-lavis/transformers-4.26.1-

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Import Required Libraries 📚</h1></span>

In [2]:
import os
import gc
import copy
import time
import random
import joblib

# For data manipulation
import numpy as np
import pandas as pd

# Pytorch Imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader

# Utils
from tqdm import tqdm
from collections import defaultdict

# For Transformer Models
from transformers import AutoProcessor, AdamW
from transformers import BlipForConditionalGeneration

# For colored terminal text
from colorama import Fore, Back, Style
b_ = Fore.BLUE
y_ = Fore.YELLOW
sr_ = Style.RESET_ALL

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# For descriptive error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ['TOKENIZERS_PARALLELISM'] = "False"

<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;"> Weights & Biases (W&B) is a set of machine learning tools that helps you build better models faster. <strong>Kaggle competitions require fast-paced model development and evaluation</strong>. There are a lot of components: exploring the training data, training different models, combining trained models in different combinations (ensembling), and so on.</span>

> <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">⏳ Lots of components = Lots of places to go wrong = Lots of time spent debugging</span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">W&B can be useful for Kaggle competition with it's lightweight and interoperable tools:</span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">To learn more about Weights and Biases check out this <strong><a href="https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases">kernel</a></strong>.</span>

In [3]:
import wandb

try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    api_key = user_secrets.get_secret("wandb_api")
    wandb.login(key=api_key)
    anony = None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Training Configuration ⚙️</h1></span>

In [4]:
CONFIG = {"seed": 2023,
          "epochs": 5,
          "model_name": "Salesforce/blip-image-captioning-base",
          "train_batch_size": 4,
          "valid_batch_size": 8,
          "learning_rate": 1e-4,
          "scheduler": 'CosineAnnealingLR',
          "min_lr": 1e-6,
          "T_max": 500,
          "weight_decay": 1e-6,
          "n_accumulate": 1,
          "device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
          "competition": "SD",
          "_wandb_kernel": "deb",
          }

CONFIG["processor"] = AutoProcessor.from_pretrained(CONFIG['model_name'])

Downloading (…)rocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Set Seed for Reproducibility</h1></span>

In [5]:
def set_seed(seed=42):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed(CONFIG['seed'])

# <h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Data 📖</h1>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.</span>
<br>
<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">DiffusionDB is publicly available at <a href="https://huggingface.co/datasets/poloclub/diffusiondb">Hugging Face Dataset</a>.</span>
<br><hr>
<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">We will use the first 5k images of DiffusionDB-2M subset</span>

In [6]:
from datasets import load_dataset

# Load the dataset with the `2m_first_5k` subset
dataset = load_dataset('poloclub/diffusiondb', '2m_first_5k')

Downloading builder script:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

Downloading and preparing dataset diffusion_db/2m_first_5k to /root/.cache/huggingface/datasets/poloclub___diffusion_db/2m_first_5k/0.9.1/547894e3a57aa647ead68c9faf148324098f47f2bc1ab6705d670721de9d89d1...


Downloading data:   0%|          | 0.00/581M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/585M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/643M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/585M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/595M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/195M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset diffusion_db downloaded and prepared to /root/.cache/huggingface/datasets/poloclub___diffusion_db/2m_first_5k/0.9.1/547894e3a57aa647ead68c9faf148324098f47f2bc1ab6705d670721de9d89d1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['image', 'prompt', 'seed', 'step', 'cfg', 'sampler', 'width', 'height', 'user_name', 'timestamp', 'image_nsfw', 'prompt_nsfw'],
        num_rows: 5000
    })
})

In [8]:
dataset = dataset['train']
dataset = dataset.filter(lambda example: example["step"] == 50)
len(dataset)

  0%|          | 0/5 [00:00<?, ?ba/s]

4984

In [9]:
dataset[0]

{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x768>,
 'prompt': 'a renaissance portrait of dwayne johnson, art in the style of rembrandt!! intricate. ultra detailed, oil on canvas, wet - on - wet technique, pay attention to facial details, highly realistic, cinematic lightning, intricate textures, illusionistic detail, ',
 'seed': 2480545905,
 'step': 50,
 'cfg': 16.0,
 'sampler': 'k_euler_ancestral',
 'width': 512,
 'height': 768,
 'user_name': 'e9dfc969d22cb9c5621ad075b3826c28f18ef3840c6dda59c4ac7daa55241393',
 'timestamp': datetime.datetime(2022, 8, 20, 5, 28, tzinfo=<UTC>),
 'image_nsfw': 0.16348764300346375,
 'prompt_nsfw': 0.000792665290646255}

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Data Split</h1></span>

In [10]:
dataset = dataset.train_test_split(test_size=0.1)

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Dataset Class</h1></span>

In [11]:
class ImageCaptioningDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        encoding = self.processor(images=item["image"], text=item["prompt"], 
                                  padding="max_length", return_tensors="pt")
        # remove batch dimension
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        return encoding

In [12]:
train_dataset = ImageCaptioningDataset(dataset['train'], CONFIG['processor'])
valid_dataset = ImageCaptioningDataset(dataset['test'], CONFIG['processor'])

In [13]:
train_dataset[0].keys()

dict_keys(['pixel_values', 'input_ids', 'attention_mask'])

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Model</h1></span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">BLIP is a model that is able to perform various multi-modal tasks including Image captioning </span> <br>
<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Model documentation: https://huggingface.co/docs/transformers/model_doc/blip </span>

In [14]:
model = BlipForConditionalGeneration.from_pretrained(CONFIG['model_name'])

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/990M [00:00<?, ?B/s]

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Training Function</h1></span>

In [15]:
def train_one_epoch(model, optimizer, scheduler, dataloader, device, epoch):
    model.train()
    
    dataset_size = 0
    running_loss = 0.0
    
    bar = tqdm(enumerate(dataloader), total=len(dataloader))
    for step, data in bar:
        input_ids = data['input_ids'].to(device)
        pixel_values = data['pixel_values'].to(device)
        
        batch_size = input_ids.size(0)

        outputs = model(input_ids=input_ids, 
                        pixel_values=pixel_values, 
                        labels=input_ids)
                
        loss = outputs.loss
        loss = loss / CONFIG['n_accumulate']
        loss.backward()
    
        if (step + 1) % CONFIG['n_accumulate'] == 0:
            optimizer.step()

            # zero the parameter gradients
            optimizer.zero_grad()

            if scheduler is not None:
                scheduler.step()
                
        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size
        
        epoch_loss = running_loss / dataset_size
        
        bar.set_postfix(Epoch=epoch, Train_Loss=epoch_loss,
                        LR=optimizer.param_groups[0]['lr'])
    gc.collect()
    
    return epoch_loss

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Validation Function</h1></span>

In [16]:
@torch.no_grad()
def valid_one_epoch(model, dataloader, device, epoch):
    model.eval()
    
    dataset_size = 0
    running_loss = 0.0
    
    bar = tqdm(enumerate(dataloader), total=len(dataloader))
    for step, data in bar:        
        input_ids = data['input_ids'].to(device)
        pixel_values = data['pixel_values'].to(device)
        
        batch_size = input_ids.size(0)

        outputs = model(input_ids=input_ids, 
                        pixel_values=pixel_values, 
                        labels=input_ids)
                
        loss = outputs.loss
        
        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size
        
        epoch_loss = running_loss / dataset_size
        
        bar.set_postfix(Epoch=epoch, Valid_Loss=epoch_loss,
                        LR=optimizer.param_groups[0]['lr'])   
    
    gc.collect()
    
    return epoch_loss

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Run Training</h1></span>

In [17]:
def run_training(model, optimizer, scheduler, device, num_epochs):
    # To automatically log gradients
    wandb.watch(model, log_freq=100)
    
    if torch.cuda.is_available():
        print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))
    
    start = time.time()
    best_model_wts = copy.deepcopy(model.state_dict())
    best_epoch_loss = np.inf
    history = defaultdict(list)
    
    for epoch in range(1, num_epochs + 1): 
        train_epoch_loss = train_one_epoch(model, optimizer, scheduler, 
                                           dataloader=train_loader, 
                                           device=CONFIG['device'], epoch=epoch)
        
        val_epoch_loss = valid_one_epoch(model, valid_loader, device=CONFIG['device'], 
                                         epoch=epoch)
    
        history['Train Loss'].append(train_epoch_loss)
        history['Valid Loss'].append(val_epoch_loss)
        
        # Log the metrics
        wandb.log({"Train Loss": train_epoch_loss})
        wandb.log({"Valid Loss": val_epoch_loss})
        
        # deep copy the model
        if val_epoch_loss <= best_epoch_loss:
            print(f"{b_}Validation Loss Improved ({best_epoch_loss} ---> {val_epoch_loss})")
            best_epoch_loss = val_epoch_loss
            run.summary["Best Loss"] = best_epoch_loss
            best_model_wts = copy.deepcopy(model.state_dict())
            PATH = f"BestLoss.bin"
            torch.save(model.state_dict(), PATH)
            # Save a model file from the current directory
            print(f"Model Saved{sr_}")
            
        print()
    
    end = time.time()
    time_elapsed = end - start
    print('Training complete in {:.0f}h {:.0f}m {:.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600) // 60, (time_elapsed % 3600) % 60))
    print("Best Loss: {:.4f}".format(best_epoch_loss))
    
    # load best model weights
    model.load_state_dict(best_model_wts)
    
    return model, history

In [18]:
def fetch_scheduler(optimizer):
    if CONFIG['scheduler'] == 'CosineAnnealingLR':
        scheduler = lr_scheduler.CosineAnnealingLR(optimizer,T_max=CONFIG['T_max'], 
                                                   eta_min=CONFIG['min_lr'])
    elif CONFIG['scheduler'] == 'CosineAnnealingWarmRestarts':
        scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer,T_0=CONFIG['T_0'], 
                                                             eta_min=CONFIG['min_lr'])
    elif CONFIG['scheduler'] == None:
        return None
        
    return scheduler

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Start Training</span>

In [19]:
run = wandb.init(project=CONFIG['competition'], 
                 config=CONFIG,
                 job_type='Train',
                 tags=[CONFIG['model_name']],
                 name="BLIP-baseline",
                 anonymous='must')

# Create Dataloaders
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=CONFIG['train_batch_size'])
valid_loader = DataLoader(valid_dataset, shuffle=False, batch_size=CONFIG['valid_batch_size'])

model.to(CONFIG['device'])

# Define Optimizer and Scheduler
optimizer = AdamW(model.parameters(), lr=CONFIG['learning_rate'], weight_decay=CONFIG['weight_decay'])
scheduler = fetch_scheduler(optimizer)

model, history = run_training(model, optimizer, scheduler,
                              device=CONFIG['device'],
                              num_epochs=CONFIG['epochs'])

run.finish()

del model, history, train_loader, valid_loader
_ = gc.collect()
print()
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mdchanda[0m. Use [1m`wandb login --relogin`[0m to force relogin


[INFO] Using GPU: Tesla P100-PCIE-16GB



100%|██████████| 1122/1122 [18:02<00:00,  1.04it/s, Epoch=1, LR=8.62e-5, Train_Loss=1.74]
100%|██████████| 63/63 [00:41<00:00,  1.53it/s, Epoch=1, LR=8.62e-5, Valid_Loss=1.53]


[34mValidation Loss Improved (inf ---> 1.533640047830188)
Model Saved[0m



100%|██████████| 1122/1122 [17:57<00:00,  1.04it/s, Epoch=2, LR=5.24e-5, Train_Loss=1.48]
100%|██████████| 63/63 [00:41<00:00,  1.54it/s, Epoch=2, LR=5.24e-5, Valid_Loss=1.49]


[34mValidation Loss Improved (1.533640047830188 ---> 1.4860287522505184)
Model Saved[0m



100%|██████████| 1122/1122 [18:02<00:00,  1.04it/s, Epoch=3, LR=1.75e-5, Train_Loss=1.44]
100%|██████████| 63/63 [00:41<00:00,  1.53it/s, Epoch=3, LR=1.75e-5, Valid_Loss=1.46]


[34mValidation Loss Improved (1.4860287522505184 ---> 1.4617982918848256)
Model Saved[0m



100%|██████████| 1122/1122 [18:01<00:00,  1.04it/s, Epoch=4, LR=1.14e-6, Train_Loss=1.42]
100%|██████████| 63/63 [00:41<00:00,  1.53it/s, Epoch=4, LR=1.14e-6, Valid_Loss=1.46]


[34mValidation Loss Improved (1.4617982918848256 ---> 1.4550576501475547)
Model Saved[0m



100%|██████████| 1122/1122 [18:01<00:00,  1.04it/s, Epoch=5, LR=1.24e-5, Train_Loss=1.4]
100%|██████████| 63/63 [00:42<00:00,  1.50it/s, Epoch=5, LR=1.24e-5, Valid_Loss=1.45]


[34mValidation Loss Improved (1.4550576501475547 ---> 1.4526172795133265)
Model Saved[0m

Training complete in 1h 33m 48s
Best Loss: 1.4526


0,1
Train Loss,█▃▂▁▁
Valid Loss,█▄▂▁▁

0,1
Best Loss,1.45262
Train Loss,1.40263
Valid Loss,1.45262





# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Visualizations</h1></span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;"><a href="https://wandb.ai/dchanda/SD">View the Complete Dashboard Here ⮕</a></span>

In [20]:
# This is just to display the W&B run page in this interactive session.
from IPython import display

# we create an IFrame and set the width and height
iF = display.IFrame(run.url, width=1080, height=720)
iF

![Upvote!](https://img.shields.io/badge/Upvote-If%20you%20like%20my%20work-07b3c8?style=for-the-badge&logo=kaggle)