# Visual Question Answering
The Framework is set up in way, that it is easy to combine a vision model from the [timm](https://github.com/huggingface/pytorch-image-models/tree/main/timm/models) library with a language model from [huggingface](https://huggingface.co/). For both models, either pre-trained weights can be used or the models can be trained as a composite in an end-to-end fashion.
For this example usage we will be using the [`RSVQAxBEN DataModule`](extra/rsvqaxben.ipynb) from [1] inside a [`pytorch lightning`](https://pytorch-lightning.readthedocs.io/en/stable/) trainer. The network will be integrated into a [`LightningModule`](https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html) to release us from writing training loop etc.
[1] [RSVQA Meets Bigearthnet: A New, Large-Scale, Visual Question Answering Dataset for Remote Sensing](https://ieeexplore.ieee.org/document/9553307)

First we start by importing the basics we need from `torch` and `pytorch_lightning` that are needed to set up the `LightningModule`.

In [1]:
# remove-output
# remove-input
import pytorch_lightning as pl
import torch
import torch.nn.functional as F
from torch import optim

from configvlm import ConfigVLM

  from .autonotebook import tqdm as notebook_tqdm


## Pytorch Lightning Module
The `Module` we use to encapsulate the model divides the usual loop into functions that are called internally by `pytorch_lightning`. The necessary functions are just `training_step` and `configure_optimizer`, but to have a fully functional script, we add the validation and test steps as well as evaluation of the validation and test results. All `_step` functions are working on a single batch while `_epoch_end` functions are called after all batches are used and are passed a list of all return values of their respective `_step` functions.
For VQA we have to add one additional function, as the network works with 3 values (vision + language input, output) instead of the usual 2 (input, output). Therefore we add a function (here called `_disassemble_batch`), which disassembles the batch into input and output where the _input contains both modalities_.

In [2]:
class LitVQAEncoder(pl.LightningModule):
    def __init__(
        self,
        config: ConfigVLM.VLMConfiguration,
        lr: float = 1e-3,
    ):
        super().__init__()
        self.lr = lr
        self.config = config
        self.model = ConfigVLM.ConfigVLM(config)

    def _disassemble_batch(self, batch):
        images, questions, labels = batch
        # For some reason questions come in here transposed as a list of Tensors
        # where the first elements of the question are in the first tensor (first
        # element of the list), all the second elements are in the second tensor
        # which is the second element of the list and so on.
        # So we first make it a list of lists and then a big tensor and then
        # transpose this tensor.
        # Now each tensor contains one question
        questions = torch.tensor(
            [x.tolist() for x in questions], device=self.device
        ).T.int()
        return (images, questions), labels

    def training_step(self, batch, batch_idx):
        x, y = self._disassemble_batch(batch)
        x_hat = self.model(x)
        loss = F.binary_cross_entropy_with_logits(x_hat, y)
        self.log("train/loss", loss)
        return {"loss": loss}

    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), lr=self.lr, weight_decay=0.01)
        return optimizer

    # ============== NON-MANDATORY-FUNCTION ===============

    def validation_step(self, batch, batch_idx):
        x, y = self._disassemble_batch(batch)
        x_hat = self.model(x)
        loss = F.binary_cross_entropy_with_logits(x_hat, y)
        return {"loss": loss, "outputs": x_hat, "labels": y}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
        self.log("val/loss", avg_loss)

    def test_step(self, batch, batch_idx):
        x, y = self._disassemble_batch(batch)
        x_hat = self.model(x)
        loss = F.binary_cross_entropy_with_logits(x_hat, y)
        return {"loss": loss, "outputs": x_hat, "labels": y}

    def test_epoch_end(self, outputs):
        avg_loss = torch.stack([x["loss"] for x in outputs]).mean()
        self.log("test/loss", avg_loss)

## Configuring
Now that we have our model, we will use the `pytorch_lightning.Trainer` to run our loops. Results are logged to `tensorboard`.

We start by importing some callbacks used during training

In [3]:
from pytorch_lightning.loggers import TensorBoardLogger
from configvlm.ConfigVLM import VLMConfiguration

as well as defining our hyperparameters.

In [4]:
vision_model_name = "resnet18"
text_model_name = "prajjwal1/bert-tiny"
seed = 42
number_of_channels = 12
image_size = 120
epochs = 4
lr = 5e-4

Then we create the configuration for usage in model creation later and the logger.

In [5]:
# remove-output
# seed for pytorch, numpy, python.random, Dataloader workers, spawned subprocesses
pl.seed_everything(seed, workers=True)

model_config = VLMConfiguration(
    timm_model_name=vision_model_name,
    hf_model_name=text_model_name,  # different to pre-training
    classes=1000,  # different to pre-training
    image_size=image_size,
    channels=number_of_channels,
    network_type=ConfigVLM.VLMType.VQA_CLASSIFICATION  # different to pre-training
)

logger = TensorBoardLogger(
    save_dir="./tb_logs",
    name="VQA Test Model",
    version="testversion"
)

Global seed set to 42


We log the hyperparameters and create a [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html).

In [6]:
# remove-output
trainer = pl.Trainer(
    max_epochs=epochs,
    accelerator="auto",
    logger=logger,
    log_every_n_steps=1,
)

logger.log_hyperparams({
    "Model Name": "VQA Test Model",
    "Seed": seed,
    "Epochs": epochs,
    "Channels": number_of_channels,
    "Image Size": image_size,
    "GPU": torch.cuda.get_device_name() if torch.cuda.is_available() else "-",
    "Learning Rate": lr,
})

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


## Creating Model + Dataset
Finally, we create the model defined above and our datamodule. We will be using a datamodule from this framework described in the Extra section.

In [7]:
# remove-input
# remove-output
import pathlib
my_data_path = str(pathlib.Path("").resolve().parent.joinpath("configvlm").joinpath("extra").joinpath("mock_data").resolve(strict=True))
# set precision on Ampere cards to bfloat16
torch.set_float32_matmul_precision('medium')

In [8]:
# hide-output
from configvlm.extra.RSVQAxBEN_DataModule_LMDB_Encoder import RSVQAxBENDataModule
from configvlm.ConfigVLM import get_hf_model
model = LitVQAEncoder(config=model_config, lr=lr)
dm = RSVQAxBENDataModule(
    data_dir=my_data_path,
    img_size=(number_of_channels, image_size, image_size),
    num_workers_dataloader=4,
    tokenizer = get_hf_model(model_name=text_model_name)[0]
)

Some weights of the model checkpoint at /home/lhackel/.cache/configvlm/pretrained_models/huggingface_models/prajjwal1/bert-tiny were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Dataloader using 4 workers

[96mHINT: pin_memory set to None [0m


## Running
Now we just have to call the `fit()` and optionally the `test()` functions.

:::{note}
These calls generate quite a bit of output depending on the number of batches and epochs. The output is removed for readability.
:::

In [9]:
# hide-output
trainer.fit(model, datamodule=dm)

(11:07:59) Datamodule setup called
Loading split RSVQAxBEN data for train...
              25 QA-pairs indexed
              25 QA-pairs in reduced data set


Counting Answers: 100%|██████████| 25/25 [00:00<00:00, 346064.69it/s]



The 1000 most frequent answers cover about 100.00 % of the total answers.


Converting to NP arrays: 100%|██████████| 25/25 [00:00<00:00, 845625.81it/s]


Loading split RSVQAxBEN data for val...
              25 QA-pairs indexed
              25 QA-pairs in reduced data set


Converting to NP arrays: 100%|██████████| 25/25 [00:00<00:00, 474468.78it/s]
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


setup took 0.01 seconds
  Total training samples:       25  Total validation samples:       25


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type      | Params
------------------------------------
0 | model | ConfigVLM | 16.6 M
------------------------------------
16.6 M    Trainable params
0         Non-trainable params
16.6 M    Total params
66.281    Total estimated model params size (MB)


Epoch 0:  50%|█████     | 2/4 [00:00<00:00,  3.15it/s, loss=0.691, v_num=sion]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/2 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][A
Epoch 0:  75%|███████▌  | 3/4 [00:00<00:00,  3.35it/s, loss=0.691, v_num=sion]
Epoch 0: 100%|██████████| 4/4 [00:00<00:00,  4.41it/s, loss=0.691, v_num=sion]
Epoch 1:  50%|█████     | 2/4 [00:00<00:00,  5.09it/s, loss=0.683, v_num=sion]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/2 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s][A
Epoch 1:  75%|███████▌  | 3/4 [00:00<00:00,  4.69it/s, loss=0.683, v_num=sion]
Epoch 1: 100%|██████████| 4/4 [00:00<00:00,  6.15it/s, loss=0.683, v_num=sion]
Epoch 2:  50%|█████     | 2/4 [00:00<00:00,  6.18it/s, loss=0.666, v_num=sion]
Validation: 0it [00:00, ?it/s][A
Validation:   0%|          | 0/2 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/2 [00:00<?

`Trainer.fit` stopped: `max_epochs=4` reached.


Epoch 3: 100%|██████████| 4/4 [00:00<00:00,  4.25it/s, loss=0.639, v_num=sion]


In [10]:
# hide-output
trainer.test(model, datamodule=dm)

(11:08:08) Datamodule setup called
Loading split RSVQAxBEN data for test...
              25 QA-pairs indexed
              25 QA-pairs in reduced data set


Converting to NP arrays: 100%|██████████| 25/25 [00:00<00:00, 421114.86it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


setup took 0.00 seconds
  Total test samples:       25
Testing DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 100.99it/s]


[{'test/loss': 0.5322797298431396}]