# Multihop Hybrid (Table + text) Question Generation: Training example
In this notebook, we will see how to fine-tune and evaluate a question generation model on HybridQA dataset.

## Configuration

We start by setting some parameters to configure the process.  Note that depending on the GPU being used you may need to tune the batch size.

In [1]:
model_name_or_path="t5-small"
modality="hybrid"
dataset_name="hybrid_qa"
max_len=200
target_max_len=40
output_dir="/dccstor/cssblr/rbhat//models/"
learning_rate=0.0001
num_train_epochs=2
per_device_train_batch_size=8
per_device_eval_batch_size=32
evaluation_strategy='epoch'

In [2]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    evaluation_strategy='epoch',
    learning_rate=learning_rate,
    prediction_loss_only=True,
    remove_unused_columns=False,
    )
training_args.predict_with_generate=True
training_args.remove_unused_columns = False
training_args.prediction_loss_only = False

2022-10-20 01:27:31.845509: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


---
## HybridQA data
Here we load one instance of HybridQA and visualize it. <font color='red'>This part of the code is not needed to train the model </font>

In [4]:
import json
from datasets import load_dataset

def print_hybridqa_instance(train_instance):
    print(json.dumps(train_instance, indent=4))

train_instance = load_dataset('hybrid_qa', split='train[1001:1002]')[0]
print_hybridqa_instance(train_instance)

Reusing dataset hybrid_qa (/dccstor/cssblr/rbhat/.cache/hybrid_qa/hybrid_qa/1.0.0/fabdc38783449dd6cb1acd25621af97b871e218fc3ab608191d492b408a93ab8)


{
    "question_id": "0424073b0d76fcb3",
    "question": "The tracks of what creature are found in the formation located in the largest country in Southern Europe ?",
    "table_id": "List_of_stratigraphic_units_with_ornithischian_tracks_5",
    "answer_text": "Pterosaur",
    "question_postag": "DT NNS IN WP NN VBP VBN IN DT NN VBN IN DT JJS NN IN NNP NNP .",
    "table": {
        "url": "https://en.wikipedia.org/wiki/List_of_stratigraphic_units_with_ornithischian_tracks",
        "title": "List of stratigraphic units with ornithischian tracks",
        "header": [
            "Name",
            "Location",
            "Description"
        ],
        "data": [
            {
                "value": "Aganane Formation",
                "urls": [
                    {
                        "url": "/wiki/Aganane_Formation",
                        "summary": "The Aganane Formation is a Pliensbachian geologic formation in Morocco . Fossil stegosaur and theropod tracks have been repor

---
## Loading the Model

Here we load the model based on the model_name and modality parameter set above. For HybridQA we keep modality='hybrid'. Other options are modality='table' and modality='passage'.

In [17]:
from primeqa.qg.models.qg_model import QGModel

qg_model = QGModel(model_name_or_path, modality=modality, lang='en')

loading configuration file config.json from cache at /dccstor/cssblr/rbhat/.cache/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5/config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_b

torch.Size([32128, 512]) torch.Size([32128, 512])


loading file spiece.model from cache at /dccstor/cssblr/rbhat/.cache/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5/spiece.model
loading file tokenizer.json from cache at /dccstor/cssblr/rbhat/.cache/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /dccstor/cssblr/rbhat/.cache/models--t5-small/snapshots/d78aea13fa7ecd06c29e3e46195d6341255065d5/config.json
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "la

torch.Size([32128, 512]) torch.Size([32100, 512])


# Loading Data

Here we process and load the data.

In [8]:
from primeqa.qg.processors.data_loader import QGDataLoader

qgdl = QGDataLoader(
    tokenizer=qg_model.tokenizer,
    modality=modality,
    dataset_name=dataset_name,
    input_max_len=max_len,
    target_max_len=target_max_len
    )

train_dataset = qgdl.create(dataset_split="train[:100]")
valid_dataset = qgdl.create(dataset_split="validation[:50]")
print(train_dataset)
print(valid_dataset)

Reusing dataset hybrid_qa (/dccstor/cssblr/rbhat/.cache/hybrid_qa/hybrid_qa/1.0.0/fabdc38783449dd6cb1acd25621af97b871e218fc3ab608191d492b408a93ab8)


  0%|          | 0/1 [00:00<?, ?ba/s]


Extracting chains: 0it [00:00, ?it/s][A
Extracting chains: 1it [00:06,  6.14s/it][A
Extracting chains: 2it [00:06,  2.71s/it][A
Extracting chains: 3it [00:08,  2.52s/it][A
Extracting chains: 4it [00:09,  1.85s/it][A
Extracting chains: 6it [00:10,  1.06s/it][A
Extracting chains: 7it [00:11,  1.23s/it][A
Extracting chains: 8it [00:12,  1.06s/it][A
Extracting chains: 9it [00:14,  1.41s/it][A
Extracting chains: 10it [00:15,  1.17s/it][A
Extracting chains: 14it [00:18,  1.05it/s][A
Extracting chains: 15it [00:19,  1.12it/s][A
Extracting chains: 16it [00:29,  2.81s/it][A
Extracting chains: 17it [00:30,  2.43s/it][A
Extracting chains: 19it [00:32,  1.85s/it][A
Extracting chains: 20it [00:33,  1.70s/it][A
Extracting chains: 21it [00:34,  1.40s/it][A
Extracting chains: 22it [00:35,  1.44s/it][A
Extracting chains: 23it [00:36,  1.38s/it][A
Extracting chains: 24it [00:41,  2.34s/it][A
Extracting chains: 27it [00:43,  1.35s/it][A
Extracting chains: 28it [00:45,  1.51s/it][A


  0%|          | 0/1 [00:00<?, ?ba/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Reusing dataset hybrid_qa (/dccstor/cssblr/rbhat/.cache/hybrid_qa/hybrid_qa/1.0.0/fabdc38783449dd6cb1acd25621af97b871e218fc3ab608191d492b408a93ab8)


  0%|          | 0/1 [00:00<?, ?ba/s]


Extracting chains: 0it [00:00, ?it/s][A
Extracting chains: 1it [00:12, 12.67s/it][A
Extracting chains: 2it [00:13,  5.61s/it][A
Extracting chains: 3it [00:17,  5.10s/it][A
Extracting chains: 4it [00:18,  3.27s/it][A
Extracting chains: 5it [00:19,  2.39s/it][A
Extracting chains: 6it [00:22,  2.74s/it][A
Extracting chains: 9it [00:24,  1.41s/it][A
Extracting chains: 11it [00:24,  1.07s/it][A
Extracting chains: 12it [00:25,  1.11it/s][A
Extracting chains: 13it [00:25,  1.26it/s][A
Extracting chains: 14it [00:26,  1.16it/s][A
Extracting chains: 15it [00:39,  3.91s/it][A
Extracting chains: 16it [00:39,  2.92s/it][A
Extracting chains: 17it [00:40,  2.43s/it][A
Extracting chains: 18it [00:41,  2.02s/it][A
Extracting chains: 20it [00:49,  2.80s/it][A
Extracting chains: 22it [00:49,  1.77s/it][A
Extracting chains: 23it [00:51,  1.80s/it][A
Extracting chains: 24it [00:59,  3.26s/it][A
Extracting chains: 25it [01:00,  2.58s/it][A
Extracting chains: 26it [01:01,  2.25s/it][A

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask'],
    num_rows: 75
})
Dataset({
    features: ['label', 'input', 'input_ids', 'attention_mask', 'target_ids', 'target_attention_mask'],
    num_rows: 42
})


# Train using QGTrainer
Here we create a QG trainer with the training arguments defined above and use it to train on HybridQA training data (or any custom data following the same format)

In [14]:
import os
from primeqa.qg.trainers.qg_trainer import QGTrainer
from primeqa.qg.utils.data_collator import T2TDataCollator
from primeqa.qg.metrics.generation_metrics import rouge_metrics

compute_metrics = rouge_metrics(qg_model.tokenizer)

trainer = QGTrainer(
    model=qg_model.model,
    tokenizer = qg_model.tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=T2TDataCollator(),
    compute_metrics=compute_metrics
    )

train_results = trainer.train()
trainer.save_model()
print(train_results.metrics)

***** Running training *****
  Num examples = 75
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 20


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,3.373989,16.8598,3.3474,14.7241,14.834
2,No log,3.297106,17.3003,2.9068,14.7613,14.8174


***** Running Evaluation *****
  Num examples = 42
  Batch size = 32
***** Running Evaluation *****
  Num examples = 42
  Batch size = 32


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to /dccstor/cssblr/rbhat//models/
Configuration saved in /dccstor/cssblr/rbhat//models/config.json
Model weights saved in /dccstor/cssblr/rbhat//models/pytorch_model.bin
tokenizer config file saved in /dccstor/cssblr/rbhat//models/tokenizer_config.json
Special tokens file saved in /dccstor/cssblr/rbhat//models/special_tokens_map.json
Copy vocab file to /dccstor/cssblr/rbhat//models/spiece.model


{'train_runtime': 49.5407, 'train_samples_per_second': 3.028, 'train_steps_per_second': 0.404, 'total_flos': 10891100160000.0, 'train_loss': 3.7156166076660155, 'epoch': 2.0}


## Evaluation

Here we evaluate the trained model on validation set

In [15]:
metrics = trainer.evaluate()
print(metrics)

***** Running Evaluation *****
  Num examples = 42
  Batch size = 32


{'eval_loss': 3.2971057891845703, 'eval_rouge1': 17.3003, 'eval_rouge2': 2.9068, 'eval_rougeL': 14.7613, 'eval_rougeLsum': 14.8174, 'eval_runtime': 3.7786, 'eval_samples_per_second': 11.115, 'eval_steps_per_second': 0.529, 'epoch': 2.0}
