
# Question Answering System with BERT  
**Dataset:** HotpotQA  
**Course:** NLP Applications  

This notebook demonstrates:
- Data loading & EDA
- Preprocessing for BERT QA
- Fine-tuning BERT


## 1. Install and Import Libraries

In [1]:

!pip install transformers datasets torch evaluate pandas numpy


Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [2]:

import torch
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import BertTokenizerFast, BertForQuestionAnswering, Trainer, TrainingArguments


## 2. Load HotpotQA Dataset

In [3]:

dataset = load_dataset("hotpot_qa", "distractor")
dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

distractor/train-00000-of-00002.parquet:   0%|          | 0.00/166M [00:00<?, ?B/s]

distractor/train-00001-of-00002.parquet:   0%|          | 0.00/166M [00:00<?, ?B/s]

distractor/validation-00000-of-00001.par(…):   0%|          | 0.00/27.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context'],
        num_rows: 90447
    })
    validation: Dataset({
        features: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context'],
        num_rows: 7405
    })
})

## 3. Merge Context Paragraphs

In [4]:

def merge_context(example):
    paragraphs = example["context"]["sentences"]
    example["merged_context"] = " ".join([" ".join(p) for p in paragraphs])
    return example

dataset = dataset.map(merge_context)


Map:   0%|          | 0/90447 [00:00<?, ? examples/s]

Map:   0%|          | 0/7405 [00:00<?, ? examples/s]

## 4. Exploratory Data Analysis

In [5]:

questions = dataset["train"]["question"]
pd.Series(questions).str.split().str[0].value_counts().head(10)


Unnamed: 0,count
What,19837
Which,11744
The,8195
Who,7480
Are,3492
In,3379
When,2309
Where,1622
How,1324
Were,608


In [6]:

answers = dataset["train"]["answer"]
answer_lengths = [len(a.split()) for a in answers]
pd.Series(answer_lengths).describe()


Unnamed: 0,0
count,90447.0
mean,2.226287
std,1.809021
min,1.0
25%,1.0
50%,2.0
75%,3.0
max,89.0


## 5. Sample Questions and Answers

In [7]:

for i in range(5):
    print("Question:", dataset["train"][i]["question"])
    print("Answer:", dataset["train"][i]["answer"])
    print("Context:", dataset["train"][i]["merged_context"][:300])
    print("-"*80)


Question: Which magazine was started first Arthur's Magazine or First for Women?
Answer: Arthur's Magazine
Context: Radio City is India's first private FM radio station and was started on 3 July 2001.  It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).  It plays Hindi, English and regional
--------------------------------------------------------------------------------
Question: The Oberoi family is part of a hotel company that has a head office in what city?
Answer: Delhi
Context: The Ritz-Carlton Jakarta is a hotel and skyscraper in Jakarta, Indonesia and 14th Tallest building in Jakarta.  It is located in city center of Jakarta, near Mega Kuningan, adjacent to the sister JW Marriott Hotel.  It is operated by The Ritz-Carlton Hotel Company.  The complex has two towers that c
--------------------------------------------------------------------------------
Quest

## 6. Tokenization and Preprocessing

In [8]:

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def preprocess(example):
    encoding = tokenizer(
        example["question"],
        example["merged_context"],
        truncation=True,
        padding="max_length",
        max_length=384,
        return_offsets_mapping=True
    )

    answer = example["answer"]
    start_char = example["merged_context"].find(answer)
    end_char = start_char + len(answer)

    start_pos, end_pos = 0, 0
    for i, (start, end) in enumerate(encoding["offset_mapping"]):
        if start <= start_char < end:
            start_pos = i
        if start < end_char <= end:
            end_pos = i

    encoding["start_positions"] = start_pos
    encoding["end_positions"] = end_pos
    encoding.pop("offset_mapping")
    return encoding

tokenized_dataset = dataset["train"].select(range(20000)).map(preprocess)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

## 7. Load and Fine-tune BERT

## Hyperparameter Experiments
We experimented with:
- Learning rates: 2e-5, 3e-5
- Batch sizes: 8, 16
- Epochs: 2, 3

Based on training stability and runtime constraints, we selected:
learning_rate = 2e-5, batch_size = 8, epochs = 2

In [9]:

model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

training_args = TrainingArguments(
    output_dir="./bert-qa",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=500
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

trainer.train()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Step,Training Loss
500,2.9082
1000,2.0675
1500,1.8063
2000,1.6573
2500,1.5444
3000,1.248
3500,1.1902
4000,1.1439
4500,1.1552
5000,1.1448


TrainOutput(global_step=5000, training_loss=1.5865746826171876, metrics={'train_runtime': 3344.975, 'train_samples_per_second': 11.958, 'train_steps_per_second': 1.495, 'total_flos': 7838902702080000.0, 'train_loss': 1.5865746826171876, 'epoch': 2.0})

## 8. Save Model

In [10]:

model.save_pretrained("./bert-qa-model")
tokenizer.save_pretrained("./bert-qa-model")


('./bert-qa-model/tokenizer_config.json',
 './bert-qa-model/special_tokens_map.json',
 './bert-qa-model/vocab.txt',
 './bert-qa-model/added_tokens.json',
 './bert-qa-model/tokenizer.json')

In [11]:
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="./bert-qa-model",
    tokenizer="./bert-qa-model"
)

context = """Ramanujan is a great Mathematician from India. """
question = "Who is Ramanujan?"

result = qa_pipeline(question=question, context=context)
print(result)


Device set to use cuda:0


{'score': 0.44675394892692566, 'start': 21, 'end': 34, 'answer': 'Mathematician'}


In [14]:
!zip -r "bert-qa-model.zip" "bert-qa-model"
from google.colab import files
files.download("bert-qa-model.zip")


  adding: bert-qa-model/ (stored 0%)
  adding: bert-qa-model/config.json (deflated 47%)
  adding: bert-qa-model/model.safetensors (deflated 7%)
  adding: bert-qa-model/special_tokens_map.json (deflated 42%)
  adding: bert-qa-model/vocab.txt (deflated 53%)
  adding: bert-qa-model/tokenizer_config.json (deflated 75%)
  adding: bert-qa-model/tokenizer.json (deflated 71%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>