# World Bank Financial Survey Q&A Model Project 

This project develops a NLP powererd question-answering system that is trained on World Bank Survey Data containing financial information gathered from various federal banks across the globe. This notebook walks the user through gathering/processing the data and training/deploying the final model. 

### Dataset Description
The World Bank survey dataset comprises of structured financial questions sent to financial instituitions worldwide. The dataset includes multi-dimensional survey responses, hierarchial question structures, and financial metrics. For this project, we will use the questions that have long-form textual answers to train an NLP model, rather than using binary response questions.

### Project Architecture

##### Phase 1 - Data Processing 

- Transform unstructured survey data into structurerd NLP training pairs
    - Parse all relevant sheets from excel file
    - Properly handle hierarchical question structures to ensure each question answer pair is standalone
- Indentify and Flag PII using a ML model in dataset

##### Phase 2 - Model Development & Fine Tuning

- Fine-tune a Google FLAN-T5-Base NLP model 
- Optimize the model's performance on this specific World Bank survey domain
- Evaluate model performance using validation and test sample sets

##### Phase 3 - Deployement

- Deploy fine-tuned model to production environment on Azure/Huggingface

##### Future Steps (if time allows):
- implement API for interacting with model
- add some sort of sentiment analysis to clasify questions/answers (financial questions, admin questions, etc)
- get feedback on model performance (answer quality/hallucinations/knowledge gaps)  
- add additional survey questions to knowledge base 

In [None]:
# download data from World Bank Database
import requests

url = "https://datacatalogfiles.worldbank.org/ddh-published/0038632/2/DR0047737/2021_04_26_brss-public-release.xlsx"
response = requests.get(url)

with open("worldbank_data.xlsx", "wb") as f:
    f.write(response.content)

Now that data is downloaded, it needs to be converted from an xlsx file with row column format to something that works for t5 training (question:answer pairs).  

In [1]:
## read and process data
import pandas as pd
import re

# Remove extra unnecessary information from question
# For example, "Select all that apply"
def simplify_question(qText):
    if pd.isna(qText):
        return ""
    
    text = str(qText).strip()
    
    # split on common instruction starters and take first part
    for splitter in [" Please ", " If ", " Include ", " Specify ", " Describe ", " List "]:
        if splitter in text:
            text = text.split(splitter)[0]
            break
    
    # if there's a question mark, take up to first one
    if "?" in text:
        text = text.split("?")[0] + "?"
    
    return text.strip()

# loads all sheets at once
allSheets = pd.ExcelFile("worldbank_data.xlsx")

# store samples
samples = []

# process all sheets except first 2 and last 1
process = allSheets.sheet_names[2:-1]

# read first sheet and extract countries
dfFirst = pd.read_excel(allSheets, sheet_name=process[0], header=None)
countries = [str(c) for c in dfFirst.iloc[0, 2:].values if not pd.isna(c)]

for sheet in process:
    # read current sheet
    df = pd.read_excel(allSheets, sheet_name=sheet, header=None)
    
    # create parent and base vars
    parent = None
    currBase = None
    
    # iterate through every row except header
    # get question index and question text
    for idx, row in df.iloc[1:].iterrows():
        qIndex = row[0]
        qText = row[1]
        
        # if the question index is null but text does exist 
        # then the question is a parent question
        # assign parent question and then clear prev base and move onto next row
        if pd.isna(qIndex) and not pd.isna(qText):
            parent = simplify_question(qText)  # ← Simplify parent too
            currBase = None
            continue
        
        # regex starts with Q and captures groups delimited by _
        # group 1 is the main question number
        # group 2 is sub-question number
        # group 3 is for multi-part questions with extra text
        # non-capturing group is for sections of index which are unnecessary
        match = re.match(r'Q(\d+)_([0-9_]+?)([a-zA-Z_]+)?(?:_[A-Z]|_\d{4}|$)', str(qIndex))
        
        # if regex matched then process row, otherwise skip
        if match:
            baseNum = f"{match.group(1)}_{match.group(2)}"
            isMulti = bool(match.group(3)) or bool(re.search(r'_\d{4}', str(qIndex)))
            part = match.group(3) if match.group(3) else ""
        else:
            continue
        
        # if new base is different to current base, update base
        if baseNum and baseNum != currBase:
            # reset parent if new question isn't multi part
            if not isMulti:
                parent = None
            currBase = baseNum
        
        # loop through each column
        for colIdx, country in enumerate(countries):
            
            # get answer for current column
            answer = row[colIdx + 2]
            
            # skip column if there's no answer
            if pd.isna(answer):
                continue
            
            # Simplify the question text
            simplifiedQ = simplify_question(qText)  # ← KEY CHANGE
            
            # if question is multi-part combine parent question and question text
            if isMulti and parent:
                completeQ = f"{parent} {simplifiedQ}"
            # otherwise just append question text
            else:
                completeQ = simplifiedQ
            
            # fill in sample entry
            sample = {
                "input": f"Answer this question about {country}: {completeQ}".strip(),
                "target": str(answer).strip()
            }
            
            # append sample to list
            samples.append(sample)

Now that the data is in proper training format, it needs to be checked for PII. We will use Microsoft's Presidio pre-trained ML library to detect PII (https://github.com/microsoft/presidio).

In [None]:
# install dependecies
# !pip install presidio_analyzer presidio_anonymizer
# !python -m spacy download en_core_web_lg

In [None]:
from presidio_analyzer import AnalyzerEngine
from tqdm import tqdm
import json

# initialize analyzer
analyzer = AnalyzerEngine()

# specific countries and years are necessary to the survey data
# do not flag these as PII
excludeWords = set(countries)
excludeWords.update(['2011', '2012', '2013', '2014', '2015', '2016'])

# only include entries that the model has 70%+ confidnece is PII
CONFIDENCE = 0.7

# only track unique PII values
seenPII = set()

# storage for PII
potentialPII = []

# iterate through every sample
for idx, sample in enumerate(tqdm(samples, desc='finding pii')):

    # get input question and target
    inputText = sample["input"]
    targetText = sample["target"]

    # analyze input and target
    inputRes = analyzer.analyze(text=inputText, language='en')
    targetRes = analyzer.analyze(text=targetText, language='en')

    # filter out exclude list from text matches
    inputRes = [r for r in inputRes 
                if r.score >= CONFIDENCE
                and not any(inputText[r.start:r.end] in word or word in inputText[r.start:r.end] for word in excludeWords)] 
    targetRes = [r for r in targetRes 
                 if r.score >= CONFIDENCE 
                 and not any(targetText[r.start:r.end] in word or word in targetText[r.start:r.end] for word in excludeWords)]

    # if pii is found
    isNewPII = False
    for r in inputRes:
        if inputText[r.start:r.end] not in seenPII:
            isNewPII = True
            seenPII.add(inputText[r.start:r.end])
    for r in targetRes:
        if targetText[r.start:r.end] not in seenPII:
            isNewPII = True
            seenPII.add(targetText[r.start:r.end])

    if isNewPII:
        res = {
            "input": inputText,
            "target": targetText,
            "inputPII": [{"type": r.entity_type, "text": inputText[r.start:r.end], "score": r.score} for r in inputRes],
            "targetPII": [{"type": r.entity_type, "text": targetText[r.start:r.end], "score": r.score} for r in targetRes]
        }
        potentialPII.append(res)

# dump all potential flagged PII into a json file
with open('potentialPII.json', 'w', encoding='utf-8') as f:
    json.dump(potentialPII, f, indent=2, ensure_ascii=False)


finding pii: 100%|██████████| 107833/107833 [35:25<00:00, 50.73it/s] 


The code dumps all potential PII matches to a seperate JSON file saved to the current directory (potentiallyPII.json). This file can now be manually checked to determine which flagged keywords are false postives and which are actually PII. Once all PII is removed from the dataset, the T5 model training can begin.

In [None]:
# install dependencies
# !pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
# !pip install transformers datasets accelerate


In [None]:
import torch
from torch.utils.data import DataLoader
from torch.nn import CrossEntropyLoss
from torch.optim.lr_scheduler import CosineAnnealingLR
from tqdm import tqdm
from transformers import (
    AutoTokenizer, 
    AutoModelForSeq2SeqLM, 
    DataCollatorForSeq2Seq,
)
from datasets import Dataset
import random

# balances samples to significantly reduce training time for project constraints
# also helps prevent the model from learning to predict yes/no for every question
samplesSmall = [s for s in samples if len(s["target"].split()) < 3]
samplesLarge = [s for s in samples if len(s["target"].split()) >= 3]
random.seed(42)
samplesBalanced = (
    random.sample(samplesLarge, min(int(len(samples) * 0.7), len(samplesLarge))) + 
    random.sample(samplesSmall, min(int(len(samples) * 0.3), len(samplesSmall)))
)
random.shuffle(samplesBalanced)

# convert existing data to hugging face dataset
data = Dataset.from_list(samplesBalanced)

# Setup
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

def preprocess(samples):
    modelInputs = tokenizer(
        samples["input"],
        max_length=512,
        truncation=True,
        padding=False
    )
    targets = tokenizer(
        samples["target"],
        max_length=128,
        truncation=True,
        padding=False
    )
    modelInputs["labels"] = targets["input_ids"]
    return modelInputs

trainValSplit = data.train_test_split(test_size=0.2)
valTestSplit = trainValSplit["test"].train_test_split(test_size=0.5)

splits = {
    "train": trainValSplit["train"],
    "validation": valTestSplit['train'],
    "test": valTestSplit["test"]
}

finalData = {
    "train": splits["train"].map(preprocess, batched=True, remove_columns=["input", "target"]),
    "validation": splits["validation"].map(preprocess, batched=True, remove_columns=["input", "target"]),
    "test": splits["test"].map(preprocess, batched=True, remove_columns=["input", "target"])
}

dataCollator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6, weight_decay=0.01)

# learning rate sceduling - reduces the lr over all epochs
# helps model converge better by reducing oscillation
scheduler = CosineAnnealingLR(optimizer, T_max=7)

# Create dataloaders
train_dataloader = DataLoader(
    finalData["train"], 
    batch_size=4, 
    shuffle=True, 
    collate_fn=dataCollator
)

val_dataloader = DataLoader(
    finalData["validation"],
    batch_size=4,
    collate_fn=dataCollator
)

num_epochs = 7
device = "cuda"

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    
    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    for step, batch in enumerate(progress_bar):
        # look up more info
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        
        # look up more info
        # added label_smoothing - makes the model less confident and improves generalization (ability to perform in unseen data)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        logits = outputs.logits
        loss_fct = CrossEntropyLoss(label_smoothing=0.1, ignore_index=-100)
        loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))        
        
        # look up more info
        optimizer.zero_grad()
        loss.backward()

        # gradient clipping - prevents gradients from breaking if model updates by large amount
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        total_loss += loss.item()
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})
            
    avg_train_loss = total_loss / len(train_dataloader)
    print(f"\nEpoch {epoch+1} - Avg Train Loss: {avg_train_loss:.4f}")

    scheduler.step()
    
    model.eval()
    total_val_loss = 0
    
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            total_val_loss += outputs.loss.item()
    
    avg_val_loss = total_val_loss / len(val_dataloader)
    print(f"Epoch {epoch+1} - Avg Val Loss: {avg_val_loss:.4f}")
    
model.save_pretrained("./flan-t5-small-label-smooth-balanced")
tokenizer.save_pretrained("./flan-t5-small-label-smooth-balanced")


Map: 100%|██████████| 30276/30276 [00:03<00:00, 8391.21 examples/s] 
Map: 100%|██████████| 3784/3784 [00:00<00:00, 16175.57 examples/s]
Map: 100%|██████████| 3785/3785 [00:00<00:00, 14886.58 examples/s]
Epoch 1/7:  15%|█▍        | 1114/7569 [03:43<18:23,  5.85it/s, loss=2.7522] 

Now that the model is fine-tuned on the initial dataset, it can be locally queried to it correctly provides predictions.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import random

model = AutoModelForSeq2SeqLM.from_pretrained("./flan-t5-bsae-CUSTOM-TRAINED").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("./flan-t5-bsae-CUSTOM-TRAINED")

test_indices = random.sample(range(len(data)), 20)

for idx in test_indices:
    sample = data[idx]
    question = sample["input"]
    true_answer = sample["target"]
    
    inputs = tokenizer(question, return_tensors="pt", max_length=512, truncation=True).to("cuda")
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    predicted = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
    print(f"\nQ: {question[:70]}...")
    print(f"True: {true_answer[:60]}...")
    print(f"Pred: {predicted}")

# Test on custom questions
print("custom questions:")

custom_questions = [
    "Answer this question about United States: What body/agency grants banking licenses?",
    "Answer this question about France: What is the minimum capital requirement?",
    "Answer this question about Japan: Who regulates banks?"
]

for q in custom_questions:
    inputs = tokenizer(q, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_length=128, num_beams=4)
    pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\nQ: {q}")
    print(f"A: {pred}")


Q: Answer this question about Portugal: 14.1 What body/agency has the res...
True: The Central Bank for the retail banking sector. ComissÃ£o do...
Pred: The Financial Supervisory Authority of Portugal (Financial Supervisory Authority) is responsible for implementing, overseeing and enforcing any aspects of financial consumer protection laws and regulations. The Financial Supervisory Authority of Portugal (Financial Supervisory Authority) is responsible for implementing, overseeing and enforcing any aspects of financial consumer protection laws and regulations.

Q: Answer this question about Cayman Islands: 3.20.1 Which of the followi...
True: Limit of 1.25% of risk weighted assets...
Pred: Tier 2 capital is not recognised by the Cayman Islands Monetary Authority

Q: Answer this question about Lebanon: 12.1.1 a. Commercial banks...
True: Banking Control Commision of Lebanon and Special investigati...
Pred: Banking Control Commision of Lebanon and Special investigation Committee

Q: Ans

Now that we have made sure the model is working, we can upload it to a server. In this project, I'm using HuggingFace as its free and allows for easy testing. For production we would use Azure/AWS/GCP.

In [None]:
# you may need to run the authentication command directly in your terminal 
!pip install huggingface_hub
!hf auth login

^C


Now that you are logged in to huggingface, you must upload the trained model.

In [None]:
# upload model to :
model.push_to_hub("mian21/flan-t5-bsae-CUSTOM-TRAINED")
tokenizer.push_to_hub("mian21/flan-t5-bsae-CUSTOM-TRAINED")

model.safetensors: 100%|██████████| 308M/308M [03:58<00:00, 1.29MB/s]   
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


CommitInfo(commit_url='https://huggingface.co/mian21/flan-t5-bsae-CUSTOM-TRAINED/commit/9269846cf7bba0c68b074278b603102fde5356e8', commit_message='Upload tokenizer', commit_description='', oid='9269846cf7bba0c68b074278b603102fde5356e8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/mian21/flan-t5-bsae-CUSTOM-TRAINED', endpoint='https://huggingface.co', repo_type='model', repo_id='mian21/flan-t5-bsae-CUSTOM-TRAINED'), pr_revision=None, pr_num=None)

The model can be queried directly from the huggingface server using API requests or loaded directly into your code using huggingface's autotrainer.

In [None]:
import requests

API_URL = "https://router.huggingface.co/hf-inference/models/mian21/flan-t5-bsae-CUSTOM-TRAINED"
headers = {"Authorization": "Bearer "}

payload = {
  "inputs": "question: What body/agency grants banking licenses in the United States?",
  "parameters": {"max_new_tokens": 128, "temperature": 0.2}
}

resp = requests.post(API_URL, headers=headers, json=payload, timeout=120)
print(resp.status_code, resp.text)


404 Not Found
