***Cross-lingual Question Answering with BERT*** \
This notebook demonstrates how to use a pre-trained BERT model for question answering. We will:
1. Load a pre-trained BERT model and tokenizer.
3. Tokenize the inputs.
4. Perform inference to get the start and end positions of the answer.
5. Decode the tokens to get the final answer string.


In [2]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
from googletrans import Translator

In [3]:
# import data set with all questions and answers in Hindi
df = pd.read_csv('squad_translated_to_hindi_5k.csv')

In [5]:
translator = Translator()

***Translate all the context, question and answers from Hindi to English using google translate***

In [7]:
# Create a list to store the translated questions
translated_questions = []

for index, row in df.iterrows():
    translator = Translator()
    try:
        # Translate the 'question' column to English
        translated = translator.translate(row['question'], dest='en')
        translated_questions.append(translated.text)
    except Exception as e:
        translated_questions.append(None)  # or some placeholder in case of error

df['questions_en'] = translated_questions

In [8]:
# Create a list to store translated answers
translated_answers = []

for index, row in df.iterrows():
    translator = Translator()
    try:
        # Translate the 'answer' column to English
        translated = translator.translate(row['answer_text'], dest='en')
        translated_answers.append(translated.text)
    except Exception as e:
        print(f"Error at index {index}: {e}")
        translated_answers.append(None)  # or some placeholder in case of error

df['answers_en'] = translated_answers

In [138]:
# Translate context to english
translated_context = []

for index, row in df.iterrows():
    translator = Translator()
    try:
        # Translate the 'context' column to English
        translated = translator.translate(row['context'], dest='en')
        translated_context.append(translated.text)
    except Exception as e:
        print(f"Error at index {index}: {e}")
        translated_context.append(None)

df['context_en'] = translated_context

In [9]:
df.head()

Unnamed: 0,context,question,answer_start,answer_text,language,questions_en,answers_en
0,"वास्तुशिल्प रूप से, स्कूल में एक कैथोलिक चरित्...",नोट्रे डेम में ग्रोटो क्या है?,332,प्रार्थना और प्रतिबिंब का एक मैरियन स्थान,hindi,What is the grotto at Notre Dame?,A Marian place of prayer and reflection
1,"वास्तुशिल्प रूप से, स्कूल में एक कैथोलिक चरित्...",नोट्रे डेम में मुख्य भवन के शीर्ष पर क्या बैठत...,82,वर्जिन मैरी की एक सुनहरी मूर्ति,hindi,What sits on top of the main building at Notre...,A golden statue of the Virgin Mary
2,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",नोट्रे डेम की विद्वान पत्रिका कब प्रकाशित हुई?,211,सितंबर 1876,hindi,When was Notre Dame's scholarly magazine publi...,September 1876
3,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",कितनी बार नोट्रे डेम का जुगलर प्रकाशित होता है?,429,दो बार,hindi,How often is The Juggler of Notre Dame published?,twice
4,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",नोट्रे डेम में कितने छात्र समाचार पत्र पाए जात...,124,तीन,hindi,How many student newspapers are found at Notre...,Three


In [35]:
final_df = df[['context', 'questions_en', 'answer_start', 'answers_en']]

In [36]:
final_df

Unnamed: 0,context,questions_en,answer_start,answers_en
0,"वास्तुशिल्प रूप से, स्कूल में एक कैथोलिक चरित्...",What is the grotto at Notre Dame?,332,A Marian place of prayer and reflection
1,"वास्तुशिल्प रूप से, स्कूल में एक कैथोलिक चरित्...",What sits on top of the main building at Notre...,82,A golden statue of the Virgin Mary
2,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",When was Notre Dame's scholarly magazine publi...,211,September 1876
3,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",How often is The Juggler of Notre Dame published?,429,twice
4,"अधिकांश अन्य विश्वविद्यालयों के रूप में, नोट्र...",How many student newspapers are found at Notre...,124,Three
...,...,...,...,...
4729,"25 अप्रैल, 2007 को, इंटरनेट संग्रह और सुजैन खो...",Who said they had no desire to infringe on ind...,102,Internet Archive
4730,2013-14 में एक अश्लील अभिनेता खुद की संग्रहीत ...,What was the first method the actor tried to c...,108,DMCA request
4731,"16 वीं शताब्दी तक, निम्न देश - वर्तमान में नीद...",Which counties in the Low Countries were not r...,215,flanders
4732,अधिकांश निम्न देश बर्गंडी के घर के शासन में आए...,Who issued practical approval?,86,Holy Roman Emperor Charles V


In [42]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering, AdamW, AutoTokenizer


### Step 1: Initialize the tokenizer and model
I am using the \"deepset/bert-multilingual-cased\" model which is pre-trained on the SQuAD2.0 dataset. This model can handle both English and cross-lingual question answering tasks.


In [58]:
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
df = pd.read_csv('final_df.csv')

In [12]:
!pip install fuzzywuzzy


Collecting fuzzywuzzy
  Using cached fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [34]:
df.iloc[4729,3]

3

In [41]:
df.iloc[1,0]

'Architecturally, the school has a Catholic character. The gold dome of the main building is topped by a golden statue of the Virgin Mary. In front of and facing the main building, there is a copper statue of Christ surmounted with the legend "Venit edi me omnes". Next to the main building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto in Lourdes, France where the Virgin Mary venerated Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a straight line that connects to the 3 statues and the gold dome), there is a simple, modern stone statue of Mary.'

***Once the question, answers and context is converted to english the start position of the answers in the context will change. To find the correct position of the answer we use fuzzy words.***

In [67]:
# use fuzzy words to find the answer position
from fuzzywuzzy import process

def find_answer_positions(row):
    context = row['context_en'].lower()
    answer = row['answers_en'].lower()

    # Use fuzzy matching to find the closest match
    best_match = process.extractOne(answer, context.split())
    if not best_match:
        raise ValueError(f"Answer not found in the context. Context: {context} Answer: {answer}")

    start_pos = context.find(best_match[0])
    if start_pos == -1:
        raise ValueError(f"Answer not found in the context. Context: {context} Answer: {answer}")
    end_pos = start_pos + len(answer)

    return start_pos, end_pos

In [68]:
# Add start and end positions to the dataframe
df[['answer_start', 'answer_end']] = df.apply(lambda row: find_answer_positions(row), axis=1, result_type="expand")



In [7]:
df = df[['context_en', 'questions_en', 'answers_en', 'answer_start', 'answer_end']]

In [8]:
df

Unnamed: 0,context_en,questions_en,answers_en,answer_start,answer_end
0,"Architecturally, the school has a Catholic cha...",What is the grotto at Notre Dame?,A Marian place of prayer and reflection,376,415
1,"Architecturally, the school has a Catholic cha...",What sits on top of the main building at Notre...,A golden statue of the Virgin Mary,104,138
2,"As at most other universities, Notre Dame stud...",When was Notre Dame's scholarly magazine publi...,September 1876,237,251
3,"As at most other universities, Notre Dame stud...",How often is The Juggler of Notre Dame published?,twice,287,292
4,"As at most other universities, Notre Dame stud...",How many student newspapers are found at Notre...,Three,120,125
...,...,...,...,...,...
4729,"On April 25, 2007, the Internet Archive and Su...",Who said they had no desire to infringe on ind...,Internet Archive,23,39
4730,In 2013–14 a pornographic actor was trying to ...,What was the first method the actor tried to c...,DMCA request,113,125
4731,"By the 16th century, the Low Countries – corre...",Which counties in the Low Countries were not r...,flanders,282,290
4732,Most of the Low Countries came under the rule ...,Who issued practical approval?,Holy Roman Emperor Charles V,125,153


In [9]:
df = df.dropna(subset=['context_en', 'questions_en', 'answers_en'])


In [75]:
df.to_csv('final_df.csv')

***Initialize the model and the tokenizer to make the training data available in a format that can be understood by the bert model***

***A class named QADataset is created to tokenize the data in batches and return the inputs in a format that can be fed into the model***

***Using this a train data loader and a test data loader are created***

In [14]:
from sklearn.model_selection import train_test_split
df.reset_index(drop=True, inplace=True)
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [12]:
import torch
from torch.utils.data import DataLoader, Dataset, RandomSampler
from transformers import BertTokenizer, BertForQuestionAnswering, AdamW

model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)
class QADataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len=512):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, index):
        row = self.dataframe.iloc[index]
        context = row['context_en']
        question = row['questions_en']
        answer = row['answers_en']
        start_positions = row['answer_start']
        end_positions = row['answer_end']
        
        inputs = self.tokenizer.encode_plus(
            context,
            question,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        input_ids = inputs['input_ids'].squeeze()
        attention_mask = inputs['attention_mask'].squeeze()
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'start_positions': torch.tensor(start_positions, dtype=torch.long),
            'end_positions': torch.tensor(end_positions, dtype=torch.long)
        }

# Prepare the dataset and dataloader
train_data = QADataset(train_data, tokenizer)
test_data = QADataset(test_data, tokenizer)

train_loader = DataLoader(train_data, batch_size=8, sampler=RandomSampler(dataset))
test_loader = DataLoader(test_data, batch_size=8, sampler=RandomSampler(dataset))


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Split the data into train and test**

In [15]:
train_data.to_csv('train_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

In [24]:
from transformers import AdamW, get_linear_schedule_with_warmup

# Initialize the model
model = BertForQuestionAnswering.from_pretrained('bert-base-multilingual-cased')

# Define the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Learning rate scheduler
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


You will see an error in the notebook under the train session, this is because the training was done using google colab T4 GPU since training on the notebook was taking very long. The trained model was then loaded into the notebook.

***Train the model in batches for 5 epochs using the previously defined QADataset class***

In [25]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
epochs = 5

for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Average Training Loss: {avg_train_loss}")


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


KeyboardInterrupt: 

***Evaluate the model performance***

In [57]:
from sklearn.metrics import accuracy_score, f1_score

def evaluate(model, dataloader):
    model.eval()
    total_loss = 0
    start_acc = 0
    end_acc = 0
    start_f1 = 0
    end_f1 = 0

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
            loss = outputs.loss
            total_loss += loss.item()
            
            start_preds = outputs.start_logits.argmax(dim=1)
            end_preds = outputs.end_logits.argmax(dim=1)
            
            start_acc += accuracy_score(start_positions.cpu().numpy(), start_preds.cpu().numpy())
            end_acc += accuracy_score(end_positions.cpu().numpy(), end_preds.cpu().numpy())
            
            start_f1 += f1_score(start_positions.cpu().numpy(), start_preds.cpu().numpy(), average='macro')
            end_f1 += f1_score(end_positions.cpu().numpy(), end_preds.cpu().numpy(), average='macro')
    
    avg_loss = total_loss / len(dataloader)
    start_acc /= len(dataloader)
    end_acc /= len(dataloader)
    start_f1 /= len(dataloader)
    end_f1 /= len(dataloader)
    
    return avg_loss, start_acc, end_acc, start_f1, end_f1

# # Evaluate the model
avg_loss, start_acc, end_acc, start_f1, end_f1 = evaluate(model, test_loader)
print(f"Evaluation - Loss: {avg_loss}, Start Accuracy: {start_acc}, End Accuracy: {end_acc}, Start F1: {start_f1}, End F1: {end_f1}")


***Load the pre trained model***

In [27]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

# Define the path to your saved model files
model_load_path = "final_trained_model/"

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained(model_load_path)

# Load the model
model = BertForQuestionAnswering.from_pretrained(model_load_path, from_tf=False, config="final_trained_model/config.json")


***A question answer function to take the input and context tokenize it, pass it to the model and predict the output***

In [52]:
def answer_question(question, context):
    inputs = tokenizer.encode_plus(
        question,
        context,
        add_special_tokens=True,
        return_tensors="pt",
        max_length=512,
        truncation=True,
    )
    input_ids = inputs["input_ids"].tolist()[0]
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Run inference
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        start_scores = outputs.start_logits
        end_scores = outputs.end_logits

    # Get the most likely beginning and end of the answer span
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores) + 1

    # Convert token indices to tokens
    answer_tokens = tokens[answer_start:answer_end]

    # Convert tokens to string
    answer = tokenizer.convert_tokens_to_string(answer_tokens)

    return answer


***An app to answer questions based on a context***

In [56]:
import gradio as gr
from transformers import pipeline
from googletrans import Translator

# Load the QA pipeline
qa_pipeline = pipeline("question-answering")

# Initialize the Google Translator
translator = Translator()

def answer_question(question, context):
    # Translate context from Hindi to English
    translated = translator.translate(context, src='hi', dest='en')
    context_en = translated.text
    
    # Use QA pipeline to get the answer
    result = qa_pipeline(question=question, context=context_en)
    return result["answer"]

iface = gr.Interface(
    fn=predict,
    inputs=["text", "text"],
    outputs="text",
    title="Crosslingual QA Model",
    description="Ask a question based on the given context",
)

iface.launch()


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Running on local URL:  http://127.0.0.1:7869

To create a public link, set `share=True` in `launch()`.


