# Lab 3: Multilingual BERT and Zero-Shot Transfer

## June 27, 2023

Welcome to the third lab of the course. In this assignment we will learn how to fine-tune a multilingual BERT or mBERT model on a Natural Language Inference task [XNLI](https://arxiv.org/abs/1809.05053). We will fine-tune the model on English Training data and then evaluate the performance of the fine-tuned models on different languages demonstrating the zero-shot capabilities of mBERT. 

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    data_dir = "gdrive/MyDrive/PlakshaNLP2023/Lab3a/data/xnli"
except:
    data_dir = "/datadrive/t-kabir/work/repos/PlakshaNLP/TLPNLP2023/source/Lab3a/data/xnli"

In [None]:
# Install required libraries
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install matplotlib
!pip install transformers
!pip install tqdm

In [None]:
# We start by importing libraries that we will be making use of in the assignment.
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import copy
import tqdm

## XNLI: Task Description

XNLI is a multilingual benchmark for Natural Language Inference, that contains training data available in English which was obtained from the popular [MNLI](https://cims.nyu.edu/~sbowman/multinli/), and test and dev sets available for 15 different languages. In NLI, we are given two sentences, one is a premise and other an hypothesis, and the task is to predict whether the hypothesis is i) entialed in the premise, or ii) contradicts the premise, or iii) neutral to the premise. 

<img src="https://i.ibb.co/bd4P20K/nli-examples.jpg" alt="nli-examples" border="0">

This makes NLI a multi-class classification task where we want to predict the correct label out of the three possible classes. We start by loading the dataset into memory. The training set in XNLI is comparitively huge with around 400k examples, which can lead to higher training times. Hence for the purpose of this assignment we will work with a fraction of the full data i.e. ~40k examples

In [None]:
def load_xnli_dataset(lang, split = "train"):
    filename = os.path.join(data_dir, f"{split}-{lang}.tsv")
    sentence1s = []
    sentence2s = []
    labels = []
    with open(filename) as f:
        for i,line in enumerate(f):
            if i == 0:
                continue
            row = line.split("\t")
            sentence1 = row[0]
            sentence2 = row[1]
            label = row[2].split("\n")[0]
            sentence1s.append(sentence1)
            sentence2s.append(sentence2)
            labels.append((label))
    
    return pd.DataFrame({
        "premise": sentence1s,
        "hypothesis" : sentence2s,
        "label" : labels
    })

In [None]:
# Load Training data in english
train_en_data = load_xnli_dataset("en", "train")[:40000]

#Like last assignment we will use split the training data to get some validation examples as well
train_en_data, val_en_data = train_test_split(train_en_data, test_size=0.05)

print(f"Number of examples in training data: {len(train_en_data)}")
print(f"Number of examples in validation data: {len(val_en_data)}")

train_en_data.head()

In [None]:
# Load Test data in other languages
test_langs = ["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"]

lang2test_df = {lang : load_xnli_dataset(lang, "dev") for lang in test_langs}

In [None]:
print(f"Number of Test examples: {len(lang2test_df['en'])}")
lang2test_df["en"].head()

In [None]:
for lang, test_df in lang2test_df.items():
    print(f"{lang} test set:")
    print(test_df.head())
    print("***************************\n")

## mBERT using HuggingFace's transformers library

mBERT is a multilingual variant of BERT, which is trained on wikipedia articles in around [100 languages](BertTokenizer). Like monolingual BERT the transformers library also provides pre-trained models and tokenizers for multilingual BERT. To create an instance of one, we only need to specify `"bert-base-multilingual-cased"` or `"bert-base-multilingual-uncased"` in `BertTokenizer.from_pretrained` and `BertModel.from_pretrained` methods and that's it! See examples below for a demonstration:

In [None]:
from transformers import BertTokenizer, BertModel

In [None]:
mbert_tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-uncased")

In [None]:
mbert_tokenizer.tokenize("thinking machines")

In [None]:
mbert_tokenizer.tokenize("maquinas de pensar")

In [None]:
mbert_tokenizer.tokenize("सोच मशीन")

As you can see mBERT's tokenizer works on different languages. We can similarly load a pretrained mbert model and feed data in different languages

In [None]:
mbert_model = BertModel.from_pretrained("bert-base-multilingual-uncased")

In [None]:
mbert_model

As you can see the architecture is identical to the original BERT model. The only thing that is different is the shape of word_embeddings which is 105879 X 768, meaning there are 105879 unique tokens supported by mBERT (uncased). In contrast BERT (uncased) supports 30522 tokens.

In [None]:
en_sent = "thinking machines"
tokenizer_output = mbert_tokenizer(en_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

In [None]:
es_sent = "maquinas de pensar"
tokenizer_output = mbert_tokenizer(es_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

In [None]:
hi_sent = "सोच मशीन"
tokenizer_output = mbert_tokenizer(hi_sent, return_tensors="pt")
input_ids, attn_mask = tokenizer_output["input_ids"], tokenizer_output["attention_mask"]

mbert_model(input_ids, attention_mask = attn_mask)

Hence, we can very easily use mBERT for generating predictions on texts written in different languages.

## Task 1: Fine-tune mBERT on XNLI

We can now start fine-tuning mBERT on this dataset. We will start by defining the custom `Dataset` class for the task and then define the model and training loop.

## Task 1.1: Custom Dataset Class (15 minutes)

Like in the previous assignments, implement the `XNLImBertDataset` class below that processes and stores the data as well as provides a way to iterate through the dataset. The details about various methods in the class are mentioned in their docstrings.

In [None]:
class XNLImBertDataset(Dataset):
    
    def __init__(self, premises,
                 hypotheses,
                 labels,
                 max_length,
                mbert_variant = "bert-base-multilingual-uncased"):
        
        """
        Constructor for the `XNLImBertDataset` class. Stores the `premises`, `hypotheses` and `labels`
        which can then be used by other methods. Also initializes the tokenizer.
        
        Inputs:
            - premises (list) : A list of sentences constituting the premise in each example
            - hypotheses (list) : A list of sentences constituting the hypothesis in each example
            - labels (list) : A list of labels denoting for each premise-hypothesis pair.
            - max_length (int): Maximum length of the encoded sequence.  
                                If number of tokens are lower than `max_length` add padding otherwise truncate
        
        
        Note that labels are in the form of strings "entailment", "contradiction" and "neutral". For training the
        models we will want the labels in the numeric form, so you should define a mapping from the text label
        to a numeric id. You should order the labels in alphabetical order while defining the mapping i.e. 
        contadiction -> 0, entailment -> 1, "neutral" - > 2 (such that we have consistency across everyone).
        
        Also note that we have a `max_length` argument today. This is to ensure that all sequences in the batch size.
        This way we will not need to define a seperate collate_fn like we did in the previous assignments / labs.
        You may want to look up how to pad upto a maximum length (padding, max_length, and truncate arguments while calling the tokenizer might be useful)
        
        """
        
        self.premises = None
        self.hypotheses = None
        self.labels = None
        self.max_length = None
        self.tokenizer = None
        self.label2id = None # Define it as a dictionary
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
    def __len__(self):
        """
        Returns the length of the dataset
        """
        length = None
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        return length
    
    def __getitem__(self, idx):
        """
        
        Returns the features and label corresponding to the the `idx` entry in the dataset.
        
        Inputs:
            - idx (int): Index corresponding to the sentence_pair,label to be returned
        
        Returns:
            - input_ids (torch.tensor): Indices of the tokens in the sentence pair.
                                        Shape of the tensor should be (`seq_len`,)
            - mask (torch.tensor): Attention mask indicating which tokens are padded.
            - label (int): Label for the premise-hypothesis pair
            
        Hint: We have 2 sentences in a pair which must be concatenated using the [SEP] token before we tokenize and encode them
        
        """
        
        input_ids = None
        mask = None
        label = None
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        return input_ids.squeeze(0), mask.squeeze(0), label

In [None]:
print("Running Sample Test Cases")
sample_premises = ["A man inspects the uniform of a figure in some East Asian country.",
                    "An older and younger man smiling.",
                   "A soccer game with multiple males playing."
                    ]
sample_hypotheses = ["The man is sleeping.",
                     "Two men are smiling and laughing at the cats playing on the floor.",
                    "Some men are playing a sport."]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 32
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_hypotheses,
    sample_labels,
    sample_max_len
)
print(f"Sample Test Case 1: Checking if `__len__` is implemented correctly")
dataset_len= len(sample_dataset)
expected_len = len(sample_labels)
print(f"Dataset Length: {dataset_len}")
print(f"Expected Length: {expected_len}")
assert len(sample_dataset) == len(sample_premises)
print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 2: Checking if `__getitem__` is implemented correctly for `idx= 0`")
sample_idx = 0
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids =  torch.tensor([  101,   143, 10564, 15450, 84789, 10107, 10103, 38884, 10108,   143,
        16745, 10104, 10970, 11344, 17147, 11913,   119,   102, 10103, 10564,
        10127, 55860,   119,   102,     0,     0,     0,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0])
expected_label = 0
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")

print(f"Sample Test Case 3: Checking if `__getitem__` is implemented correctly for `idx= 1`")
sample_idx = 1
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([  101, 10144, 18585, 10110, 24392, 10564, 14965, 64581,   119,   102,
        10536, 10562, 10320, 14965, 64581, 10110, 18418, 82863, 10160, 10103,
        45670, 14734, 10125, 10103, 21005,   119,   102,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0])
expected_label = 2
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")


print(f"Sample Test Case 4: Checking if `__getitem__` is implemented correctly for `idx= 2`")
sample_idx = 2
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids = torch.tensor([  101,   143, 20071, 11336, 10171, 18248, 19592, 14734,   119,   102,
        10970, 10562, 10320, 14734,   143, 13148,   119,   102,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
expected_label = 1
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")



sample_premises = ["एक आदमी किसी पूर्वी एशियाई देश में एक आकृति की वर्दी का निरीक्षण करता है।",
                    "एक बूढ़ा और छोटा आदमी मुस्कुरा रहा है।",
                   "एक फ़ुटबॉल खेल जिसमें कई पुरुष खेल रहे हैं।"
                    ]
sample_sentence2s = ["आदमी सो रहा है।",
                     "फर्श पर खेल रही बिल्लियों को देखकर दो आदमी मुस्कुरा रहे हैं और हंस रहे हैं।",
                    "कुछ पुरुष कोई खेल खेल रहे हैं।"
                    ]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 36
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_sentence2s,
    sample_labels,
    sample_max_len
)

print(f"Sample Test Case 5: Checking for hindi")
sample_idx = 1
input_ids, mask, label = sample_dataset.__getitem__(sample_idx)
expected_input_ids =  torch.tensor([  101, 11384,   569, 30119, 10949, 11142, 74535, 10949,   533, 13764,
        25695,   571, 12114, 19086, 10949, 36335,   580,   591,   102,   568,
        11551, 17109, 12334, 56426, 52061,   569, 28393, 41790, 20106, 11483,
        91329, 19086, 29931,   533, 13764,   102])
expected_mask = torch.tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
expected_label = 2
print(f"input_ids:\n {input_ids}")
print(f"Expected input_ids:\n {expected_input_ids}")
assert (expected_input_ids == input_ids).all()

print(f"mask:\n {mask}")
print(f"Expected mask:\n {expected_mask}")
assert (expected_mask == mask).all()

print(f"label:\n {label}")
print(f"Expected label:\n {expected_label}")
assert expected_label == label

print("Sample Test Case Passed!")
print("****************************************\n")




Initialize dataset and dataloaders for english training and validation sets

In [None]:
max_seq_len = 128
batch_size = 8

train_en_premises, train_en_hypotheses = train_en_data["premise"].values, train_en_data["hypothesis"].values
train_en_labels = train_en_data["label"].values

val_en_premises, val_en_hypotheses = val_en_data["premise"].values, val_en_data["hypothesis"].values
val_en_labels = val_en_data["label"].values

train_en_dataset = XNLImBertDataset(train_en_premises, train_en_hypotheses, train_en_labels, max_seq_len)
val_en_dataset = XNLImBertDataset(val_en_premises, val_en_hypotheses, val_en_labels, max_seq_len)

train_en_dataloader = DataLoader(train_en_dataset, batch_size = batch_size)
val_en_dataloader = DataLoader(val_en_dataset, batch_size = batch_size)

## Task 1.2: Implement mBERT Based Classifier for NLI (15 minutes)

Similar to last assignment implement a classifier with an mBERT module followed by a classification layer. Note that we have a 3-class classification problem this time and unlike last time we have fixed labels for all examples (Phew!). So we just need a linear layer (followed by log-softmax) on top the CLS embeddings to get log probabilities for each of the class. The architecture will look something like this

![architecture](images/mbert_xnli.png)

Implement the `mBERTNLIClassifierModel` below

In [None]:

class mBERTNLIClassifierModel(nn.Module):
    
    def __init__(self, d_hidden = 768, mbert_variant = "bert-base-multilingual-uncased"):
        
        """
        Constructor for the `mBERTNLIClassifierModel` class. Use this to define  the network architecture
        which should be: Input -> mBERT -> Linear Layer -> Log-Softmax
        
        Inputs:
            - d_hidden (int): Size of the hidden representations of mbert
            - mbert_variant (str): mBERT variant to use
        
        """
        super(mBERTNLIClassifierModel, self).__init__()
        
        self.mbert_layer = None
        self.output_layer = None
        self.log_softmax_layer = None
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        
    def forward(self, input_ids, attn_mask):
        
        """
        Forward Passes the inputs through the network and obtains the prediction
        
        Inputs:
            - input_ids (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the sequence of token ids
            - attn_mask (torch.tensor): A torch tensor of shape [batch_size, seq_len]
                                        representing the attention mask such that padded tokens are 0 and rest 1
                                        
        Returns:
          - output (torch.tensor): A torch tensor of shape [batch_size, 3] containing (log) probabilities
          of each class 
                                                
        """
        
        output = None
        
        # YOUR CODE HERE
        raise NotImplementedError()
        
        return output

In [None]:
print(f"Running Sample Test Cases!")
torch.manual_seed(42)
model = mBERTNLIClassifierModel()

sample_premises = ["A man inspects the uniform of a figure in some East Asian country.",
                    "An older and younger man smiling.",
                   "A soccer game with multiple males playing."
                    ]
sample_hypotheses = ["The man is sleeping.",
                     "Two men are smiling and laughing at the cats playing on the floor.",
                    "Some men are playing a sport."]
sample_labels = ["contradiction", "neutral", "entailment"]
sample_max_len = 32
sample_dataset = XNLImBertDataset(
    sample_premises,
    sample_hypotheses,
    sample_labels,
    sample_max_len
)


print("Sample Test Case 1")
sample_idx = 0
input_ids, attn_mask, label = sample_dataset.__getitem__(sample_idx)
mbert_cls_out = model(input_ids.unsqueeze(0), attn_mask.unsqueeze(0)).detach().numpy()
expected_mbert_cls_out = np.array([[-0.9885041, -1.479876,  -0.915788 ]])
print(f"Model Output: {mbert_cls_out }")
print(f"Expected Output: {expected_mbert_cls_out}")

assert mbert_cls_out .shape == expected_mbert_cls_out.shape
assert np.allclose(mbert_cls_out, expected_mbert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")

print("Sample Test Case 2")
sample_idx = 1
input_ids, attn_mask, label = sample_dataset.__getitem__(sample_idx)
mbert_cls_out = model(input_ids.unsqueeze(0), attn_mask.unsqueeze(0)).detach().numpy()
expected_mbert_cls_out = np.array([[-0.97441876, -1.4775381,  -0.9304163 ]])
print(f"Model Output: {mbert_cls_out }")
print(f"Expected Output: {expected_mbert_cls_out}")

assert mbert_cls_out .shape == expected_mbert_cls_out.shape
assert np.allclose(mbert_cls_out, expected_mbert_cls_out, 1e-4)
print("Test Case Passed! :)")
print("******************************\n")


## Task 1.3: Training and Evaluating the Model (30 minutes)

Similar to previous assignments implement the `train` and `evaluate` functions below.

In [None]:
def evaluate(model, test_dataloader, device = "cpu"):
    
    """
    Evaluates `model` on test dataset

    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifier model to be evaluated
        - test_dataloader (torch.utils.DataLoader): A dataloader defined over the test dataset

    Returns:
        - accuracy (float): Average accuracy over the test dataset 
    """
    
    
    accuracy = None
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return accuracy
    
    
    
def train(model, train_dataloader, val_dataloader,
          lr = 1e-5, num_epochs = 3,
          device = "cpu"):
    
    """
    Runs the training loop. Define the loss function as NLLLoss
    and optimizer as Adam and train for `num_epochs` epochs.

    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifer model to be trained
        - train_dataloader (torch.utils.DataLoader): A dataloader defined over the training dataset
        - val_dataloader (torch.utils.DataLoader): A dataloader defined over the validation dataset
        - lr (float): The learning rate for the optimizer
        - num_epochs (int): Number of epochs to train the model for.
        - device (str): Device to train the model on. Can be either 'cuda' (for using gpu) or 'cpu'

    Returns:
        - best_model (mBERTNLIClassifierModel): model corresponding to the highest validation accuracy (checked at the end of each epoch)
        - best_val_accuracy (float): Validation accuracy corresponding to the best epoch
    """
        
    best_val_accuracy = float("-inf")
    best_model = None
    
    # YOUR CODE HERE
    raise NotImplementedError()
    best_model.zero_grad()
    return best_model, best_val_accuracy

In [None]:
torch.manual_seed(42)
print("Training on 100 data points for sanity check")

max_seq_len = 128
batch_size = 8

sample_premises, sample_hypotheses = train_en_data["premise"].values[:100], train_en_data["hypothesis"].values[:100]
sample_labels = train_en_data["label"].values[:100]

sample_dataset = XNLImBertDataset(sample_premises, sample_hypotheses, sample_labels, max_seq_len)
sample_dataloader = DataLoader(sample_dataset, batch_size = batch_size)


model = mBERTNLIClassifierModel()
best_model, best_val_acc = train(model, sample_dataloader, sample_dataloader, lr = 5e-5, num_epochs = 10, device = "cuda")
print(f"Best Validation Accuracy: {best_val_acc}")
print(f"Expected Best Validation Accuracy: {0.99}")

Since we just trained and evaluated on same 100 examples, you should expect nearly perfect 99% accuracy. Now let's train on the entire dataset.

In [None]:
model = mBERTNLIClassifierModel()
best_model, best_val_acc = train(model, train_en_dataloader, val_en_dataloader, lr = 1e-5, num_epochs = 2, device = "cuda")
print(f"Best Validation Accuracy: {best_val_acc}")
print(f"Expected Best Validation Accuracy: {0.7675}")

## Task 1.4: Zero-Shot Transfer (30 minutes)

Pre-trained multilingual models like mBERT have shown to exhibit zero-shot transfer capabilities to new languages for which the model was never fine-tuned on. You can read more about zero-shot transfer in mBERT in this [paper](https://arxiv.org/abs/1906.01502). We now test this phenomenon for ourselves, where we will evaluate the performance of the mBERT classifier that we just trained on the English on the test sets in 15 different languages. Implement the `evaluate_on_diff_langs` function below that does that

In [None]:
def evaluate_on_diff_langs(model, lang2test_df, max_length = 128, batch_size = 8, device = "cpu"):
    
    """
    Evaluates the accuracy of the fine-tuned model on test data in different langauges.
    
    Inputs:
        - model (mBERTNLIClassifierModel): mBERT based classifer model fine-tuned on English data
        - lang2test_df (dict): A dictionary with langauges as keys and
                                their corresponding test sets (in form of pandas dataframe)
                                as values
                                
    Returns:
        - lang2acc (dict): A dictionary with language ids as keys and the accuracy on it's test set as values
                            eg: {"en" : 0.8, "fr" : 0.77, "hi": 0.72, ...}
    
    """
    
    lang2acc = None
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return lang2acc
    

In [None]:
lang2acc = evaluate_on_diff_langs(best_model, lang2test_df, max_length = 128, batch_size = 8, device = "cuda")
expected_vals = {'ar': 0.5989583333333334,
 'bg': 0.6454326923076923,
 'de': 0.6698717948717948,
 'el': 0.6402243589743589,
 'en': 0.7263621794871795,
 'es': 0.6923076923076923,
 'fr': 0.6802884615384616,
 'hi': 0.5893429487179487,
 'ru': 0.6478365384615384,
 'sw': 0.53125,
 'th': 0.35136217948717946,
 'tr': 0.610176282051282,
 'ur': 0.5637019230769231,
 'vi': 0.6193910256410257,
 'zh': 0.6073717948717948}
print(f"Langauge to Accuracy:\n {lang2acc}")
print(f"Expected Values:\n {expected_vals}")

Don't worry if the values do not match exactly, but you can expect similar patterns i.e. the fine-tuned model on English data, performs reasonably on other new langauges as well compared to it's performance on English test data. Performance on langauges like German, French and Spanish is much closer to the performance on English. However, it is on the lower side for languages like Swahilli, Urdu and Thai. The values are still surprisingly high, considering a random guess will fetch you an accuracy of 33%.