## Goal of Study

### Context
- *Multilingual Models (XGLM/mBERT):* These models are not trained on separate English, Spanish, and Hindi datasets. They are trained on a massive, mixed-corpus of many languages at once. The hope is that the model learns abstract, language-agnostic linguistic concepts. For example, the concept of a "verb" or "subject" should be represented similarly, regardless of whether the word is "run"  or "दौड़ना" (daudna - Hindi).
- *Shared Representation Space:* The tokenizer has a shared vocabulary for all languages. This forces the model to map tokens from different languages that have similar meanings (e.g., "cat" and "gato") to nearby points in its embedding space. This shared space is what makes transfer possible.

### The Zero-Shot Hypothesis
If we fine-tune the model on an English task (e.g., sentiment classification), the model learns to move representations of sentences into "positive" or "negative" regions of its internal space. The hypothesis is that because a Spanish sentence with a positive sentiment is already represented near its English equivalent, it too will be correctly classified, even though the model has never seen a labeled Spanish example.

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
from src.finetune import train_short_run 
from src.eval import eval_perplexity, eval_classification_accuracy 

In [None]:
# --- Configuration ---
# Use XGLM for causal LM (perplexity) tasks
# Use mBERT or XLM-R for classification tasks
MODEL_NAME = "bert-base-multilingual-cased" # Let's use a classifier for a clearer signal
TRAIN_DATASET_NAME = "multi_nli" # English-only training data
EVAL_DATASET_NAME = "xnli"
EVAL_DATASET_CONFIG = "all_languages"
LANGS = ["en", "es", "fr", "hi"]
LANGS = ["en", "es", "fr", "hi"]
NUM_SAMPLES_PER_LANG = 1000 # Keep it small and fast
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# --- Load Model and Tokenizer ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# For classification, we need a different model head
# The NLI task has 3 labels: entailment (0), neutral (1), contradiction (2)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3).to(DEVICE)

# --- Load and Prepare Datasets ---
print("Loading English training data from MultiNLI...")
# We only need a small slice for this quick experiment
train_dataset = load_dataset(TRAIN_DATASET_NAME, split='train').shuffle(seed=42).select(range(NUM_SAMPLES_PER_LANG * 5)) # Use a bit more for training


# Load the multilingual validation data and create separate datasets for each language
print("Loading and filtering multilingual validation data from XNLI...")
eval_datasets = {}

for lang in LANGS:
    print(f"  -> Loading XNLI for language: {lang}")
    # The language code is passed as the second argument to load_dataset
    lang_dset = load_dataset(EVAL_DATASET_NAME, lang, split='validation')
    
    # Select a small sample and store it
    eval_datasets[lang] = lang_dset.shuffle(seed=42).select(range(NUM_SAMPLES_PER_LANG))

print(f"Loaded evaluation data for: {list(eval_datasets.keys())}")
# Example: train_dataset is ready, and eval_datasets['es'] is our Spanish test set.

#### Pre-trained Model Performance

In [None]:
# `eval_classification_accuracy`: It tokenizes batches, feeds them to the model, gets logits, finds the argmax, and compares to labels.

baseline_results = {}
for lang in LANGS:
    print(f"Evaluating baseline for language: {lang}")
    accuracy = eval_classification_accuracy(model, tokenizer, eval_datasets[lang], DEVICE)
    baseline_results[lang] = accuracy
    print(f"  -> Accuracy: {accuracy:.2%}")

#### Visualizing Cross-Lingual Embeddings (T-SNE)

The central idea, demonstrated in papers on mBERT ("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding") and more explicitly in work on sentence embeddings like LASER ("Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond"), is that these models learn a language-agnostic semantic space. Sentences with equivalent meanings, even in different languages, are mapped to nearby vectors. To visualize this, we use `T-SNE` by projecting the high-dimensional embeddings into a lower-dimensional space (2-D) while preserving the local structure and relationships between original embeddings.

We will extract the sentence representation from the final layer's [CLS] token, which is the standard method for getting a sentence-level embedding from BERT-like models.

In [None]:
import torch
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
from transformers import AutoModel

# Use the base model for this, not the one fine-tuned for classification
base_model = AutoModel.from_pretrained("bert-base-multilingual-cased", output_hidden_states=True).to(DEVICE)
base_model.eval()

# 1. Define parallel sentences
parallel_sentences = [
    ("The cat is sleeping on the mat.", "en"), # English
    ("El gato está durmiendo en la alfombra.", "es"), # Spanish
    ("Le chat dort sur le tapis.", "fr"), # French
    ("बिल्ली चटाई पर सो रही है।", "hi"), # Hindi
    
    ("This restaurant has amazing food.", "en"),
    ("Este restaurante tiene comida increíble.", "es"),
    ("Ce restaurant propose une cuisine incroyable.", "fr"),
    ("इस रेस्टोरेंट का खाना अद्भुत है।", "hi"),

    ("Can you help me with this problem?", "en"),
    ("¿Puedes ayudarme con este problema?", "es"),
    ("Pouvez-vous m'aider avec ce problème ?", "fr"),
    ("क्या आप इस समस्या में मेरी मदद कर सकते हैं?", "hi"),
]

# 2. Extract [CLS] token embeddings from last layer (for getting a sentence-level embedding from BERT-like models)
sentence_embeddings = []
sentence_labels = []
lang_labels = []

with torch.no_grad():
    for sentence, lang in parallel_sentences:
        inputs = tokenizer(sentence, return_tensors='pt').to(DEVICE)
        outputs = base_model(**inputs)
        # Get the last hidden state and take the [CLS] token's representation (at index 0)
        cls_embedding = outputs.last_hidden_state[0, 0, :].cpu().numpy()
        sentence_embeddings.append(cls_embedding)
        sentence_labels.append(sentence)
        lang_labels.append(lang)

sentence_embeddings = np.array(sentence_embeddings)

# 3. Apply T-SNE
tsne = TSNE(n_components=2, perplexity=5, random_state=42, init='pca', learning_rate='auto')
embeddings_2d = tsne.fit_transform(sentence_embeddings)

# 4. Plot the results
plt.figure(figsize=(14, 10))
colors = {'en': 'blue', 'es': 'red', 'fr': 'green', 'hi': 'purple'}
lang_names = {'en': 'English', 'es': 'Spanish', 'fr': 'French', 'hi': 'Hindi'}

# Plot points
for lang in np.unique(lang_labels):
    ix = np.where(np.array(lang_labels) == lang)
    plt.scatter(embeddings_2d[ix, 0], embeddings_2d[ix, 1], c=colors[lang], label=lang_names[lang], s=100)

# Annotate points
for i, txt in enumerate(sentence_labels):
    # Shorten long sentences for clarity in the plot
    display_txt = (txt[:50] + '...') if len(txt) > 50 else txt
    plt.annotate(display_txt, (embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=9)

plt.title("T-SNE Visualization of Cross-Lingual Sentence Embeddings")
plt.xlabel("T-SNE Dimension 1")
plt.ylabel("T-SNE Dimension 2")
plt.legend()
plt.grid(True)
plt.show()

#### Interpreting Cross-Lingual Attention Maps

Looking at a single attention head can be noisy. As shown in papers like "Analyzing Multi-Head Self-Attention" (Voita et al., 2019), different heads specialize. Some act like positional masks, some handle syntactic dependencies, etc. A simple and powerful method for getting a clearer signal, used in many interpretability analyses, is to average the attention probabilities across all heads within a given layer. This smooths out the noise and reveals the dominant, most robust attention pattern at that level of abstraction.

We will compare the averaged attention maps for a parallel sentence pair to see if the model is "reasoning" about the sentences in a structurally similar way.

In [None]:
from transformers import AutoModelForCausalLM
import seaborn as sns

# We need a model that can output attentions, thus we use XGLM for a causal LM example.
# NOTE: We can adapt this for mBERT, but we'll get a square attention matrix.
# For causal LMs, it's triangular.
attn_model = AutoModelForCausalLM.from_pretrained("facebook/xglm-564M", output_attentions=True).to(DEVICE)
attn_tokenizer = AutoTokenizer.from_pretrained("facebook/xglm-564M")
attn_model.eval()

def plot_cross_lingual_attention(sent1, sent2, layer_idx, model, tokenizer):
    """
    Plots the average attention heatmaps for two sentences side-by-side.
    """
    fig, axes = plt.subplots(1, 2, figsize=(16, 7))
    
    for i, (ax, sent) in enumerate(zip(axes, [sent1, sent2])):
        inputs = tokenizer(sent, return_tensors='pt').to(DEVICE)
        with torch.no_grad():
            outputs = model(**inputs)
        
        # attentions is a tuple of (num_layers) tensors
        # Each tensor is (batch_size, num_heads, seq_len, seq_len)
        attentions = outputs.attentions[layer_idx]
        
        # Average across all heads
        avg_attention = attentions.squeeze(0).mean(dim=0).cpu().numpy()
        
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        
        sns.heatmap(avg_attention, xticklabels=tokens, yticklabels=tokens, cmap='viridis', ax=ax)
        ax.set_title(f"Average Attention (Layer {layer_idx})\nSentence {i+1}: '{sent}'")

    plt.tight_layout()
    plt.show()

# Example: "The black cat" vs "El gato negro"
# We want to see if "cat" attends to "black" similarly to how "gato" attends to "negro"
eng_sent = "The black cat sat down."
esp_sent = "El gato negro se sentó."

# A mid-level layer is often good for semantic/syntactic relationships
plot_cross_lingual_attention(eng_sent, esp_sent, layer_idx=12, model=attn_model, tokenizer=attn_tokenizer)

#### Finetune on english only dataset

In [None]:
from src.train import train_classifier

print("Fine-tuning on English...")
# This function should return the fine-tuned model
tuned_model = train_classifier(
    model,
    tokenizer,
    train_dataset,
    device=DEVICE,
    epochs=3,
    lr=2e-5,
)

#### Zero-shot evaluation of Models

In [None]:
zero_shot_results = {}
for lang in LANGS:
    print(f"Evaluating zero-shot for language: {lang}")
    accuracy = eval_classification_accuracy(tuned_model, tokenizer, datasets[lang], DEVICE)
    zero_shot_results[lang] = accuracy
    print(f"  -> Accuracy: {accuracy:.2%}")

#### Questions to Study:
- Baseline Performance: Is the pretrained mBERT model better than random (20% for 5 stars)? It should be, showing it already has some cross-lingual understanding.
- Performance on English: How much did fine-tuning improve the English accuracy? This is your sanity check. zero_shot_results['en'] should be much higher than baseline_results['en'].
- The Transfer Gap: Compare zero_shot_results for es, fr, hi to the tuned English performance. The difference is the "transfer gap."
- Linguistic Distance: This is the key insight. You will likely see that performance drops as the language gets more distant from English.
    - es, fr (Romance languages): Should transfer very well. They share Latin roots and sentence structure with English.
    - hi (Indo-Aryan language): Will likely see the biggest performance drop. It uses a different script (Devanagari) and has different grammatical structures. The shared vocabulary is much smaller.