# Week 39 - Telugu Generative Question Answering

**Goal**: Compare three QA approaches BEFORE and AFTER fine-tuning

**Models**: All using `mrm8488/mT5-small-finetuned-tydiqa-for-xqa` base  
**Data**: 50 train, 100 val (Telugu from TyDi-XOR-RC)  

## Three Approaches:

### Model 1: Question (Telugu) + Context (English) → Answer (Telugu)
- **Input**: Telugu question + English context
- **Output**: Telugu answer
- **Use case**: Cross-lingual QA with context support

### Model 2: Question (Telugu) → Answer (Telugu)  
- **Input**: Telugu question only (no context)
- **Output**: Telugu answer
- **Use case**: Closed-book QA, tests model's knowledge

### Model 3: Question (Telugu) → Answer (English)
- **Input**: Telugu question only (no context)
- **Output**: English answer  
- **Use case**: Cross-lingual generation without context

## Experimental Design (per model):
1. **Zero-shot (base)**: Evaluate pre-trained model
2. **Few-shot (base)**: Evaluate with 3 examples in prompt
3. **Fine-tune**: Train on 50 Telugu examples (10 epochs)
4. **Zero-shot (fine-tuned)**: Re-evaluate after fine-tuning
5. **Few-shot (fine-tuned)**: Re-evaluate with examples

## Analysis:
- Compare all 3 models across zero-shot, few-shot, and fine-tuned settings
- Analyze performance on answerable vs unanswerable questions
- Discuss whether models can answer without context

---

In [1]:
# 1. SETUP
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install -q transformers datasets torch pandas pyarrow evaluate sacrebleu rouge-score accelerate
    from google.colab import drive
    drive.mount('/content/drive')

print("✓ Setup complete")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
Mounted at /content/drive
✓ Setup complete


In [2]:
# 2. IMPORTS
import pandas as pd
import torch
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from datasets import Dataset
from tqdm.auto import tqdm
import evaluate
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda


In [3]:
# 3. LOAD DATA
if IN_COLAB:
    BASE = Path("/content/drive/MyDrive/Colab_Notebooks/NLP/tydi_xor_rc")
else:
    BASE = Path("./tydi_xor_rc")

df_train = pd.read_parquet(BASE / "train.parquet")
df_val = pd.read_parquet(BASE / "validation.parquet")

# Filter Telugu with answers
df_train_te = df_train[(df_train['lang'] == 'te') & (df_train['answer_inlang'].notna())].copy()
df_val_te = df_val[(df_val['lang'] == 'te') & (df_val['answer_inlang'].notna())].copy()

print(f"✓ Loaded data")
print(f"  Train (Telugu): {len(df_train_te)}")
print(f"  Val (Telugu):   {len(df_val_te)}")

✓ Loaded data
  Train (Telugu): 50
  Val (Telugu):   100


In [4]:
# 4. DATA EXPLORATION
print("\n" + "="*60)
print("DATA EXPLORATION")
print("="*60)

print(f"\n1. Dataset sizes:")
print(f"   Training:   {len(df_train_te)} examples")
print(f"   Validation: {len(df_val_te)} examples")

print(f"\n2. Answerability distribution:")
print(f"   Training:")
print(f"     Answerable:   {df_train_te['answerable'].sum()} ({df_train_te['answerable'].sum()/len(df_train_te)*100:.1f}%)")
print(f"     Unanswerable: {(~df_train_te['answerable']).sum()} ({(~df_train_te['answerable']).sum()/len(df_train_te)*100:.1f}%)")
print(f"   Validation:")
print(f"     Answerable:   {df_val_te['answerable'].sum()} ({df_val_te['answerable'].sum()/len(df_val_te)*100:.1f}%)")
print(f"     Unanswerable: {(~df_val_te['answerable']).sum()} ({(~df_val_te['answerable']).sum()/len(df_val_te)*100:.1f}%)")

print(f"\n3. Answer length statistics (Telugu):")
df_train_te['answer_len'] = df_train_te['answer_inlang'].str.len()
df_val_te['answer_len'] = df_val_te['answer_inlang'].str.len()
print(f"   Training:   Mean={df_train_te['answer_len'].mean():.1f}, Max={df_train_te['answer_len'].max()}")
print(f"   Validation: Mean={df_val_te['answer_len'].mean():.1f}, Max={df_val_te['answer_len'].max()}")

print(f"\n4. Context length statistics (English):")
df_train_te['context_len'] = df_train_te['context'].str.len()
df_val_te['context_len'] = df_val_te['context'].str.len()
print(f"   Training:   Mean={df_train_te['context_len'].mean():.0f}, Max={df_train_te['context_len'].max()}")
print(f"   Validation: Mean={df_val_te['context_len'].mean():.0f}, Max={df_val_te['context_len'].max()}")

print(f"\n5. Sample examples:")
print(f"\n   [ANSWERABLE EXAMPLE]")
if df_train_te['answerable'].sum() > 0:
    sample = df_train_te[df_train_te['answerable'] == True].iloc[0]
    print(f"   Question: {sample['question']}")
    print(f"   Answer:   {sample['answer_inlang']}")

print(f"\n   [UNANSWERABLE EXAMPLE]")
if (~df_train_te['answerable']).sum() > 0:
    sample = df_train_te[df_train_te['answerable'] == False].iloc[0]
    print(f"   Question: {sample['question']}")
    print(f"   Answer:   {sample['answer_inlang']}")

print("\n" + "="*60)


DATA EXPLORATION

1. Dataset sizes:
   Training:   50 examples
   Validation: 100 examples

2. Answerability distribution:
   Training:
     Answerable:   5 (10.0%)
     Unanswerable: 45 (90.0%)
   Validation:
     Answerable:   7 (7.0%)
     Unanswerable: 93 (93.0%)

3. Answer length statistics (Telugu):
   Training:   Mean=15.8, Max=45
   Validation: Mean=14.7, Max=45

4. Context length statistics (English):
   Training:   Mean=731, Max=1083
   Validation: Mean=693, Max=942

5. Sample examples:

   [ANSWERABLE EXAMPLE]
   Question: 2011 నాటికి గొరిగపూడి గ్రామ జనాభా ఎంత?
   Answer:   2229

   [UNANSWERABLE EXAMPLE]
   Question: 1990 నాటికి ఆఫ్రికాలో అతిపెద్ద జనాభా కలిగిన దేశం ఏది?
   Answer:   నైజీరియా



In [5]:
# 5. LOAD BASE MODEL
MODEL = "mrm8488/mT5-small-finetuned-tydiqa-for-xqa"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model_base = AutoModelForSeq2SeqLM.from_pretrained(MODEL).to(device)

print(f"✓ Loaded {MODEL}")
print(f"  This model is already fine-tuned on TyDiQA (multilingual QA dataset)")
print(f"  We'll evaluate it BEFORE and AFTER fine-tuning on our Telugu data")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/368 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/634 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/2.19G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.19G [00:00<?, ?B/s]

Some weights of the model checkpoint at mrm8488/mT5-small-finetuned-tydiqa-for-xqa were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight']
- This IS expected if yo

✓ Loaded mrm8488/mT5-small-finetuned-tydiqa-for-xqa
  This model is already fine-tuned on TyDiQA (multilingual QA dataset)
  We'll evaluate it BEFORE and AFTER fine-tuning on our Telugu data


In [6]:
# 6. HELPER FUNCTIONS
def evaluate_model(model, approach_name, model_type="m1", use_few_shot=False):
    """
    Evaluate model on validation set with zero-shot or few-shot prompting.

    Args:
        model: The model to evaluate
        approach_name: Name for progress display
        model_type: "m1" (Q+C→A_te), "m2" (Q→A_te), "m3" (Q→A_en)
        use_few_shot: Whether to use few-shot prompting
    """
    model.eval()
    predictions = []

    # Get few-shot examples if needed
    if use_few_shot:
        few_shot_examples = df_train_te.sample(min(3, len(df_train_te)), random_state=42)

    for i in tqdm(range(len(df_val_te)), desc=approach_name):
        row = df_val_te.iloc[i]

        if use_few_shot:
            # Build few-shot prompt based on model type
            prompt_parts = []
            for _, ex in few_shot_examples.iterrows():
                if model_type == "m1":
                    # Model 1: Q(te) + C(en) → A(te)
                    prompt_parts.append(f"question: {ex['question']} context: {ex['context'][:150]}")
                    prompt_parts.append(f"answer: {ex['answer_inlang']}")
                elif model_type == "m2":
                    # Model 2: Q(te) → A(te)
                    prompt_parts.append(f"question: {ex['question']}")
                    prompt_parts.append(f"answer: {ex['answer_inlang']}")
                elif model_type == "m3":
                    # Model 3: Q(te) → A(en)
                    prompt_parts.append(f"question: {ex['question']}")
                    # Use 'answer' column for English answers
                    answer_en = ex['answer'] if pd.notna(ex['answer']) else 'unknown'
                    prompt_parts.append(f"answer: {answer_en}")

            # Add current question
            if model_type == "m1":
                prompt_parts.append(f"question: {row['question']} context: {row['context'][:300]}")
            else:  # m2 or m3
                prompt_parts.append(f"question: {row['question']}")
            prompt_parts.append("answer:")
            prompt = " ".join(prompt_parts)
        else:
            # Zero-shot prompt based on model type
            if model_type == "m1":
                # Model 1: Q(te) + C(en) → A(te)
                prompt = f"question: {row['question']} context: {row['context'][:400]}"
            else:  # m2 or m3
                # Model 2 & 3: Q(te) → A
                prompt = f"question: {row['question']}"

        inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).to(device)

        with torch.no_grad():
            outputs = model.generate(**inputs, max_length=64, num_beams=4)

        pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
        predictions.append(pred)

    return predictions

# References for evaluation (all 100 validation examples)
references_te = df_val_te['answer_inlang'].tolist()  # Telugu answers
# For Model 3: use the 'answer' field for English answers
references_en = df_val_te['answer'].apply(lambda x: x if pd.notna(x) else "unknown").tolist()

print("✓ Helper functions defined")

✓ Helper functions defined


---
### MODEL 1: Question (Telugu) + Context (English) → Answer (Telugu)


In [7]:
# 7. MODEL 1 - EXPERIMENT 1: ZERO-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 1: Q(te)+C(en)→A(te) | ZERO-SHOT (BASE)")
print("="*70)

predictions_m1_base_zeroshot = evaluate_model(model_base, "M1 Zero-shot (base)", model_type="m1", use_few_shot=False)

print(f"✓ Generated {len(predictions_m1_base_zeroshot)} predictions")
sentinel_count = sum(1 for p in predictions_m1_base_zeroshot if '<extra_id' in p)
print(f"  Sentinel tokens: {sentinel_count}/{len(predictions_m1_base_zeroshot)}")


MODEL 1: Q(te)+C(en)→A(te) | ZERO-SHOT (BASE)


M1 Zero-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions
  Sentinel tokens: 0/100


In [8]:
# 8. MODEL 1 - EXPERIMENT 2: FEW-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 1: Q(te)+C(en)→A(te) | FEW-SHOT (BASE)")
print("="*70)

predictions_m1_base_fewshot = evaluate_model(model_base, "M1 Few-shot (base)", model_type="m1", use_few_shot=True)

print(f"✓ Generated {len(predictions_m1_base_fewshot)} predictions")


MODEL 1: Q(te)+C(en)→A(te) | FEW-SHOT (BASE)


M1 Few-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [9]:
# 9. MODEL 1 - EXPERIMENT 3: FINE-TUNING
print("\n" + "="*70)
print("MODEL 1: Q(te)+C(en)→A(te) | FINE-TUNING (10 EPOCHS)")
print("="*70)

# Prepare dataset for Model 1
def preprocess_m1(examples):
    inputs = [f"question: {q} context: {c[:400]}" for q, c in zip(examples['question'], examples['context'])]
    targets = examples['answer_inlang']

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Convert to HuggingFace Dataset
train_dataset_m1 = Dataset.from_pandas(df_train_te[['question', 'context', 'answer_inlang']])
train_dataset_m1 = train_dataset_m1.map(preprocess_m1, batched=True, remove_columns=train_dataset_m1.column_names)

# Create Model 1 for fine-tuning
model_m1_finetuned = AutoModelForSeq2SeqLM.from_pretrained(MODEL).to(device)

# Training arguments
training_args_m1 = Seq2SeqTrainingArguments(
    output_dir="./mt5_m1_finetuned",
    num_train_epochs=10,
    per_device_train_batch_size=4,
    learning_rate=3e-4,
    save_strategy="no",
    logging_steps=10,
    predict_with_generate=True,
    fp16=torch.cuda.is_available()
)

# Trainer
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_m1_finetuned)

trainer_m1 = Seq2SeqTrainer(
    model=model_m1_finetuned,
    args=training_args_m1,
    train_dataset=train_dataset_m1,
    data_collator=data_collator
)

# Fine-tune
print("Training Model 1 on 50 Telugu examples...")
trainer_m1.train()
print("\n✓ Model 1 fine-tuning complete!")


MODEL 1: Q(te)+C(en)→A(te) | FINE-TUNING (10 EPOCHS)


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Some weights of the model checkpoint at mrm8488/mT5-small-finetuned-tydiqa-for-xqa were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight']
- This IS expected if yo

Training Model 1 on 50 Telugu examples...


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmateus-g-spencer[0m ([33mmateus-g-spencer-university-of-copenhagen[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,164.5844
20,128.118
30,47.992
40,12.52
50,5.9603
60,3.8818
70,3.0039
80,1.9804
90,1.9203
100,1.643



✓ Model 1 fine-tuning complete!


In [10]:
# 10. MODEL 1 - EXPERIMENT 4: ZERO-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 1: Q(te)+C(en)→A(te) | ZERO-SHOT (FINE-TUNED)")
print("="*70)

predictions_m1_finetuned_zeroshot = evaluate_model(model_m1_finetuned, "M1 Zero-shot (fine-tuned)", model_type="m1", use_few_shot=False)

print(f"✓ Generated {len(predictions_m1_finetuned_zeroshot)} predictions")


MODEL 1: Q(te)+C(en)→A(te) | ZERO-SHOT (FINE-TUNED)


M1 Zero-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [11]:
# 11. MODEL 1 - EXPERIMENT 5: FEW-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 1: Q(te)+C(en)→A(te) | FEW-SHOT (FINE-TUNED)")
print("="*70)

predictions_m1_finetuned_fewshot = evaluate_model(model_m1_finetuned, "M1 Few-shot (fine-tuned)", model_type="m1", use_few_shot=True)

print(f"✓ Generated {len(predictions_m1_finetuned_fewshot)} predictions")


MODEL 1: Q(te)+C(en)→A(te) | FEW-SHOT (FINE-TUNED)


M1 Few-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


---
### MODEL 2: Question (Telugu) → Answer (Telugu)


In [12]:
# 12. MODEL 2 - EXPERIMENT 1: ZERO-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 2: Q(te)→A(te) | ZERO-SHOT (BASE)")
print("="*70)

predictions_m2_base_zeroshot = evaluate_model(model_base, "M2 Zero-shot (base)", model_type="m2", use_few_shot=False)

print(f"✓ Generated {len(predictions_m2_base_zeroshot)} predictions")


MODEL 2: Q(te)→A(te) | ZERO-SHOT (BASE)


M2 Zero-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [13]:
# 13. MODEL 2 - EXPERIMENT 2: FEW-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 2: Q(te)→A(te) | FEW-SHOT (BASE)")
print("="*70)

predictions_m2_base_fewshot = evaluate_model(model_base, "M2 Few-shot (base)", model_type="m2", use_few_shot=True)

print(f"✓ Generated {len(predictions_m2_base_fewshot)} predictions")


MODEL 2: Q(te)→A(te) | FEW-SHOT (BASE)


M2 Few-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [14]:
# 14. MODEL 2 - EXPERIMENT 3: FINE-TUNING
print("\n" + "="*70)
print("MODEL 2: Q(te)→A(te) | FINE-TUNING (10 EPOCHS)")
print("="*70)

# Prepare dataset for Model 2 (no context)
def preprocess_m2(examples):
    inputs = [f"question: {q}" for q in examples['question']]
    targets = examples['answer_inlang']

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Convert to HuggingFace Dataset
train_dataset_m2 = Dataset.from_pandas(df_train_te[['question', 'answer_inlang']])
train_dataset_m2 = train_dataset_m2.map(preprocess_m2, batched=True, remove_columns=train_dataset_m2.column_names)

# Create Model 2 for fine-tuning
model_m2_finetuned = AutoModelForSeq2SeqLM.from_pretrained(MODEL).to(device)

# Training arguments
training_args_m2 = Seq2SeqTrainingArguments(
    output_dir="./mt5_m2_finetuned",
    num_train_epochs=10,
    per_device_train_batch_size=4,
    learning_rate=3e-4,
    save_strategy="no",
    logging_steps=10,
    predict_with_generate=True,
    fp16=torch.cuda.is_available()
)

# Trainer
data_collator_m2 = DataCollatorForSeq2Seq(tokenizer, model=model_m2_finetuned)

trainer_m2 = Seq2SeqTrainer(
    model=model_m2_finetuned,
    args=training_args_m2,
    train_dataset=train_dataset_m2,
    data_collator=data_collator_m2
)

# Fine-tune
print("Training Model 2 on 50 Telugu examples...")
trainer_m2.train()
print("\n✓ Model 2 fine-tuning complete!")


MODEL 2: Q(te)→A(te) | FINE-TUNING (10 EPOCHS)


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Some weights of the model checkpoint at mrm8488/mT5-small-finetuned-tydiqa-for-xqa were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight']
- This IS expected if yo

Training Model 2 on 50 Telugu examples...


Step,Training Loss
10,171.0829
20,148.1423
30,118.6429
40,97.6399
50,84.0885
60,67.6524
70,58.7552
80,47.2794
90,51.236
100,48.9112



✓ Model 2 fine-tuning complete!


In [15]:
# 15. MODEL 2 - EXPERIMENT 4: ZERO-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 2: Q(te)→A(te) | ZERO-SHOT (FINE-TUNED)")
print("="*70)

predictions_m2_finetuned_zeroshot = evaluate_model(model_m2_finetuned, "M2 Zero-shot (fine-tuned)", model_type="m2", use_few_shot=False)

print(f"✓ Generated {len(predictions_m2_finetuned_zeroshot)} predictions")


MODEL 2: Q(te)→A(te) | ZERO-SHOT (FINE-TUNED)


M2 Zero-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [16]:
# 16. MODEL 2 - EXPERIMENT 5: FEW-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 2: Q(te)→A(te) | FEW-SHOT (FINE-TUNED)")
print("="*70)

predictions_m2_finetuned_fewshot = evaluate_model(model_m2_finetuned, "M2 Few-shot (fine-tuned)", model_type="m2", use_few_shot=True)

print(f"✓ Generated {len(predictions_m2_finetuned_fewshot)} predictions")


MODEL 2: Q(te)→A(te) | FEW-SHOT (FINE-TUNED)


M2 Few-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


---
### MODEL 3: Question (Telugu) → Answer (English)


In [17]:
# 17. MODEL 3 - EXPERIMENT 1: ZERO-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 3: Q(te)→A(en) | ZERO-SHOT (BASE)")
print("="*70)

predictions_m3_base_zeroshot = evaluate_model(model_base, "M3 Zero-shot (base)", model_type="m3", use_few_shot=False)

print(f"✓ Generated {len(predictions_m3_base_zeroshot)} predictions")


MODEL 3: Q(te)→A(en) | ZERO-SHOT (BASE)


M3 Zero-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [18]:
# 18. MODEL 3 - EXPERIMENT 2: FEW-SHOT (BASE)
print("\n" + "="*70)
print("MODEL 3: Q(te)→A(en) | FEW-SHOT (BASE)")
print("="*70)

predictions_m3_base_fewshot = evaluate_model(model_base, "M3 Few-shot (base)", model_type="m3", use_few_shot=True)

print(f"✓ Generated {len(predictions_m3_base_fewshot)} predictions")


MODEL 3: Q(te)→A(en) | FEW-SHOT (BASE)


M3 Few-shot (base):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [19]:
# 19. MODEL 3 - EXPERIMENT 3: FINE-TUNING
print("\n" + "="*70)
print("MODEL 3: Q(te)→A(en) | FINE-TUNING (10 EPOCHS)")
print("="*70)

# Prepare dataset for Model 3 (Telugu question → English answer)
def preprocess_m3(examples):
    inputs = [f"question: {q}" for q in examples['question']]
    # Use 'answer' column for English answers
    targets = [ans if pd.notna(ans) else "unknown" for ans in examples['answer']]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Convert to HuggingFace Dataset
train_dataset_m3 = Dataset.from_pandas(df_train_te[['question', 'answer']])
train_dataset_m3 = train_dataset_m3.map(preprocess_m3, batched=True, remove_columns=train_dataset_m3.column_names)

# Create Model 3 for fine-tuning
model_m3_finetuned = AutoModelForSeq2SeqLM.from_pretrained(MODEL).to(device)

# Training arguments
training_args_m3 = Seq2SeqTrainingArguments(
    output_dir="./mt5_m3_finetuned",
    num_train_epochs=10,
    per_device_train_batch_size=4,
    learning_rate=3e-4,
    save_strategy="no",
    logging_steps=10,
    predict_with_generate=True,
    fp16=torch.cuda.is_available()
)

# Trainer
data_collator_m3 = DataCollatorForSeq2Seq(tokenizer, model=model_m3_finetuned)

trainer_m3 = Seq2SeqTrainer(
    model=model_m3_finetuned,
    args=training_args_m3,
    train_dataset=train_dataset_m3,
    data_collator=data_collator_m3
)

# Fine-tune
print("Training Model 3 on 50 Telugu examples...")
trainer_m3.train()
print("\n✓ Model 3 fine-tuning complete!")


MODEL 3: Q(te)→A(en) | FINE-TUNING (10 EPOCHS)


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Some weights of the model checkpoint at mrm8488/mT5-small-finetuned-tydiqa-for-xqa were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.1.layer.2.DenseReluDense.wi.weight', 'decoder.block.2.layer.2.DenseReluDense.wi.weight', 'decoder.block.3.layer.2.DenseReluDense.wi.weight', 'decoder.block.4.layer.2.DenseReluDense.wi.weight', 'decoder.block.5.layer.2.DenseReluDense.wi.weight', 'decoder.block.6.layer.2.DenseReluDense.wi.weight', 'decoder.block.7.layer.2.DenseReluDense.wi.weight', 'encoder.block.0.layer.1.DenseReluDense.wi.weight', 'encoder.block.1.layer.1.DenseReluDense.wi.weight', 'encoder.block.2.layer.1.DenseReluDense.wi.weight', 'encoder.block.3.layer.1.DenseReluDense.wi.weight', 'encoder.block.4.layer.1.DenseReluDense.wi.weight', 'encoder.block.5.layer.1.DenseReluDense.wi.weight', 'encoder.block.6.layer.1.DenseReluDense.wi.weight', 'encoder.block.7.layer.1.DenseReluDense.wi.weight']
- This IS expected if yo

Training Model 3 on 50 Telugu examples...


Step,Training Loss
10,179.6994
20,160.6426
30,106.8983
40,64.2995
50,30.6445
60,9.9647
70,5.6358
80,2.8828
90,2.4278
100,2.2582



✓ Model 3 fine-tuning complete!


In [20]:
# 20. MODEL 3 - EXPERIMENT 4: ZERO-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 3: Q(te)→A(en) | ZERO-SHOT (FINE-TUNED)")
print("="*70)

predictions_m3_finetuned_zeroshot = evaluate_model(model_m3_finetuned, "M3 Zero-shot (fine-tuned)", model_type="m3", use_few_shot=False)

print(f"✓ Generated {len(predictions_m3_finetuned_zeroshot)} predictions")


MODEL 3: Q(te)→A(en) | ZERO-SHOT (FINE-TUNED)


M3 Zero-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


In [21]:
# 21. MODEL 3 - EXPERIMENT 5: FEW-SHOT (FINE-TUNED)
print("\n" + "="*70)
print("MODEL 3: Q(te)→A(en) | FEW-SHOT (FINE-TUNED)")
print("="*70)

predictions_m3_finetuned_fewshot = evaluate_model(model_m3_finetuned, "M3 Few-shot (fine-tuned)", model_type="m3", use_few_shot=True)

print(f"✓ Generated {len(predictions_m3_finetuned_fewshot)} predictions")


MODEL 3: Q(te)→A(en) | FEW-SHOT (FINE-TUNED)


M3 Few-shot (fine-tuned):   0%|          | 0/100 [00:00<?, ?it/s]

✓ Generated 100 predictions


---
### METRICS COMPARISON


In [22]:
# 22. METRICS COMPARISON - ALL MODELS
print("\n" + "="*80)
print("METRICS COMPARISON ACROSS ALL 3 MODELS")
print("="*80 + "\n")

# Load evaluation metrics
bleu_metric = evaluate.load("sacrebleu")
rouge_metric = evaluate.load("rouge")

# Prepare reference answers
references_te_list = [[ans] for ans in references_te]  # Telugu answers
references_en_list = [[ans] for ans in references_en]  # English answers

# Store results
results = {}

# Define all approaches with their corresponding references
approaches = {
    # Model 1: Q+C→A(te)
    "M1 Base Zero-shot": (predictions_m1_base_zeroshot, references_te_list),
    "M1 Base Few-shot": (predictions_m1_base_fewshot, references_te_list),
    "M1 Fine-tuned Zero-shot": (predictions_m1_finetuned_zeroshot, references_te_list),
    "M1 Fine-tuned Few-shot": (predictions_m1_finetuned_fewshot, references_te_list),

    # Model 2: Q→A(te)
    "M2 Base Zero-shot": (predictions_m2_base_zeroshot, references_te_list),
    "M2 Base Few-shot": (predictions_m2_base_fewshot, references_te_list),
    "M2 Fine-tuned Zero-shot": (predictions_m2_finetuned_zeroshot, references_te_list),
    "M2 Fine-tuned Few-shot": (predictions_m2_finetuned_fewshot, references_te_list),

    # Model 3: Q→A(en)
    "M3 Base Zero-shot": (predictions_m3_base_zeroshot, references_en_list),
    "M3 Base Few-shot": (predictions_m3_base_fewshot, references_en_list),
    "M3 Fine-tuned Zero-shot": (predictions_m3_finetuned_zeroshot, references_en_list),
    "M3 Fine-tuned Few-shot": (predictions_m3_finetuned_fewshot, references_en_list),
}

# Compute metrics for each approach
for approach_name, (predictions, refs) in approaches.items():
    # BLEU score
    bleu_result = bleu_metric.compute(predictions=predictions, references=refs)
    bleu_score = bleu_result['score']

    # ROUGE scores
    rouge_result = rouge_metric.compute(predictions=predictions, references=[r[0] for r in refs])
    rouge1 = rouge_result['rouge1']
    rouge2 = rouge_result['rouge2']
    rougeL = rouge_result['rougeL']

    results[approach_name] = {
        'BLEU': bleu_score,
        'ROUGE-1': rouge1,
        'ROUGE-2': rouge2,
        'ROUGE-L': rougeL
    }

# Display results by model
for model_num in ["M1", "M2", "M3"]:
    print(f"\n{'='*80}")
    if model_num == "M1":
        print(f"MODEL 1: Question (Telugu) + Context (English) → Answer (Telugu)")
    elif model_num == "M2":
        print(f"MODEL 2: Question (Telugu) → Answer (Telugu)")
    else:
        print(f"MODEL 3: Question (Telugu) → Answer (English)")
    print(f"{'='*80}")

    model_results = {k: v for k, v in results.items() if k.startswith(model_num)}

    for approach_name, metrics in model_results.items():
        short_name = approach_name.replace(model_num + " ", "")
        print(f"{short_name:30} BLEU: {metrics['BLEU']:6.2f}  ROUGE-1: {metrics['ROUGE-1']:.4f}  ROUGE-2: {metrics['ROUGE-2']:.4f}  ROUGE-L: {metrics['ROUGE-L']:.4f}")

# Overall comparison table
print(f"\n{'='*80}")
print("COMPLETE COMPARISON TABLE")
print(f"{'='*80}")
print(f"{'Approach':<30} {'BLEU':>8} {'ROUGE-1':>10} {'ROUGE-2':>10} {'ROUGE-L':>10}")
print("-"*80)
for approach_name, metrics in results.items():
    print(f"{approach_name:<30} {metrics['BLEU']:>8.2f} {metrics['ROUGE-1']:>10.4f} {metrics['ROUGE-2']:>10.4f} {metrics['ROUGE-L']:>10.4f}")
print("="*80)

# Highlight best overall scores
best_bleu = max(results.items(), key=lambda x: x[1]['BLEU'])
best_rouge1 = max(results.items(), key=lambda x: x[1]['ROUGE-1'])
print(f"\n✓ Best BLEU overall: {best_bleu[0]} ({best_bleu[1]['BLEU']:.2f})")
print(f"✓ Best ROUGE-1 overall: {best_rouge1[0]} ({best_rouge1[1]['ROUGE-1']:.4f})")


METRICS COMPARISON ACROSS ALL 3 MODELS



Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


MODEL 1: Question (Telugu) + Context (English) → Answer (Telugu)
Base Zero-shot                 BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Base Few-shot                  BLEU:   0.02  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Fine-tuned Zero-shot           BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Fine-tuned Few-shot            BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000

MODEL 2: Question (Telugu) → Answer (Telugu)
Base Zero-shot                 BLEU:   0.03  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Base Few-shot                  BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Fine-tuned Zero-shot           BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000
Fine-tuned Few-shot            BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  ROUGE-L: 0.0000

MODEL 3: Question (Telugu) → Answer (English)
Base Zero-shot                 BLEU:   0.00  ROUGE-1: 0.0000  ROUGE-2: 0.0000  RO

In [23]:
# 23. ANSWERABLE vs UNANSWERABLE ANALYSIS
print("\n" + "="*80)
print("ANSWERABLE vs UNANSWERABLE QUESTIONS ANALYSIS")
print("="*80 + "\n")

# Split validation set by answerability
answerable_mask = df_val_te['answerable'].values
unanswerable_mask = ~answerable_mask

answerable_indices = [i for i, x in enumerate(answerable_mask) if x]
unanswerable_indices = [i for i, x in enumerate(unanswerable_mask) if x]

print(f"Validation set breakdown:")
print(f"  Answerable:   {len(answerable_indices)} examples ({len(answerable_indices)/len(df_val_te)*100:.1f}%)")
print(f"  Unanswerable: {len(unanswerable_indices)} examples ({len(unanswerable_indices)/len(df_val_te)*100:.1f}%)")

# Function to compute metrics for subset
def compute_subset_metrics(predictions, references, indices, subset_name):
    if len(indices) == 0:
        return None

    pred_subset = [predictions[i] for i in indices]
    ref_subset = [[references[i][0]] for i in indices]

    bleu = bleu_metric.compute(predictions=pred_subset, references=ref_subset)
    rouge = rouge_metric.compute(predictions=pred_subset, references=[r[0] for r in ref_subset])

    return {
        'BLEU': bleu['score'],
        'ROUGE-1': rouge['rouge1'],
        'subset_name': subset_name,
        'count': len(indices)
    }

# Analyze best performing model (M1 fine-tuned few-shot) on both subsets
print(f"\n{'='*80}")
print("MODEL 1 FINE-TUNED FEW-SHOT: Answerable vs Unanswerable")
print(f"{'='*80}")

metrics_answerable = compute_subset_metrics(
    predictions_m1_finetuned_fewshot,
    references_te_list,
    answerable_indices,
    "Answerable"
)

metrics_unanswerable = compute_subset_metrics(
    predictions_m1_finetuned_fewshot,
    references_te_list,
    unanswerable_indices,
    "Unanswerable"
)

if metrics_answerable:
    print(f"\nAnswerable ({metrics_answerable['count']} examples):")
    print(f"  BLEU:    {metrics_answerable['BLEU']:.2f}")
    print(f"  ROUGE-1: {metrics_answerable['ROUGE-1']:.4f}")

if metrics_unanswerable:
    print(f"\nUnanswerable ({metrics_unanswerable['count']} examples):")
    print(f"  BLEU:    {metrics_unanswerable['BLEU']:.2f}")
    print(f"  ROUGE-1: {metrics_unanswerable['ROUGE-1']:.4f}")

# Show sample predictions for unanswerable questions
print(f"\n{'='*80}")
print("SAMPLE PREDICTIONS FOR UNANSWERABLE QUESTIONS")
print(f"{'='*80}")

if len(unanswerable_indices) > 0:
    for idx in unanswerable_indices[:3]:  # Show first 3 unanswerable
        row = df_val_te.iloc[idx]
        print(f"\nQuestion: {row['question']}")
        print(f"Expected: {row['answer_inlang']}")
        print(f"M1 Prediction: {predictions_m1_finetuned_fewshot[idx]}")
        print(f"M2 Prediction: {predictions_m2_finetuned_fewshot[idx]} (no context)")
        print("-" * 80)

print("\n✓ Can models answer without context?")
print("  Compare M1 (with context) vs M2 (without context) on answerable questions")
print("  Also examine M2 performance on unanswerable questions")


ANSWERABLE vs UNANSWERABLE QUESTIONS ANALYSIS

Validation set breakdown:
  Answerable:   7 examples (7.0%)
  Unanswerable: 93 examples (93.0%)

MODEL 1 FINE-TUNED FEW-SHOT: Answerable vs Unanswerable

Answerable (7 examples):
  BLEU:    0.00
  ROUGE-1: 0.0000

Unanswerable (93 examples):
  BLEU:    0.00
  ROUGE-1: 0.0000

SAMPLE PREDICTIONS FOR UNANSWERABLE QUESTIONS

Question: మలేరియా వ్యాధి కి మందు కనిపెట్టిన శాస్త్రవేత్త ఎవరు?
Expected: హన్స్ ఆండర్సాగ్
M1 Prediction: క కం కం
M2 Prediction: <0xF7>సన్నసన్నసన్నన్నాడున్నాడున్నాడుsummarysummaryసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్నసన్న (no context)
--------------------------------------------------------------------------------

Question: ఈస్ట్ ఇండియా కంపెనీ భారతదేశంలోకి ఎప్పుడు వచ్చింది?
Expected: 1608
M1 Prediction: అ అ అ అం
M2 Prediction: సన్నసన్నసన్నసన్నసన్నసన్న

In [None]:
# 24. BETTER EVALUATION METRICS (Exact Match & Token F1)
print("\n" + "="*80)
print("ALTERNATIVE EVALUATION: EXACT MATCH & TOKEN F1")
print("="*80 + "\n")

def exact_match(predictions, references):
    """Check if prediction exactly matches reference (case-sensitive)"""
    matches = sum(1 for p, r in zip(predictions, references) if p.strip() == r.strip())
    return matches / len(predictions) * 100

def token_f1(predictions, references):
    """Compute token-level F1 (useful for partial credit)"""
    scores = []
    for pred, ref in zip(predictions, references):
        pred_tokens = set(pred.lower().split())
        ref_tokens = set(ref.lower().split())
        if len(pred_tokens) == 0 and len(ref_tokens) == 0:
            scores.append(1.0)
        elif len(pred_tokens) == 0 or len(ref_tokens) == 0:
            scores.append(0.0)
        else:
            overlap = len(pred_tokens & ref_tokens)
            precision = overlap / len(pred_tokens)
            recall = overlap / len(ref_tokens)
            if precision + recall == 0:
                scores.append(0.0)
            else:
                f1 = 2 * precision * recall / (precision + recall)
                scores.append(f1)
    return sum(scores) / len(scores) * 100

# Compute better metrics for all models
better_results = {}

model_predictions = {
    "M1 Base Zero-shot": predictions_m1_base_zeroshot,
    "M1 Base Few-shot": predictions_m1_base_fewshot,
    "M1 Fine-tuned Zero-shot": predictions_m1_finetuned_zeroshot,
    "M1 Fine-tuned Few-shot": predictions_m1_finetuned_fewshot,
    "M2 Base Zero-shot": predictions_m2_base_zeroshot,
    "M2 Base Few-shot": predictions_m2_base_fewshot,
    "M2 Fine-tuned Zero-shot": predictions_m2_finetuned_zeroshot,
    "M2 Fine-tuned Few-shot": predictions_m2_finetuned_fewshot,
}

for model_name, preds in model_predictions.items():
    # Determine which reference to use (Telugu for M1/M2, English for M3)
    if "M1" in model_name or "M2" in model_name:
        refs = references_te
    else:
        refs = references_en

    em = exact_match(preds, refs)
    f1 = token_f1(preds, refs)
    better_results[model_name] = {'EM': em, 'Token_F1': f1}
    print(f"{model_name:30} EM: {em:5.1f}%  Token F1: {f1:5.1f}%")

# Model 3 results
print("\nModel 3 (Q→A_en):")
for model_name in ["M3 Base Zero-shot", "M3 Base Few-shot", "M3 Fine-tuned Zero-shot", "M3 Fine-tuned Few-shot"]:
    preds_map = {
        "M3 Base Zero-shot": predictions_m3_base_zeroshot,
        "M3 Base Few-shot": predictions_m3_base_fewshot,
        "M3 Fine-tuned Zero-shot": predictions_m3_finetuned_zeroshot,
        "M3 Fine-tuned Few-shot": predictions_m3_finetuned_fewshot,
    }
    preds = preds_map[model_name]
    em = exact_match(preds, references_en)
    f1 = token_f1(preds, references_en)
    better_results[model_name] = {'EM': em, 'Token_F1': f1}
    print(f"{model_name:30} EM: {em:5.1f}%  Token F1: {f1:5.1f}%")


ALTERNATIVE EVALUATION: EXACT MATCH & TOKEN F1

M1 Base Zero-shot              EM:   0.0%  Token F1:   0.0%
M1 Base Few-shot               EM:   0.0%  Token F1:   0.1%
M1 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.0%
M1 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%
M2 Base Zero-shot              EM:   0.0%  Token F1:   0.4%
M2 Base Few-shot               EM:   0.0%  Token F1:   0.0%
M2 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.0%
M2 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%

Model 3 (Q→A_en):
M3 Base Zero-shot              EM:   0.0%  Token F1:   0.0%
M3 Base Few-shot               EM:   0.0%  Token F1:   0.0%
M3 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.8%
M3 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%



In [None]:
# 25. SAMPLE PREDICTIONS ANALYSIS
print("\n" + "="*80)
print("SAMPLE PREDICTIONS: WHAT DID THE MODELS GENERATE?")
print("="*80 + "\n")

# Show 5 examples from each model configuration
for model_type, preds, pred_source in [
    ("Model 1 (Q+C→A_te)", predictions_m1_base_zeroshot, "M1 Base Zero-shot"),
    ("Model 2 (Q→A_te)", predictions_m2_base_zeroshot, "M2 Base Zero-shot"),
    ("Model 3 (Q→A_en)", predictions_m3_base_zeroshot, "M3 Base Zero-shot"),
]:
    print(f"\n{model_type} - {pred_source}")
    print("-" * 80)
    for i in range(min(3, len(preds))):
        q = df_val_te.iloc[i]['question']
        ans_te = df_val_te.iloc[i]['answer']
        ans_en = df_val_te.iloc[i]['answer_xor'] if 'answer_xor' in df_val_te.columns else "N/A"
        pred = preds[i]
        answerable = not pd.isna(ans_te) and ans_te != ""

        print(f"\nExample {i+1}:")
        print(f"  Question (Te): {q[:60]}...")
        print(f"  Answerable: {answerable}")
        print(f"  Ground Truth (Te): {ans_te}")
        if "M3" in model_type:
            print(f"  Ground Truth (En): {ans_en}")
        print(f"  Model Prediction: {pred}")
        print(f"  Match: {pred.strip() == str(ans_te).strip()}")


SAMPLE PREDICTIONS: WHAT DID THE MODELS GENERATE?


Model 1 (Q+C→A_te) - M1 Base Zero-shot
--------------------------------------------------------------------------------

Example 1:
  Question (Te): మలేరియా వ్యాధి కి మందు కనిపెట్టిన శాస్త్రవేత్త ఎవరు?...
  Answerable: True
  Ground Truth (Te): Hans Andersag
  Model Prediction: రామ్ శ్రీత్త్ కి మందునింనిమె the the the the of theing the the the of the the the of the the the of the the the theని with the the of the the the the the the the the the the the the the the the the the the the the the the the
  Match: False

Example 2:
  Question (Te): మున్నా చిత్రానికి సంగీత దర్శకుడు ఎవరు?...
  Answerable: True
  Ground Truth (Te): Harish Jairaj
  Model Prediction: మీ మీరూఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠఠ
  Match: False

Example 3:
  Question (Te): ఈస్ట్ ఇండియా కంపెనీ భారతదేశంలోకి ఎప్పుడు వచ్చింది?...
  Answerable: True
  Ground Truth (Te): 1608
  Model Prediction: ండీ మీ మీ మీ మీ教えてください మీ教えてください教えてくださいడ్గూడ్డ్ింగ్

In [None]:
# 26. EVALUATION ON ANSWERABLE SUBSET ONLY
print("\n" + "="*80)
print("EVALUATION ON ANSWERABLE VALIDATION EXAMPLES ONLY")
print("="*80 + "\n")

# Identify answerable examples
answerable_mask = (df_val_te['answer'].notna()) & (df_val_te['answer'] != "")
answerable_indices = np.where(answerable_mask)[0]
print(f"Validation split: {len(df_val_te)} total examples")
print(f"Answerable examples: {len(answerable_indices)} ({len(answerable_indices)/len(df_val_te)*100:.1f}%)")
print(f"Unanswerable examples: {len(df_val_te) - len(answerable_indices)} ({(len(df_val_te)-len(answerable_indices))/len(df_val_te)*100:.1f}%)\n")

if len(answerable_indices) > 0:
    # Compute metrics ONLY on answerable subset
    def exact_match_subset(predictions, references, indices):
        subset_preds = [predictions[i] for i in indices]
        subset_refs = [references[i] for i in indices]
        return exact_match(subset_preds, subset_refs)

    def token_f1_subset(predictions, references, indices):
        subset_preds = [predictions[i] for i in indices]
        subset_refs = [references[i] for i in indices]
        return token_f1(subset_preds, subset_refs)

    print("ANSWERABLE-ONLY RESULTS:")
    print("-" * 80)

    answerable_results = {}
    for model_name, preds in model_predictions.items():
        refs = references_te  # M1/M2 use Telugu
        em = exact_match_subset(preds, refs, answerable_indices)
        f1 = token_f1_subset(preds, refs, answerable_indices)
        answerable_results[model_name] = {'EM': em, 'Token_F1': f1}
        print(f"{model_name:30} EM: {em:5.1f}%  Token F1: {f1:5.1f}%")

    # Model 3
    print("\nModel 3 (Q→A_en):")
    for model_name in ["M3 Base Zero-shot", "M3 Base Few-shot", "M3 Fine-tuned Zero-shot", "M3 Fine-tuned Few-shot"]:
        preds_map = {
            "M3 Base Zero-shot": predictions_m3_base_zeroshot,
            "M3 Base Few-shot": predictions_m3_base_fewshot,
            "M3 Fine-tuned Zero-shot": predictions_m3_finetuned_zeroshot,
            "M3 Fine-tuned Few-shot": predictions_m3_finetuned_fewshot,
        }
        preds = preds_map[model_name]
        em = exact_match_subset(preds, references_en, answerable_indices)
        f1 = token_f1_subset(preds, references_en, answerable_indices)
        print(f"{model_name:30} EM: {em:5.1f}%  Token F1: {f1:5.1f}%")
else:
    print("No answerable examples in validation set!")


EVALUATION ON ANSWERABLE VALIDATION EXAMPLES ONLY

Validation split: 100 total examples
Answerable examples: 100 (100.0%)
Unanswerable examples: 0 (0.0%)

ANSWERABLE-ONLY RESULTS:
--------------------------------------------------------------------------------
M1 Base Zero-shot              EM:   0.0%  Token F1:   0.0%
M1 Base Few-shot               EM:   0.0%  Token F1:   0.1%
M1 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.0%
M1 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%
M2 Base Zero-shot              EM:   0.0%  Token F1:   0.4%
M2 Base Few-shot               EM:   0.0%  Token F1:   0.0%
M2 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.0%
M2 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%

Model 3 (Q→A_en):
M3 Base Zero-shot              EM:   0.0%  Token F1:   0.0%
M3 Base Few-shot               EM:   0.0%  Token F1:   0.0%
M3 Fine-tuned Zero-shot        EM:   0.0%  Token F1:   0.8%
M3 Fine-tuned Few-shot         EM:   0.0%  Token F1:   0.0%