# Strong Baseline: Fine-tuned mBART (Complete)

**This notebook does EVERYTHING:**
1. ‚úÖ Trains mBART on CS-Sum dataset
2. ‚úÖ Generates predictions on test set
3. ‚úÖ Evaluates with ROUGE, BERTScore, CMC
4. ‚úÖ Shows results
5. ‚úÖ Downloads predictions + scores

**Run on Google Colab with GPU enabled!**

**Time:** 2-4 hours (mostly training)

## Step 1: Setup & Installation

In [None]:
# Install all dependencies
!pip install -q transformers==4.36.0 datasets==2.16.0 accelerate==0.25.0
!pip install -q sentencepiece==0.1.99 rouge-score==0.1.2 bert-score==0.3.13
!pip install -q langdetect==1.0.9

print("‚úÖ All dependencies installed!")

In [None]:
# Import libraries
import json
import torch
import numpy as np
from collections import Counter
from datasets import Dataset
from transformers import (
    MBartForConditionalGeneration,
    MBart50TokenizerFast,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from rouge_score import rouge_scorer
from bert_score import score as bert_score_fn
from langdetect import detect_langs, LangDetectException
from tqdm import tqdm

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected! Training will be VERY slow.")
    print("   Enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU (T4)")

## Step 2: Upload Your Data

**Click the folder icon on the left ‚Üí** Upload these files:
- `cs_sum_train.jsonl`
- `cs_sum_dev.jsonl`
- `cs_sum_test.jsonl`

Then run the cell below to verify.

In [None]:
import os

# Check if files exist
required_files = ['cs_sum_train.jsonl', 'cs_sum_dev.jsonl', 'cs_sum_test.jsonl']
all_present = True

for file in required_files:
    if os.path.exists(file):
        lines = sum(1 for _ in open(file))
        print(f"‚úÖ {file}: {lines} examples")
    else:
        print(f"‚ùå {file}: NOT FOUND")
        all_present = False

if all_present:
    print("\n‚úÖ All data files present! Ready to proceed.")
else:
    print("\n‚ùå Please upload the missing files before continuing.")

## Step 3: Load and Prepare Data

In [None]:
def load_jsonl(filepath):
    """Load JSONL file."""
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

def prepare_data(data):
    """Convert to format for training."""
    prepared = []
    for item in data:
        # Concatenate thread messages
        messages = item.get('messages', [])
        thread_text = ' '.join([msg['text'] for msg in messages])
        
        # Get summary
        summary = item.get('summary', '')
        
        if thread_text and summary:
            prepared.append({
                'thread_id': item.get('thread_id', ''),
                'thread': thread_text,
                'summary': summary
            })
    
    return prepared

# Load data
print("Loading data...")
train_data = prepare_data(load_jsonl('cs_sum_train.jsonl'))
dev_data = prepare_data(load_jsonl('cs_sum_dev.jsonl'))
test_data = prepare_data(load_jsonl('cs_sum_test.jsonl'))

print(f"Train: {len(train_data)} examples")
print(f"Dev: {len(dev_data)} examples")
print(f"Test: {len(test_data)} examples")

# Show example
print("\nExample:")
print(f"Thread: {train_data[0]['thread'][:150]}...")
print(f"Summary: {train_data[0]['summary']}")

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_list(train_data)
dev_dataset = Dataset.from_list(dev_data)
test_dataset = Dataset.from_list(test_data)

## Step 4: Initialize Model

In [None]:
# Load model and tokenizer
model_name = "facebook/mbart-large-50-many-to-many-mmt"

print(f"Loading {model_name}...")
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

print(f"‚úÖ Model loaded: {model.num_parameters():,} parameters")
print(f"‚úÖ Tokenizer vocab size: {len(tokenizer)}")

## Step 5: Preprocess Data

In [None]:
# Tokenization parameters
max_input_length = 512
max_target_length = 128

# Set source and target languages
tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "en_XX"

def preprocess_function(examples):
    """Tokenize inputs and targets."""
    inputs = tokenizer(
        examples['thread'],
        max_length=max_input_length,
        truncation=True,
        padding='max_length'
    )
    
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(
            examples['summary'],
            max_length=max_target_length,
            truncation=True,
            padding='max_length'
        )
    
    inputs['labels'] = targets['input_ids']
    return inputs

# Tokenize datasets
print("Tokenizing datasets...")
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_dev = dev_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)
print("‚úÖ Tokenization complete!")

## Step 6: Training Setup

In [None]:
# Training hyperparameters
training_args = Seq2SeqTrainingArguments(
    output_dir="./mbart_checkpoints",
    
    # Training config
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=4,
    
    # Optimization
    learning_rate=3e-5,
    warmup_steps=500,
    weight_decay=0.01,
    
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    
    # Generation
    predict_with_generate=True,
    generation_max_length=128,
    
    # Misc
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    fp16=torch.cuda.is_available(),
    push_to_hub=False,
)

# ROUGE scorer for training evaluation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def compute_metrics(eval_pred):
    """Compute ROUGE during training."""
    predictions, labels = eval_pred
    
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    
    for pred, label in zip(decoded_preds, decoded_labels):
        scores = scorer.score(label, pred)
        rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
        rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
        rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)
    
    return {
        'rouge1': np.mean(rouge_scores['rouge1']),
        'rouge2': np.mean(rouge_scores['rouge2']),
        'rougeL': np.mean(rouge_scores['rougeL'])
    }

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("‚úÖ Trainer initialized!")

## Step 7: TRAIN! üöÄ

**This will take 2-4 hours. Go get coffee! ‚òï**

Monitor:
- Loss should decrease
- ROUGE scores should increase

In [None]:
print("="*60)
print("Starting training...")
print("This will take ~2-4 hours. Monitor the progress below.")
print("="*60)
print()

train_result = trainer.train()

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETE!")
print("="*60)
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Final training loss: {train_result.metrics['train_loss']:.4f}")
print("="*60)

## Step 8: Generate Predictions on Test Set

In [None]:
print("Generating predictions on test set...")
print(f"Processing {len(test_data)} examples...\n")

predictions_list = []

# Generate predictions
for item in tqdm(test_data):
    inputs = tokenizer(
        item['thread'],
        return_tensors="pt",
        max_length=512,
        truncation=True
    ).to(model.device)
    
    with torch.no_grad():
        summary_ids = model.generate(
            inputs["input_ids"],
            max_length=128,
            num_beams=4,
            length_penalty=1.0,
            early_stopping=True,
            no_repeat_ngram_size=3
        )
    
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    predictions_list.append({
        'thread_id': item['thread_id'],
        'prediction': summary,
        'reference': item['summary'],
        'thread': item['thread']
    })

print(f"\n‚úÖ Generated {len(predictions_list)} predictions")

## Step 9: Evaluate with All Metrics

Now we compute:
1. ROUGE (content selection)
2. BERTScore (semantic similarity)
3. CMC (code-mixing coverage)

In [None]:
# Extract predictions and references
predictions = [p['prediction'] for p in predictions_list]
references = [p['reference'] for p in predictions_list]
threads = [p['thread'] for p in predictions_list]

print("Computing evaluation metrics...\n")

In [None]:
# 1. ROUGE Scores
print("üìä Computing ROUGE...")

rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

for pred, ref in zip(predictions, references):
    if pred and ref:
        scores = scorer.score(ref, pred)
        rouge_scores['rouge1'].append(scores['rouge1'].fmeasure)
        rouge_scores['rouge2'].append(scores['rouge2'].fmeasure)
        rouge_scores['rougeL'].append(scores['rougeL'].fmeasure)

rouge_results = {
    'rouge1_f1': np.mean(rouge_scores['rouge1']),
    'rouge2_f1': np.mean(rouge_scores['rouge2']),
    'rougeL_f1': np.mean(rouge_scores['rougeL'])
}

print("‚úÖ ROUGE computed")

In [None]:
# 2. BERTScore
print("üß† Computing BERTScore (this takes ~1 minute)...")

P, R, F1 = bert_score_fn(
    predictions,
    references,
    lang='en',
    model_type='bert-base-multilingual-cased',
    verbose=False,
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

bertscore_results = {
    'bertscore_precision': P.mean().item(),
    'bertscore_recall': R.mean().item(),
    'bertscore_f1': F1.mean().item()
}

print("‚úÖ BERTScore computed")

In [None]:
# 3. Code-Mixing Coverage (CMC)
print("üåç Computing Code-Mixing Coverage...")

def detect_language_distribution(text):
    """Detect language distribution."""
    if not text:
        return {}
    
    words = text.split()
    lang_counts = Counter()
    total = 0
    
    for word in words:
        if len(word) < 2:
            continue
        try:
            langs = detect_langs(word)
            if langs:
                lang_counts[langs[0].lang] += 1
                total += 1
        except:
            continue
    
    if total == 0:
        return {}
    
    return {lang: count / total for lang, count in lang_counts.items()}

cmc_scores = []

for pred, thread in zip(predictions, threads):
    if not pred or not thread:
        cmc_scores.append(0.5)
        continue
    
    thread_langs = detect_language_distribution(thread)
    pred_langs = detect_language_distribution(pred)
    
    if not thread_langs or not pred_langs:
        cmc_scores.append(0.5)
        continue
    
    all_langs = set(list(thread_langs.keys()) + list(pred_langs.keys()))
    ratio_diff = sum(abs(thread_langs.get(l, 0) - pred_langs.get(l, 0)) for l in all_langs)
    cmc = max(0.0, 1.0 - (ratio_diff / 2.0))
    cmc_scores.append(cmc)

cmc_results = {
    'code_mixing_coverage': np.mean(cmc_scores)
}

print("‚úÖ CMC computed")

## Step 10: Display Results

In [None]:
# Combine all scores
all_scores = {**rouge_results, **bertscore_results, **cmc_results}

# Display
print("\n" + "="*60)
print("EVALUATION RESULTS - mBART Fine-tuned")
print("="*60)

print("\nüìä Content Selection (ROUGE):")
print(f"  ROUGE-1 F1      : {all_scores['rouge1_f1']:.4f}")
print(f"  ROUGE-2 F1      : {all_scores['rouge2_f1']:.4f}")
print(f"  ROUGE-L F1      : {all_scores['rougeL_f1']:.4f}")

print("\nüß† Semantic Similarity (BERTScore):")
print(f"  Precision       : {all_scores['bertscore_precision']:.4f}")
print(f"  Recall          : {all_scores['bertscore_recall']:.4f}")
print(f"  F1              : {all_scores['bertscore_f1']:.4f}")

print("\nüåç Bilingual Faithfulness (Novel Metric):")
print(f"  CMC             : {all_scores['code_mixing_coverage']:.4f}")

print("\n" + "="*60)

# Interpretation
print("\nInterpretation:")
if all_scores['rougeL_f1'] > 0.30:
    print("  ‚úÖ ROUGE-L > 0.30: Good content selection")
else:
    print("  ‚ö†Ô∏è ROUGE-L < 0.30: Below target")

if all_scores['bertscore_f1'] > 0.70:
    print("  ‚úÖ BERTScore > 0.70: Strong semantic match")
else:
    print("  ‚ö†Ô∏è BERTScore < 0.70: Below target")

if all_scores['code_mixing_coverage'] > 0.70:
    print("  ‚úÖ CMC > 0.70: Good language preservation")
else:
    print("  ‚ö†Ô∏è CMC < 0.70: Language collapse detected")

print("="*60)

## Step 11: Show Example Predictions

In [None]:
print("\n" + "="*70)
print("EXAMPLE PREDICTIONS")
print("="*70)

for i in range(min(3, len(predictions_list))):
    example = predictions_list[i]
    print(f"\nExample {i+1}:")
    print("-" * 70)
    print(f"Thread (first 150 chars): {example['thread'][:150]}...")
    print(f"\nPrediction: {example['prediction']}")
    print(f"\nReference:  {example['reference']}")
    print("-" * 70)

## Step 12: Save Results for Download

In [None]:
# Save predictions
with open('mbart_predictions.jsonl', 'w', encoding='utf-8') as f:
    for item in predictions_list:
        f.write(json.dumps({
            'thread_id': item['thread_id'],
            'prediction': item['prediction']
        }, ensure_ascii=False) + '\n')

print("‚úÖ Saved: mbart_predictions.jsonl")

# Save scores
with open('mbart_scores.json', 'w') as f:
    json.dump(all_scores, f, indent=2)

print("‚úÖ Saved: mbart_scores.json")

# Save detailed results with references
with open('mbart_detailed_results.json', 'w', encoding='utf-8') as f:
    json.dump(predictions_list, f, indent=2, ensure_ascii=False)

print("‚úÖ Saved: mbart_detailed_results.json")

print("\nüì• Download these files:")
print("  1. mbart_predictions.jsonl (for submission)")
print("  2. mbart_scores.json (metrics)")
print("  3. mbart_detailed_results.json (full results)")
print("\nClick the folder icon ‚Üí Right-click each file ‚Üí Download")

## Step 13: Create Summary Report

In [None]:
summary_report = f"""
{'='*60}
MILESTONE 2 - STRONG BASELINE SUMMARY
{'='*60}

Model: mBART-large-50 (611M parameters)
Dataset: CS-Sum (Chinese-English code-mixed)
Training: {len(train_data)} examples, 3 epochs
Test Set: {len(test_data)} examples

RESULTS:
{'-'*60}
ROUGE-1 F1      : {all_scores['rouge1_f1']:.4f}
ROUGE-2 F1      : {all_scores['rouge2_f1']:.4f}
ROUGE-L F1      : {all_scores['rougeL_f1']:.4f}

BERTScore P     : {all_scores['bertscore_precision']:.4f}
BERTScore R     : {all_scores['bertscore_recall']:.4f}
BERTScore F1    : {all_scores['bertscore_f1']:.4f}

CMC             : {all_scores['code_mixing_coverage']:.4f}
{'-'*60}

KEY FINDING:
Fine-tuned mBART achieves ROUGE-L = {all_scores['rougeL_f1']:.2f}
but shows {'language collapse' if all_scores['code_mixing_coverage'] < 0.70 else 'good language preservation'}
(CMC = {all_scores['code_mixing_coverage']:.2f})

Training completed in: {train_result.metrics['train_runtime']:.0f} seconds
                       ({train_result.metrics['train_runtime']/3600:.1f} hours)

{'='*60}
"""

print(summary_report)

# Save report
with open('training_summary.txt', 'w') as f:
    f.write(summary_report)

print("\n‚úÖ Saved: training_summary.txt")

## üéâ DONE!

**You now have:**
1. ‚úÖ Trained mBART model
2. ‚úÖ Test predictions (`mbart_predictions.jsonl`)
3. ‚úÖ Evaluation scores (`mbart_scores.json`)
4. ‚úÖ Complete results (`mbart_detailed_results.json`)
5. ‚úÖ Summary report (`training_summary.txt`)

**Next steps:**
1. Download all the files (click folder icon ‚Üí right-click ‚Üí download)
2. Use these results in your `strong-baseline.md`
3. Include in your final report

**For your report, you can say:**
> "We fine-tuned mBART-large-50 on 2,584 Chinese-English code-mixed conversations for 3 epochs (~{train_result.metrics['train_runtime']/3600:.1f} hours on T4 GPU). On the test set of 325 examples, the model achieved ROUGE-L = {all_scores['rougeL_f1']:.3f}, BERTScore F1 = {all_scores['bertscore_f1']:.3f}, and CMC = {all_scores['code_mixing_coverage']:.3f}."