
# 🧪 AfriHealth-MultiBench Evaluation Notebook

This notebook provides **baseline evaluation scripts** for the AfriHealth-MultiBench project (or any similar multimodal benchmark).  
It evaluates performance across:

- 🗣️ **Speech Recognition (WER)**
- 🌍 **Translation (BLEU, ChrF)**
- ❓ **Question Answering (Exact Match, F1)**

---


In [None]:

# ================================================================
# 1. Setup & Imports
# ================================================================

!pip install jiwer sacrebleu evaluate pandas numpy tqdm --quiet

import pandas as pd
import numpy as np
from tqdm import tqdm
from jiwer import wer
import sacrebleu
import evaluate

# Load Hugging Face SQuAD evaluation metric
qa_metric = evaluate.load("squad")

print("✅ Environment ready.")


In [None]:

# ================================================================
# 2. Load Predictions & References
# ================================================================

# Expected CSV format:
# ASR: id | reference | hypothesis
# MT:  id | source | reference | prediction
# QA:  id | context | question | reference | prediction

# Example file paths (replace with your data paths)
asr_df = pd.read_csv("data/asr_results.csv")
mt_df = pd.read_csv("data/translation_results.csv")
qa_df = pd.read_csv("data/qa_results.csv")

print("ASR:", asr_df.shape)
print("MT:", mt_df.shape)
print("QA:", qa_df.shape)


In [None]:

# ================================================================
# 3. Evaluation Functions
# ================================================================

# 3a. Word Error Rate (WER)
def evaluate_asr(df):
    wers = []
    for _, row in tqdm(df.iterrows(), total=len(df)):
        wers.append(wer(row["reference"], row["hypothesis"]))
    avg_wer = np.mean(wers)
    return {"WER": round(avg_wer * 100, 2)}


# 3b. Translation (BLEU, ChrF)
def evaluate_translation(df):
    references = [[ref] for ref in df["reference"].tolist()]
    predictions = df["prediction"].tolist()
    
    bleu = sacrebleu.corpus_bleu(predictions, references).score
    chrf = sacrebleu.corpus_chrf(predictions, references).score
    
    return {"BLEU": round(bleu, 2), "ChrF": round(chrf, 2)}


# 3c. Question Answering (Exact Match, F1)
def evaluate_qa(df):
    predictions = [{"id": str(r["id"]), "prediction_text": r["prediction"]} for _, r in df.iterrows()]
    references = [{"id": str(r["id"]), "answers": {"text": [r["reference"]], "answer_start": [0]}} for _, r in df.iterrows()]
    
    results = qa_metric.compute(predictions=predictions, references=references)
    return {"ExactMatch": round(results["exact_match"], 2), "F1": round(results["f1"], 2)}


In [None]:

# ================================================================
# 4. Run Evaluation
# ================================================================

asr_scores = evaluate_asr(asr_df)
mt_scores = evaluate_translation(mt_df)
qa_scores = evaluate_qa(qa_df)

# Combine into summary
summary = pd.DataFrame([asr_scores, mt_scores, qa_scores], index=["ASR", "MT", "QA"])

print("\n📊 Evaluation Summary:\n")
display(summary)

# Save results
summary.to_csv("evaluation_summary.csv", index=True)
print("\n✅ Saved evaluation summary to 'evaluation_summary.csv'")



---

## 📄 Example Input Format

**ASR**
```csv
id,reference,hypothesis
1,thank you for calling,thank you for calling
2,how are you doing today,how are doing today
```

**Translation**
```csv
id,source,reference,prediction
1,bonjour,hello,hi
2,merci beaucoup,thank you very much,thanks a lot
```

**Question Answering**
```csv
id,context,question,reference,prediction
1,The capital of Nigeria is Abuja,What is the capital of Nigeria?,Abuja,Lagos
```

---


