# 🩺 HealthTequity Voice Processing Pipeline

This notebook demonstrates an end-to-end **Voice-to-Insight** pipeline for the HealthTequity case study. It processes Spanish medical speech through transcription, translation, GPT-based analysis, and **automatic ASR evaluation** (WER, CER, SER).

---

### 📘 Companion Notebooks
Two supplementary notebooks support data generation for this pipeline:

| Notebook | Purpose | Output Folder |
|-----------|---------|---------------|
| **Synthetic Blood Pressure Generator** | Creates a 30-day synthetic blood pressure dataset | `data/synthetic_csv/` |
| **Spanish Audio Generator** | Generates Spanish-language health questions from the dataset | `data/Spanish_audio/` |

> Reviewers may also upload their own CSVs and audio recordings. To compute WER/CER/SER, provide **ground-truth transcriptions** for the input audio.

---

### ⚙️ Pipeline Overview

| Component | Description | Technology |
|-----------|-------------|------------|
| **ASR (Speech-to-Text)** | Transcribes Spanish medical speech to text | Whisper |
| **Translation** | Translates Spanish text → English | GPT-4o-mini |
| **LLM Analysis** | Answers questions from the BP CSV | GPT-4o-mini |
| **(Optional) TTS** | Synthesizes Spanish answers to audio | Any TTS provider |
| **ASR Evaluation** | Computes WER/CER/SER for input & output | Whisper + JiWER |

---

### 💡 Notes for Reviewers
- Default Whisper model is **"base"** but can be changed.
- This notebook auto-creates required folders under the project root.
- Results are saved as CSVs and a summary chart in `results/`.

---

### 🔗 Resources
- **GitHub Repository:** [HealthTequity-LLM](https://github.com/nattaran/HealthTequity-LLM)
- **Data Folders:**
  - `data/synthetic_csv/` → blood pressure datasets
  - `data/Spanish_audio/` → input audio files
  - `results/` → outputs, evaluations, and charts

> Designed for reproducibility in Google Colab.


In [ ]:
# ==========================================================
# 🚀 Quick Start: run the full pipeline in one call
# ==========================================================
# Estimated runtime: ~2-6 minutes with Whisper 'base' (depends on audio length)

from pipeline.main import run_full_pipeline
from pipeline.config import CSV_DIR, AUDIO_DIR

csv_path = CSV_DIR / "synthetic_bp_one_person.csv"
run_full_pipeline(csv_path, AUDIO_DIR)


In [ ]:
# ==========================================================
# (Optional) Mount Google Drive in Colab
# ==========================================================
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)


In [ ]:
# ==========================================================
# Whisper model configuration (adjustable)
# ==========================================================
# Options: "tiny", "base", "small", "medium", "large"
WHISPER_MODEL_SIZE = "base"  # reviewers can change this

# Ensure core folders exist (the pipeline.config already does this at import time)
from pipeline.config import BASE_DIR, LLM_OUT, EVAL_DIR, TTS_DIR
BASE_DIR.mkdir(parents=True, exist_ok=True)
LLM_OUT.mkdir(parents=True, exist_ok=True)
EVAL_DIR.mkdir(parents=True, exist_ok=True)
TTS_DIR.mkdir(parents=True, exist_ok=True)
print("Project root:", BASE_DIR)


In [ ]:
# ==========================================================
# Imports for evaluation and transcription
# ==========================================================
import pandas as pd
from pathlib import Path
import whisper
from pipeline.asr_module import transcribe_spanish_audio
from pipeline.evaluation_module import evaluate_asr_pair
from pipeline.config import CSV_DIR, AUDIO_DIR
print("Imports ready.")


## ⬛ Step 5 & 6 — Unified ASR Evaluation (Input & Output)

This section evaluates ASR quality for:
- **Input ASR**: original Spanish audio in `data/Spanish_audio/` vs. ground-truth Spanish text
- **Output ASR**: generated Spanish TTS in `results/tts_audio/` vs. Spanish answers from LLM results

Both use the same evaluation function and save WER/CER/SER per sample to CSV.


In [ ]:
# Helper: load ground-truth for INPUT side
# Looks for CSV at data/synthetic_csv/ground_truth.csv with columns:
#   audio_file, ground_truth_text
def load_input_ground_truth(csv_dir: Path = CSV_DIR):
    gt_csv = csv_dir / "ground_truth.csv"
    if not gt_csv.exists():
        print(f"⚠️ Ground-truth CSV not found: {gt_csv}\nCreate it with columns: audio_file, ground_truth_text")
        return None
    df = pd.read_csv(gt_csv)
    if not {"audio_file", "ground_truth_text"}.issubset(df.columns):
        raise ValueError("ground_truth.csv must include columns: audio_file, ground_truth_text")
    # Return texts in the order of audio files present in AUDIO_DIR
    audio_files = sorted([p.name for p in (AUDIO_DIR).glob('*.wav')])
    mapping = dict(zip(df["audio_file"], df["ground_truth_text"]))
    gt_texts = [mapping.get(name, "") for name in audio_files]
    return audio_files, gt_texts


In [ ]:
# Core evaluation routine reused for input and output
def evaluate_folder_against_texts(audio_folder: Path, ground_truth_texts, label: str, out_csv: Path,
                                  whisper_model_size: str = WHISPER_MODEL_SIZE):
    model = whisper.load_model(whisper_model_size)
    audio_paths = sorted(audio_folder.glob('*.wav'))
    if not audio_paths:
        print(f"⚠️ No .wav files found in {audio_folder}")
        return None
    if ground_truth_texts is None:
        print("⚠️ No ground-truth texts provided; skipping evaluation.")
        return None
    if len(ground_truth_texts) != len(audio_paths):
        print(f"⚠️ Mismatch: {len(ground_truth_texts)} GT vs {len(audio_paths)} audio files.")
    # Transcribe each audio
    hyp_texts = []
    for p in audio_paths:
        text, _ = transcribe_spanish_audio(model, p)
        hyp_texts.append(text)
    # Evaluate
    df = evaluate_asr_pair(ground_truth_texts, hyp_texts, label=label, output_csv=out_csv)
    return df


In [ ]:
# ----------------------------------------------------------
# Input ASR Evaluation (AUDIO_DIR vs. ground_truth.csv)
# ----------------------------------------------------------
from pipeline.config import EVAL_DIR
input_csv_path = EVAL_DIR / "input_asr_results.csv"
loaded = load_input_ground_truth()
if loaded is not None:
    audio_names, gt_input_texts = loaded
    _ = evaluate_folder_against_texts(
        audio_folder=AUDIO_DIR,
        ground_truth_texts=gt_input_texts,
        label="Input ASR Evaluation",
        out_csv=input_csv_path,
    )


In [ ]:
# ----------------------------------------------------------
# Output ASR Evaluation (TTS_DIR vs. Spanish answers from LLM)
# ----------------------------------------------------------
from pipeline.config import TTS_DIR, LLM_OUT
gt_output_texts = None
final_results_csv = LLM_OUT / "final_pipeline_results.csv"
fallback_results_csv = LLM_OUT / "llm_results.csv"

if final_results_csv.exists():
    df_res = pd.read_csv(final_results_csv)
    if "spanish_answer" in df_res.columns:
        gt_output_texts = list(df_res["spanish_answer"].astype(str))
elif fallback_results_csv.exists():
    df_res = pd.read_csv(fallback_results_csv)
    if "spanish_answer" in df_res.columns:
        gt_output_texts = list(df_res["spanish_answer"].astype(str))

output_csv_path = EVAL_DIR / "output_asr_results.csv"
if gt_output_texts is not None:
    _ = evaluate_folder_against_texts(
        audio_folder=TTS_DIR,
        ground_truth_texts=gt_output_texts,
        label="Output ASR Evaluation",
        out_csv=output_csv_path,
    )
else:
    print("⚠️ Could not locate Spanish answers in LLM outputs; skipping Output ASR evaluation.")


## 📈 Step 7 — Summaries & Visualization

We load both CSVs, print a quick summary, and plot a simple bar chart comparing mean WER/CER/SER for Input vs Output.


In [ ]:
import matplotlib.pyplot as plt
from pipeline.config import EVAL_DIR

def load_eval(path: Path):
    if not path.exists():
        print(f"⚠️ Missing: {path}")
        return None
    return pd.read_csv(path)

inp = load_eval(EVAL_DIR / "input_asr_results.csv")
outp = load_eval(EVAL_DIR / "output_asr_results.csv")

if inp is not None:
    print("\nInput ASR (first 5 rows):")
    display(inp.head())
if outp is not None:
    print("\nOutput ASR (first 5 rows):")
    display(outp.head())

if (inp is not None) and (outp is not None):
    inp_mean = inp[["WER","CER","SER"]].mean()
    out_mean = outp[["WER","CER","SER"]].mean()
    metrics_df = pd.DataFrame({"Input ASR": inp_mean, "Output ASR": out_mean})
    ax = metrics_df.plot(kind="bar", figsize=(8,5))
    ax.set_title("ASR Performance Comparison (Input vs Output)")
    ax.set_ylabel("Error Rate")
    ax.set_xticklabels(["WER","CER","SER"], rotation=0)
    ax.grid(axis="y", linestyle="--", alpha=0.7)
    plt.tight_layout()
    plot_path = EVAL_DIR / "asr_metrics_plot.png"
    plt.savefig(plot_path, dpi=300)
    plt.show()
    print(f"\n🖼️ Chart saved to: {plot_path}")


## 🗂️ Step 8 — Results Summary

Key output files generated by this pipeline (download from the `results/` folders):


In [ ]:
from pipeline.config import LLM_OUT
print("\n📂 Results Directory Summary:")
print("─"*40)
print(f"Input ASR Evaluation CSV:  {EVAL_DIR / 'input_asr_results.csv'}")
print(f"Output ASR Evaluation CSV: {EVAL_DIR / 'output_asr_results.csv'}")
print(f"LLM Q&A Results CSV:       {LLM_OUT / 'llm_results.csv'}")
print(f"ASR Metrics Visualization:  {EVAL_DIR / 'asr_metrics_plot.png'}")
print("\n✅ All results saved successfully!")
