<a href="https://colab.research.google.com/github/nattaran/HealthTequity-LLM/blob/main/HealthTequity_VoicePipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🩺 HealthTequity Voice Processing Pipeline

## 📘 Introduction
This notebook demonstrates an end-to-end **Voice-to-Insight** pipeline developed for the **HealthTequity Case Study**.  
It processes Spanish medical speech into actionable insights through **transcription**, **translation**, **LLM-based reasoning**, and **automated evaluation** using **WER**, **CER**, and **SER** metrics.  

The system connects **speech understanding (ASR)** with **data analytics (LLM)** and **speech synthesis (TTS)**, forming a reproducible and modular workflow for healthcare data analysis.

---

### 🧩 Companion Notebooks
Two supporting notebooks generate the datasets used in this pipeline:

| Notebook | Purpose | Output Folder |
|-----------|----------|---------------|
| **Synthetic Blood Pressure Generator** | Creates a 30-day synthetic blood pressure dataset | `data/synthetic_csv/` |
| **Spanish Audio Generator** | Produces Spanish-language health questions from the dataset | `data/Spanish_audio/` |

> 🔹 *Reviewers may also upload their own CSVs and audio recordings.*  
> To compute **WER/CER/SER**, ensure a `ground_truth.csv` file exists with reference text for each input audio file.

---

### ⚙️ Pipeline Overview

| Step | Component | Description | Output |
|------|------------|-------------|---------|
| **1️⃣** | **ASR (Whisper)** | Transcribes Spanish medical audio into text | Spanish transcript |
| **2️⃣** | **Translation (GPT)** | Translates Spanish → English for question understanding | `audio_translations.csv` |
| **3️⃣** | **ASR Evaluation (Input)** | Compares transcriptions to ground truth → computes **WER**, **CER**, **SER** | `input_asr_metrics.csv` |
| **4️⃣** | **LLM Analysis (GPT-4o-mini)** | Answers English questions using the blood pressure CSV context | `final_pipeline_results.csv` |
| **5️⃣** | **Spanish TTS (gTTS)** | Converts English answers back into spoken Spanish | `tts_audio/*.wav` |
| **6️⃣** | **ASR Evaluation (Output)** | Evaluates TTS intelligibility via Whisper ASR (WER/CER/SER) | `output_asr_metrics.csv` |
| **7️⃣** | **Visualization** | Plots a bar chart comparing input vs. output ASR metrics | `asr_comparison_chart.png` |

---

### 🧠 Translation Toggle

This pipeline can use either **Whisper** or **GPT** for Spanish-to-English translation:

| Mode | How to Enable | Description |
|------|----------------|-------------|
| 🧩 **GPT Translation (Recommended)** | `USE_GPT_TRANSLATION = True` | Uses GPT-4o-mini for domain-accurate translations (preserves medical terms and numbers). Requires OpenAI API key. |
| ⚙️ **Whisper-Only Mode (Free)** | `USE_GPT_TRANSLATION = False` | Uses Whisper’s built-in `task="translate"` mode. Works offline but may reduce translation accuracy. |

You can change this toggle at the top of the ASR module.

---

### 📊 Evaluation Metrics

| Metric | Description | Interpretation |
|---------|-------------|----------------|
| **WER (Word Error Rate)** | Measures overall transcription accuracy (substitutions, deletions, insertions) | Lower = better |
| **CER (Character Error Rate)** | Captures fine-grained textual differences | Sensitive to short utterances |
| **SER (Sentence Error Rate)** | Binary metric — whether a sentence matches exactly | 0 = perfect, 1 = any error |

All metrics are computed for both **input** (Spanish audio → text) and **output** (Spanish TTS → text) stages.

---

### 🧰 System Notes
- Default Whisper model is `"base"` but can be changed to `"small"`, `"medium"`, or `"large"`.  
- All directories (`data/`, `results/`, etc.) are auto-created by the pipeline.  
- Results and charts are saved under `/results/` for easy retrieval.  
- The notebook can run entirely on Google Colab; only GPT translation requires an OpenAI API key.

---

### 📂 Folder Structure

```text
/content/drive/MyDrive/HealthTequity-LLM/
│
├── data/
│   ├── synthetic_csv/              ← Synthetic BP data + ground truth
│   │   ├── synthetic_bp_one_person.csv
│   │   └── ground_truth.csv
│   │
│   └── Spanish_audio/              ← Input Spanish audio files
│       ├── question_1_es.wav
│       ├── question_2_es.wav
│       └── ...
│
├── results/
│   ├── llm_outputs/                ← Translations + LLM responses
│   ├── tts_audio/                  ← Spanish audio answers
│   └── evaluation_metrics/         ← WER/CER/SER + chart
│
├── BloodPressure_Generator.ipynb
├── SpanishAudio_Generator.ipynb
└── HealthTequity_VoicePipeline.ipynb


## 📁 Step 1 — Mount Google Drive and Sync Project Repository

This step connects your Colab environment to Google Drive and ensures you have the latest version of the **HealthTequity-LLM** project.

**What this does:**
1. Clears any previously mounted Drive (prevents the “mountpoint already contains files” error).
2. Mounts Google Drive at `/content/drive`.
3. Clones the `HealthTequity-LLM` GitHub repository into your Drive if it doesn’t exist,  
   or updates it with the latest version if it already exists.

**After this cell runs successfully:**
- You’ll be working directly from  
  `/content/drive/MyDrive/HealthTequity-LLM`
- All project data, code, and outputs will stay persistent in your Google Drive.


In [None]:
# ==========================================================
# 📁 STEP 1 — Mount Google Drive and Clone/Update Repository
# ==========================================================
from google.colab import drive
import os, shutil

MOUNT_POINT = '/content/drive'
REPO_URL = "https://github.com/nattaran/HealthTequity-LLM.git"
REPO_PATH = f"{MOUNT_POINT}/MyDrive/HealthTequity-LLM"

# --- Clean any existing mountpoint to prevent ValueError ---
if os.path.exists(MOUNT_POINT) and os.path.isdir(MOUNT_POINT) and os.listdir(MOUNT_POINT):
    print(f"⚙️ Clearing existing mountpoint: {MOUNT_POINT}")
    try:
        shutil.rmtree(MOUNT_POINT)
        os.makedirs(MOUNT_POINT)
    except Exception as e:
        print(f"⚠️ Warning: Could not fully clear mountpoint: {e}")

# --- Mount Google Drive ---
print("🔗 Mounting Google Drive...")
drive.mount(MOUNT_POINT, force_remount=True)

# --- Clone or update GitHub repo ---
if not os.path.exists(REPO_PATH):
    print(f"📦 Cloning repository into {REPO_PATH}...")
    !git clone {REPO_URL} {REPO_PATH}
else:
    print("🔄 Repository already exists — updating...")
    %cd {REPO_PATH}
    !git fetch origin
    !git pull

%cd {REPO_PATH}
print(f"✅ Environment ready. Working directory: {os.getcwd()}")


🔗 Mounting Google Drive...
Mounted at /content/drive
🔄 Repository already exists — updating...
/content/drive/MyDrive/HealthTequity-LLM
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)[K
Unpacking objects: 100% (3/3), 12.41 KiB | 181.00 KiB/s, done.
From https://github.com/nattaran/HealthTequity-LLM
   645c368..0dd138f  main       -> origin/main
Updating 645c368..0dd138f
Fast-forward
 HealthTequity_VoicePipeline.ipynb | 1354 [32m+++++++++++++++++++++++[m[31m--------------[m
 1 file changed, 837 insertions(+), 517 deletions(-)
/content/drive/MyDrive/HealthTequity-LLM
✅ Environment ready. Working directory: /content/drive/MyDrive/HealthTequity-LLM


## ⚙️ Step 2 — Install Project Dependencies

This step installs all the required Python packages for the HealthTequity-LLM pipeline.

**Notes:**
- Run this cell once per Colab session.
- Dependencies are listed in `requirements.txt` at the repository root.
- You can add or pin package versions there (e.g., `whisper==1.0`, `jiwer==3.0.2`).
- Colab may show warnings for already-installed packages — you can safely ignore them.


In [None]:

# Install project requirements (run once per session)
# If you prefer to pin versions, ensure requirements.txt includes exact versions.
!pip install -r requirements.txt
# Verify installation of key modules
import sys
print("\n✅ Package installation complete.")
print("Python version:", sys.version)
!pip list | grep -E "openai|whisper|jiwer|pandas|torch|matplotlib"

Collecting git+https://github.com/openai/whisper.git (from -r requirements.txt (line 22))
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-w3s6_5tk
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-w3s6_5tk
  Resolved https://github.com/openai/whisper.git to commit c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting whisper (from -r requirements.txt (line 20))
  Downloading whisper-1.1.10.tar.gz (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ffmpeg-python>=0.2.0 (from -r requirements.txt (line 25))
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting jiwer>=3.0.

## 📂 Step 3 — Project Paths and Data Dependencies

This step defines the directory structure used by the **HealthTequity Voice Pipeline**  
and ensures all necessary folders exist within your Google Drive.

By default, the **Audio Generation Notebook** automatically produces:
- Spanish question audio files (`.wav`) under `data/Spanish_audio/`
- The corresponding `ground_truth.csv` file under `data/synthetic_csv/`
- A synthetic blood-pressure dataset (`synthetic_bp_one_person.csv`) under the same folder

These files are automatically detected when this pipeline runs.

---

### 🧩 Reviewer Flexibility
If you prefer to **test with your own data**, you can:
- Upload custom `.wav` files to `data/Spanish_audio/`
- Upload your own ground truth file to `data/synthetic_csv/ground_truth.csv`
- Upload a new blood pressure dataset to `data/synthetic_csv/`

Your files will be automatically used by the pipeline — no code modification needed.

---

### 🗂 Folder Overview

| Folder | Purpose |
|---------|----------|
| `data/synthetic_csv/` | Synthetic or user-provided CSV datasets and ground truth |
| `data/Spanish_audio/` | Input Spanish `.wav` question audio files |
| `results/llm_outputs/` | LLM-generated question–answer CSV outputs |
| `results/evaluation_metrics/` | ASR evaluation metrics (WER, CER, SER) |
| `results/tts_audio/` | Generated Spanish audio (TTS) responses |

All paths are created automatically under:
`/content/drive/MyDrive/HealthTequity-LLM`


In [None]:
# ==========================================================
# 📂 STEP 3 — Project Paths and Data Dependencies
# ==========================================================
from pathlib import Path
import os

# --- Define project root in Google Drive ---
PROJECT_ROOT = Path("/content/drive/MyDrive/HealthTequity-LLM")

# --- Define subdirectories ---
DATA_DIR    = PROJECT_ROOT / "data"
CSV_DIR     = DATA_DIR / "synthetic_csv"
AUDIO_DIR   = DATA_DIR / "Spanish_audio"
RESULTS_DIR = PROJECT_ROOT / "results"
LLM_OUT     = RESULTS_DIR / "llm_outputs"
EVAL_DIR    = RESULTS_DIR / "evaluation_metrics"
TTS_DIR     = RESULTS_DIR / "tts_audio"

# --- Create folders if missing (safe & repeatable) ---
for p in [DATA_DIR, CSV_DIR, AUDIO_DIR, RESULTS_DIR, LLM_OUT, EVAL_DIR, TTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

# --- Verify generated or uploaded data ---
ground_truth = CSV_DIR / "ground_truth.csv"
audio_files = list(AUDIO_DIR.glob("*.wav"))
bp_csv = CSV_DIR / "synthetic_bp_one_person.csv"

print("✅ Directory structure verified:")
print(f"📁 Project root:        {PROJECT_ROOT}")
print(f"📄 Ground truth file:   {'✅ Found' if ground_truth.exists() else '⚠️ Missing'}")
print(f"🎧 Spanish audio files: {len(audio_files)} found")
print(f"📊 Blood pressure CSV:  {'✅ Found' if bp_csv.exists() else '⚠️ Missing'}")
print(f"📊 Results directory:   {RESULTS_DIR}")
print(f"🧠 LLM outputs:         {LLM_OUT}")
print(f"🧮 Evaluation metrics:  {EVAL_DIR}")
print(f"🔉 TTS audio:           {TTS_DIR}")

if not ground_truth.exists() or not audio_files:
    print("\n⚠️ Note: Run the Audio Generation Notebook or upload your own data to the above folders.")



✅ Directory structure verified:
📁 Project root:        /content/drive/MyDrive/HealthTequity-LLM
📄 Ground truth file:   ✅ Found
🎧 Spanish audio files: 9 found
📊 Blood pressure CSV:  ✅ Found
📊 Results directory:   /content/drive/MyDrive/HealthTequity-LLM/results
🧠 LLM outputs:         /content/drive/MyDrive/HealthTequity-LLM/results/llm_outputs
🧮 Evaluation metrics:  /content/drive/MyDrive/HealthTequity-LLM/results/evaluation_metrics
🔉 TTS audio:           /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio


## 🔑 Step 4 — OpenAI API Key Initialization

This step initializes the OpenAI client for all LLM operations (e.g., translations, question answering).

**How it works:**
- The code first looks for your API key in the Colab environment (`os.environ["OPENAI_API_KEY"]`).
- If no key is found, you’ll be securely prompted to paste it (input remains hidden).
- Once entered, the key is stored in the current runtime session for later API calls.

> 💡 **Security note:**  
> Your key is **not saved permanently** — it will reset when the Colab runtime restarts.  
> Reviewers can safely run this section with their own OpenAI API keys.


In [None]:
# ==========================================================
# 🔑 STEP 4 — OpenAI API Key Initialization
# ==========================================================
import os
from getpass import getpass
from openai import OpenAI

# --- Check if API key exists; if not, prompt user securely ---
if not os.getenv("OPENAI_API_KEY"):
    print("⚠️ OpenAI API key not found in environment.")
    os.environ["OPENAI_API_KEY"] = getpass("🔐 Paste your OpenAI API key (input hidden): ").strip()
else:
    print("✅ Found existing OpenAI API key in environment.")

# --- Initialize client (raises error if key invalid) ---
try:
    client = OpenAI()
    print("✅ OpenAI client initialized successfully.")
except Exception as e:
    print(f"❌ Failed to initialize OpenAI client: {e}")


⚠️ OpenAI API key not found in environment.
🔐 Paste your OpenAI API key (input hidden): ··········
✅ OpenAI client initialized successfully.


## 🗣️ Step 5 — ASR (Whisper) and Spanish → English Translation

This section runs the **Automatic Speech Recognition (ASR)** and **translation** stage of the pipeline.  
It is used on both sides of the workflow:

1. **Input side** – transcribes Spanish question audio and translates it to English for LLM analysis.  
2. **Output side** – re-transcribes generated Spanish TTS responses for ASR evaluation.

---

### ⚙️ How It Works
- **Whisper ASR** performs transcription directly from audio.  
- **Whisper Translation Mode** (`task="translate"`) produces English text automatically — no API key required.  
- Optionally, reviewers can enable **OpenAI GPT translation** for more fluent medical phrasing.

---

### 🔧 Configuration
You can control translation behavior with this flag:
```python
USE_GPT_TRANSLATION = True  # Set True to use GPT-based translation instead of Whisper


In [None]:
# ==========================================================
# 🗣️ STEP 5 — ASR (Whisper) and Spanish → English Translation
# ==========================================================
import whisper
import pandas as pd
import os
from pathlib import Path

# --- Toggle for optional GPT translation ---
USE_GPT_TRANSLATION = True   # Default: Whisper only (free)
WHISPER_MODEL_SIZE = "base"   # Adjustable model size ("tiny", "small", "medium", "large")

def transcribe_spanish_audio(model, audio_path: Path):
    """
    Transcribe a single Spanish audio file using Whisper.
    Returns Spanish text and detected language.
    """
    result = model.transcribe(str(audio_path), language="es", task="transcribe", verbose=False)
    return result["text"].strip(), result.get("language", "unknown")

def translate_audio_whisper(model, audio_path: Path):
    """
    Use Whisper’s translation mode to directly translate Spanish → English.
    """
    result = model.transcribe(str(audio_path), task="translate", verbose=False)
    return result["text"].strip()

def translate_spanish_to_english_gpt(spanish_text: str) -> str:
    """
    Optional GPT translation for higher fluency (requires OpenAI API key).
    """
    prompt = (
        "Translate the following Spanish medical transcription into clear, faithful English:\n\n"
        + spanish_text
    )
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[{"role": "user", "content": prompt}],
    )
    return result.choices[0].message.content.strip()

def process_and_translate_audio(audio_folder: Path, output_csv: Path,
                                model_size: str = WHISPER_MODEL_SIZE) -> pd.DataFrame:
    """
    Run Whisper ASR on all .wav files in a folder,
    translate Spanish → English (Whisper or GPT),
    and save results to CSV.
    """
    print(f"🎧 Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)

    audio_files = sorted([f for f in os.listdir(audio_folder) if f.endswith(".wav")])
    if not audio_files:
        print(f"⚠️ No audio files found in {audio_folder}")
        return pd.DataFrame()

    results = []
    print(f"🔍 Processing {len(audio_files)} audio files...")

    for fname in audio_files:
        audio_path = audio_folder / fname
        # 1️⃣ Spanish transcription
        es_text, detected_lang = transcribe_spanish_audio(model, audio_path)
        # 2️⃣ Translation (Whisper → English or GPT optional)
        if USE_GPT_TRANSLATION:
            en_text = translate_spanish_to_english_gpt(es_text)
        else:
            en_text = translate_audio_whisper(model, audio_path)

        results.append({
            "audio_file": fname,
            "spanish_transcription": es_text,
            "english_translation": en_text,
            "language_detected": detected_lang
        })
        print(f"✅ {fname} → processed")

    df = pd.DataFrame(results)
    df.to_csv(output_csv, index=False, encoding="utf-8-sig")
    print(f"\n💾 Results saved to: {output_csv}")
    return df

# Example (do not auto-run):
# trans_csv = LLM_OUT / "audio_translations.csv"
# trans_df = process_and_translate_audio(AUDIO_DIR, trans_csv, model_size="base")



## 🧮 Step 6 — Unified ASR Evaluation (WER / CER / SER)

This section defines a **single reusable evaluation module** for computing ASR performance metrics on any
pair of *ground-truth* and *predicted* transcriptions.

It is used in two contexts:
1. **Input side evaluation** – compares Whisper’s transcribed Spanish questions against the ground-truth CSV.  
2. **Output side evaluation** – compares Whisper’s re-transcription of TTS-generated Spanish responses
   against the LLM-generated ground-truth Spanish text.

---

### 📏 Metrics Computed
| Metric | Description |
|---------|--------------|
| **WER (Word Error Rate)** | Fraction of words incorrectly predicted (insertions + deletions + substitutions) / total words |
| **CER (Character Error Rate)** | Character-level edit distance normalized by text length |
| **SER (Sentence Error Rate)** | Percentage of sentences that are not identical to ground truth |

---

### ⚙️ How to Use
Call the function below with:
```python
evaluate_asr_whisper(
    gt_csv=Path(...),          # CSV with ground_truth_text
    audio_folder=Path(...),    # folder of .wav files to evaluate
    output_csv=Path(...),      # where to save evaluation results
    model_size="base"          # whisper model size (default)
)


In [None]:

# ==========================================================
# 🧩 STEP 8 — UNIFIED ASR EVALUATION FUNCTION (FINAL)
# ==========================================================
import os
import pandas as pd
import whisper
import re, unicodedata, Levenshtein
from jiwer import process_words
from pathlib import Path

def normalize_text(text: str) -> str:
    """Lowercase, strip accents, and remove punctuation."""
    text = text.lower()
    text = ''.join(
        c for c in unicodedata.normalize('NFD', text)
        if unicodedata.category(c) != 'Mn'
    )
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

def evaluate_asr_whisper(
    gt_csv: Path,
    audio_folder: Path,
    output_csv: Path,
    model_size: str = "base",
    gt_text_col: str = "ground_truth_text"
):
    """
    Evaluate ASR performance using Whisper and compute WER, CER, SER.
    Works for both input and output sides of the pipeline.

    Parameters
    ----------
    gt_csv : Path
        CSV file containing at least columns [audio_file, ground_truth_text]
    audio_folder : Path
        Directory containing .wav audio files to evaluate
    output_csv : Path
        Output CSV for detailed ASR metrics
    model_size : str, optional
        Whisper model size to load (default: 'base')
        Options: 'tiny', 'base', 'small', 'medium', 'large'
    gt_text_col : str, optional
        Column name for the reference text (default: 'ground_truth_text')
    """
    if not Path(gt_csv).exists():
        raise FileNotFoundError(f"❌ Ground truth CSV not found at: {gt_csv}")
    if not Path(audio_folder).exists():
        raise FileNotFoundError(f"❌ Audio folder not found at: {audio_folder}")

    df_gt = pd.read_csv(gt_csv)
    if "audio_file" not in df_gt.columns or gt_text_col not in df_gt.columns:
        raise ValueError(f"❌ CSV must contain ['audio_file', '{gt_text_col}'] columns.")

    print(f"🎧 Loading Whisper model: {model_size}")
    model = whisper.load_model(model_size)

    results = []
    print(f"🔍 Evaluating {len(df_gt)} audio files...")

    for _, row in df_gt.iterrows():
        audio_name = row["audio_file"]
        gt_text = str(row[gt_text_col]).strip()
        audio_path = Path(audio_folder) / audio_name
        if not audio_path.exists():
            print(f"⚠️ Missing audio file: {audio_name}")
            continue

        # --- Transcribe ---
        result = model.transcribe(str(audio_path), language="es", task="transcribe", verbose=False)
        hyp_text = result["text"].strip()

        # --- Normalize & compute metrics ---
        gt_norm, hyp_norm = normalize_text(gt_text), normalize_text(hyp_text)
        measures = process_words(gt_norm, hyp_norm)
        wer = round(measures.wer, 4)
        cer = round(Levenshtein.distance(gt_norm, hyp_norm) / max(len(gt_norm), 1), 4)
        ser = 0 if gt_norm == hyp_norm else 1

        results.append({
            "audio_file": audio_name,
            "ground_truth": gt_text,
            "whisper_transcription": hyp_text,
            "WER": wer,
            "CER": cer,
            "SER": ser,
            "Substitutions": measures.substitutions,
            "Deletions": measures.deletions,
            "Insertions": measures.insertions
        })
        print(f"✅ {audio_name} → WER={wer}, CER={cer}, SER={ser}")

    # --- Save results ---
    out_df = pd.DataFrame(results)
    out_df.to_csv(output_csv, index=False, encoding="utf-8-sig")
    print(f"\n💾 ASR metrics saved to: {output_csv}")
    print(f"📊 Average WER={out_df['WER'].mean():.3f}, CER={out_df['CER'].mean():.3f}, SER={out_df['SER'].mean():.3f}")
    return out_df



## 📊 Step 7 — ASR Metrics Visualization (Input vs Output)

This step visualizes the ASR performance of the pipeline for both the **input** and **output** stages.

**Purpose:**
- The **Input ASR** evaluates how well Whisper transcribes Spanish question audios compared to ground truth.  
- The **Output ASR** evaluates Whisper’s accuracy in re-transcribing the TTS-generated Spanish answers.

The plot shows average **WER (Word Error Rate)**, **CER (Character Error Rate)**, and **SER (Sentence Error Rate)** side-by-side.

**Expected Input Files:**
- `results/evaluation_metrics/input_asr_metrics.csv`
- `results/evaluation_metrics/output_asr_metrics.csv`

**Output:**
- A bar chart saved as `results/evaluation_metrics/asr_comparison_chart.png`


In [None]:
# ==========================================================
# 📊 STEP 7 — ASR Metrics Visualization (Input vs Output)
# ==========================================================
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

def load_asr_summary(csv_path: Path):
    """Load ASR metrics and compute average WER, CER, SER."""
    if not csv_path.exists():
        print(f"⚠️ Missing file: {csv_path}")
        return {"WER": None, "CER": None, "SER": None}
    df = pd.read_csv(csv_path)
    return {
        "WER": df["WER"].mean(),
        "CER": df["CER"].mean(),
        "SER": df["SER"].mean()
    }

def plot_asr_comparison(
    input_asr_csv: Path,
    output_asr_csv: Path,
    output_dir: Path
):
    """
    Compare average ASR metrics between input and output sides,
    and save a side-by-side bar chart visualization.

    Parameters
    ----------
    input_asr_csv : Path
        CSV file containing input-side ASR metrics.
    output_asr_csv : Path
        CSV file containing output-side ASR metrics.
    output_dir : Path
        Directory where the visualization will be saved.
    """
    # --- Load average metrics ---
    input_metrics = load_asr_summary(input_asr_csv)
    output_metrics = load_asr_summary(output_asr_csv)

    # --- Prepare data for plotting ---
    metrics = ["WER", "CER", "SER"]
    input_vals = [input_metrics[m] for m in metrics]
    output_vals = [output_metrics[m] for m in metrics]

    # --- Plot chart ---
    fig, ax = plt.subplots(figsize=(7, 5))
    x = range(len(metrics))
    width = 0.35

    ax.bar([i - width/2 for i in x], input_vals, width, label="Input ASR", alpha=0.8)
    ax.bar([i + width/2 for i in x], output_vals, width, label="Output ASR", alpha=0.8)

    ax.set_xticks(x)
    ax.set_xticklabels(metrics)
    ax.set_ylabel("Error Rate")
    ax.set_ylim(0, 1)
    ax.set_title("ASR Performance Comparison — Input vs Output")
    ax.legend()
    ax.grid(axis="y", linestyle="--", alpha=0.6)

    plt.tight_layout()

    # --- Save chart ---
    save_path = Path(output_dir) / "asr_comparison_chart.png"
    plt.savefig(save_path, dpi=300)
    plt.close()

    print(f"📊 ASR comparison chart saved to: {save_path}")
    return save_path



## 🧠 Step 8 — LLM Question Answering (Blood Pressure Analysis)

This step defines the **LLM-based question-answering function** for the HealthTequity Voice Pipeline.  
The model uses the translated English questions from the Spanish audio and answers them using the provided synthetic blood-pressure dataset.

---

### ⚙️ How It Works
- The dataset is passed to the model as a **CSV text block** (context).  
- Each question (translated from Spanish → English) is analyzed in context.  
- The LLM returns a structured **JSON response** containing:
  - `"answer"` — a natural-language English explanation.
  - `"computed_fields"` — optional numeric values or summaries used in reasoning.

---

### 📚 Behavior Rules
- The LLM can only use the dataset for factual answers.  
- It can cite external information **only** when describing “normal” blood pressure ranges.  
- All outputs follow a conversational, user-friendly tone.

---

### 🧩 Usage Example
```python
csv_text = (CSV_DIR / "synthetic_bp_one_person.csv").read_text()
q = "What is my average blood pressure this week?"
response = ask_gpt(q, csv_text)
print(response["answer"])


{
  "answer": "Your average systolic pressure this week was 118 mm Hg and your diastolic pressure was 77 mm Hg.",
  "computed_fields": {"systolic_avg": 118, "diastolic_avg": 77}
}



## Step 3 – LLM Question Answering <a id="qa"></a>
This section queries an LLM with English questions derived from the ASR+translation step and provides answers based on a tabular blood-pressure dataset.

**Inputs**: A CSV file with synthetic blood-pressure records.  
**Outputs**: English answers and associated computed fields.


In [None]:

#  ==========================================================
# 🧠 STEP 8 — LLM Question Answering (Blood Pressure Analysis)
# ==========================================================
import json

def ask_gpt(question_en: str, csv_block: str) -> dict:
    """
    Query the LLM with an English question and the CSV dataset context.

    Parameters
    ----------
    question_en : str
        English question derived from Spanish transcription.
    csv_block : str
        CSV content as a text block for in-context grounding.

    Returns
    -------
    dict
        Dictionary containing:
        - "answer" : str — model's English response
        - "computed_fields" : dict — optional numeric details
    """

    system_prompt = """
You are a careful and detail-oriented data analyst.

You are given a synthetic blood pressure dataset in CSV format. It contains readings for one individual over the last 30 consecutive days, with the following columns:

- date
- age
- sex
- systolic_mmHg
- diastolic_mmHg

Use only the data in the CSV to answer all questions, except when normal blood pressure ranges are requested — in those cases, you may use external references but must cite your source.

---

🧠 Interpretation Guidelines:

- "Today" refers to the most recent date in the dataset.
- "Yesterday" means the date before "today" in the dataset.
- "Last week" or "last month" refer to 7- or 30-day windows before "today".
- If a date or range is unavailable, clearly say so.
- Use conversational date formats like “October 12” instead of numeric ones.

---

💬 Answer Style:
- Use natural, conversational English.
- Address the user as “you”.
- Respond clearly and concisely.

---

✅ Response Format:
Always return valid JSON with this structure:

{
  "answer": "<English answer>",
  "computed_fields": { "numeric values used" }
}
"""

    user_prompt = (
        f"CSV Dataset:\n{csv_block}\n\n"
        f"Question:\n{question_en}\n\n"
        "Please analyze the data and respond strictly in valid JSON format as defined above."
    )

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                {"role": "system", "content": system_prompt.strip()},
                {"role": "user", "content": user_prompt.strip()},
            ],
        )
        answer_text = response.choices[0].message.content.strip()
    except Exception as e:
        print(f"❌ OpenAI API error: {e}")
        return {"answer": "Error: failed to retrieve response.", "computed_fields": {}}

    # Safely handle JSON parsing
    try:
        result = json.loads(answer_text)
        if not isinstance(result, dict):
            raise ValueError("Invalid JSON structure.")
    except Exception:
        result = {"answer": answer_text, "computed_fields": {}}

    return result



## 🔊 Step 9 — Spanish Translation & Text-to-Speech (Optimized)

This module converts each English answer from the LLM into natural-sounding **Spanish audio**.  
It provides two flexible layers:

1. **Translation (English → Spanish)** — by default uses **OpenAI GPT-4o-mini**,  
   but automatically falls back to `googletrans` if an API key is not available.
2. **Text-to-Speech (Spanish → Audio)** — uses **gTTS + pydub** (free and Colab-friendly).

All outputs are saved as `.wav` files in the `results/tts_audio/` directory.

---

### 📥 Input / 📤 Output

| Step | Input | Output |
|------|--------|---------|
| Translation | English text | Spanish text |
| TTS | Spanish text | Spoken Spanish `.wav` file |


In [None]:
# ==========================================================
# 🔊 STEP 9 — SPANISH TRANSLATION & TEXT-TO-SPEECH (OPTIMIZED)
# ==========================================================
import os
from pathlib import Path

def translate_to_spanish(text_en: str) -> str:
    """
    Translate English text to Spanish.
    Uses OpenAI GPT if available, otherwise falls back to googletrans.

    Parameters
    ----------
    text_en : str
        English text to translate.

    Returns
    -------
    str
        Spanish translation.
    """
    # --- 1️⃣ Try OpenAI translation first ---
    try:
        if "client" in globals():
            prompt = (
                "Translate the following English medical answer into clear, "
                "natural Spanish:\n\n" + text_en
            )
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0,
                messages=[{"role": "user", "content": prompt}],
            )
            return resp.choices[0].message.content.strip()
    except Exception as e:
        print(f"⚠️ OpenAI translation failed: {e}")

    # --- 2️⃣ Fallback: googletrans (free) ---
    try:
        from googletrans import Translator
        translator = Translator()
        return translator.translate(text_en, src="en", dest="es").text
    except Exception as e:
        print(f"⚠️ googletrans fallback failed: {e}")
        return "(translation unavailable)"

def text_to_speech_spanish(text_es: str, out_wav_path: Path):
    """
    Generate Spanish TTS audio (free fallback using gTTS + pydub).

    Parameters
    ----------
    text_es : str
        Spanish text to synthesize.
    out_wav_path : Path
        Destination path for the WAV file.
    """
    try:
        from gtts import gTTS
        from pydub import AudioSegment

        out_wav_path.parent.mkdir(parents=True, exist_ok=True)
        tmp_mp3 = out_wav_path.with_suffix(".mp3")

        # Generate MP3 and convert to WAV
        gTTS(text_es, lang="es").save(tmp_mp3)
        AudioSegment.from_mp3(tmp_mp3).export(out_wav_path, format="wav")
        os.remove(tmp_mp3)

        print(f"✅ Spanish TTS saved → {out_wav_path.name}")
    except Exception as e:
        raise RuntimeError(f"TTS generation failed: {e}")


## 🚀 Step 10 — Full Pipeline Integration

This step executes the **complete HealthTequity Voice Pipeline**, combining all previous components into a single, reproducible workflow.

The pipeline processes spoken Spanish questions, interprets them through an AI-driven analytics system, and produces accurate Spanish spoken answers grounded in tabular blood-pressure data.

---

### 🧩 Workflow Overview

| Step | Process | Description | Output |
|------|----------|--------------|---------|
| 1️⃣ | **Input ASR + Translation** | Transcribes each Spanish question and translates it into English. | `audio_translations.csv` |
| 2️⃣ | **Input ASR Evaluation** | Compares Whisper transcriptions with ground-truth text to compute WER, CER, and SER. | `input_asr_metrics.csv` |
| 3️⃣ | **CSV Grounding** | Loads the synthetic blood-pressure dataset as the context for LLM reasoning. | In-memory CSV string |
| 4️⃣ | **LLM Question Answering** | Uses GPT to analyze the dataset and answer each English question. | English answer text |
| 5️⃣ | **Spanish Translation + TTS** | Converts each English answer into natural Spanish and generates speech audio. | `results/tts_audio/*.wav` |
| 6️⃣ | **Output ASR Evaluation** | Transcribes TTS-generated Spanish audio and compares it to ground-truth Spanish answers. | `output_asr_metrics.csv` |
| 7️⃣ | **Visualization** | Displays a side-by-side comparison of WER, CER, and SER for input vs. output ASR. | Matplotlib chart |

---

### 📁 Input Requirements
- **`data/synthetic_csv/synthetic_bp_one_person.csv`** — Blood-pressure dataset  
- **`data/Spanish_audio/`** — Spanish question audio files  
- **`data/synthetic_csv/ground_truth.csv`** — Ground-truth transcription for ASR evaluation  

Reviewers may replace these files with their own data to test other datasets or audio samples.

---

### 💾 Output Artifacts
All generated files are automatically stored in the following directories:
- **Transcriptions & LLM results:** `results/llm_outputs/`
- **Evaluation metrics:** `results/evaluation_metrics/`
- **Spanish speech audio:** `results/tts_audio/`

---

### ⚙️ Configuration
You can select different Whisper models for accuracy/speed trade-offs using the parameter:
```python
whisper_model_size="base"  # options: "tiny", "base", "small", "medium", "large"


In [None]:

import json
import pandas as pd

def run_full_pipeline(csv_path: Path, audio_folder: Path, whisper_model_size: str = "base"):
    """
    Execute the full HealthTequity voice analysis pipeline.
    Includes:
      1️⃣ Input-side ASR & translation (Spanish → English)
      2️⃣ ASR evaluation (WER / CER / SER)
      3️⃣ LLM-driven Q&A on the blood pressure dataset
      4️⃣ Spanish translation + TTS output
      5️⃣ Output-side ASR evaluation (WER / CER / SER)

    Parameters
    ----------
    csv_path : Path
        Path to the synthetic blood-pressure CSV.
    audio_folder : Path
        Directory containing input Spanish .wav files.
    whisper_model_size : str, optional
        Whisper model size (default: "base").

    Returns
    -------
    dict
        Summary dictionary with key file paths for inspection.
    """

    print("\n==============================")
    print("🎙️ STEP 1: Input ASR + Translation (Spanish → English)")
    print("==============================")

    trans_csv = LLM_OUT / "audio_translations.csv"
    _ = process_and_translate_audio(audio_folder, trans_csv, model_size=whisper_model_size)

    print("\n==============================")
    print("📊 STEP 2: Evaluate Input ASR (WER / CER / SER)")
    print("==============================")

    gt_csv = CSV_DIR / "ground_truth.csv"
    asr_csv = EVAL_DIR / "input_asr_metrics.csv"
    _ = evaluate_asr_whisper(
        gt_csv=gt_csv,
        audio_folder=audio_folder,
        output_csv=asr_csv,
        model_size=whisper_model_size,
        gt_text_col="ground_truth"
    )

    print("\n==============================")
    print("📈 STEP 3: LLM Question Answering")
    print("==============================")

    df_bp = pd.read_csv(csv_path)
    csv_block = df_bp.to_csv(index=False)
    trans_df = pd.read_csv(trans_csv)

    results = []
    for i, row in trans_df.iterrows():
        q_en = row["english_translation"]
        ans = ask_gpt(q_en, csv_block)
        ans_en = ans.get("answer", "").strip()
        ans_es = translate_to_spanish(ans_en)

        # Generate Spanish TTS output
        out_wav = TTS_DIR / f"answer_{i + 1}_es.wav"
        text_to_speech_spanish(ans_es, out_wav)

        results.append({
            "question_number": i + 1,
            "audio_file_in": row["audio_file"],
            "spanish_question": row["spanish_transcription"],
            "english_question": q_en,
            "english_answer": ans_en,
            "spanish_answer": ans_es,
            "audio_file": str(out_wav),  # ✅ unified column for ASR evaluation
            "computed_fields": json.dumps(ans.get("computed_fields", {}))
        })

    final_csv = LLM_OUT / "final_pipeline_results.csv"
    pd.DataFrame(results).to_csv(final_csv, index=False)
    print(f"✅ Saved final results to: {final_csv}")

    print("\n==============================")
    print("🧠 STEP 4: Evaluate Output ASR (WER / CER / SER)")
    print("==============================")

    output_asr_csv = EVAL_DIR / "output_asr_metrics.csv"
    _ = evaluate_asr_whisper(
        gt_csv=final_csv,
        audio_folder=TTS_DIR,
        output_csv=output_asr_csv,
        model_size=whisper_model_size,
        gt_text_col="spanish_answer"
    )

    print("\n==============================")
    print("📊 STEP 5: Visualize ASR Comparison")
    print("==============================")

    plot_asr_comparison(
        input_asr_csv=asr_csv,
        output_asr_csv=output_asr_csv,
        output_dir=EVAL_DIR
    )

    print("\n✅ Full pipeline completed successfully.")
    return {
        "transcriptions_csv": str(trans_csv),
        "input_asr_metrics_csv": str(asr_csv),
        "final_pipeline_csv": str(final_csv),
        "output_asr_metrics_csv": str(output_asr_csv)
    }


# Example (not auto-run in submission)
bp_csv = CSV_DIR / "synthetic_bp_one_person.csv"
_ = run_full_pipeline(bp_csv, AUDIO_DIR, whisper_model_size="base")



🎙️ STEP 1: Input ASR + Translation (Spanish → English)
🎧 Loading Whisper model: base




🔍 Processing 9 audio files...


100%|██████████| 1303/1303 [00:04<00:00, 325.25frames/s]


✅ q10_es.wav → processed


100%|██████████| 477/477 [00:03<00:00, 156.88frames/s]


✅ q2_es.wav → processed


100%|██████████| 427/427 [00:02<00:00, 166.57frames/s]


✅ q3_es.wav → processed


100%|██████████| 573/573 [00:02<00:00, 207.54frames/s]


✅ q4_es.wav → processed


100%|██████████| 328/328 [00:03<00:00, 83.21frames/s]


✅ q5_es.wav → processed


100%|██████████| 1461/1461 [00:04<00:00, 346.44frames/s]


✅ q6_es.wav → processed


100%|██████████| 739/739 [00:02<00:00, 269.37frames/s]


✅ q7_es.wav → processed


100%|██████████| 516/516 [00:02<00:00, 192.25frames/s]


✅ q8_es.wav → processed


100%|██████████| 458/458 [00:02<00:00, 176.23frames/s]


✅ q9_es.wav → processed

💾 Results saved to: /content/drive/MyDrive/HealthTequity-LLM/results/llm_outputs/audio_translations.csv

📊 STEP 2: Evaluate Input ASR (WER / CER / SER)
🎧 Loading Whisper model: base




🔍 Evaluating 10 audio files...
⚠️ Missing audio file: q1_es.wav


100%|██████████| 477/477 [00:03<00:00, 134.15frames/s]


✅ q2_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 427/427 [00:03<00:00, 141.22frames/s]


✅ q3_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 573/573 [00:02<00:00, 203.56frames/s]


✅ q4_es.wav → WER=0.0714, CER=0.0128, SER=1


100%|██████████| 328/328 [00:02<00:00, 129.30frames/s]


✅ q5_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 1461/1461 [00:04<00:00, 300.37frames/s]


✅ q6_es.wav → WER=0.0968, CER=0.0595, SER=1


100%|██████████| 739/739 [00:03<00:00, 191.49frames/s]


✅ q7_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 516/516 [00:02<00:00, 193.63frames/s]


✅ q8_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 458/458 [00:03<00:00, 149.34frames/s]


✅ q9_es.wav → WER=0.0, CER=0.0, SER=0


100%|██████████| 1303/1303 [00:04<00:00, 287.91frames/s]


✅ q10_es.wav → WER=0.0714, CER=0.0121, SER=1

💾 ASR metrics saved to: /content/drive/MyDrive/HealthTequity-LLM/results/evaluation_metrics/input_asr_metrics.csv
📊 Average WER=0.027, CER=0.009, SER=0.333

📈 STEP 3: LLM Question Answering
✅ Spanish TTS saved → answer_1_es.wav
✅ Spanish TTS saved → answer_2_es.wav
✅ Spanish TTS saved → answer_3_es.wav
✅ Spanish TTS saved → answer_4_es.wav
✅ Spanish TTS saved → answer_5_es.wav
✅ Spanish TTS saved → answer_6_es.wav
✅ Spanish TTS saved → answer_7_es.wav
✅ Spanish TTS saved → answer_8_es.wav
✅ Spanish TTS saved → answer_9_es.wav
✅ Saved final results to: /content/drive/MyDrive/HealthTequity-LLM/results/llm_outputs/final_pipeline_results.csv

🧠 STEP 4: Evaluate Output ASR (WER / CER / SER)
🎧 Loading Whisper model: base
🔍 Evaluating 9 audio files...


100%|██████████| 8366/8366 [00:28<00:00, 297.45frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_1_es.wav → WER=2.1892, CER=0.7607, SER=1


100%|██████████| 16216/16216 [00:51<00:00, 316.15frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_2_es.wav → WER=1.9787, CER=0.547, SER=1


100%|██████████| 6703/6703 [00:17<00:00, 377.60frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_3_es.wav → WER=0.4416, CER=0.1766, SER=1


100%|██████████| 3484/3484 [00:11<00:00, 296.51frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_4_es.wav → WER=1.0889, CER=0.2934, SER=1


100%|██████████| 2066/2066 [00:17<00:00, 121.52frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_5_es.wav → WER=0.375, CER=0.4762, SER=1


100%|██████████| 5090/5090 [00:46<00:00, 109.38frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_6_es.wav → WER=0.127, CER=0.1835, SER=1


100%|██████████| 6000/6000 [00:19<00:00, 301.75frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_7_es.wav → WER=0.2286, CER=0.1885, SER=1


100%|██████████| 1401/1401 [00:03<00:00, 418.31frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_8_es.wav → WER=0.1176, CER=0.1889, SER=1


100%|██████████| 5623/5623 [00:18<00:00, 296.32frames/s]


✅ /content/drive/MyDrive/HealthTequity-LLM/results/tts_audio/answer_9_es.wav → WER=1.4032, CER=0.3877, SER=1

💾 ASR metrics saved to: /content/drive/MyDrive/HealthTequity-LLM/results/evaluation_metrics/output_asr_metrics.csv
📊 Average WER=0.883, CER=0.356, SER=1.000

📊 STEP 5: Visualize ASR Comparison
📊 ASR comparison chart saved to: /content/drive/MyDrive/HealthTequity-LLM/results/evaluation_metrics/asr_comparison_chart.png

✅ Full pipeline completed successfully.
