<a href="https://colab.research.google.com/github/krishna11-dot/Talksynch/blob/main/TalkSync_Demo_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TalkSync Demo - Real-Time English to Hindi Translation System

## Problem Statement
Team members who are not fluent in English struggle to communicate during client calls on video conferencing platforms (Zoom, Google Meet, Microsoft Teams). This creates misunderstandings and reduces efficiency.

## Solution Overview
A real-time translation system that:
1. Captures English speech from client calls
2. Transcribes speech to text
3. Translates English text to Hindi
4. Synthesizes Hindi audio for team members

## Demo Scope (Phase 1)
- Standalone pipeline demonstration
- English to Hindi translation only
- File upload and microphone recording support
- Latency and quality metrics display

## Out of Scope (Phase 2)
- Video conferencing platform integration
- Real-time streaming
- Bidirectional translation
- Voice selection and customization

---
# System Architecture

```
                           TALKSYNC ARCHITECTURE
    
    INPUT                                                    OUTPUT
    -----                                                    ------
    English Audio                                            Hindi Audio
    (microphone/file)                                        (synthesized)
         |                                                        ^
         v                                                        |
+------------------------------------------------------------------------+
|                                                                         |
|                    (Orchestrator)                              |
|                                                                        |
|  Responsibilities:                                                     |
|  - Route data through pipeline sequentially                            |
|  - Validate each module output before proceeding                       |
|  - Handle errors gracefully with meaningful messages                   |
|  - Log metrics for performance monitoring                              |
|                                                                        |
|  Design Principle: "LLM is a translator, not a controller"             |
|  The Decision Box controls flow; modules perform specific tasks.       |
+------------------------------------------------------------------------+
         |                    |                    |
         v                    v                    v
  +-------------+     +---------------+     +---------------+
  |     |     |      |     |       |
  |     ASR     | --> |  Translation  | --> |     TTS       |
  +-------------+     +---------------+     +---------------+
  | Model:      |     | Model:        |     | Model:        |
  | Whisper     |     | IndicTrans2   |     | Chatterbox    |
  | (base)      |     | (200M)        |     | Multilingual  |
  +-------------+     +---------------+     +---------------+
  | Input:      |     | Input:        |     | Input:        |
  | Audio file  |     | English text  |     | Hindi text    |
  +-------------+     +---------------+     +---------------+
  | Output:     |     | Output:       |     | Output:       |
  | English     |     | Hindi text    |     | Hindi audio   |
  | text        |     |               |     | waveform      |
  +-------------+     +---------------+     +---------------+
  | Success     |     | Success       |     | Success       |
  | Criteria:   |     | Criteria:     |     | Criteria:     |
  | WER < 20%   |     | Semantic      |     | Clear         |
  |             |     | accuracy      |     | pronunciation |
  +-------------+     +---------------+     +---------------+
```

---
# Success Metrics

## B Metrics

| Metric | Target | Measurement Method |
|--------|--------|--------------------|
| Communication Success | Team understands client message | Demo shows accurate translation |
| End-to-End Latency | Less than 5 seconds | Timestamp logging at each stage |
| Output Quality | Intelligible Hindi audio | Manual verification during demo |

## T Metrics

| Module | Metric | Target | How to Measure | Fallback Strategy |
|--------|--------|--------|----------------|-------------------|
| ASR (Whisper) | Word Error Rate | Less than 20% | Compare transcription to source | Re-record with cleaner audio |
| Translation (IndicTrans2) | Semantic Accuracy | Meaning preserved | Manual review of output | Try simpler sentences |
| TTS (Chatterbox) | Intelligibility | Clear pronunciation | Listen test | Adjust CFG and exaggeration parameters |



---
# Environment Setup
---

In [1]:
# Cell 1: System Check
# Verify Python version and GPU availability before proceeding

import sys
print(f"Python version: {sys.version}")

# GPU check - required for acceptable latency
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv

# Cell 2: Install PyTorch (compatible versions for Colab Python 3.12)
# Colab typically has PyTorch pre-installed, but let's ensure compatibility

# Uninstall existing PyTorch, Torchaudio, torchvision to prevent conflicts
!pip uninstall -y torch torchaudio torchvision

# Install PyTorch 2.6.0 and compatible torchaudio (2.6.0) and torchvision (0.21.0, latest available for cu124)
# Use --index-url to specify CUDA 12.4 compatible wheels
!pip install torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
name, memory.total [MiB], memory.free [MiB]
Tesla T4, 15360 MiB, 15095 MiB
Found existing installation: torch 2.9.0+cu126
Uninstalling torch-2.9.0+cu126:
  Successfully uninstalled torch-2.9.0+cu126
Found existing installation: torchaudio 2.9.0+cu126
Uninstalling torchaudio-2.9.0+cu126:
  Successfully uninstalled torchaudio-2.9.0+cu126
Found existing installation: torchvision 0.24.0+cu126
Uninstalling torchvision-0.24.0+cu126:
  Successfully uninstalled torchvision-0.24.0+cu126
Looking in indexes: https://download.pytorch.org/whl/cu124
Collecting torch==2.6.0
  Downloading https://download.pytorch.org/whl/cu124/torch-2.6.0%2Bcu124-cp312-cp312-linux_x86_64.whl.metadata (28 kB)
Collecting torchaudio==2.6.0
  Downloading https://download.pytorch.org/whl/cu124/torchaudio-2.6.0%2Bcu124-cp312-cp312-linux_x86_64.whl.metadata (6.6 kB)
Collecting torchvision==0.21.0
  Downloading https://download.pytorch.org/whl/cu124/torchvisio

In [None]:
# Cell 2: Install Dependencies
# Run once, then restart runtime if prompted

# Core ML dependencies
!pip install -q transformers accelerate sentencepiece
!pip install -q openai-whisper
!pip install -q scipy soundfile

# IndicTrans2 preprocessing toolkit (required for proper translation)
!pip install -q IndicTransToolkit


# Step 1: Check if it's installed with a different name
!pip list | grep -i chatter

# Step 2: Try different installation methods
!pip install --upgrade chatterbox-tts --no-cache-dir
# OR
!pip install git+https://github.com/chatterbox-tts/chatterbox-tts.git
# OR
!pip install chatterbox-tts==0.1.0  # Try specific versio


# Gradio for interactive demo interface
!pip install -q gradio


print("\nAll dependencies installed.")
print("If this is your first run, restart the runtime: Runtime -> Restart runtime")

In [None]:
# This WILL work - clone directly and add to path
!git clone https://github.com/resemble-ai/chatterbox.git /content/chatterbox_repo

# Install dependencies
!pip install torch torchaudio librosa safetensors huggingface_hub perth

# Add to Python path
import sys
sys.path.insert(0, '/content/chatterbox_repo/src')

# Now import
from chatterbox.mtl_tts import ChatterboxMultilingualTTS, SUPPORTED_LANGUAGES
print(f"✅ Success! Hindi available: {'hi' in SUPPORTED_LANGUAGES}")

In [2]:
# Cell 3: Verify Imports

import importlib

required_packages = [
    ('whisper', 'Whisper ASR'),
    ('transformers', 'Transformers'),
    ('torch', 'PyTorch'),
    ('torchaudio', 'TorchAudio'),
    ('gradio', 'Gradio UI'),
]

print("Package Verification:")
print("-" * 50)
all_ok = True
for pkg, name in required_packages:
    try:
        mod = importlib.import_module(pkg)
        version = getattr(mod, '__version__', 'installed')
        print(f"[OK] {name}: {version}")
    except ImportError:
        print(f"[FAIL] {name}: NOT INSTALLED")
        all_ok = False

# Check Chatterbox multilingual
try:
    from chatterbox.mtl_tts import ChatterboxMultilingualTTS
    print(f"[OK] Chatterbox Multilingual TTS: installed")
except ImportError as e:
    print(f"[FAIL] Chatterbox Multilingual TTS: {e}")
    all_ok = False

# Check IndicTransToolkit
try:
    from IndicTransToolkit.processor import IndicProcessor
    print(f"[OK] IndicTransToolkit: installed")
except ImportError as e:
    print(f"[FAIL] IndicTransToolkit: {e}")
    all_ok = False

print("-" * 50)
if all_ok:
    print("All packages verified successfully.")
else:
    print("Some packages failed. Re-run Cell 3 and restart runtime.")

Package Verification:
--------------------------------------------------
[OK] Whisper ASR: 20250625
[OK] Transformers: 4.46.3
[OK] PyTorch: 2.6.0+cu124
[OK] TorchAudio: 2.6.0+cu124
[OK] Gradio UI: 5.50.0
[OK] Chatterbox Multilingual TTS: installed
[OK] IndicTransToolkit: installed
--------------------------------------------------
All packages verified successfully.


In [3]:
# Cell 4: Hugging Face Token Setup (Secure Method)
# Store token in Colab Secrets: Left sidebar -> Key icon -> Add 'HF_TOKEN'

from google.colab import userdata

try:
    HF_TOKEN = userdata.get('HF_TOKEN')
    print("HF Token loaded from Colab Secrets")
except Exception as e:
    print("WARNING: HF_TOKEN not found in Colab Secrets")
    print("Instructions: Left sidebar -> Key icon -> Add 'HF_TOKEN' with your Hugging Face token")
    HF_TOKEN = None

HF Token loaded from Colab Secrets


---
# Module 1: Automatic Speech Recognition (ASR)

**Model**: OpenAI Whisper (base)  
**Input**: Audio file (WAV, MP3, etc.)  
**Output**: English text transcription  
**Success Criteria**: Word Error Rate less than 20%

---

In [None]:
# Cell 5: ASR Module

import whisper
import time

class ASRModule:
    """
    Automatic Speech Recognition Module
    Converts English audio to English text using OpenAI Whisper.

    Technical Metric: Word Error Rate (WER)
    Target: Less than 20% for clean audio
    """

    def __init__(self, model_size="base"):
        """
        Initialize Whisper model.

        Args:
            model_size: Model variant (tiny, base, small, medium, large)
                       'base' provides good balance of speed and accuracy
        """
        print(f"Loading Whisper {model_size} model...")
        self.model = whisper.load_model(model_size)
        self.model_size = model_size
        print(f"ASR Module initialized ({model_size})")

    def transcribe(self, audio_path):
        """
        Transcribe audio file to text.

        Args:
            audio_path: Path to audio file

        Returns:
            dict with keys: success, text, language, latency_ms, error
        """
        start_time = time.time()

        try:
            result = self.model.transcribe(audio_path, language="en")
            latency_ms = (time.time() - start_time) * 1000

            return {
                "success": True,
                "text": result["text"].strip(),
                "language": result["language"],
                "latency_ms": round(latency_ms, 2),
                "error": None
            }
        except Exception as e:
            return {
                "success": False,
                "text": "",
                "language": None,
                "latency_ms": round((time.time() - start_time) * 1000, 2),
                "error": str(e)
            }

    def get_metrics(self):
        return {
            "module": "ASR",
            "model": f"Whisper-{self.model_size}",
            "target_wer": "< 20%"
        }

# Initialize
asr_module = ASRModule(model_size="base")

---
# Module 2: Neural Machine Translation

**Model**: IndicTrans2 (ai4bharat/indictrans2-en-indic-dist-200M)  
**Input**: English text  
**Output**: Hindi text (Devanagari script)  
**Success Criteria**: Semantic meaning preserved

**Important**: IndicTrans2 requires IndicTransToolkit for preprocessing. Without it, the model will throw "Invalid source language tag" errors.

---

In [None]:
# Cell 6: Translation Module

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit.processor import IndicProcessor
import time

class TranslationModule:
    """
    Neural Machine Translation Module
    Converts English text to Hindi using IndicTrans2.

    Technical Metric: Semantic accuracy
    Target: Meaning preserved accurately

    Note: Uses IndicProcessor for required preprocessing/postprocessing.
    """

    def __init__(self, hf_token=None):
        self.model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.src_lang = "eng_Latn"
        self.tgt_lang = "hin_Deva"

        print(f"Loading IndicTrans2 model on {self.device}...")

        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            trust_remote_code=True,
            token=hf_token
        )

        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            self.model_name,
            trust_remote_code=True,
            token=hf_token,
            torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
        ).to(self.device)

        # IndicProcessor handles required preprocessing
        self.processor = IndicProcessor(inference=True)

        print(f"Translation Module initialized (IndicTrans2-200M)")

    def translate(self, text, src_lang="eng_Latn", tgt_lang="hin_Deva"):
        """
        Translate text from English to Hindi.

        Args:
            text: English text to translate
            src_lang: Source language code (eng_Latn for English)
            tgt_lang: Target language code (hin_Deva for Hindi)

        Returns:
            dict with keys: success, source_text, translated_text, latency_ms, error
        """
        start_time = time.time()

        try:
            # Step 1: Preprocess with IndicProcessor (required)
            input_sentences = [text]
            batch = self.processor.preprocess_batch(
                input_sentences,
                src_lang=src_lang,
                tgt_lang=tgt_lang
            )

            # Step 2: Tokenize
            inputs = self.tokenizer(
                batch,
                truncation=True,
                padding="longest",
                return_tensors="pt",
                return_attention_mask=True
            ).to(self.device)

            # Step 3: Generate translation
            with torch.no_grad():
                generated_tokens = self.model.generate(
                    **inputs,
                    use_cache=True,
                    min_length=0,
                    max_length=256,
                    num_beams=5,
                    num_return_sequences=1
                )

            # Step 4: Decode tokens
            decoded = self.tokenizer.batch_decode(
                generated_tokens,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True
            )

            # Step 5: Postprocess with IndicProcessor (required)
            translations = self.processor.postprocess_batch(decoded, lang=tgt_lang)

            translated_text = translations[0]
            latency_ms = (time.time() - start_time) * 1000

            return {
                "success": True,
                "source_text": text,
                "translated_text": translated_text,
                "src_lang": src_lang,
                "tgt_lang": tgt_lang,
                "latency_ms": round(latency_ms, 2),
                "error": None
            }

        except Exception as e:
            return {
                "success": False,
                "source_text": text,
                "translated_text": "",
                "latency_ms": round((time.time() - start_time) * 1000, 2),
                "error": str(e)
            }

    def get_metrics(self):
        return {
            "module": "Translation",
            "model": "IndicTrans2-200M",
            "target": "Semantic accuracy"
        }

# Initialize
translation_module = TranslationModule(hf_token=HF_TOKEN)

---
# Module 3: Text-to-Speech Synthesis

**Model**: Chatterbox Multilingual (ResembleAI, 0.5B parameters)  
**Input**: Hindi text (Devanagari script)  
**Output**: Hindi audio waveform  
**Success Criteria**: Clear, intelligible pronunciation

**Note**: Chatterbox supports 23 languages including Hindi. The multilingual model must be installed from GitHub (not PyPI) for Hindi support.

---

In [None]:
# Cell 7: TTS Module

import torchaudio as ta
import torch
import time
from IPython.display import Audio, display

# Fix for perth watermarker compatibility issue
class MockWatermarker:
    """Mock watermarker for environments where perth version is incompatible."""
    def apply_watermark(self, wav, sample_rate):
        return wav

import perth
if not hasattr(perth, 'PerthImplicitWatermarker'):
    perth.PerthImplicitWatermarker = MockWatermarker
    print("Patched perth.PerthImplicitWatermarker for compatibility")

class TTSModule:
    """
    Text-to-Speech Synthesis Module
    Converts Hindi text to Hindi audio using Chatterbox Multilingual.

    Technical Metric: Intelligibility, naturalness
    Target: Clear pronunciation, natural prosody

    Supported languages: 23 including Hindi (hi), English (en), and others.
    """

    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Loading Chatterbox Multilingual on {self.device}...")

        try:
            from chatterbox.mtl_tts import ChatterboxMultilingualTTS
            self.model = ChatterboxMultilingualTTS.from_pretrained(device=self.device)
            self.sample_rate = self.model.sr
            self.is_multilingual = True
            print(f"TTS Module initialized (Multilingual - 23 languages)")
        except Exception as e:
            print(f"Multilingual model failed: {e}")
            print("Attempting English-only fallback...")
            from chatterbox.tts import ChatterboxTTS
            self.model = ChatterboxTTS.from_pretrained(device=self.device)
            self.sample_rate = self.model.sr
            self.is_multilingual = False
            print(f"TTS Module initialized (English-only)")

    def synthesize(self, text, language="hi", exaggeration=0.5, cfg_weight=0.5):
        """
        Synthesize speech from text.

        Args:
            text: Text to synthesize
            language: Language code ('hi' for Hindi, 'en' for English)
            exaggeration: Emotion intensity (0.0-1.0, default 0.5)
            cfg_weight: Classifier-free guidance weight (0.0-1.0, default 0.5)

        Returns:
            dict with keys: success, audio, sample_rate, latency_ms, error
        """
        start_time = time.time()

        try:
            if self.is_multilingual:
                wav = self.model.generate(
                    text,
                    language_id=language,
                    exaggeration=exaggeration,
                    cfg_weight=cfg_weight
                )
            else:
                wav = self.model.generate(
                    text,
                    exaggeration=exaggeration,
                    cfg_weight=cfg_weight
                )

            latency_ms = (time.time() - start_time) * 1000

            return {
                "success": True,
                "audio": wav,
                "sample_rate": self.sample_rate,
                "latency_ms": round(latency_ms, 2),
                "language": language,
                "error": None
            }

        except Exception as e:
            return {
                "success": False,
                "audio": None,
                "sample_rate": self.sample_rate,
                "latency_ms": round((time.time() - start_time) * 1000, 2),
                "error": str(e)
            }

    def save_audio(self, wav, output_path):
        """Save audio tensor to file."""
        ta.save(output_path, wav, self.sample_rate)
        return output_path

    def play_audio(self, wav):
        """Play audio in notebook."""
        display(Audio(wav.squeeze().cpu().numpy(), rate=self.sample_rate))

    def get_metrics(self):
        return {
            "module": "TTS",
            "model": "Chatterbox-Multilingual-0.5B",
            "multilingual": self.is_multilingual,
            "target": "Clear intelligibility"
        }

# Initialize
tts_module = TTSModule()

---
# Central Orchestrator

**Role**: Coordinates all modules, validates outputs, handles errors

**Design Principle**:The Orchestrator controls flow based on deterministic rules. Each module performs its specific task without awareness of the overall pipeline.

---

In [7]:
# Cell 8: Decision Box

from dataclasses import dataclass
from typing import Optional, Any
import time

@dataclass
class PipelineResult:
    """Container for pipeline execution results."""
    success: bool
    english_text: str
    hindi_text: str
    audio: Optional[Any]
    total_latency_ms: float
    module_latencies: dict
    errors: list

class DecisionBox:
    """
    Central Orchestrator for the TalkSync Pipeline.

    Responsibilities:
    1. Route data through pipeline sequentially (ASR -> Translation -> TTS)
    2. Validate each module output before proceeding
    3. Handle errors gracefully with meaningful messages
    4. Log metrics for performance monitoring

    Design Principle:
    The Decision Box makes routing decisions based on module outputs.
    Each module is a separate concern with its own success criteria.
    """

    def __init__(self, asr, translation, tts):
        self.asr = asr
        self.translation = translation
        self.tts = tts

        # Thresholds for validation
        self.max_latency_ms = 10000  # 10 seconds max per module
        self.min_text_length = 2     # Minimum valid output

        print("Decision Box initialized")
        self._print_architecture()

    def _print_architecture(self):
        print("\n" + "=" * 60)
        print("TALKSYNC ARCHITECTURE")
        print("=" * 60)
        print(f"Module 1 (ASR):         {self.asr.get_metrics()['model']}")
        print(f"Module 2 (Translation): {self.translation.get_metrics()['model']}")
        print(f"Module 3 (TTS):         {self.tts.get_metrics()['model']}")
        print("=" * 60 + "\n")

    def process_audio(self, audio_path, verbose=True):
        """
        Full pipeline: Audio -> Text -> Translation -> Speech

        Args:
            audio_path: Path to English audio file
            verbose: Print progress updates

        Returns:
            PipelineResult with all outputs and metrics
        """
        start_time = time.time()
        errors = []
        latencies = {}

        # MODULE 1: ASR
        if verbose:
            print("\n[1/3] ASR: Transcribing English audio...")

        asr_result = self.asr.transcribe(audio_path)
        latencies["asr"] = asr_result["latency_ms"]

        if not asr_result["success"]:
            errors.append(f"ASR failed: {asr_result['error']}")
            return PipelineResult(
                success=False, english_text="", hindi_text="", audio=None,
                total_latency_ms=(time.time() - start_time) * 1000,
                module_latencies=latencies, errors=errors
            )

        english_text = asr_result["text"]
        if verbose:
            print(f"    Transcribed: \"{english_text}\"")
            print(f"    Latency: {asr_result['latency_ms']:.0f}ms")

        # MODULE 2: TRANSLATION
        if verbose:
            print("\n[2/3] Translation: English -> Hindi...")

        trans_result = self.translation.translate(english_text)
        latencies["translation"] = trans_result["latency_ms"]

        if not trans_result["success"]:
            errors.append(f"Translation failed: {trans_result['error']}")
            return PipelineResult(
                success=False, english_text=english_text, hindi_text="", audio=None,
                total_latency_ms=(time.time() - start_time) * 1000,
                module_latencies=latencies, errors=errors
            )

        hindi_text = trans_result["translated_text"]
        if verbose:
            print(f"    Translated: \"{hindi_text}\"")
            print(f"    Latency: {trans_result['latency_ms']:.0f}ms")

        # MODULE 3: TTS
        if verbose:
            print("\n[3/3] TTS: Generating Hindi speech...")

        tts_result = self.tts.synthesize(hindi_text, language="hi")
        latencies["tts"] = tts_result["latency_ms"]

        if not tts_result["success"]:
            errors.append(f"TTS failed: {tts_result['error']}")
            return PipelineResult(
                success=False, english_text=english_text, hindi_text=hindi_text, audio=None,
                total_latency_ms=(time.time() - start_time) * 1000,
                module_latencies=latencies, errors=errors
            )

        if verbose:
            print(f"    Audio generated")
            print(f"    Latency: {tts_result['latency_ms']:.0f}ms")

        total_latency = (time.time() - start_time) * 1000

        if verbose:
            print("\n" + "=" * 50)
            print("PIPELINE COMPLETE")
            print(f"Total Latency: {total_latency:.0f}ms")
            print("=" * 50)

        return PipelineResult(
            success=True,
            english_text=english_text,
            hindi_text=hindi_text,
            audio=tts_result["audio"],
            total_latency_ms=total_latency,
            module_latencies=latencies,
            errors=[]
        )

    def process_text(self, english_text, verbose=True):
        """
        Text-only pipeline: Skip ASR, start from English text.
        Useful for testing Translation + TTS modules.

        Args:
            english_text: English text to translate
            verbose: Print progress updates

        Returns:
            PipelineResult with all outputs and metrics
        """
        start_time = time.time()
        latencies = {"asr": 0}
        errors = []

        if verbose:
            print(f"\nInput: \"{english_text}\"")
            print("\n[1/2] Translation: English -> Hindi...")

        trans_result = self.translation.translate(english_text)
        latencies["translation"] = trans_result["latency_ms"]

        if not trans_result["success"]:
            errors.append(f"Translation failed: {trans_result['error']}")
            return PipelineResult(
                success=False, english_text=english_text, hindi_text="", audio=None,
                total_latency_ms=(time.time() - start_time) * 1000,
                module_latencies=latencies, errors=errors
            )

        hindi_text = trans_result["translated_text"]
        if verbose:
            print(f"    Translated: \"{hindi_text}\"")
            print("\n[2/2] TTS: Generating Hindi speech...")

        tts_result = self.tts.synthesize(hindi_text, language="hi")
        latencies["tts"] = tts_result["latency_ms"]

        if not tts_result["success"]:
            errors.append(f"TTS failed: {tts_result['error']}")
            return PipelineResult(
                success=False, english_text=english_text, hindi_text=hindi_text, audio=None,
                total_latency_ms=(time.time() - start_time) * 1000,
                module_latencies=latencies, errors=errors
            )

        if verbose:
            print(f"    Audio generated")

        total_latency = (time.time() - start_time) * 1000
        if verbose:
            print(f"\nTotal Latency: {total_latency:.0f}ms")

        return PipelineResult(
            success=True,
            english_text=english_text,
            hindi_text=hindi_text,
            audio=tts_result["audio"],
            total_latency_ms=total_latency,
            module_latencies=latencies,
            errors=[]
        )

# Initialize Decision Box
decision_box = DecisionBox(
    asr=asr_module,
    translation=translation_module,
    tts=tts_module
)

Decision Box initialized

TALKSYNC ARCHITECTURE
Module 1 (ASR):         Whisper-base
Module 2 (Translation): IndicTrans2-200M
Module 3 (TTS):         Chatterbox-Multilingual-0.5B



---
# Interactive Demo Interface (Gradio)

Provides user-friendly interface for:
1. Recording audio from microphone
2. Uploading audio files
3. Text input for translation
4. Playing generated Hindi audio

---

In [None]:
# Cell 9: Gradio Interface with Clear Metrics

import gradio as gr
import numpy as np
import tempfile
import os

def format_metrics(result, include_asr=True):
    """
    Format business and technical metrics for display.
    """
    latency_target = 5000  # 5 seconds

    metrics_text = f"""
======================================================================
                    TALKSYNC METRICS DASHBOARD
======================================================================

BUSINESS METRICS
----------------------------------------------------------------------
  End-to-End Latency:     {result.total_latency_ms:.0f}ms (target: under 5000ms)
                          {"Within target - acceptable for demo" if result.total_latency_ms < latency_target else "Above target - user will notice delay"}

  Translation Completed:  {"Yes - Hindi text generated" if result.hindi_text else "No - translation failed"}

  Audio Generated:        {"Yes - Hindi speech ready to play" if result.audio is not None else "No - TTS failed to produce audio"}

TECHNICAL METRICS
----------------------------------------------------------------------"""

    # Per-module breakdown
    for module, latency in result.module_latencies.items():
        if latency > 0:
            # Add context for each module
            if module == "asr":
                context = "Time to convert speech to text"
            elif module == "translation":
                context = "Time to translate English to Hindi"
            elif module == "tts":
                context = "Time to generate Hindi audio"
            else:
                context = ""

            metrics_text += f"\n  {module.upper():12} {latency:>6.0f}ms    ({context})"

    # Latency breakdown as percentage
    total = result.total_latency_ms
    if total > 0:
        metrics_text += "\n\n  Latency Breakdown:"
        for module, latency in result.module_latencies.items():
            if latency > 0:
                percentage = (latency / total) * 100
                bar_length = int(percentage / 5)  # Scale to max 20 chars
                bar = "█" * bar_length
                metrics_text += f"\n    {module.upper():12} {bar} {percentage:.0f}%"

    # Architecture
    metrics_text += f"""

ARCHITECTURE
----------------------------------------------------------------------
  Module 1 (ASR):         {asr_module.get_metrics()['model']}
  Module 2 (Translation): {translation_module.get_metrics()['model']}
  Module 3 (TTS):         {tts_module.get_metrics()['model']}"""

    # Errors if any
    if result.errors:
        metrics_text += "\n\nERRORS"
        metrics_text += "\n----------------------------------------------------------------------"
        for error in result.errors:
            error_short = error[:80] + "..." if len(error) > 80 else error
            metrics_text += f"\n  {error_short}"

    # Recommendation
    if result.total_latency_ms > latency_target:
        # Find the slowest module
        slowest_module = max(result.module_latencies.items(), key=lambda x: x[1])
        metrics_text += f"""

RECOMMENDATION
----------------------------------------------------------------------
  Bottleneck: {slowest_module[0].upper()} module is taking {slowest_module[1]:.0f}ms
  Suggestion: Use shorter sentences to reduce processing time"""

    metrics_text += "\n\n======================================================================"

    return metrics_text


def process_audio_input(audio):
    """
    Process audio input from microphone or file upload.
    """
    if audio is None:
        return "No audio provided", "", None, "Error: No audio input"

    try:
        sample_rate, audio_data = audio

        if audio_data.dtype == np.int16:
            audio_data = audio_data.astype(np.float32) / 32768.0
        elif audio_data.dtype == np.int32:
            audio_data = audio_data.astype(np.float32) / 2147483648.0

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            temp_path = f.name
            import soundfile as sf
            sf.write(temp_path, audio_data, sample_rate)

        result = decision_box.process_audio(temp_path, verbose=False)
        os.unlink(temp_path)

        if result.success:
            hindi_audio = result.audio.squeeze().cpu().numpy()
            metrics_text = format_metrics(result, include_asr=True)
            return result.english_text, result.hindi_text, (tts_module.sample_rate, hindi_audio), metrics_text
        else:
            metrics_text = format_metrics(result, include_asr=True)
            return result.english_text or "Transcription failed", result.hindi_text or "", None, metrics_text

    except Exception as e:
        return f"Error: {str(e)}", "", None, f"Exception: {str(e)}"


def process_text_input(english_text):
    """
    Process text input (skip ASR).
    """
    if not english_text or not english_text.strip():
        return "", None, "Error: No text provided"

    try:
        result = decision_box.process_text(english_text.strip(), verbose=False)

        if result.success:
            hindi_audio = result.audio.squeeze().cpu().numpy()
            metrics_text = format_metrics(result, include_asr=False)
            return result.hindi_text, (tts_module.sample_rate, hindi_audio), metrics_text
        else:
            metrics_text = format_metrics(result, include_asr=False)
            return result.hindi_text or "", None, metrics_text

    except Exception as e:
        return "", None, f"Exception: {str(e)}"


# Build Gradio Interface
with gr.Blocks(title="TalkSync Demo") as demo:
    gr.Markdown("""
    # TalkSync - English to Hindi Real-Time Translation

    Translate English speech or text to Hindi audio.

    **Pipeline**: Audio/Text -> ASR (Whisper) -> Translation (IndicTrans2) -> TTS (Chatterbox)
    """)

    with gr.Tabs():
        with gr.TabItem("Audio Input"):
            gr.Markdown("Record from microphone or upload an audio file.")

            with gr.Row():
                with gr.Column():
                    audio_input = gr.Audio(
                        sources=["microphone", "upload"],
                        type="numpy",
                        label="English Audio Input"
                    )
                    audio_submit = gr.Button("Translate Audio", variant="primary")

                with gr.Column():
                    audio_english_out = gr.Textbox(label="Transcribed English Text", lines=2)
                    audio_hindi_out = gr.Textbox(label="Translated Hindi Text", lines=2)
                    audio_output = gr.Audio(label="Hindi Audio Output", type="numpy")

            audio_metrics = gr.Textbox(
                label="Metrics Dashboard",
                lines=25,
                max_lines=30
            )

            audio_submit.click(
                fn=process_audio_input,
                inputs=[audio_input],
                outputs=[audio_english_out, audio_hindi_out, audio_output, audio_metrics]
            )

        with gr.TabItem("Text Input"):
            gr.Markdown("Enter English text directly (skips ASR module).")

            with gr.Row():
                with gr.Column():
                    text_input = gr.Textbox(
                        label="English Text",
                        placeholder="Enter English text here...",
                        lines=3
                    )
                    text_submit = gr.Button("Translate Text", variant="primary")

                    gr.Markdown("**Example sentences:**")
                    gr.Examples(
                        examples=[
                            ["Hello, how are you?"],
                            ["The meeting is at 3 PM."],
                            ["Please send the report."],
                            ["Thank you for your help."]
                        ],
                        inputs=[text_input]
                    )

                with gr.Column():
                    text_hindi_out = gr.Textbox(label="Translated Hindi Text", lines=2)
                    text_audio_output = gr.Audio(label="Hindi Audio Output", type="numpy")

            text_metrics = gr.Textbox(
                label="Metrics Dashboard",
                lines=25,
                max_lines=30
            )

            text_submit.click(
                fn=process_text_input,
                inputs=[text_input],
                outputs=[text_hindi_out, text_audio_output, text_metrics]
            )

print("Launching Gradio interface...")
demo.launch(share=True, debug=False)



---
# Demo Summary

## What This Demo Shows

1. **Architecture**: Orchestration pattern with three independent modules
2. **ASR**: Whisper converts English audio to text
3. **Translation**: IndicTrans2 converts English text to Hindi
4. **TTS**: Chatterbox synthesizes Hindi audio
5. **Metrics**: Business and technical metrics for evaluation

## Key Technical Decisions

| Decision | Rationale |
|----------|----------|
| Whisper base model | Balance of speed and accuracy for demo |
| IndicTrans2 with IndicProcessor | Required preprocessing for proper translation |
| Chatterbox from GitHub | PyPI version lacks multilingual support |
| Decision Box pattern | Separation of concerns, easier debugging |
| Gradio interface | User-friendly demo without custom frontend |

## Phase 2 Roadmap

- Video conferencing platform integration (Zoom, Meet, Teams)
- Real-time streaming (chunk-based processing)
- Bidirectional translation (Hindi to English)
- Voice selection and customization
- Custom vocabulary/glossary support

---