# Wav2Vec2 Hindi ASR Model - Interactive Notebook

This notebook demonstrates how to use the SPRING-INX Wav2Vec2 model for Hindi speech-to-text transcription.

## What we're doing:
- Loading a pre-trained Wav2Vec2 model for Hindi
- Transcribing audio files from Hindi speech to text
- Measuring inference time on CPU (Xeon)

Let's get started! üé§‚Üíüìù

## Step 1: Import Required Libraries

In [14]:
import sys
sys.path.insert(0, '/home/ubuntu/ASR/ASR/Wav2Vec2/.venv/lib/python3.10/site-packages')

import torch
import fairseq
import librosa
import torch.nn.functional as F
import torchaudio.sox_effects as ta_sox
import time
import numpy as np
from pathlib import Path

print("‚úì All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: CPU")

‚úì All libraries imported successfully!
PyTorch version: 2.1.0+cpu
Device: CPU


## Step 2: Define the AudioToText Class

In [15]:
class AudioToText:
    """Wav2Vec2 based Hindi speech-to-text transcriber"""
    
    DEFAULT_SAMPLING_RATE = 16000

    def __init__(self, model_path="hindi.pt", warmup_iterations=1, sample_audio_path="./samples/hindi.wav"):
        """
        Initialize the ASR model.
        
        Args:
            model_path: Path to the .pt model file
            warmup_iterations: Number of warmup iterations
            sample_audio_path: Path to a sample audio for warmup
        """
        print("Loading model...")
        self.model, self.cfg, self.task = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_path])
        self.model = self.model[0]
        self.dtype = torch.float32
        self.model.to(self.dtype)
        self.model.eval()
        print(f"‚úì Model loaded from {model_path}")

        self.effects = [["gain", "-n"]]
        self.token = self.task.target_dictionary
        self.warmup_audio_path = sample_audio_path
        
        print(f"Running {warmup_iterations} warmup iteration(s)...")
        self.warmup(warmup_iterations)
        print("‚úì Model ready for inference!")
    
    def warmup(self, warmup_iters: int):
        """Run warmup iterations to optimize performance"""
        for i in range(warmup_iters):
            with torch.no_grad():
                _ = self.transcribe(self.warmup_audio_path)
            print(f"  Warmup {i+1}/{warmup_iters} complete")

    def transcribe(self, path, sample_rate=16000):
        """
        Transcribe an audio file to text.
        
        Args:
            path: Path to audio file
            sample_rate: Audio sampling rate (default: 16kHz)
            
        Returns:
            transcriptions: List of transcribed text
            total_time: Time taken for inference in seconds
        """
        # Load audio
        audio, sr = librosa.load(path, sr=sample_rate)
        
        # Start timing
        st = time.perf_counter()
        
        # Apply effects
        input_sample, rate = ta_sox.apply_effects_tensor(
            torch.tensor(audio).unsqueeze(0), sample_rate, self.effects)
        input_sample = input_sample.to(self.dtype)
        
        # Normalize
        with torch.no_grad():
            input_sample = F.layer_norm(input_sample, input_sample.shape)

        # Get model predictions
        with torch.no_grad():
            logits = self.model(source=input_sample, padding_mask=None)['encoder_out']
        
        predicted_ids = torch.argmax(logits, axis=-1)
        predicted_ids = torch.unique_consecutive(predicted_ids.T, dim=1).tolist()

        # Convert token IDs to text
        transcriptions = []
        for ids in predicted_ids:
            transcription = self.token.string(ids)
            transcription = transcription.replace(' ', "").replace('|', " ").strip()
            transcriptions.append(transcription)
        
        total_time = time.perf_counter() - st
        
        return transcriptions, total_time

print("‚úì AudioToText class defined successfully!")

‚úì AudioToText class defined successfully!


## Step 3: Initialize the Model

In [10]:
# Create the transcriber instance
transcriber = AudioToText(
    model_path="hindi.pt",
    warmup_iterations=1,
    sample_audio_path="./samples/hindi.wav"
)

Loading model...




‚úì Model loaded from hindi.pt
Running 1 warmup iteration(s)...
  Warmup 1/1 complete
‚úì Model ready for inference!


## Step 4: Transcribe Audio Files

In [16]:
# Transcribe the first sample
print("Transcribing: ./samples/hindi.wav")
print("=" * 50)
transcription, inference_time = transcriber.transcribe("./samples/hindi.wav")
print(f"\nTranscription:")
for text in transcription:
    print(f"  {text}")
print(f"\nInference Time: {inference_time:.4f} seconds")

Transcribing: ./samples/hindi.wav



Transcription:
  ‡§∏‡§æ‡§•‡§ø‡§Ø‡•ã‡§Ç ‡§≤‡•ã‡§ï‡§≤ ‡§™‡•ç‡§∞‡•ã‡§°‡§ï‡•ç‡§ü ‡§ï‡•ã ‡§ó‡•ç‡§≤‡•ã‡§¨‡§≤ ‡§¨‡§®‡§æ‡§®‡•á ‡§Æ‡•á‡§Ç ‡§π‡§Æ‡§æ‡§∞‡•á ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó ‡§≠‡•Ä ‡§™‡•Ä‡§õ‡•á ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à ‡§™‡§ø‡§õ‡§≤‡•á ‡§Æ‡§π‡•Ä‡§®‡•á

Inference Time: 0.3407 seconds


## Step 5: Transcribe Second Sample

In [17]:
# Transcribe the second sample
print("Transcribing: ./samples/hindi2.wav")
print("=" * 50)
transcription, inference_time = transcriber.transcribe("./samples/hindi2.wav")
print(f"\nTranscription:")
for text in transcription:
    print(f"  {text}")
print(f"\nInference Time: {inference_time:.4f} seconds")

Transcribing: ./samples/hindi2.wav



Transcription:
  ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§®‡•á ‡§ú‡•ã ‡§ï‡§∞ ‡§¶‡§ø‡§ñ‡§æ‡§Ø‡§æ ‡§π‡•à ‡§µ‡§π ‡§¶‡•á‡§∂ ‡§≠‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó‡•ã‡§Ç ‡§ï‡•á ‡§≤‡§ø‡§è ‡§≠‡•Ä ‡§è‡§ï ‡§Æ‡§ø‡§∏‡§æ‡§≤ ‡§π‡•à ‡§Ø‡§π‡§æ‡§Ç ‡§ï‡•á ‡§™‡•Å‡§≤‡§µ‡§æ‡§Æ‡§æ ‡§∏‡•á‡•§

Inference Time: 0.3313 seconds


## Step 6: Batch Transcription (Optional)

In [13]:
# Transcribe multiple files at once
from pathlib import Path

audio_files = [
    "./samples/hindi.wav",
    "./samples/hindi2.wav",
    "./samples/modi_speech.wav"
]

results = {}

print("Batch Transcription Results")
print("=" * 50)

for audio_file in audio_files:
    if Path(audio_file).exists():
        print(f"\nFile: {audio_file}")
        try:
            transcription, inference_time = transcriber.transcribe(audio_file)
            results[audio_file] = {
                'transcription': transcription,
                'time': inference_time
            }
            print(f"  Text: {transcription[0] if transcription else 'No text'}")
            print(f"  Time: {inference_time:.4f}s")
        except Exception as e:
            print(f"  Error: {e}")
    else:
        print(f"\nFile not found: {audio_file}")

print("\n" + "=" * 50)
print("Batch processing complete!")

Batch Transcription Results

File: ./samples/hindi.wav


  Text: ‡§∏‡§æ‡§•‡§ø‡§Ø‡•ã‡§Ç ‡§≤‡•ã‡§ï‡§≤ ‡§™‡•ç‡§∞‡•ã‡§°‡§ï‡•ç‡§ü ‡§ï‡•ã ‡§ó‡•ç‡§≤‡•ã‡§¨‡§≤ ‡§¨‡§®‡§æ‡§®‡•á ‡§Æ‡•á‡§Ç ‡§π‡§Æ‡§æ‡§∞‡•á ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó ‡§≠‡•Ä ‡§™‡•Ä‡§õ‡•á ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à ‡§™‡§ø‡§õ‡§≤‡•á ‡§Æ‡§π‡•Ä‡§®‡•á
  Time: 0.4361s

File: ./samples/hindi2.wav
  Text: ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§®‡•á ‡§ú‡•ã ‡§ï‡§∞ ‡§¶‡§ø‡§ñ‡§æ‡§Ø‡§æ ‡§π‡•à ‡§µ‡§π ‡§¶‡•á‡§∂ ‡§≠‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó‡•ã‡§Ç ‡§ï‡•á ‡§≤‡§ø‡§è ‡§≠‡•Ä ‡§è‡§ï ‡§Æ‡§ø‡§∏‡§æ‡§≤ ‡§π‡•à ‡§Ø‡§π‡§æ‡§Ç ‡§ï‡•á ‡§™‡•Å‡§≤‡§µ‡§æ‡§Æ‡§æ ‡§∏‡•á‡•§
  Time: 0.4284s

File: ./samples/modi_speech.wav
  Text: ‡§∏ ‡§∏‡§æ ‡§•‡§ø ‡§Ø‡•ã  ‡§≤‡•ã‡§ï‡§≤ ‡§™‡•ç‡§∞‡•ã‡§°‡§ï‡•ç‡§ü ‡§ï‡•ã ‡§ó‡•ç‡§≤‡•ã‡§¨‡§≤ ‡§¨‡§®‡§æ‡§®‡•á ‡§Æ‡•á‡§Ç ‡§π‡§Æ‡§æ‡§∞‡•á ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó ‡§≠‡•Ä ‡§™‡•Ä‡§õ‡•á ‡§®‡§π‡•Ä‡§Ç ‡§π‡•à ‡§™‡§ø‡§õ‡§≤‡•á ‡§Æ‡§π‡•Ä‡§®‡•á ‡§ú‡§Æ‡•ç‡§Æ‡•Ç-‡§ï‡§∂‡•ç‡§Æ‡•Ä‡§∞ ‡§®‡•á ‡§ú‡•ã ‡§ï‡§∞ ‡§¶‡§ø‡§ñ‡§æ‡§Ø‡§æ ‡§π‡•à ‡§µ‡•ã ‡§¶‡•á‡§∂ ‡§≠‡§∞ ‡§ï‡•á ‡§≤‡•ã‡§ó‡•ã‡§Ç ‡§ï‡•á ‡§≤‡

## Summary

‚úÖ **What we accomplished:**
- Loaded the Wav2Vec2 Hindi ASR model on CPU
- Transcribed audio files to text
- Measured inference performance

üìä **Performance on Xeon CPU:**
- Typical inference time: ~0.3-0.5 seconds per audio file
- Model runs smoothly on CPU without GPU
- Suitable for batch processing

üîß **Next Steps:**
- Try with different Hindi audio files
- Download models for other Indian languages (Tamil, Telugu, Bengali, etc.)
- Integrate into a web service or API
- Deploy as a microservice

üìö **Resources:**
- [SPRING-INX Model Repository](https://asr.iitm.ac.in/models)
- [Fairseq Documentation](https://fairseq.readthedocs.io/)
- [Wav2Vec2 Paper](https://arxiv.org/abs/2006.11477)