# Research text to speech technologies from srt file

This notebook implements a pipeline for converting subtitle files (SRT format) to natural-sounding speech audio. It utilizes the TTS library to generate speech from text segments and combines them with proper timing to match the original subtitle timing.

## Overview

The notebook provides functionality to:
1. Parse SRT subtitle files
2. Generate speech for each subtitle segment using various TTS models
3. Adjust speech speed to match subtitle timing
4. Combine all segments into a continuous audio file

This approach is useful for creating voiceovers for videos, generating audio versions of subtitled content, or creating accessible versions of media

## Dependencies

The implementation relies on the following libraries:
- `srt`: For parsing SRT subtitle files
- `torch`: PyTorch for deep learning operations
- `coqui-tts`: Text-to-Speech library for generating speech
- `pydub`: For audio processing and manipulation
- `IPython.display`: For audio playback in the notebook

## Key Components

### SRT Parsing
The notebook provides functionality to parse SRT subtitle files, extracting timing information and text content.

### Text-to-Speech Generation
Multiple TTS models are supported:
- English TTS model
- Vietnamese TTS model
- Multilingual model with support for different voices

### Audio Processing
The system intelligently:
- Adjusts speech speed to match subtitle timing
- Handles silences between subtitles
- Combines multiple audio segments into a continuous track

# Experimentation Guidelines

This section provides guidelines on how to experiment with and extend this notebook for your text-to-speech projects.

1. **Setup Environment**:
   - Create your virtual environment: `python3 -m venv venv`
   - Install dependencies: `pip3 install -r requirements.txt`

3. **Prepare Your SRT Files**:
   - Place your SRT files in the `srt/` directory
   - Ensure they're properly formatted and encoded (UTF-8 recommended)

4. **Basic Usage**:
   - Run the example code cell to convert an SRT file to audio:
     ```python
     create_audio_from_srt("srt/your_file.srt", "output_filename.wav")
     ```
   - Listen to the generated audio to evaluate quality

In [1]:
import srt
import torch
import os
from TTS.api import TTS
from IPython.display import Audio
from pydub import AudioSegment
import tempfile

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def parse_srt(srt_file):
    with open(srt_file, 'r', encoding='utf-8') as file:
        subtitles = srt.parse(file.read())
        segments = [(sub.start, sub.end, sub.content) for sub in subtitles]
    return segments

In [3]:
model_vie = "tts_models/vie/fairseq/vits"
model_eng = "tts_models/eng/fairseq/vits"
model_multilang = "tts_models/multilingual/multi-dataset/xtts_v2"

In [4]:
def text_to_speech(text, output_path, model_name):
    try:
        device = "cuda" if torch.cuda.is_available() else "cpu"
        api = TTS(model_name=model_name, progress_bar=False).to(device)
        result = api.tts_to_file(text, file_path=output_path)
        print("result", result)
    except Exception as e:
        print(f"❌ TTS failed: {e}")
        return

In [5]:
def text_to_speech_multilang(text, output_path, speaker, model_name):
    try:
        device = "cuda" if torch.cuda.is_available() else "cpu"
        tts = TTS(model_name=model_name, progress_bar=True).to(device)
        result = tts.tts_to_file(
                  text=text,
                  speaker="Craig Gutsy",
                  language="en",
                  file_path=output_path
                )
        print("result", result)
    except Exception as e:
        print(f"❌ TTS failed: {e}")
        return

In [6]:
def combine_audio_segments(segments, model_name="tts_models/eng/fairseq/vits", temp_dir="audio"):
    """
    Combine audio segments with speed adjustment based on subtitle timing
    
    Args:
        segments: List of tuples (start_time, end_time, content)
        model_name: TTS model to use
        temp_dir: Directory for temporary audio files
    """
    
    
    # Create temp directory if it doesn't exist
    os.makedirs(temp_dir, exist_ok=True)
    
    combined = AudioSegment.silent(duration=0)
    current_time = 0  # Track current position in milliseconds
    
    for i, (start, end, content) in enumerate(segments):
        # Convert timedelta to milliseconds
        start_ms = start.total_seconds() * 1000
        end_ms = end.total_seconds() * 1000
        target_duration_ms = end_ms - start_ms
        
        # Skip empty content
        if not content.strip():
            continue
        
        # Generate TTS audio for this segment
        temp_audio_path = os.path.join(temp_dir, f"segment_{i}.wav")
        
        try:
            # Generate TTS audio
            text_to_speech(content.strip(), temp_audio_path, model_name)
            
            # Load the generated audio
            if os.path.exists(temp_audio_path):
                audio_segment = AudioSegment.from_wav(temp_audio_path)
                original_duration_ms = len(audio_segment)
                
                # Calculate speed ratio to fit target duration
                if original_duration_ms > 0 and target_duration_ms > 0:
                    speed_ratio = original_duration_ms / target_duration_ms
                    
                    # Adjust playback speed
                    # speed_ratio > 1: need to speed up (shorter duration)
                    # speed_ratio < 1: need to slow down (longer duration)
                    adjusted_audio = audio_segment._spawn(
                        audio_segment.raw_data, 
                        overrides={"frame_rate": int(audio_segment.frame_rate * speed_ratio)}
                    ).set_frame_rate(audio_segment.frame_rate)
                    
                    # Ensure the duration matches exactly (trim or pad if needed)
                    if len(adjusted_audio) > target_duration_ms:
                        adjusted_audio = adjusted_audio[:int(target_duration_ms)]
                    elif len(adjusted_audio) < target_duration_ms:
                        padding = AudioSegment.silent(duration=int(target_duration_ms - len(adjusted_audio)))
                        adjusted_audio = adjusted_audio + padding
                else:
                    # Fallback: use original audio or silence
                    adjusted_audio = AudioSegment.silent(duration=int(target_duration_ms))
                
                # Add silence gap if needed (between current position and start time)
                if start_ms > current_time:
                    silence_duration = start_ms - current_time
                    silence = AudioSegment.silent(duration=int(silence_duration))
                    combined += silence
                
                # Add the adjusted audio segment
                combined += adjusted_audio
                current_time = end_ms
                
                # Clean up temp file
                os.remove(temp_audio_path)
                
                print(f"✅ Processed segment {i+1}: '{content[:50]}...' (Speed ratio: {speed_ratio:.2f})")
                
            else:
                print(f"❌ Failed to generate audio for segment {i+1}")
                # Add silence for failed segments
                if start_ms > current_time:
                    silence_duration = start_ms - current_time
                    silence = AudioSegment.silent(duration=int(silence_duration))
                    combined += silence
                
                silence_segment = AudioSegment.silent(duration=int(target_duration_ms))
                combined += silence_segment
                current_time = end_ms
                
        except Exception as e:
            print(f"❌ Error processing segment {i+1}: {e}")
            # Add silence for failed segments
            if start_ms > current_time:
                silence_duration = start_ms - current_time
                silence = AudioSegment.silent(duration=int(silence_duration))
                combined += silence
            
            silence_segment = AudioSegment.silent(duration=int(target_duration_ms))
            combined += silence_segment
            current_time = end_ms
    
    # Clean up temp directory
    try:
        os.rmdir(temp_dir)
    except:
        pass  # Directory might not be empty or might not exist
    
    return combined

# Example usage:
def create_audio_from_srt(srt_file, output_audio, model_name="tts_models/eng/fairseq/vits"):
    """
    Complete workflow: parse SRT -> generate TTS -> combine audio
    """
    # Parse the SRT file
    segments = parse_srt(srt_file)
    print(f"📄 Parsed {len(segments)} segments from {srt_file}")
    
    # Generate and combine audio
    combined_audio = combine_audio_segments(segments, model_name)
    
    # Export the final audio
    combined_audio.export(output_audio, format="wav")
    print(f"🎵 Audio saved to: {output_audio}")
    
    return combined_audio

In [11]:
create_audio_from_srt("srt/version1.srt", "version1.wav")

📄 Parsed 7 segments from srt/version1.srt


hey sarah, did you finish the project report?
Character ',' not found in the vocabulary. Discarding it.
hey sarah, did you finish the project report?
Character '?' not found in the vocabulary. Discarding it.


result audio/segment_0.wav
✅ Processed segment 1: 'Hey Sarah, did you finish the project report?...' (Speed ratio: 1.49)


almost done!
Character '!' not found in the vocabulary. Discarding it.
just need to review the final section.
Character '.' not found in the vocabulary. Discarding it.


result audio/segment_1.wav
✅ Processed segment 2: 'Almost done! Just need to review the final section...' (Speed ratio: 1.18)


great!
Character '!' not found in the vocabulary. Discarding it.
the deadline is tomorrow, right?
Character ',' not found in the vocabulary. Discarding it.
the deadline is tomorrow, right?
Character '?' not found in the vocabulary. Discarding it.


result audio/segment_2.wav
✅ Processed segment 3: 'Great! The deadline is tomorrow, right?...' (Speed ratio: 1.95)


yeah, but i think we're in good shape.
Character ',' not found in the vocabulary. Discarding it.
yeah, but i think we're in good shape.
Character '.' not found in the vocabulary. Discarding it.
how's your presentation coming along?
Character '?' not found in the vocabulary. Discarding it.


result audio/segment_3.wav
✅ Processed segment 4: 'Yeah, but I think we're in good shape. How's your ...' (Speed ratio: 1.27)


pretty well.
Character '.' not found in the vocabulary. Discarding it.


result audio/segment_4.wav
✅ Processed segment 5: 'Pretty well. I'm just polishing the slides now....' (Speed ratio: 1.61)


perfect.
Character '.' not found in the vocabulary. Discarding it.
want to grab coffee after we submit everything?
Character '?' not found in the vocabulary. Discarding it.


result audio/segment_5.wav
✅ Processed segment 6: 'Perfect. Want to grab coffee after we submit every...' (Speed ratio: 1.56)


absolutely!
Character '!' not found in the vocabulary. Discarding it.
i could really use a break by then.
Character '.' not found in the vocabulary. Discarding it.


result audio/segment_6.wav
✅ Processed segment 7: 'Absolutely! I could really use a break by then....' (Speed ratio: 1.78)
🎵 Audio saved to: version1.wav
