# TTS Data Enhancement: Punctuation & Padding

This notebook demonstrates two important preprocessing steps to improve TTS fine-tuning quality:

1. **Impose punctuation**  
   Many transcripts lack proper punctuation, which leads to unnatural pauses and robotic prosody.  
   Here, we use a multilingual punctuation restoration model to add punctuation marks to the text, improving TTS learning and resulting in more natural-sounding speech.

2. **Add padding to audio**  
   During training, we noticed that around **60k steps**, certain glitches consistently appeared at the start of audio clips.  
   To mitigate this, we add a small silent padding at the beginning of each audio file, helping the model handle initial audio artifacts and learn better prosody.


In [None]:
import pandas as pd

df = pd.read_csv("metadata.csv", sep="|", header=None, names=["id", "text", "phonemes"])

# remove ".wav" suffix
df["id"] = df["id"].str.replace(".wav", "", regex=False)


In [None]:
!pip install deepmultilingualpunctuation

Collecting deepmultilingualpunctuation
  Downloading deepmultilingualpunctuation-1.0.1-py3-none-any.whl.metadata (4.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.8.1->deepmultilingualpunctuation)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.8.1->deepmultilingualpunctuation)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.8.1->deepmultilingualpunctuation)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.8.1->deepmultilingualpunctuation)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.8.1->deepmultilingualpunctuation)
  Downloading nvidia_cublas_cu1

In [None]:
from deepmultilingualpunctuation import PunctuationModel




In [None]:
model = PunctuationModel()



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/406 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


AttributeError: 'float' object has no attribute 'lower'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Copy file to your Drive (adjust path as needed)



Mounted at /content/drive


In [None]:
def restore_punct(text):
    try:
        return model.restore_punctuation(text).lower()
    except Exception:
        return str(text).lower()  # fallback if model fails

df["text"] = df["text"].apply(restore_punct)

# Save new CSV
df.to_csv("metadata_with_punct.csv", sep="|", index=False, header=False)

In [None]:
!cp metadata_with_punct.csv /content/drive/MyDrive/metadata_with_punct.csv

In [2]:
from IPython.display import Audio

In [3]:
# Poor pausing
Audio("inference_output_paragraph_20k_steps.wav")

In [4]:
# Model re-learning the prosody
Audio("inference_output_paragraph_25k_steps.wav")

In [7]:
# Gets better, both in humanness and pauses
Audio("inference_output_paragraph_30k_steps.wav")

In [None]:

!mv wavs wavs_original
!pip install pydub

In [None]:
from pydub import AudioSegment
import os

orig_folder = "wavs_small_pause"
out_folder = "wavs"

padding_ms = 100  # 100ms small pause

for filename in os.listdir(orig_folder):
    if not filename.endswith(".wav"):
        continue
    path_in = os.path.join(orig_folder, filename)
    path_out = os.path.join(out_folder, filename)

    audio = AudioSegment.from_wav(path_in)
    silence = AudioSegment.silent(duration=padding_ms)
    padded_audio = silence + audio
    padded_audio.export(path_out, format="wav")
