# Data collection and processing
## YouTube Audio Parsing & Preprocessing for TTS

This notebook demonstrates the complete pipeline to:

1. Download YouTube videos as audio
2. Crop and clean the audio
3. Separate vocals from "accompaniment" - noise
4. Segment audio into chunks based on silence
5. Generate transcripts for fine-tuning using Whisper
This workflow prepares high-quality datasets for TTS fine-tuning.

In [None]:
!pip install yt-dlp
!pip install pydub
!apt install ffmpeg


Collecting yt-dlp
  Downloading yt_dlp-2025.8.11-py3-none-any.whl.metadata (175 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.5/175.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading yt_dlp-2025.8.11-py3-none-any.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: yt-dlp
Successfully installed yt-dlp-2025.8.11
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
from yt_dlp import YoutubeDL

urls = [
    "https://www.youtube.com/watch?v=bQX1s5pVoiI&list=PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E&index=1",

]

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': 'downloads/%(title)s.%(ext)s',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '192',
    }],
}

with YoutubeDL(ydl_opts) as ydl:
    ydl.download(urls)

[youtube:tab] Extracting URL: https://www.youtube.com/watch?v=bQX1s5pVoiI&list=PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E&index=1
[youtube:tab] Downloading playlist PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E - add --no-playlist to download just the video bQX1s5pVoiI
[youtube:tab] PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E: Downloading webpage
[youtube:tab] Extracting URL: https://www.youtube.com/playlist?list=PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E
[youtube:tab] PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E: Downloading webpage
[youtube:tab] PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E: Redownloading playlist API JSON with unavailable videos
[download] Downloading playlist: Quotidiano in classe - Regia Matteo Tabaro
[youtube:tab] PLeitUu65aluiU6SzJWmfOFxwWqC9JeN-E page 1: Downloading API JSON
[youtube:tab] Playlist Quotidiano in classe - Regia Matteo Tabaro: Downloading 16 items of 16
[download] Downloading item 1 of 16
[youtube] Extracting URL: https://www.youtube.com/watch?v=bQX1s5pVoiI
[youtube] bQX1s5pVoiI: Downloading webpage
[yo

## 2. Download Audio from YouTube

We define a list of URLs and download their audio in WAV format.

As our training data we are choosing the voice of the narrator of il Resto del Carlino TG


In [None]:
urls = [
    "https://www.youtube.com/watch?v=zIGKxi27L0Y&list=PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9&index=1",

]

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': 'downloads_to_crop/%(title)s.%(ext)s',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '192',
    }],
}

with YoutubeDL(ydl_opts) as ydl:
    ydl.download(urls)

[youtube:tab] Extracting URL: https://www.youtube.com/watch?v=zIGKxi27L0Y&list=PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9&index=1
[youtube:tab] Downloading playlist PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9 - add --no-playlist to download just the video zIGKxi27L0Y
[youtube:tab] PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9: Downloading webpage
[youtube:tab] Extracting URL: https://www.youtube.com/playlist?list=PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9
[youtube:tab] PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9: Downloading webpage
[youtube:tab] PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9: Redownloading playlist API JSON with unavailable videos
[download] Downloading playlist: Euro 2016 - Regia Matteo Tabaro
[youtube:tab] PLeitUu65aluhIQDrUblHtRQ0mO8sGGTg9 page 1: Downloading API JSON
[youtube:tab] Playlist Euro 2016 - Regia Matteo Tabaro: Downloading 27 items of 27
[download] Downloading item 1 of 27
[youtube] Extracting URL: https://www.youtube.com/watch?v=zIGKxi27L0Y
[youtube] zIGKxi27L0Y: Downloading webpage
[youtube] zIGKxi27L0Y: Do

In [None]:
from pydub import AudioSegment
from pathlib import Path

input_folder = Path("downloads_to_crop")
output_folder = Path("downloads")
output_folder.mkdir(exist_ok=True)

for file_path in input_folder.glob("*.*"):
    if file_path.suffix.lower() not in [".wav", ".mp3"]:
        continue

    audio = AudioSegment.from_file(file_path)
    duration_ms = len(audio)

    start_ms = 7_000
    end_ms = duration_ms - 10_000

    if end_ms <= start_ms:
        print(f"Skipping {file_path.name} (too short)")
        continue

    cropped = audio[start_ms:end_ms]
    output_path = output_folder / file_path.name
    cropped.export(output_path, format=file_path.suffix[1:])

    print(f"Cropped: {file_path.name} → {output_path.name}")


Cropped: Italia, l’attacco va e in panchina c’è un tesoro.wav → Italia, l’attacco va e in panchina c’è un tesoro.wav
Cropped: Esplode la Francia di Griezmann, ma ora c’è la Germania.wav → Esplode la Francia di Griezmann, ma ora c’è la Germania.wav
Cropped: L’Eire punisce Italia 2, con la Spagna tornano i titolari.wav → L’Eire punisce Italia 2, con la Spagna tornano i titolari.wav
Cropped: Berlusconi spalanca le porte ai cinesi, agli Europei Ronaldo contro Bale.wav → Berlusconi spalanca le porte ai cinesi, agli Europei Ronaldo contro Bale.wav
Cropped: Con l'Eire Conte scopre l’Italia b.wav → Con l'Eire Conte scopre l’Italia b.wav
Cropped: Italia-Germania l’ora della passione, e c’è anche De Rossi.wav → Italia-Germania l’ora della passione, e c’è anche De Rossi.wav
Cropped: Tavecchio sicuro： Conte lascia una grande eredità a Ventura.wav → Tavecchio sicuro： Conte lascia una grande eredità a Ventura.wav
Cropped: La Francia vuole il suo Europeo ma Ronaldo può ribaltare il pronostico.wav → L

In [None]:
!pip install spleeter

Collecting spleeter
  Downloading spleeter-2.4.2-py3-none-any.whl.metadata (11 kB)
Collecting ffmpeg-python<0.3.0,>=0.2.0 (from spleeter)
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting httpx<0.20.0,>=0.19.0 (from httpx[http2]<0.20.0,>=0.19.0->spleeter)
  Downloading httpx-0.19.0-py3-none-any.whl.metadata (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.6/45.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting norbert<0.3.0,>=0.2.1 (from spleeter)
  Downloading norbert-0.2.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting numpy<2.0.0 (from spleeter)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas<2.0.0,>=1.3.0 (from spleeter)
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata 

In [None]:
import subprocess
from pydub import AudioSegment
from pathlib import Path

output_folder = Path("downloads")
filtered_folder = Path("downloads_filtered")
filtered_folder.mkdir(exist_ok=True)

for file_path in output_folder.glob("*.*"):
    if file_path.suffix.lower() not in [".wav", ".mp3"]:
        continue

    print(f"Separating: {file_path.name}")

    # a temp folder for spleeter output
    temp_out = Path("spleeter_output")
    temp_out.mkdir(exist_ok=True)

    # splitting sources (vocals + accompaniment)
    subprocess.run([
        "spleeter", "separate",
        "-p", "spleeter:2stems",
        "-o", str(temp_out),
        str(file_path)
    ], check=True)


    vocals_path = temp_out / file_path.stem / "vocals.wav"
    final_path = filtered_folder / file_path.with_suffix(".wav").name
    vocals_audio = AudioSegment.from_wav(vocals_path)
    vocals_audio.export(final_path, format="wav")


    for child in (temp_out / file_path.stem).iterdir():
        child.unlink()
    (temp_out / file_path.stem).rmdir()


Separating: Ritorna ＂Quotidiano in classe＂： dite la vostra sui grandi temi dell'attualità.wav
Filtered vocals saved to: downloads_filtered/Ritorna ＂Quotidiano in classe＂： dite la vostra sui grandi temi dell'attualità.wav
Separating: Italia, l’attacco va e in panchina c’è un tesoro.wav
Filtered vocals saved to: downloads_filtered/Italia, l’attacco va e in panchina c’è un tesoro.wav
Separating: Esplode la Francia di Griezmann, ma ora c’è la Germania.wav
Filtered vocals saved to: downloads_filtered/Esplode la Francia di Griezmann, ma ora c’è la Germania.wav
Separating: Scuola, un solo mese di vacanze estive： che ne pensate？.wav
Filtered vocals saved to: downloads_filtered/Scuola, un solo mese di vacanze estive： che ne pensate？.wav
Separating: L’Eire punisce Italia 2, con la Spagna tornano i titolari.wav
Filtered vocals saved to: downloads_filtered/L’Eire punisce Italia 2, con la Spagna tornano i titolari.wav
Separating: Disastri ambientali, l'Italia paga un conto di 2,6 miliardi l'anno.wa

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
import shutil
from pathlib import Path

drive_target = Path("/content/drive/My Drive/tts_it_news_raw")
drive_target.mkdir(parents=True, exist_ok=True)

local_downloads = Path("downloads_filtered")

for file_path in local_downloads.glob("*.wav"):
    shutil.copy(file_path, drive_target / file_path.name)



In [None]:
import os
from pydub import AudioSegment, silence

input_dir = "downloads"
output_dir = "downloads_segmented_by_pauses"
os.makedirs(output_dir, exist_ok=True)

max_chunk_len_ms = 10 * 1000  # 10 seconds max per chunk
min_silence_len_ms = 300      # minimum silence length to split
silence_thresh_db = -40       # silence threshold

metadata_lines = []

file_counter = 0

for filename in os.listdir(input_dir):
    if not filename.endswith(".wav"):
        continue

    filepath = os.path.join(input_dir, filename)
    audio = AudioSegment.from_wav(filepath)

    # silence intervals
    silent_ranges = silence.detect_silence(
        audio,
        min_silence_len=min_silence_len_ms,
        silence_thresh=silence_thresh_db
    )


    silent_ranges = [(start, stop) for start, stop in silent_ranges]

    # If no silence, split by fixed chunks
    if not silent_ranges:
        chunks = []
        for start_ms in range(0, len(audio), max_chunk_len_ms):
            end_ms = min(start_ms + max_chunk_len_ms, len(audio))
            chunks.append((start_ms, end_ms))
    else:
        # Use silence points to split, but limit chunk length to max_chunk_len_ms
        chunks = []
        prev_end = 0
        for start_sil, end_sil in silent_ranges:
            if (start_sil - prev_end) > max_chunk_len_ms:
                # split into fixed chunks if segment is too long
                segment_start = prev_end
                while segment_start + max_chunk_len_ms < start_sil:
                    chunks.append((segment_start, segment_start + max_chunk_len_ms))
                    segment_start += max_chunk_len_ms
                chunks.append((segment_start, start_sil))
            else:
                chunks.append((prev_end, start_sil))
            prev_end = end_sil
        if prev_end < len(audio):
            chunks.append((prev_end, len(audio)))


    for start_ms, end_ms in chunks:
        chunk_audio = audio[start_ms:end_ms]
        if len(chunk_audio) < 1000:
            continue

        file_counter += 1
        out_filename = f"chunk_{file_counter:05d}.wav"
        out_path = os.path.join(output_dir, out_filename)
        chunk_audio.export(out_path, format="wav")

        metadata_lines.append(f"{out_filename}|")


metadata_path = os.path.join(output_dir, "metadata.csv")
with open(metadata_path, "w", encoding="utf-8") as f:
    for line in metadata_lines:
        f.write(line + "\n")

print(f"Done! TOtal {file_counter} chunks}")


In [None]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-kjopfjzp
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-kjopfjzp
  Resolved https://github.com/openai/whisper.git to commit c0d2f624c09dc18e709e37c2ad90c039a4eb72a2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting typing-extensions>=4.10.0 (from torch->openai-whisper==20250625)
  Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch->openai-whisper==20250625)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch->openai-whisper==20250625)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-man

In [None]:
import os
import whisper

# CONFIG
AUDIO_DIR = "downloads_segmented_by_pauses"  # folder with 10s audio chunks
OUTPUT_FILE = "metadata_upd.csv"  # file used for fine-tuning
LANGUAGE = "it"

# Load whisper model
model = whisper.load_model("medium")  # or "small", "large", etc.

lines = []

for filename in os.listdir(AUDIO_DIR):
    if filename.endswith(".wav"):
        filepath = os.path.join(AUDIO_DIR, filename)
        print(f"Transcribing {filename}...")

        # Transcribe
        result = model.transcribe(filepath, language=LANGUAGE)
        text = result["text"].strip()

        # Format: path|transcript
        rel_path = os.path.join(AUDIO_DIR, filename)
        lines.append(f"{rel_path}|{text}")

# Save to file
with open(OUTPUT_FILE, "w", encoding="utf-8") as f:
    for line in lines:
        f.write(line + "\n")

print(f"Saved metadata to {OUTPUT_FILE}")
