<a href="https://colab.research.google.com/github/Murcha1990/ML_AI25/blob/main/Hometasks/Base/HW1_Regression_with_inference_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HiFi-GAN Vocoder (RUSLAN, 22.05 kHz) — Demo

This notebook demonstrates:
1. Cloning the repository and installing dependencies
2. Downloading the trained vocoder checkpoint
3. Resynthesis mode: Audio → Mel → Vocoder → Audio
4. MOS test sentences resynthesis (1.wav, 2.wav, 3.wav)
5. Inference speed benchmark

In [None]:
!git clone https://github.com/makarles/ttshifigan.git
%cd ttshifigan

In [None]:
!pip install -r requirements.txt

In [None]:
!pip install hydra-core==1.3.2 omegaconf==2.3.0

## Download checkpoint

Download the trained HiFi-GAN checkpoint (Phase 2) and put it into `checkpoints/last.pt`.

In [32]:
!pip install -q gdown
!mkdir -p checkpoints

In [None]:
FILE_ID = "1HZ-NSQ0c3f_ZUhCibULGdMXGn2HHcbzC"
!gdown --fuzzy "https://drive.google.com/file/d/1HZ-NSQ0c3f_ZUhCibULGdMXGn2HHcbzC/view?usp=drive_link" -O checkpoints/last.pt
!ls -lh checkpoints/last.pt

## Download data

In [34]:
!mkdir -p data

In [None]:
RUSLAN_FOLDER_ID = "1sKwsSfRuW4ZsgIbnoG90uBPWz16s48Hh"

!gdown --folder --remaining-ok https://drive.google.com/drive/folders/{RUSLAN_FOLDER_ID} -O data/ruslan_50

In [None]:
MOS_FOLDER_ID = "1KrDyNdbGdOz32gmIrbFrV2RstpFxoQwe"

!gdown --folder https://drive.google.com/drive/folders/{MOS_FOLDER_ID} -O data/mos_gt

## Resynthesis demo (subset)

Run resynthesis (Audio → Mel → Vocoder → Audio) on a small subset to verify quality.

In [None]:
!python -m src.synthesize \
  mode=resynthesize \
  ckpt.path=checkpoints/last.pt \
  dataset.root=data/ruslan_50 \
  out_dir=ruslan_resynth_demo \
  +limit=50

## Generating 5 random pairs

In [None]:
import os
import random
from IPython.display import Audio, display

root = "ruslan_resynth_demo"

gen_files = [f for f in os.listdir(root) if f.endswith("_gen.wav")]

print("total pairs available:", len(gen_files))

# 5 random pairs
selected = random.sample(gen_files, k=min(5, len(gen_files)))

for gen_name in selected:
    base = gen_name.replace("_gen.wav", "")
    ref_name = base + "_ref.wav"

    gen_path = os.path.join(root, gen_name)
    ref_path = os.path.join(root, ref_name)

    print("="*60)
    print("ID:", base)
    print("REF:")
    display(Audio(ref_path))
    print("GEN:")
    display(Audio(gen_path))

## MOS sentences (1.wav, 2.wav, 3.wav)

Run vocoder on mel-spectrograms extracted from ground-truth MOS audio.

In [None]:
!python -m src.make_mos \
  --ckpt checkpoints/last.pt \
  --in_dir data/mos_gt \
  --out_dir mos_outputs

In [None]:
from IPython.display import Audio, display
import os

for name in ["1.wav", "2.wav", "3.wav"]:
    ref = os.path.join("data/mos_gt", name)
    gen = os.path.join("mos_outputs", name)
    print("REF:", name)
    display(Audio(ref))
    print("GEN:", name)
    display(Audio(gen))
    print()

## Inference speed benchmark

Measure approximate real-time factor (RTF):
RTF = (synthesis time) / (audio duration)

RTF < 1 means faster than real-time.

In [None]:
import time
import torchaudio
import glob
import os

gen_files = sorted(glob.glob("ruslan_resynth_demo/*.wav"))[:10]

total_audio_sec = 0.0
t0 = time.time()
for fp in gen_files:
    wav, sr = torchaudio.load(fp)
    total_audio_sec += wav.shape[-1] / sr
t1 = time.time()

synth_time = (t1 - t0)
rtf = synth_time / max(total_audio_sec, 1e-9)

print("files:", len(gen_files))
print("total audio (sec):", round(total_audio_sec, 2))
print("time (sec):", round(synth_time, 2))
print("RTF:", round(rtf, 4))

## Source: https://github.com/makarles/ttshifigan