# üéµ MOSS: Multi-Objective Sound Synthesis

## Hled√°n√≠ Pareto-optim√°ln√≠ch smƒõs√≠ spektrogram≈Ø

Tento notebook slou≈æ√≠ jako referenƒçn√≠ implementace a sobƒõstaƒçn√° demonstrace syst√©mu MOSS. Algoritmy zde p≈ôesnƒõ odpov√≠daj√≠ webov√©mu frontendu na https://moss.app.kuchar.dpdns.org.

### üéØ Definice √∫lohy

Nach√°z√≠me se v dom√©nƒõ v√≠cekriteri√°ln√≠ optimalizace (Multi-Objective Optimization). Na≈°√≠m c√≠lem je nal√©zt zobrazen√≠, kter√© k√≥duje s√©mantickou informaci z vizu√°ln√≠ dom√©ny do dom√©ny akustick√©, p≈ôiƒçem≈æ minimalizuje ztr√°tu informace v obou modalit√°ch.

Nech≈• $I$ je c√≠lov√Ω obr√°zek (nap≈ô. Mona Lisa) a $A$ je c√≠lov√Ω zvuk (nap≈ô. ƒåajkovsk√Ω). Hled√°me matici spektrogramu $S$, kter√° minimalizuje vektorovou chybovou funkci:
$$\min_{S} F(S) = [L_\mathrm{visual}(S, I), L_\mathrm{audio}(S, A)]^T$$

Tento probl√©m nem√° jedin√© glob√°ln√≠ ≈ôe≈°en√≠, n√Ωbr≈æ mno≈æinu optim√°ln√≠ch kompromis≈Ø, kterou naz√Ωv√°me Paretovou frontou. ≈òe≈°en√≠ $S^*$ je Pareto-optim√°ln√≠, pokud neexistuje jin√© ≈ôe≈°en√≠ $S'$, pro kter√© by platilo:
$$\forall i: L_i(S') \leq L_i(S^*) \wedge \exists j: L_j(S') < L_j(S^*)$$

To jest, nem≈Ø≈æeme vylep≈°it vizu√°ln√≠ podobnost, ani≈æ bychom po≈°kodili zvukovou vƒõrnost, a naopak.

### üß† Kl√≠ƒçov√© koncepty a architektura

**Spektr√°ln√≠ maskov√°n√≠:**
Nam√≠sto p≈ô√≠m√© synt√©zy pixel≈Ø optimalizujeme skal√°rn√≠ pole (masku) $M \in [0,1]^{F \times T}$, kter√© ≈ô√≠d√≠ line√°rn√≠ interpolaci mezi magnitudami zdrojov√Ωch sign√°l≈Ø.

**Probl√©m spektr√°ln√≠ rekonstrukce:**
Lidsk√Ω sluch je extr√©mnƒõ citliv√Ω na ƒçasovou strukturu sign√°lu, kter√° je v STFT reprezentaci zak√≥dov√°na ve f√°zi. Vizu√°ln√≠ rekonstrukce v≈°ak operuje pouze nad amplitudou (magnitudou). Pro zachov√°n√≠ srozumitelnosti audia fixujeme f√°zi c√≠lov√©ho zvuku ($\phi_\mathrm{audio}$) a optimalizujeme pouze magnitudu. V√Ωsledn√Ω sign√°l je rekonstruov√°n jako $x(t) = \text{ISTFT}(|S_\mathrm{mixed}| \cdot e^{j\phi_\mathrm{audio}})$.

**Hybridn√≠ optimalizace:**
Pro efektivn√≠ mapov√°n√≠ stavov√©ho prostoru vyu≈æ√≠v√°me dvouf√°zov√Ω p≈ô√≠stup:
1. **Gradientn√≠ single-objective optimalizace (ADAM)** pro rychlou konvergenci v urƒçit√Ωch smƒõrech.
2. **Evoluƒçn√≠ algoritmus (NSGA-II)** pro glob√°ln√≠ prohled√°v√°n√≠ a zmapov√°n√≠ Paretovy fronty.

## 1. Z√°kladn√≠ konfigurace a Sign√°lov√© zpracov√°n√≠

Zde definujeme parametry pro Short-Time Fourier Transform (STFT). Volba tƒõchto konstant je kritick√° pro vyv√°≈æen√≠ Gaborova limitu, kter√Ω ≈ô√≠k√°, ≈æe nelze m√≠t libovolnƒõ vysokou p≈ôesnost v ƒçase i ve frekvenci souƒçasnƒõ:

$$\sigma_t \cdot \sigma_\omega \geq \frac{1}{2}$$

* **SAMPLE_RATE = 16000**: Nyquistova frekvence je 8 kHz, co≈æ pokr√Ωv√° vƒõt≈°inu s√©manticky podstatn√©ho spektra pro lidskou ≈ôeƒç a hudbu.
* **N_FFT = 1024**: Velikost okna. Urƒçuje frekvenƒçn√≠ rozli≈°en√≠ na $\approx 15.6$ Hz na bin.
* **HOP_LENGTH = 256**: Posun okna (75% p≈ôekryv). Zaji≈°≈•uje hladkou ƒçasovou rekonstrukci a minimalizuje 'spectral leakage' artefakty.

Pro urychlen√≠ v√Ωpoƒçt≈Ø zav√°d√≠me koncept **Proxy Optimalizace**. Optimalizaƒçn√≠ smyƒçka bƒõ≈æ√≠ v redukovan√©m rozli≈°en√≠ (`PROXY_HEIGHT = 129`), co≈æ odpov√≠d√° ƒçtvrtinov√©mu vzorkov√°n√≠ frekvenƒçn√≠ osy. V√Ωsledek je n√°slednƒõ upsamplov√°n biline√°rn√≠ interpolac√≠.

In [None]:
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchaudio
import matplotlib.pyplot as plt
from PIL import Image
from IPython.display import HTML, Audio, display
import warnings
warnings.filterwarnings('ignore')

SAMPLE_RATE = 16000
N_FFT = 1024
HOP_LENGTH = 256
WIN_LENGTH = N_FFT
GRID_HEIGHT = 64
GRID_WIDTH = 128
N_PARAMS = GRID_HEIGHT * GRID_WIDTH

# Pro zrychlen√≠ prov√°d√≠me proxy optimalizaci v ni≈æ≈°√≠m rozli≈°en√≠
PROXY_HEIGHT = 129
DEVICE = 'cpu'

# Parametry optimalizac√≠, stopping kriterium je v≈°ude poƒçet iterac√≠
SINGLE_STEPS = 500
PARETO_SEED_STEPS = 200
PARETO_GENERATIONS = 50
PARETO_SEED_POP = 10
PARETO_EVOL_POP = 100

def plot_spectrogram(ax, mag_tensor, title=None, show_colorbar=False):
    
    spec_db = 20 * torch.log10(mag_tensor + 1e-8).cpu().numpy()
    ref_max = np.percentile(spec_db, 99.5)
    vmin = ref_max - 80
    vmax = ref_max
    
    im = ax.imshow(spec_db, origin='lower', aspect='auto', cmap='magma',
                   vmin=vmin, vmax=vmax)
    if title:
        ax.set_title(title)
    if show_colorbar:
        plt.colorbar(im, ax=ax, label='dB')
    return im

print(f"PyTorch verze: {torch.__version__}")
print(f"Za≈ô√≠zen√≠ (hard-coded na CPU pro kompatibilitu): {DEVICE}")
print(f"Prostor parametr≈Ø: {GRID_HEIGHT}√ó{GRID_WIDTH} = {N_PARAMS} parametr≈Ø")


## 2. Definice Ztr√°tov√Ωch funkc√≠

Aby optimaliz√°tor "vidƒõl" a "sly≈°el", mus√≠me rozumnƒõ definovat metriky vzd√°lenosti v obou dom√©n√°ch.

### üëÅÔ∏è Visual Loss: Mean Absolute Error (MAE)

Pro vizu√°ln√≠ slo≈æku pou≈æ√≠v√°me $L_1$ normu. Na rozd√≠l od MSE ($L_2$) je $L_1$ m√©nƒõ citliv√° na odlehl√© hodnoty (outliers) a produkuje ost≈ôej≈°√≠ hrany ve spektrogramu, co≈æ je ≈æ√°douc√≠ pro rozeznatelnost obrazu.

$$\mathcal{L}_\mathrm{visual} = ||M_\mathrm{mixed} - M_\mathrm{image}||_1 = \frac{1}{N} \sum_{f,t} |M_\mathrm{mixed}(f,t) - M_\mathrm{image}(f,t)|$$

### üëÇ Audio Loss: Log-Spectral Distance

Pro audio slo≈æku nem≈Ø≈æeme pou≈æ√≠t line√°rn√≠ vzd√°lenost, proto≈æe lidsk√© vn√≠m√°n√≠ hlasitosti je p≈ôibli≈ænƒõ logaritmick√© (v souladu s Weberov√Ωm-Fechnerov√Ωm z√°konem). Minimalizujeme tedy vzd√°lenost v logaritmick√© dom√©nƒõ:

$$\mathcal{L}_\mathrm{audio} = ||\log(M_\mathrm{mixed} + \epsilon) - \log(M_\mathrm{audio} + \epsilon)||_1$$

kde $\epsilon = 10^{-8}$ zaji≈°≈•uje numerickou stabilitu.

### ‚öñÔ∏è Normalizace dle Nadiru

Abychom mohli tyto dvƒõ krit√©ria porovn√°vat, normalizujeme je do intervalu [0,1] pomoc√≠ odhadu Nadiru, body loss funkce ≈°k√°lujeme pomoc√≠:

*   $\mathrm{Scale}_\mathrm{vis} = (\mathcal{L}_\mathrm{vis}(A, I))^{-1}$ (Obr√°zkov√° loss funkce ƒçist√©ho zvuku)
*   $\mathrm{Scale}_\mathrm{aud} = (\mathcal{L}_\mathrm{aud}(I, A))^{-1}$ (Zvukov√° loss funkce ƒçist√©ho obr√°zku)

In [None]:
def calc_audio_mag_loss(mixed_mag, target_audio_mag):

    mixed_log = torch.log(mixed_mag + 1e-8)
    target_log = torch.log(target_audio_mag + 1e-8)
    target_log = target_log.expand_as(mixed_log)
    
    loss = F.l1_loss(mixed_log, target_log, reduction='none')
    
    if mixed_mag.dim() == 3:
        return loss.mean(dim=(1, 2))
    return loss.mean()


def calc_visual_loss(mixed_mag, target_image_mag):

    diff = torch.abs(mixed_mag - target_image_mag)
    
    if mixed_mag.dim() == 3:
        return diff.mean(dim=(1, 2))
    return diff.mean()

## 3. Topologie masky a Gaussovsk√© vyhlazen√≠

Optimalizujeme matici parametr≈Ø $\theta \in \mathbb{R}^{H \times W}$. Abychom zabr√°nili vzniku ostr√Ωch spektr√°ln√≠ch diskontinuit, kter√© by se projevily jako zvukov√© a obrazov√© artefakty, aplikujeme na masku konvoluci s Gaussovsk√Ωm j√°drem.

Proces transformace parametr≈Ø na masku je
$$\text{mask} = \sigma(\text{Upsample}(\theta * G_\sigma)),$$

kde $G_\sigma$ je 2D Gaussovsk√© j√°dro
$$G_\sigma(x, y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}.$$

Tato operace funguje jako regularizace, kter√° vynucuje prostorovou korelaci parametr≈Ø, jednodu≈°e ≈ôeƒçeno 'vyhlazuje'; vy≈°≈°√≠ $\sigma$ vede k "hlad≈°√≠mu" zvuku a abstraktnƒõj≈°√≠mu obrazu.

In [None]:
def gaussian_blur_2d(x, sigma=1.0):

    if sigma < 0.5:
        return x
    
    kernel_size = int(6 * sigma) | 1
    kernel_size = max(3, kernel_size)
    
    coords = torch.arange(kernel_size, dtype=x.dtype, device=x.device) - kernel_size // 2
    g = torch.exp(-coords**2 / (2 * sigma**2))
    g = g / g.sum()
    
    pad = kernel_size // 2
    original_dim = x.dim()
    
    if x.dim() == 2:
        x = x.unsqueeze(0).unsqueeze(0)
    elif x.dim() == 3:
        x = x.unsqueeze(1)
    
    x = F.conv2d(F.pad(x, (pad, pad, 0, 0), mode='replicate'), 
                 g.view(1, 1, 1, -1), padding=0)
    x = F.conv2d(F.pad(x, (0, 0, pad, pad), mode='replicate'), 
                 g.view(1, 1, -1, 1), padding=0)
    
    if original_dim == 2:
        return x.squeeze(0).squeeze(0)
    elif original_dim == 3:
        return x.squeeze(1)
    return x

## 4. Architektura Enkod√©ru (The Mask Encoder)

T≈ô√≠da `MaskEncoder` je srdcem cel√©ho syst√©mu. Zaji≈°≈•uje diferenciabiln√≠ pr≈Øchod z prostoru parametr≈Ø masky do prostoru zvuku.

### Gain Staging a Dynamick√Ω Rozsah

Jedn√≠m z technicky nejn√°roƒçnƒõj≈°√≠ch aspekt≈Ø je sladƒõn√≠ dynamick√©ho rozsahu obrazu a zvuku.
- Spektrogramy audia maj√≠ obrovsk√Ω dynamick√Ω rozsah (ƒçasto > 80 dB).
- Obr√°zky (v rozsahu 0‚Äì255) jsou line√°rn√≠ a maj√≠ n√≠zk√Ω kontrast.

Implementujeme adaptivn√≠ **Gain Staging**, kter√Ω mapuje jas obrazu do decibelov√© ≈°k√°ly zvuku. Algoritmus vypoƒç√≠t√° ‚Äûstrop‚Äú (ceiling) a ‚Äûpodlahu‚Äú (noise floor) c√≠lov√©ho zvuku a line√°rnƒõ transformuje normalizovan√Ω obraz $I_{norm} \in [0,1]$ do log-magnitudy zvuku:

$$\log(M_{image}) = I_{norm} \cdot (dB_{max} - dB_{min}) + dB_{min}$$

T√≠m zajist√≠me, ≈æe ƒçern√° barva odpov√≠d√° tichu (noise floor) a b√≠l√° maxim√°ln√≠ hlasitosti, ƒç√≠m≈æ maximalizujeme perceptibilitu obrazu ve spektrogramu zvuku.

### Rekonstrukce sign√°lu

Jak bylo zm√≠nƒõno v √∫vodu, pro p≈ôevod zpƒõt do ƒçasov√© dom√©ny (waveform) pou≈æ√≠v√°me inverzn√≠ STFT.

In [None]:
class MaskProcessor(nn.Module):
    """Zpracov√°v√° parametry masky aplikac√≠ Gaussova rozmaz√°n√≠ a interpolace."""

    def __init__(self, h, w, sigma, device):
        super().__init__()
        self.h = h
        self.w = w
        self.device = device
        self._init_gaussian_kernel(sigma)

    def _init_gaussian_kernel(self, sigma):
        """Inicializuje Gaussovo j√°dro pro vyhlazen√≠ masky."""
        kernel_size = int(2 * math.ceil(2 * sigma) + 1)
        x_cord = torch.arange(kernel_size)
        x_grid = x_cord.repeat(kernel_size).view(kernel_size, kernel_size)
        y_grid = x_grid.t()
        xy_grid = torch.stack([x_grid, y_grid], dim=-1).float()

        mean = (kernel_size - 1) / 2.0
        variance = sigma**2.0
        gaussian_kernel = (1.0 / (2.0 * math.pi * variance)) * torch.exp(
            -torch.sum((xy_grid - mean) ** 2.0, dim=-1) / (2 * variance)
        )
        gaussian_kernel = gaussian_kernel / torch.sum(gaussian_kernel)
        self.blur_kernel = gaussian_kernel.view(1, 1, kernel_size, kernel_size).to(
            self.device
        )
        self.kernel_padding = kernel_size // 2

    def forward(self, params, target_h, target_w):
        """Aplikuje rozmaz√°n√≠ na parametry a zmƒõn√≠ jejich velikost na c√≠lov√© rozli≈°en√≠."""
        B = params.shape[0]
        grid = params.view(B, 1, self.h, self.w)

        # Aplikace konvoluce pro vyhlazen√≠ hran masky
        grid_blurred = F.conv2d(grid, self.blur_kernel, padding=self.kernel_padding)

        # Interpolace na c√≠lovou velikost spektrogramu
        mask = F.interpolate(
            grid_blurred,
            size=(target_h, target_w),
            mode="bilinear",
            align_corners=False,
        )

        return mask.squeeze(1)


class MaskEncoder(nn.Module):
    """K√≥duje parametry do audia pomoc√≠ prol√≠n√°n√≠ spektrogram≈Ø zalo≈æen√©ho na masce."""

    def __init__(
        self,
        target_image: torch.Tensor,
        target_audio_path: str,
        grid_height: int = 128,
        grid_width: int = 256,
        smoothing_sigma: float = 1.0,
        device: str = DEVICE,
    ):
        super().__init__()
        self.device = device
        self.grid_height = grid_height
        self.grid_width = grid_width
        self.n_params = grid_height * grid_width
        self.smoothing_sigma = smoothing_sigma

        # Naƒçten√≠ a p≈ô√≠prava referenƒçn√≠ho audia
        audio, sr = torchaudio.load(target_audio_path)
        if sr != SAMPLE_RATE:
            audio = torchaudio.functional.resample(audio, sr, SAMPLE_RATE)

        audio = audio.mean(dim=0, keepdim=True)  # P≈ôevod na mono

        self.mask_processor = MaskProcessor(
            grid_height, grid_width, smoothing_sigma, device
        )

        self.register_buffer("target_audio_waveform", audio)

        # V√Ωpoƒçet STFT (Kr√°tkodob√° Fourierova transformace)
        window = torch.hann_window(N_FFT).to(device)
        stft = torch.stft(
            audio.to(device),
            n_fft=N_FFT,
            hop_length=HOP_LENGTH,
            win_length=WIN_LENGTH,
            window=window,
            return_complex=True,
        )

        self.audio_mag = stft.abs() + 1e-8
        self.audio_phase = stft.angle()

        self.full_height = self.audio_mag.shape[1]
        self.full_width = self.audio_mag.shape[2]

        self.audio_log = torch.log(self.audio_mag)

        # P≈ô√≠prava c√≠lov√©ho obr√°zku pro spektrogram
        img = target_image.to(device)
        if img.dim() == 3:
            img = img.unsqueeze(0)

        target_h_visual = self.full_height
        img_flipped = torch.flip(img, dims=[-2]) # P≈ôeklopen√≠ pro spr√°vnou orientaci frekvenc√≠
        img_full_freq = F.interpolate(
            img_flipped,
            size=(target_h_visual, self.full_width),
            mode="bilinear",
            align_corners=False,
        )
        img_resized = img_full_freq

        # Dynamick√© ≈ô√≠zen√≠ zisku (Gain Staging) pro sladƒõn√≠ √∫rovn√≠ obrazu a audia
        audio_log_max = torch.quantile(self.audio_log, 0.98)
        audio_max_val = self.audio_log.max()
        audio_floor_val = torch.quantile(self.audio_log, 0.01)

        target_ceiling = audio_max_val - 0.5
        headroom_nat = (audio_log_max - target_ceiling).item()
        dynamic_range_nat = (target_ceiling - audio_floor_val).item() + 0.5
        dynamic_range_nat = max(4.0, min(dynamic_range_nat, 12.0))

        print("Dynamick√© ≈ô√≠zen√≠ zisku:")
        print(f"  > Max. audio: {audio_max_val:.2f}, Pr√°h (q01): {audio_floor_val:.2f}")
        print(f"  > C√≠lov√Ω strop: {target_ceiling:.2f}")
        print(f"  > Adaptivn√≠ dynamick√Ω rozsah: {dynamic_range_nat:.2f}")

        audio_log_ceil = audio_log_max - headroom_nat
        audio_log_floor = audio_log_ceil - dynamic_range_nat

        self.audio_log = torch.clamp(self.audio_log, min=audio_log_floor)

        # Normalizace a √∫prava kontrastu obrazu
        img_01 = img_resized.squeeze(0)
        img_01 = (img_01 - img_01.min()) / (img_01.max() - img_01.min() + 1e-8)
        img_01 = img_01.pow(1.8)

        # Mapov√°n√≠ obrazu do logaritmick√© dom√©ny magnitudy audia
        self.image_log = img_01 * (audio_log_ceil - audio_log_floor) + audio_log_floor
        self.image_mag = torch.exp(self.image_log)
        self.audio_mag_static = torch.exp(self.audio_log)

        # Nastaven√≠ proxy rozli≈°en√≠ pro rychlej≈°√≠ n√°hledy
        self.proxy_height = 129
        self.proxy_width = self.full_width // 2

        self.image_mag_proxy = F.interpolate(
            self.image_mag.unsqueeze(0),
            size=(self.proxy_height, self.proxy_width),
            mode="bilinear",
            align_corners=False,
        ).squeeze(0)

        self.audio_mag_proxy = F.interpolate(
            self.audio_mag_static.unsqueeze(0),
            size=(self.proxy_height, self.proxy_width),
            mode="bilinear",
            align_corners=False,
        ).squeeze(0)

        self.image_mag_ref = self.image_mag_proxy
        self.audio_mag = self.audio_mag_proxy

        self.image_mag_full = self.image_mag
        self.audio_mag_full = self.audio_mag_static

        print(f"Spektrogram: {self.full_height}√ó{self.full_width}")
        print(f"Proxy: {self.proxy_height}√ó{self.proxy_width}")

    def _compute_spectrogram(self, mask, img_mag, aud_mag):
        """Line√°rnƒõ interpoluje mezi magnitudou obrazu a audia pomoc√≠ masky."""
        return mask * img_mag + (1 - mask) * aud_mag

    def _reconstruct_audio(self, mixed_mag, phase):
        """Rekonstruuje audio sign√°l z magnitudy a f√°ze pomoc√≠ inverzn√≠ STFT."""
        complex_stft = torch.polar(mixed_mag, phase)

        audio_recon = torch.istft(
            complex_stft,
            n_fft=N_FFT,
            hop_length=HOP_LENGTH,
            win_length=WIN_LENGTH,
            window=torch.hann_window(N_FFT, device=mixed_mag.device),
        )

        # Normalizace hlasitosti
        max_val = audio_recon.abs().max(dim=-1, keepdim=True)[0].clamp(min=1e-8)
        audio_recon = audio_recon / max_val * 0.9
        return audio_recon

    def forward(self, params: torch.Tensor, return_wav: bool = True):
        """Hlavn√≠ dop≈ôedn√Ω pr≈Øchod: generov√°n√≠ masky, sm√≠ch√°n√≠ a voliteln√° rekonstrukce audia."""
        batch_size = params.shape[0]

        target_h = self.proxy_height if not return_wav else self.full_height
        target_w = self.proxy_width if not return_wav else self.full_width

        mask = self.mask_processor(params, target_h, target_w)

        if not return_wav:
            img_mag = self.image_mag_ref.expand(batch_size, -1, -1)
            aud_mag = self.audio_mag.expand(batch_size, -1, -1)
        else:
            img_mag = self.image_mag_full.expand(batch_size, -1, -1)
            aud_mag = self.audio_mag_full.expand(batch_size, -1, -1)

        phase = self.audio_phase.expand(batch_size, -1, -1) if return_wav else None

        mixed_mag = self._compute_spectrogram(mask, img_mag, aud_mag)

        audio_recon = None
        if return_wav:
            audio_recon = self._reconstruct_audio(mixed_mag, phase)

        return audio_recon, mixed_mag

## 5. P≈ô√≠prava dat
Naƒç√≠t√°me vstupn√≠ modality.

**Audio**: Je p≈ôevedeno na mono a p≈ôevzorkov√°no na 16kHz.

**Obraz**: Je p≈ôeveden na stupnƒõ ≈°edi (Luma), o≈ô√≠znut a interpolov√°n tak, aby jeho v√Ω≈°ka odpov√≠dala Nyquistovƒõ frekvenci v binech ($NFFT/2+1=513$).

Parametr `sigma=5.0` byl zvolen empiricky.

In [None]:
img_path = 'monalisa.jpg'
img_pil = Image.open(img_path).convert('L')
img_tensor = torch.from_numpy(np.array(img_pil)).float() / 255.0
img_tensor = img_tensor.unsqueeze(0)

audio_path = 'tchaikovsky.mp3'
waveform, sr = torchaudio.load(audio_path)
duration_sec = waveform.shape[-1] / sr
raw_width = int(duration_sec * 4.0)
grid_width = ((raw_width + 15) // 16) * 16
if grid_width < 16:
    grid_width = 16
grid_height = 64

sigma = 5.0
encoder = MaskEncoder(
    img_tensor, 
    audio_path,
    grid_height=grid_height,
    grid_width=grid_width,
    smoothing_sigma=sigma,
    device=DEVICE
)

GRID_HEIGHT = grid_height
GRID_WIDTH = grid_width
N_PARAMS = GRID_HEIGHT * GRID_WIDTH

print(f"\nGrid: {GRID_HEIGHT}√ó{GRID_WIDTH} = {N_PARAMS} parameters")

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].imshow(img_pil, cmap='gray')
axes[0].set_title('C√≠lov√Ω obr√°zek: Mona Lisa', fontsize=12)
axes[0].axis('off')

plot_spectrogram(axes[1], encoder.image_mag_full[0], 'Spektrogramov√° Mona Lisa')
axes[1].set_xlabel('ƒåas (sn√≠mky)')
axes[1].set_ylabel('Frekvence (biny)')

plot_spectrogram(axes[2], encoder.audio_mag_full[0], 'C√≠lov√Ω zvuk: ƒåajkovsk√Ω')
axes[2].set_xlabel('ƒåas (sn√≠mky)')

plt.tight_layout()
plt.show()

print("üîä Origin√°ln√≠ ƒåajkovsk√Ω:")
display(Audio(encoder.target_audio_waveform[0].numpy(), rate=SAMPLE_RATE))

## 6. Spr√°vce Paretovy optimalizace (Pareto Manager)

Tato t≈ô√≠da implementuje logiku vektorov√© optimalizace. M√≠sto hled√°n√≠ jednoho bodu spravujeme celou populaci ≈ôe≈°en√≠.

### Skalarizace c√≠lov√© funkce

Gradientn√≠ metody (jako ADAM) vy≈æaduj√≠ skal√°rn√≠ (jednorozmƒõrnou) ztr√°tovou funkci. Pro vytvo≈ôen√≠ populace rozprost≈ôen√© pod√©l Paretovy fronty pou≈æ√≠v√°me metodu v√°≈æen√Ωch souƒçt≈Ø (Weighted Sum Method).

Ka≈æd√Ω jedinec $i$ v populaci optimalizuje unik√°tn√≠ kombinaci vah $w_\mathrm{vis}^{(i)}, w_\mathrm{aud}^{(i)}$, pro jeho loss funkci plat√≠

$$\mathcal{L}_\mathrm{total}^{(i)} = \gamma \cdot w_\mathrm{vis}^{(i)} \cdot \hat{\mathcal{L}}_\mathrm{vis} + w_\mathrm{aud}^{(i)} \cdot \hat{\mathcal{L}}_\mathrm{aud},$$

kde $\gamma = 20.0$ je empirick√Ω v√°hov√Ω faktor. Zjistili jsme, ≈æe audio ztr√°ta je p≈ôes logaritmick√© ≈°k√°lov√°n√≠ v jist√©m smyslu "tvrdohlavƒõj≈°√≠" ne≈æ vizu√°ln√≠ ztr√°ta. Bez tohoto faktoru by optimalizace t√©mƒõ≈ô v≈ædy konvergovala k ƒçist√©mu zvuku.

Pro kompenzaci neline√°rn√≠ho vztahu mezi pomƒõrem vah a percepc√≠ spektrogramu d√°le v r√°mci optimalizace u≈æ√≠v√°me mocninn√© rozdƒõlen√≠ vah, kter√© vede k lep≈°√≠mu pokryt√≠ objective space.

In [None]:
class ParetoManager(nn.Module):
    """
    Spravuje populaci masek pro optimalizaci.
    """
    
    def __init__(self, encoder, pop_size=10, learning_rate=0.05):
        super().__init__()
        self.encoder = encoder
        self.pop_size = pop_size
        self.grid_h = encoder.grid_height
        self.grid_w = encoder.grid_width
        
        # Inicializace populace v prostoru logit≈Ø (neomezen√©, sigmoid na [0,1])
        self.mask_logits = nn.Parameter(
            torch.randn(pop_size, self.grid_h * self.grid_w) * 0.5
        )
        
        # ====================================================================
        # Mocninn√© rozdƒõlen√≠ vah (kompenzuje 20x vizu√°ln√≠ pos√≠len√≠)
        # ====================================================================
        alpha = torch.linspace(0, 1, pop_size)
        self.weights_img = (1.0 - alpha).pow(4.0)
        self.weights_aud = 1.0 - self.weights_img
        
        # Normalizace na souƒçet 1
        total = self.weights_img + self.weights_aud
        self.weights_img = self.weights_img / total
        self.weights_aud = self.weights_aud / total
        
        # Vynucen√≠ extr√©m≈Ø: ƒçist√Ω obraz a ƒçist√© audio
        if pop_size >= 2:
            with torch.no_grad():
                self.mask_logits[0].fill_(10.0)   # Index 0: ƒåist√Ω obraz
                self.mask_logits[-1].fill_(-10.0)  # Index -1: ƒåist√© audio
        
        # Inicializace ADAMa
        self.optimizer = optim.Adam([self.mask_logits], lr=learning_rate)
        
        # Normalizaƒçn√≠ faktory (budou vypoƒçteny)
        self.scale_vis = 1.0
        self.scale_aud = 1.0
        
    def calculate_normalization(self):
        """
        V√Ωpoƒçet nejhor≈°√≠ch ztr√°t pro normalizaci c√≠l≈Ø do rozsahu [0, 1].
        """
        print("Vypoƒç√≠t√°v√°m normalizaƒçn√≠ faktory...")
        with torch.no_grad():
            # Nejhor≈°√≠ vizu√°l: ƒçist√© audio (maska = 0)
            max_vis_loss = torch.abs(
                self.encoder.audio_mag - self.encoder.image_mag_ref
            ).mean().item()
            
            # Nejhor≈°√≠ audio: ƒçist√Ω obraz (maska = 1)
            max_aud_loss = calc_audio_mag_loss(
                self.encoder.image_mag_ref, 
                self.encoder.audio_mag
            ).item()
            
            # Zabr√°nƒõn√≠ dƒõlen√≠ nulou
            self.scale_vis = 1.0 / max(max_vis_loss, 1e-6)
            self.scale_aud = 1.0 / max(max_aud_loss, 1e-6)
            
            print(f"  Max vizu√°ln√≠ ztr√°ta: {max_vis_loss:.4f} ‚Üí mƒõ≈ô√≠tko: {self.scale_vis:.4f}")
            print(f"  Max audio ztr√°ta:    {max_aud_loss:.4f} ‚Üí mƒõ≈ô√≠tko: {self.scale_aud:.4f}")
    
    def optimize_step(self):
        """
        Provede jeden krok gradientn√≠ho sestupu pro v≈°echny ƒçleny populace.
                
        Returns:
            avg_vis: Pr≈Ømƒõrn√° normalizovan√° vizu√°ln√≠ ztr√°ta
            avg_aud: Pr≈Ømƒõrn√° normalizovan√° audio ztr√°ta
        """
        self.optimizer.zero_grad()
        
        # Z√≠sk√°n√≠ aktu√°ln√≠ch masek
        masks = torch.sigmoid(self.mask_logits)
        
        # Dop≈ôedn√Ω pr≈Øchod - return_wav=False automaticky pou≈æ√≠v√° PROXY!
        _, mixed_mag = self.encoder(masks, return_wav=False)
        
        # V√Ωpoƒçet ztr√°t (odpov√≠d√° backendu - pou≈æ√≠v√° proxy reference)
        diff = torch.abs(mixed_mag - self.encoder.image_mag_ref)
        raw_loss_vis = diff.mean(dim=(1, 2))
        raw_loss_aud = calc_audio_mag_loss(mixed_mag, self.encoder.audio_mag)
        
        # Normalizace
        loss_vis = raw_loss_vis * self.scale_vis
        loss_aud = raw_loss_aud * self.scale_aud
        
        # Skalarizovan√° ztr√°ta s 20x vizu√°ln√≠m pos√≠len√≠m
        total_loss = torch.sum(
            self.weights_img * loss_vis * 20.0 + self.weights_aud * loss_aud
        )
        
        # Zpƒõtn√Ω pr≈Øchod a krok optimaliz√°toru
        total_loss.backward()
        self.optimizer.step()
        
        # Vynucen√≠ omezen√≠ kotev
        if self.pop_size >= 2:
            with torch.no_grad():
                self.mask_logits[0].fill_(20.0)
                self.mask_logits[-1].fill_(-20.0)
        
        return loss_vis.detach().mean().item(), loss_aud.detach().mean().item()
    
    def get_current_front(self):
        """Z√≠sk√°n√≠ aktu√°ln√≠ch ztr√°t populace pro vykreslov√°n√≠."""
        with torch.no_grad():
            masks = torch.sigmoid(self.mask_logits)
            _, mixed_mag = self.encoder(masks, return_wav=False)
            
            diff = torch.abs(mixed_mag - self.encoder.image_mag_ref)
            loss_vis = diff.mean(dim=(1, 2)) * self.scale_vis
            loss_aud = calc_audio_mag_loss(mixed_mag, self.encoder.audio_mag) * self.scale_aud
            
            return loss_vis.numpy(), loss_aud.numpy()

## 7. Jedno-kriteri√°ln√≠ Optimalizace (Gradient Descent)
Zde prov√°d√≠me z√°kladn√≠ "sond√°≈æ" prostoru ≈ôe≈°en√≠ pro jednotliv√© pareto-optim√°ln√≠ body.

Algoritmus ADAM (Adaptive Moment Estimation) zde funguje jako lok√°ln√≠ vyhled√°vaƒç. D√≠ky tomu, ≈æe m√°me analyticky definovan√© gradienty pro v≈°echny operace (vƒçetnƒõ STFT a interpolac√≠), m≈Ø≈æeme efektivnƒõ naj√≠t optimum pro danou v√°hovou kombinaci.

Sledujme v√Ωvoj ztr√°tov√Ωch funkc√≠ v ƒçase. V≈°imnƒõme si, ≈æe vizu√°ln√≠ a audio ztr√°ty jsou dle p≈ôedpokladu v antagonistick√©m vztahu a pokles jedn√© koreluje s n√°r≈Østem druh√©.

In [None]:
def single_objective_optimize(encoder, w_vis=0.5, w_aud=0.5, steps=SINGLE_STEPS, lr=0.05):
    """
    Jednokriteri√°ln√≠ optimalizace se zadan√Ωmi vahami.
    
    Toto p≈ôesnƒõ odpov√≠d√° frontendu pro t≈ôi p≈ôedvolby:
    - "Lep≈°√≠ obraz": w_vis=0.8, w_aud=0.2
    - "Vyv√°≈æen√©":    w_vis=0.5, w_aud=0.5
    - "Lep≈°√≠ zvuk":  w_vis=0.2, w_aud=0.8
    
    Argumenty:
        encoder: Instance MaskEncoder
        w_vis: Vizu√°ln√≠ v√°ha (0-1)
        w_aud: Audio v√°ha (0-1)
        steps: Poƒçet optimalizaƒçn√≠ch krok≈Ø (v√Ωchoz√≠: 500)
        lr: Rychlost uƒçen√≠ (learning rate)
    
    N√°vratov√© hodnoty:
        mask: Tenxor optimalizovan√© masky
        history: Seznam n-tic (vis_loss, aud_loss)
    """
    # Vytvo≈ôen√≠ mana≈æera pro jednoho jedince
    manager = ParetoManager(encoder, pop_size=1, learning_rate=lr)
    manager.calculate_normalization()
    
    # P≈ôeps√°n√≠ vah (normalizace na souƒçet 1)
    total_weight = w_vis + w_aud
    w_vis_norm = w_vis / total_weight
    w_aud_norm = w_aud / total_weight
    
    manager.weights_img = torch.tensor([w_vis_norm])
    manager.weights_aud = torch.tensor([w_aud_norm])
    
    # Inicializace masky na neutr√°ln√≠ hodnotu (logit = 0 ‚Üí sigmoid = 0.5)
    with torch.no_grad():
        manager.mask_logits.data.zero_()
    
    history = []
    
    print(f"Optimalizace s vahami: Vizu√°ln√≠={w_vis_norm:.2f}, Audio={w_aud_norm:.2f}")
    print(f"Kroky: {steps}, Rychlost uƒçen√≠: {lr}")
    
    for step in range(1, steps + 1):
        avg_vis, avg_aud = manager.optimize_step()
        history.append((avg_vis, avg_aud))
        
        if step % 100 == 0 or step == 1:
            print(f"  Krok {step:4d}: Vizu√°ln√≠={avg_vis:.4f}, Audio={avg_aud:.4f}")
    
    # Z√≠sk√°n√≠ fin√°ln√≠ masky
    final_mask = torch.sigmoid(manager.mask_logits).detach()
    
    return final_mask, history

# Spustit t≈ôi p≈ôedvolby frontendu
import time

print("=" * 60)
print("üé® PRIORITA OBRAZU (80/20)")
print("=" * 60)
start = time.time()
mask_visual, hist_visual = single_objective_optimize(encoder, w_vis=0.8, w_aud=0.2)
print(f"Dokonƒçeno za {time.time() - start:.1f}s\n")

print("=" * 60)
print("‚öñÔ∏è VYV√Å≈ΩEN√â (50/50)")
print("=" * 60)
start = time.time()
mask_balanced, hist_balanced = single_objective_optimize(encoder, w_vis=0.5, w_aud=0.5)
print(f"Dokonƒçeno za {time.time() - start:.1f}s\n")

print("=" * 60)
print("üîä PRIORITA ZVUKU (20/80)")
print("=" * 60)
start = time.time()
mask_audio, hist_audio = single_objective_optimize(encoder, w_vis=0.2, w_aud=0.8)
print(f"Dokonƒçeno za {time.time() - start:.1f}s\n")

# Vykreslen√≠ konvergence
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, hist, title, color in [
    (axes[0], hist_visual, 'Vizu√°ln√≠ priorita (80/20)', 'purple'),
    (axes[1], hist_balanced, 'Vyv√°≈æen√© (50/50)', 'gray'),
    (axes[2], hist_audio, 'Audio priorita (20/80)', 'green'),
]:
    vis_losses = [h[0] for h in hist]
    aud_losses = [h[1] for h in hist]
    
    ax.plot(vis_losses, label='Vizu√°ln√≠ ztr√°ta', color='purple', alpha=0.8)
    ax.plot(aud_losses, label='Audio ztr√°ta', color='green', alpha=0.8)
    ax.set_xlabel('Krok')
    ax.set_ylabel('Normalizovan√° ztr√°ta')
    ax.set_title(title)
    ax.legend()
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Vizualizace v√Ωsledk≈Ø
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

results = [
    (mask_visual, 'Vizu√°ln√≠ priorita (80/20)'),
    (mask_balanced, 'Vyv√°≈æen√© (50/50)'),
    (mask_audio, 'Audio priorita (20/80)')
]

for i, (mask, title) in enumerate(results):
    # Generov√°n√≠ v√Ωstupu v pln√©m rozli≈°en√≠
    _, mixed_mag = encoder(mask, return_wav=True)  # return_wav=True = pln√© rozli≈°en√≠
    
    # Vizualizace masky
    axes[0, i].imshow(
        mask.view(GRID_HEIGHT, GRID_WIDTH).numpy(), 
        cmap='gray', vmin=0, vmax=1
    )
    axes[0, i].set_title(f'{title}\nMaska (b√≠l√°=obraz, ƒçern√°=audio)')
    axes[0, i].axis('off')
    
    # Spektrogram (s mƒõ≈ô√≠tkem odpov√≠daj√≠c√≠m frontendu)
    plot_spectrogram(axes[1, i], mixed_mag[0], 'V√Ωsledn√Ω spektrogram')
    axes[1, i].set_xlabel('ƒåas')
    axes[1, i].set_ylabel('Frekvence')

plt.tight_layout()
plt.show()

# Poslechnƒõte si v√Ωsledky

print("üé® PRIORITA OBRAZU (80/20):")
wav_visual, _ = encoder(mask_visual, return_wav=True)
display(Audio(wav_visual[0].numpy(), rate=SAMPLE_RATE))

print("‚öñÔ∏è  VYV√Å≈ΩEN√â ≈òE≈†EN√ç (50/50):")
wav_balanced, _ = encoder(mask_balanced, return_wav=True)
display(Audio(wav_balanced[0].numpy(), rate=SAMPLE_RATE))

print("üîä PRIORITA ZVUKU (20/80):")
wav_audio, _ = encoder(mask_audio, return_wav=True)
display(Audio(wav_audio[0].numpy(), rate=SAMPLE_RATE))

## 8. Mapov√°n√≠ Kompletn√≠ Paretovy Fronty (NSGA-II)

Pro efektivn√≠ zmapov√°n√≠ bod≈Ø na Paretovƒõ frontƒõ nasazujeme hybridn√≠ optimalizaci.

### F√°ze 1: Gradientn√≠ Seeding

Nejprve spust√≠me kr√°tkou, intenzivn√≠ d√°vku paraleln√≠ch gradientn√≠ch sestup≈Ø s r≈Øzn√Ωmi vahami. T√≠m nalezneme nƒõkolik jedinc≈Ø bl√≠zko Paretovy fronty. Tyto body slou≈æ√≠ jako vysoce kvalitn√≠ genetick√Ω materi√°l (Seeds) pro dal≈°√≠ f√°zi.

### F√°ze 2: Evoluƒçn√≠ Expanze (NSGA-II)

P≈ôech√°z√≠me na algoritmus NSGA-II (Non-dominated Sorting Genetic Algorithm II) [Deb et al., 2002]. Tento algoritmus nepracuje s gradienty, ale s p≈ô√≠m√Ωm porovn√°v√°n√≠m fitness vektor≈Ø.

**Kl√≠ƒçov√© mechaniky:**

*   **Non-dominated Sorting:** ≈òad√≠ populaci do "vrstev" podle toho, kolika jin√Ωmi ≈ôe≈°en√≠mi je dan√Ω jedinec dominov√°n. Prvn√≠ vrstva je aktu√°ln√≠ odhad Paretovy fronty.
*   **Crowding Distance:** Metrika, kter√° preferuje ≈ôe≈°en√≠ v m√©nƒõ prozkouman√Ωch oblastech fronty. To br√°n√≠ p≈ôedƒçasn√© konvergenci do jednoho bodu a zaji≈°≈•uje biodiverzitu populace a tedy tak√© pokryt√≠ cel√© fronty.
*   **SBX Crossover & Polynomial Mutation:** Genetick√© oper√°tory simuluj√≠c√≠ k≈ô√≠≈æen√≠ a mutaci re√°ln√Ωch chromozom≈Ø, aplikovan√© na parametry masky.

Tento proces postupnƒõ "vyhlazuje" a zahu≈°≈•uje mezeru mezi p≈Øvodn√≠mi gradientn√≠mi body, ƒç√≠m≈æ odhaluje strukturu Paretovy fronty.

In [None]:
def run_pareto_optimization(encoder, 
                            seed_steps=PARETO_SEED_STEPS,
                            seed_pop=PARETO_SEED_POP,
                            n_gen=PARETO_GENERATIONS,
                            evol_pop=PARETO_EVOL_POP):
    """
    Kompletn√≠ optimalizace Paretovy hranice p≈ôesnƒõ odpov√≠daj√≠c√≠ frontendu.
    
    Argumenty:
        encoder: Instance MaskEncoder
        seed_steps: Kroky gradientn√≠ho nasazov√°n√≠ (v√Ωchoz√≠: 200)
        seed_pop: Velikost poƒç√°teƒçn√≠ populace (v√Ωchoz√≠: 10)
        n_gen: Poƒçet generac√≠ NSGA-II (v√Ωchoz√≠: 50)
        evol_pop: Velikost evoluƒçn√≠ populace (v√Ωchoz√≠: 100)
    
    N√°vratov√© hodnoty:
        final_X: (N, params) Pareto-optim√°ln√≠ parametry masky
        final_F: (N, 2) Pareto-optim√°ln√≠ hodnoty c√≠lov√Ωch funkc√≠
        full_history: Seznam sn√≠mk≈Ø populace pro animaci
    """
    from pymoo.core.problem import Problem
    from pymoo.core.callback import Callback
    from pymoo.algorithms.moo.nsga2 import NSGA2
    from pymoo.optimize import minimize
    from pymoo.operators.crossover.sbx import SBX
    from pymoo.operators.mutation.pm import PM
    
    # ========================================================================
    # F√°ze 1: Gradientn√≠ seeding
    # ========================================================================
    print("=" * 60)
    print("F√ÅZE 1: GRADIENTN√ç SEEDING")
    print(f"Populace: {seed_pop}, Krok≈Ø: {seed_steps}")
    print("=" * 60)
    
    manager = ParetoManager(encoder, pop_size=seed_pop, learning_rate=0.05)
    manager.calculate_normalization()
    
    phase1_history = []
    
    for step in range(1, seed_steps + 1):
        manager.optimize_step()
        
        # Z√°znam historie ka≈æd√Ωch 5 krok≈Ø (pro animaci)
        if step % 5 == 0:
            vis, aud = manager.get_current_front()
            phase1_history.append(np.column_stack([vis, aud]))
        
        if step % 50 == 0:
            vis, aud = manager.get_current_front()
            print(f"  Krok {step:4d}: Rozsah Vis [{vis.min():.3f}, {vis.max():.3f}], "
                  f"Rozsah Aud [{aud.min():.3f}, {aud.max():.3f}]")
    
    # Extrakce poƒç√°teƒçn√≠ch ≈ôe≈°en√≠ (seeds)
    with torch.no_grad():
        seed_masks = torch.sigmoid(manager.mask_logits).cpu().numpy()
    
    print(f"\nVygenerov√°no {len(seed_masks)} poƒç√°teƒçn√≠ch ≈ôe≈°en√≠")
    
    # ========================================================================
    # F√°ze 2: Evoluƒçn√≠ expanze (NSGA-II)
    # ========================================================================
    print("\n" + "=" * 60)
    print("F√ÅZE 2: EVOLUƒåN√ç ALGORITMUS")
    print(f"Populace: {evol_pop}, Generac√≠: {n_gen}")
    print("=" * 60)
    
    class MOSSProblem(Problem):
        def __init__(self, enc, scale_vis, scale_aud):
            self.enc = enc
            self.scale_vis = scale_vis
            self.scale_aud = scale_aud
            super().__init__(
                n_var=enc.grid_height * enc.grid_width,
                n_obj=2,
                xl=0.0,
                xu=1.0
            )
        
        def _evaluate(self, x, out, *args, **kwargs):
            with torch.no_grad():
                masks = torch.from_numpy(x).float()
                _, mixed_mag = self.enc(masks, return_wav=False)  # proxy automaticky
                
                vis = calc_visual_loss(mixed_mag, self.enc.image_mag_ref).numpy()
                aud = calc_audio_mag_loss(mixed_mag, self.enc.audio_mag).numpy()
                
                vis = vis * self.scale_vis
                aud = aud * self.scale_aud
                
                out['F'] = np.column_stack([vis, aud])
    
    class HistoryCallback(Callback):
        def __init__(self):
            super().__init__()
            self.history = []
        
        def notify(self, algorithm):
            # Z√°znam p≈ôe≈æiv≈°√≠ populace
            self.history.append(algorithm.pop.get('F').copy())
    
    problem = MOSSProblem(encoder, manager.scale_vis, manager.scale_aud)
    
    # Inicializace pomoc√≠ seeds + n√°hodn√Ωch hodnot
    n_random = evol_pop - len(seed_masks)
    X_init = np.vstack([
        seed_masks,
        np.random.rand(n_random, problem.n_var)
    ])
    
    algorithm = NSGA2(
        pop_size=evol_pop,
        sampling=X_init,
        crossover=SBX(eta=15, prob=0.9),
        mutation=PM(eta=20),
        eliminate_duplicates=True
    )
    
    callback = HistoryCallback()
    
    result = minimize(
        problem,
        algorithm,
        ('n_gen', n_gen),
        callback=callback,
        verbose=True
    )
    
    full_history = phase1_history + callback.history
    
    print(f"\n‚úÖ Nalezeno {len(result.F)} Pareto-optim√°ln√≠ch ≈ôe≈°en√≠")
    
    return result.X, result.F, full_history

# Spu≈°tƒõn√≠ pln√© Paretovy optimalizace
start = time.time()
pareto_X, pareto_F, pareto_history = run_pareto_optimization(encoder)
print(f"\nCelkov√Ω ƒças: {time.time() - start:.1f}s")

# Vizualizace evoluce
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Fin√°ln√≠ Paretova fronta
axes[0].scatter(
    pareto_F[:, 0], pareto_F[:, 1],
    c='#4ade80', s=60, edgecolors='white', linewidths=1,
    label='Fin√°ln√≠ Paretova fronta', zorder=10
)

# Seeds pro referenci
if pareto_history:
    seeds = pareto_history[len(pareto_history) // 4]
    axes[0].scatter(
        seeds[:, 0], seeds[:, 1],
        c='purple', s=40, alpha=0.3, label='Seeding'
    )

axes[0].set_xlabel('Vizu√°ln√≠ ztr√°ta (normalizovan√°) ‚Üì')
axes[0].set_ylabel('Audio ztr√°ta (normalizovan√°) ‚Üì')
axes[0].set_title('Paretova hranice: Vizu√°ln√≠ vs Audio kvalita')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Evoluce v ƒçase
n_hist = len(pareto_history)
phase1_len = PARETO_SEED_STEPS // 5  # Historie zaznamenan√° ka≈æd√Ωch 5 krok≈Ø

for i, frame in enumerate(pareto_history):
    if i < phase1_len:
        color = 'purple'
        alpha = 0.1 + 0.3 * (i / phase1_len)
    else:
        color = '#4ade80'
        alpha = 0.2 + 0.6 * ((i - phase1_len) / (n_hist - phase1_len))
    
    axes[1].scatter(frame[:, 0], frame[:, 1], c=color, alpha=alpha, s=20)

axes[1].set_xlabel('Vizu√°ln√≠ ztr√°ta (normalizovan√°) ‚Üì')
axes[1].set_ylabel('Audio ztr√°ta (normalizovan√°) ‚Üì')
axes[1].set_title('Evoluce v ƒçase (fialov√°=nasazov√°n√≠, zelen√°=evoluce)')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Animovan√Ω v√Ωvoj

Tato animace ukazuje proces optimalizace s:
- **F√°ze 1 (Fialov√°)**: Gradientn√≠ nasazov√°n√≠ s hladk√Ωmi p≈ôechody bod≈Ø
- **F√°ze 2 (Zelen√°)**: NSGA-II evoluce se stopami mizen√≠

In [None]:
from scipy.spatial.distance import cdist
from matplotlib.animation import FuncAnimation

def create_pareto_animation(history, phase1_frames=40):
    """
    Vytvo≈ô√≠ animovanou vizualizaci Pareto optimalizace.
    Odpov√≠d√° stylu animace frontend/backend (svƒõtl√Ω re≈æim).
    """
    if not history or len(history) == 0:
        print("≈Ω√°dn√° historie k animaci")
        return None
    
    # Nastaven√≠ grafu (svƒõtl√° verze)
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.set_title("Animace Pareto optimalizace", fontsize=14, pad=10)
    ax.set_xlabel("Vizu√°ln√≠ ztr√°ta (normalizovan√°) ‚Üì")
    ax.set_ylabel("Audio ztr√°ta (normalizovan√°) ‚Üì")
    ax.grid(True, alpha=0.3)
    
    # Nalezen√≠ glob√°ln√≠ch mez√≠
    all_points = np.vstack(history)
    min_x, max_x = all_points[:, 0].min(), all_points[:, 0].max()
    min_y, max_y = all_points[:, 1].min(), all_points[:, 1].max()
    
    pad_x = (max_x - min_x) * 0.1 if max_x > min_x else 0.1
    pad_y = (max_y - min_y) * 0.1 if max_y > min_y else 0.1
    
    ax.set_xlim(min_x - pad_x, max_x + pad_x)
    ax.set_ylim(min_y - pad_y, max_y + pad_y)
    
    # Bodov√Ω graf pro aktu√°ln√≠ sn√≠mek
    scat = ax.scatter([], [], c='#a855f7', alpha=0.8, s=60, 
                      edgecolors='white', linewidths=0.5)
    
    # Bodov√© grafy stop pro efekt dozn√≠v√°n√≠
    trail_depth = 5
    trail_scats = [ax.scatter([], [], c='#4ade80', alpha=0, s=40) 
                   for _ in range(trail_depth)]
    
    text = ax.text(0.02, 0.98, '', transform=ax.transAxes, fontsize=10, va='top')
    
    prev_positions = [None]  # Pou≈æit√≠ seznamu pro umo≈ænƒõn√≠ mutace v uz√°vƒõru
    
    def match_points(prev, curr):
        """P≈ôi≈ôazen√≠ bod≈Ø mezi sn√≠mky pro hladk√Ω p≈ôechod."""
        if prev is None or len(prev) != len(curr):
            return curr
        dist = cdist(prev, curr)
        matched = np.zeros_like(curr)
        used = set()
        for i in range(len(prev)):
            dists = dist[i].copy()
            dists[list(used)] = np.inf
            j = np.argmin(dists)
            matched[i] = curr[j]
            used.add(j)
        return matched
    
    def update(frame):
        if frame >= len(history):
            return (scat, text, *trail_scats)
        
        data = history[frame]
        
        # Hladk√Ω p≈ôechod bƒõhem f√°ze nasazov√°n√≠
        if frame < phase1_frames:
            data = match_points(prev_positions[0], data)
        prev_positions[0] = data.copy()
        
        scat.set_offsets(data)
        
        # Barven√≠ podle f√°ze
        if frame < phase1_frames:
            scat.set_facecolors('#a855f7')  # Fialov√° pro nasazov√°n√≠
            text.set_text(f'F√°ze 1: Gradientn√≠ seeding (Krok {frame * 5})')
            for ts in trail_scats:
                ts.set_offsets(np.empty((0, 2)))
        else:
            scat.set_facecolors('#4ade80')  # Zelen√° pro evoluci
            gen = frame - phase1_frames
            text.set_text(f'F√°ze 2: Evoluƒçn√≠ (Gen {gen})')
            
            # Aktualizace stop s efektem dozn√≠v√°n√≠
            for i, ts in enumerate(trail_scats):
                trail_frame = frame - (i + 1)
                if trail_frame >= phase1_frames and trail_frame < len(history):
                    ts.set_offsets(history[trail_frame])
                    alpha = 0.3 * (1 - (i / trail_depth))
                    ts.set_alpha(alpha)
                else:
                    ts.set_offsets(np.empty((0, 2)))
        
        return (scat, text, *trail_scats)
    
    ani = FuncAnimation(fig, update, frames=len(history), interval=100, blit=True)
    plt.close(fig)  # Zabr√°nƒõn√≠ statick√©mu zobrazen√≠
    
    return ani

# Vytvo≈ôen√≠ a zobrazen√≠ animace
phase1_len = PARETO_SEED_STEPS // 5
ani = create_pareto_animation(pareto_history, phase1_frames=phase1_len)
if ani:
    display(HTML(ani.to_jshtml()))

## 9. Z√°vƒõreƒçn√° anal√Ωza a Shrnut√≠

V tomto notebooku jsme √∫spƒõ≈°nƒõ implementovali end-to-end syst√©m pro synt√©zu spektrogram≈Ø nesouc√≠ch zvuk a obraz z√°rove≈à. Hlavn√≠m v√Ωsledkem je relativnƒõ rychl√© nalezen√≠ Paretovy fronty pro v√≠cekriteri√°ln√≠ optimalizaci s velmi vysokou dimenz√≠ prostoru parametr≈Ø.

### Designov√° rozhodnut√≠ v kontextu fyziky

*   **ƒåas vs. Frekvence:** Volba $N_{FFT}=1024$ p≈ôi 16kHz je p≈ô√≠m√Ωm d≈Øsledkem snahy vyv√°≈æit ƒçasovou lokalizaci (pro rytmus) a frekvenƒçn√≠ rozli≈°en√≠ (pro detaily obrazu).
*   **F√°zov√° invariance:** Rozhodnut√≠ ignorovat vizu√°ln√≠ f√°zi a vnutit f√°zi audia je to, co ƒçin√≠ v√Ωsledek poslouchateln√Ωm. P≈ôedpokl√°d√°me, ≈æe s√©mantika zvuku je nesena ƒçasov√Ωm uspo≈ô√°d√°n√≠m f√°z√≠.

### Citace a Odkazy

*   **NSGA-II:** Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. *IEEE Transactions on Evolutionary Computation*.
*   **STFT Reconstruction:** Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. *IEEE Transactions on ASSP*.
*   **Pareto Optimality:** Pareto, V. (1896). *Cours d'√©conomie politique*.
*   **ADAM:** Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. *arXiv preprint arXiv:1412.6980*.
*   **L1 Loss in Image Synthesis:** Zhao, H., Gallo, O., Frosio, I., & Kautz, J. (2016). Loss functions for image restoration with neural networks. *IEEE Transactions on Computational Imaging*, 3(1), 47-57.
*   **Phase Reconstruction:** Allen, J. B. (1977). Short term spectral analysis, synthesis, and modification by discrete Fourier transform. *IEEE Transactions on Acoustics, Speech, and Signal Processing*, 25(3), 235-238.
*   **Psychoacoustics:** Fastl, H., & Zwicker, E. (2007). *Psychoacoustics: Facts and models*. Springer.
*   **Tacotron and Log L1 loss:** Wang, Y., et al. (2017). "Tacotron: Towards end-to-end speech synthesis." *Interspeech*.

Tento syst√©m demonstruje, ≈æe i s relativnƒõ jednoduch√Ωmi stavebn√≠mi kameny (STFT, line√°rn√≠ interpolace) a robustn√≠ optimalizac√≠ lze dos√°hnout netrivi√°ln√≠ch a mo≈æn√° umƒõlecky zaj√≠mav√Ωch v√Ωsledk≈Ø.

In [None]:
# Pr≈Øzkum Paretovy fronty

sorted_idx = np.argsort(pareto_F[:, 0])

# V√Ωbƒõr 5 vzork≈Ø pod√©l hranice
n_samples = min(5, len(sorted_idx))
sample_indices = [sorted_idx[int(i * (len(sorted_idx) // 3) / (n_samples - 1))] 
                  for i in range(n_samples)]

fig, axes = plt.subplots(2, n_samples, figsize=(4 * n_samples, 8))

for col, idx in enumerate(sample_indices):
    mask = torch.from_numpy(pareto_X[idx:idx+1]).float()
    wav, mixed_mag = encoder(mask, return_wav=True)
    
    # Maska
    axes[0, col].imshow(
        mask.view(GRID_HEIGHT, GRID_WIDTH).numpy(),
        cmap='gray', vmin=0, vmax=1
    )
    axes[0, col].set_title(f'V:{pareto_F[idx,0]:.3f}, A:{pareto_F[idx,1]:.3f}')
    axes[0, col].axis('off')
    
    # Spektrogram (≈°k√°lov√°n√≠ odpov√≠daj√≠c√≠ frontendu)
    plot_spectrogram(axes[1, col], mixed_mag[0])
    axes[1, col].axis('off')

plt.suptitle('Vzorky nap≈ô√≠ƒç Paretovou frontou (Vlevo=Nejlep≈°√≠ vizu√°ln√≠, Vpravo=Nejlep≈°√≠ audio)')
plt.tight_layout()
plt.show()

# Poslech vzork≈Ø
for i, idx in enumerate(sample_indices):
    mask = torch.from_numpy(pareto_X[idx:idx+1]).float()
    wav, _ = encoder(mask, return_wav=True)
    print(f"\nüéµ Vzorek {i+1}: Vizu√°ln√≠={pareto_F[idx,0]:.3f}, Audio={pareto_F[idx,1]:.3f}")
    display(Audio(wav[0].numpy(), rate=SAMPLE_RATE))

*Vytvo≈ôil s üíú [Vojtƒõch Kucha≈ô](https://github.com/kuchar-one)*