## Prepare environment

In [None]:
!git clone https://github.com/podmabsterio/dla_avss.git
%cd dla_avss
!pip install -r requirements.txt

## Download pretrained weights

- **ConvTasNet** – baseline audio separation model.
- **RTFSNet** – our main speech separation network.
- **Video Encoder** – generates mouth embeddings used as input to RTFSNet.


In [1]:
out_dir = "weights" # you can change dir if you want
!bash scripts/download_convtasnet.sh $out_dir
!bash scripts/download_rtfsnet.sh $out_dir
!bash scripts/download_video_encoder.sh $out_dir

Downloading weights...
Downloading...
From (original): https://drive.google.com/uc?id=18TEetAQ1212HoMBdnDWMd-1soRHJghA_
From (redirected): https://drive.google.com/uc?id=18TEetAQ1212HoMBdnDWMd-1soRHJghA_&confirm=t&uuid=3597e0e3-4d7e-4ad1-9d0d-e7677c60d531
To: /home/mabondarenko_4/dla_avss/weights/convtasnet.pth
100%|██████████████████████████████████████| 60.7M/60.7M [00:03<00:00, 18.2MB/s]
Download completed: weights/convtasnet.pth
Downloading weights...
Downloading...
From: https://drive.google.com/uc?id=15tHE1Obdn7GZZ6xGy2q9s11dOqq2DGrz
To: /home/mabondarenko_4/dla_avss/weights/rtfsnet.pth
100%|██████████████████████████████████████| 11.0M/11.0M [00:00<00:00, 63.8MB/s]
Download completed: weights/rtfsnet.pth
Downloading weights...
Downloading...
From (original): https://drive.google.com/uc?id=1TGFG0dW5M3rBErgU8i0N7M1ys9YMIvgm
From (redirected): https://drive.google.com/uc?id=1TGFG0dW5M3rBErgU8i0N7M1ys9YMIvgm&confirm=t&uuid=e62b8e73-39e6-4089-81e4-13bfc451cc95
To: /home/mabondarenko_

## Example: Running Inference


If your dataset already contains ground truth signals, you can run inference and automatically compute all metrics by specifying a metric configuration in `metrics`.  
Make sure that the dataset parameter `expect_target` is set to `True`.

If your dataset is not split into train/val/test partitions, set `partition=None` during inference.

Results will be saved in `predictions` directory. Before running inference, the script below removes the previous `predictions` directory (if it exists) to avoid mixing old results with new ones.

In [2]:
import shutil, os
if os.path.exists("predictions"):
    shutil.rmtree("predictions")

!python inference.py \
    -cn=inf_rtfsnet.yaml \
    metrics=pit \
    datasets.inf.partition=train \
    datasets.inf.expect_target=True \
    datasets.inf.dataset_path=example_data \
    video_encoder.dataset_path=example_data \
    inferencer.save_path=predictions \
    inferencer.from_pretrained=weights/rtfsnet.pth

Generating video embeddings:   0%|                       | 0/20 [00:00<?, ?it/s]Creating video embeddings directory: example_data/video_embeddings
Generating video embeddings: 100%|██████████████| 20/20 [00:05<00:00,  3.61it/s]
Video embeddings created and saved to: example_data/video_embeddings
Creating index...
Indexing files: 100%|██████████████████████████| 10/10 [00:00<00:00, 109.32it/s]
RTFSNet2SpeakersSeparation(
  (rtfs_net): RTFSNet(
    (encoder): AudioEncoder(
      (conv): Sequential(
        (0): Conv2d(2, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): ReLU()
        (2): GroupNorm(1, 256, eps=1e-05, affine=True)
      )
    )
    (audio_bottleneck): Sequential(
      (0): GroupNorm(1, 256, eps=1e-05, affine=True)
      (1): ReLU()
      (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    )
    (rtfs_block): RTFSBlock(
      (scaling_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), groups=256)
      (prelu): PReLU(num_pa

If your dataset contains only mixed audio (without ground-truth sources), you cannot compute separation metrics during inference.  
In this case, use the `empty` metric configuration (an empty list of metrics) and set `expect_target=False`.

In [3]:
if os.path.exists("predictions"):
    shutil.rmtree("predictions")

!python inference.py \
    -cn=inf_rtfsnet.yaml \
    metrics=empty \
    datasets.inf.partition=inf \
    datasets.inf.expect_target=False \
    datasets.inf.dataset_path=example_data \
    video_encoder.dataset_path=example_data \
    inferencer.save_path=predictions \
    inferencer.from_pretrained=weights/rtfsnet.pth

Video embeddings directory exists, skipping embeddings creation
Creating index...
Indexing files: 100%|█████████████████████████| 10/10 [00:00<00:00, 1177.58it/s]
RTFSNet2SpeakersSeparation(
  (rtfs_net): RTFSNet(
    (encoder): AudioEncoder(
      (conv): Sequential(
        (0): Conv2d(2, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): ReLU()
        (2): GroupNorm(1, 256, eps=1e-05, affine=True)
      )
    )
    (audio_bottleneck): Sequential(
      (0): GroupNorm(1, 256, eps=1e-05, affine=True)
      (1): ReLU()
      (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    )
    (rtfs_block): RTFSBlock(
      (scaling_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), groups=256)
      (prelu): PReLU(num_parameters=1)
      (downsampling): CompressionModule(
        (downsampling_conv): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
        (gln): GroupNorm(1, 64, eps=1e-05, affine=True)
        (prelu): PReLU(num_parameters=1)
   

If you have already saved model predictions to disk and later obtained the corresponding ground-truth sources, you can compute the separation metrics afterwards using the `calc_metrics.py` script.

In [4]:
!python calc_metrics.py \
    -cn=calc_metrics_rtfsnet.yaml \
    metric_calculator.pred_path=predictions/inf \
    metric_calculator.gt_path=example_data/audio/train

si_snri: 12.064942359924316
si_sdri: 12.064152717590332
si_snr: 12.145024299621582
pesq: 2.3135225772857666
stoi: 0.915973961353302


## Load a custom dataset

You can try the model with your **own dataset stored on Google Drive**.  
Paste a public link to your dataset folder (shared via “Anyone with the link”) and it will be downloaded automatically and prepared for inference.


In [8]:
import os
import yadisk
import zipfile

!mkdir -p data/datasets/

dataset_link = input("Введите ссылку на ваш датасет (Yandex Drive / public link): ")
y = yadisk.YaDisk()
y.download_public(dataset_link, "custom_dataset.zip")
with zipfile.ZipFile("custom_dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("data/datasets")

In [9]:
dataset_dir = 'data/datasets/with_no_gt' # change to your dataset folder name
if os.path.isdir(os.path.join(dataset_dir, "audio", "s1")):
    expect_target=True
    metrics_conf="pit"
else:
    expect_target=False
    metrics_conf="empty"


In [10]:
if os.path.exists("predictions"):
    shutil.rmtree("predictions")

!python inference.py \
    -cn=inf_rtfsnet.yaml \
    metrics=$metrics_conf \
    datasets.inf.partition=null \
    datasets.inf.expect_target=$expect_target \
    datasets.inf.dataset_path=$dataset_dir \
    video_encoder.dataset_path=$dataset_dir \
    inferencer.save_path=predictions \
    inferencer.from_pretrained=weights/rtfsnet.pth

Generating video embeddings:   0%|                       | 0/20 [00:00<?, ?it/s]Creating video embeddings directory: data/datasets/with_no_gt/video_embeddings
Generating video embeddings: 100%|██████████████| 20/20 [00:02<00:00,  9.70it/s]
Video embeddings created and saved to: data/datasets/with_no_gt/video_embeddings
Creating index...
Indexing files: 100%|█████████████████████████| 10/10 [00:00<00:00, 1294.90it/s]
RTFSNet2SpeakersSeparation(
  (rtfs_net): RTFSNet(
    (encoder): AudioEncoder(
      (conv): Sequential(
        (0): Conv2d(2, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): ReLU()
        (2): GroupNorm(1, 256, eps=1e-05, affine=True)
      )
    )
    (audio_bottleneck): Sequential(
      (0): GroupNorm(1, 256, eps=1e-05, affine=True)
      (1): ReLU()
      (2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
    )
    (rtfs_block): RTFSBlock(
      (scaling_conv): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), groups=256)
   