# EfficientTDNN

This tutorial aims to show how to load a subnet from the trained supernet and evalute is on several test sets. Taking the size as the same as the mentioned in [ECAPA-TDNN](http://www.isca-speech.org/archive/Interspeech_2020/abstracts/2650.html) or Efficient-Base in [EfficientTDNN](https://arxiv.org/abs/2103.13581), the implementation details are summarized as follows.

1. Prepare the weights of the supernet and the batchnorm of a subnet.
2. Define the architecture of the subnet and load the weights.
3. Profile the efficiency metrics of the subnet, such as memory.
4. Evaluate the subnet in EER and minDCF $_{0.01}$ on several test sets, such as Vox1-O, Vox1-E, Vox1-H.

In [None]:
import os, time, warnings
warnings.filterwarnings("ignore")
import pandas
import torch
from torch.utils.data import DataLoader
from sugar.models import WrappedModel, veri_validate
from sugar.database import Utterance
from sugar.data.voxceleb1 import veriset
from sugar.scores import score_cohorts, asnorm
from sugar.vectors import extract_vectors
from sugar.metrics import print_size_of_model, profile, latency, calculate_mindcf, calculate_eer

In [2]:
device = 'cuda:3'

## Overview

We have the supernet denoted as with the largest scale denoted as `(4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)`. In the supernet, different kernels are transformed between each other via linear transformation matrics. For different progressive training stages, the trained supernet can be concluded as follows.

- largest: a single network with the maximum architecture in the supernet.
- kernel: 243 subnets that are nested in the supernet with the kernel size `{1, 3, 5}` at different layers.
- depth: 351 subnets with the depth `{2, 3, 4}` based on the kernel stage.
- width 1: a large number of subnets where the number of channels between `[0.5, 1.0]`.
- width 2: a huge number of subnets support the minimum `0.25` channels.


The bounded subnets from the supernet are summarized as follows.

- largest: `(4, [512, 512, 512, 512, 512], [5, 5, 5, 5, 5], 1536)`
- Kmin: `(4, [512, 512, 512, 512, 512], [1, 1, 1, 1, 1], 1536)`
- Dmin: `(2, [512, 512, 512], [1, 1, 1], 1536)`
- C1min: `(2, [256, 256, 256], [1, 1, 1], 768)`
- C2min: `(2, [128, 128, 128], [1, 1, 1], 384)`

More details can be found in [EfficientTDNN at arXiv](https://arxiv.org/abs/2103.13581).

## Prepare the weights of the supernet and the batchnorm of a subnet and load the subnet

- The supernet contains the whole weights including batchnorm and so on.
- The weights of batchnorm can denoted as the weights of a subnet, since the other weights of the subnet inherts the supernet but the batchnorm is calibrated by some training speech utterances.

Note that the weights are downloaded from [huggingface](https://huggingface.co/mechanicalsea/efficient-tdnn) as follows.

- `repo_id = "mechanicalsea/efficient-tdnn"`
- supernet:
  - `filename = "depth/depth.torchparams"`
- subnet:
  - `filename = "depth/depth-ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar`

Specifically, we load the subnet as follows.

1. Define the supernet and load the weights of extractors.
2. Clone the subnet and load the weights of batchnorms.
3. Add the input layer, i.e., log Mel-filterbanks.

Note that the head `AAMSoftmax(192, 5994, 0.2, 30)` that serve as computing loss function is not added becasue it do not work for extracting speaker embeddings.

We print the results recorded in the paper [EfficientTDNN at arXiv](https://arxiv.org/abs/2103.13581).

In [3]:
repo_id = "mechanicalsea/efficient-tdnn"
supernet_filename = "depth/depth.torchparams"
subnet_filename = "depth/depth.ecapa-tdnn.3.512.512.512.512.5.3.3.3.1536.bn.tar"
subnet, info = WrappedModel.from_pretrained(
    repo_id=repo_id, supernet_filename=supernet_filename, subnet_filename=subnet_filename)
sup_state_dict = info['supernet']
sub_state_dict = info['subnet']

In [4]:
print(f"Subnet: {sub_state_dict['subnet']}")
print("Performance:")
for key in sub_state_dict.keys():
    if "EER" in key or "minDCF" in key:
        print(f"\t{key}\t{sub_state_dict[key]}{'%' if 'EER' in key else ''}")

Subnet: (3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)
Performance:
	EER (%) w/o AS-Norm	1.14%
	minDCF w/o AS-Norm	0.106
	EER (%) w/ AS-Norm	0.94%
	minDCF w/ AS-Norm	0.089


## Profile the efficiency metrics of the subnet

After sampling a subnet, the next is to profile its efficiency metrics such as memory, MACs, and parameters, where the MAC is estimated by taking a 3-second utterance as the input.

In [5]:
input_size = [1, 48000]
macs, params = profile(subnet, input_size, device=device)
print(f"Subnet: {sub_state_dict['subnet']}")
model_size = print_size_of_model(subnet)
avg_lat = latency(subnet, input_size, device=device)
print(f'MACs {macs} Params {params} Memory {model_size:.2f} MB Latency {avg_lat:.2f} ms on the {device}')

Subnet: (3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)
MACs 1.45G Params 5.79M Memory 22.34 MB Latency 11.48 ms on the cuda:3


## Evaluate the subnet 

1. Prepare dataset for evaluation, including cohorts and test set.
2. Conduct evaluation on the Vox1-O, Vox1-E, and Vox1-H test set.

### Prepare dataset for evalution

1. Dataset for evaluating.
2. Dataset for cohort-based score normalization.

Note that download the VoxCeleb1 and VoxCeleb2 data manually and save in the `vox1_root` and `vox2_root`.

In [6]:
!unzip -o datalst.zip

Archive:  datalst.zip
  inflating: list_test_all2.txt      
  inflating: list_test_hard2.txt     
  inflating: veri_test2.txt          
  inflating: vox2.6000.txt           
  inflating: vox2_trainlst.txt       


Set `vox1_root` as the root directory of the VoxCeleb1 data.

In [7]:
vox1_root = "/workspace/datasets/voxceleb/voxceleb1/"
vox2_root = "/workspace/datasets/voxceleb/voxceleb2/"

Load the test set in the form of verification trials.

In [8]:
veritesto = "veri_test2.txt"
veriteste = "list_test_all2.txt"
veritesth = "list_test_hard2.txt"
veri_testo, veri_teste, veri_testh, wav_files = veriset(
    test2=veritesto, all2=veriteste, hard2=veritesth, rootdir=vox1_root, num_samples=0, num_eval=1)
testo_loader = DataLoader(veri_testo, batch_size=1, shuffle=False, num_workers=0)
teste_loader = DataLoader(veri_teste, batch_size=1, shuffle=False, num_workers=0)
testh_loader = DataLoader(veri_testh, batch_size=1, shuffle=False, num_workers=0)

Load cohort dataset as the prepared list.

In [9]:
cohort_path = 'vox2.6000.txt'
prefix_root = vox2_root
with open(cohort_path, 'r') as f:
    cohort_txt = f.readlines()
    cohortlst = [os.path.join(prefix_root, utt.replace('\n', '')) for utt in cohort_txt]

cohortset = Utterance(cohortlst, num_samples=0, mode_eval=True)
cohorts = extract_vectors(subnet, cohortset, device=device)
cohorts = torch.cat(list(cohorts.values()))

Extract Vectors: 100%|██████████| 6000/6000 [00:51<00:00, 116.11it/s]


### Evaluate on the Vox1-O test set

- EER and minDCF_${0.01}$ without the cohort-based adaptive score normalization (AS-Norm).
- Applying the AS-Norm with the cohort set containing utterance-wise speaker embeddings.

Note that the size of cohort set is smaller than that is used in [ECAPA-TDNN](http://www.isca-speech.org/archive/Interspeech_2020/abstracts/2650.html), where all training utterances are applied.

In [10]:
def eval_veri(test_loader, network, p_target=0.01, device="cpu", vectors=None):
    eer, dcf, vec, scs = veri_validate(test_loader, network, p_target=0.01, device=device, ret_info=True, vectors=vectors)
    scs = pandas.DataFrame({'score': scs, 'enroll': test_loader.dataset.enrolls, 'test': test_loader.dataset.tests})
    labs = test_loader.dataset.labels
    eer = eer[0] * 100
    dcf = dcf[0]
    return eer, dcf, vec, scs

def eval_asnorm(labs, vec, scs, cohorts, p_target=0.01):
    cohorts_o = score_cohorts(cohorts, vec)
    asso = asnorm(scs, cohorts_o)
    eer_o_asnorm = calculate_eer(labs, asso)[0] * 100
    dcf_o_asnorm = calculate_mindcf(labs, asso, p_target=0.01)[0]
    return eer_o_asnorm, dcf_o_asnorm

Expected results:

|Metric|Result|
|:-----|-----:|
|EER (%) w/o AS-Norm|	1.14%|
|minDCF w/o AS-Norm|	0.106|

In [11]:
eero, dcfo, veco, scso = eval_veri(testo_loader, subnet, device=device)
print(f'Evaluate on Vox1-O: * EER/DCF {eero:.2f}%/{dcfo:.3f}')

Extract Vectors: 100%|██████████| 4708/4708 [01:05<00:00, 71.58it/s]
Compute Scores: 100%|██████████| 37611/37611 [00:07<00:00, 5300.22it/s]


Evaluate on Vox1-O: * EER/DCF 1.14%/0.106


AS-Norm improves the performance in both EER and minDCF.

Expected results:

|Metric|Result|
|:-----|-----:|
|EER (%) w/o AS-Norm|	0.94%|
|minDCF w/o AS-Norm|	0.089|

In [12]:
eer_o_asnorm, dcf_o_asnorm = eval_asnorm(testo_loader.dataset.labels, veco, scso, cohorts, p_target=0.01)
print(f'AS-Norm on Vox1-O: * EER/DCF {eer_o_asnorm:.2f}%/{dcf_o_asnorm:.3f}')

Score Cohorts: 100%|██████████| 4708/4708 [00:02<00:00, 1837.86it/s]
Cohort Statistics: 100%|██████████| 4708/4708 [00:00<00:00, 5545.38it/s]
Normalization Statistics: 100%|██████████| 37611/37611 [00:00<00:00, 284374.07it/s]


AS-Norm on Vox1-O: * EER/DCF 0.94%/0.089


### Evaluate on the Vox1-E test set

In [13]:
eere, dcfe, vece, scse = eval_veri(teste_loader, subnet, device=device)
print(f'Evaluate on Vox1-E: * EER/DCF {eere:.2f}%/{dcfe:.3f}')
time.sleep(1)
eer_e_asnorm, dcf_e_asnorm = eval_asnorm(teste_loader.dataset.labels, vece, scse, cohorts, p_target=0.01)
print(f'AS-Norm on Vox1-E: * EER/DCF {eer_e_asnorm:.2f}%/{dcf_e_asnorm:.3f}')

Extract Vectors:   0%|          | 520/145160 [00:07<34:57, 68.96it/s]

### Evaluate on the Vox1-H test set

In [None]:
eerh, dcfh, vech, scsh = eval_veri(testh_loader, subnet, device=device, vectors=vece)
print(f'Evaluate on Vox1-H: * EER/DCF {eerh:.2f}%/{dcfh:.3f}')
time.sleep(1)
eer_h_asnorm, dcf_h_asnorm = eval_asnorm(testh_loader.dataset.labels, vech, scsh, cohorts, p_target=0.01)
print(f'AS-Norm on Vox1-H: * EER/DCF {eer_h_asnorm:.2f}%/{dcf_h_asnorm:.3f}')

Compute Scores: 100%|██████████| 550894/550894 [01:44<00:00, 5290.34it/s]


Evaluate on Vox1-H: * EER/DCF 2.41%/0.238


Score Cohorts: 100%|██████████| 145160/145160 [02:05<00:00, 1158.00it/s]
Cohort Statistics: 100%|██████████| 145160/145160 [00:26<00:00, 5497.97it/s]
Normalization Statistics: 100%|██████████| 550894/550894 [00:02<00:00, 249554.50it/s]


AS-Norm on Vox1-H: * EER/DCF 2.18%/0.206


## Conclusion

The results of the subnet `(3, [512, 512, 512, 512], [5, 3, 3, 3], 1536)` are summarized as follows.

| Architecture | Vox1-O EER(%) | Vox1-O DCF $_{0.01}$ | Vox1-E EER(%) | Vox1-E DCF $_{0.01}$ | Vox1-H EER(%) | Vox1-H DCF $_{0.01}$ |
|:----|:---:|:---:|:---:|:---:|:---:|:---:|
| ECAPA-TDNN(512) reported | 1.01 | 0.1274 | 1.24 | 0.1418 | 2.32 | 0.2181 |
| EfficientTDNN-Base| 0.94 | 0.089 | 1.20 | 0.131 | 2.18 | 0.206 |

To conclude, the tutorial shows that how to evaluate a subnet that inherits from the trained supernet and loads the weights of the subnet, in terms of MACs, parameters, memory, latency, EER, and DCF $_{0.01}$ .

The implementation details about the search process can be found in [TDNN-NAS](./TDNN-NAS.ipynb).

## Referencing ECAPA-TDNN

```
@inproceedings{DBLP:conf/interspeech/DesplanquesTD20,
  author    = {Brecht Desplanques and
               Jenthe Thienpondt and
               Kris Demuynck},
  editor    = {Helen Meng and
               Bo Xu and
               Thomas Fang Zheng},
  title     = {{ECAPA-TDNN:} Emphasized Channel Attention, Propagation and Aggregation
               in {TDNN} Based Speaker Verification},
  booktitle = {Interspeech 2020},
  pages     = {3830--3834},
  publisher = {{ISCA}},
  year      = {2020},
}
```

## Citing EfficientTDNN

Please, cite EfficientTDNN if you use it for your research or business.

```bibtex
@article{speechbrain,
  title={{EfficientTDNN}: Efficient Architecture Search for Speaker Recognition in the Wild},
  author={Rui Wang and Zhihua Wei and Haoran Duan and Shouling Ji and Zhen Hong},
  year={2021},
  eprint={2103.13581},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2103.13581}
}
```