# EXPERIMENTO COMPLETO

Primero de todo, se deben realizar la Extracción, Transformación y Carga (ETL) de los datos.

El cuaderno desde el que se realiza es: FP_ES_TTS_ExtraccionTransformacionLimpiezaDataset.ipynb
Una vez ejecutado dicho cuaderno, tenemos en la carpeta correspondiente (en este caso en la carpeta datasets/datasetCastellanoReducido) los tres ficheros creados por el cuaderno, llamados metadata_dev, metadata_test y metadata_train.

A continuación, con esos ficheros tenemos que hacer los siguientes pasos:

## 1. Creación de los Manifiestos de Datos.

Se ha creado un script llamado `scripts/dataset_processing/tts/thorsten_neutral/get_data.py`, para generar las divisiones train/val/test en formato de manifiesto JSON JSON con los siguientes campos:
1. `audio_filepath`: localización del archivo de audio (wav);
2. `duration`: duración del archivo de audio (wav);
3. `text`: texto original;
4. `normalized_text`: texto normalizado a través del pipeline de normalización.

Una vez ejecutado este comando, se obtienen los manifiestos finales `train_manifest_text_normed.json`, `val_manifest_text_normed.json` y `test_manifest_text_normed.json`.
El script puede tardar en ejecutarse más de 1 hora.

In [1]:
import os
os.chdir("/home/irene/notebooks")

!python es_get_data.py \
    --data-root /home/irene/datasets \
    --manifests-root ../datasets/NemoSpanishTTSEsMapa152Finetuning \
    --val-size 3 \
    --test-size 2 \
    --seed-for-ds-split 87 \
    --num-workers -1 \
    --normalize-text

Dataset directory found
[NeMo I 2023-09-11 17:15:40 es_get_data:148] Preparing JSON train/val/test splits.
130it [00:05, 25.95it/s]^C
130it [00:05, 24.32it/s]
Traceback (most recent call last):
  File "/home/irene/notebooks/es_get_data.py", line 277, in <module>
    main()
  File "/home/irene/notebooks/es_get_data.py", line 235, in main
    entries_train, entries_val, entries_test, not_found_wavs, wrong_duration_wavs = __process_data(
  File "/home/irene/notebooks/es_get_data.py", line 175, in __process_data
    duration = subprocess.check_output(f"soxi -D {wav_file}", shell=True)
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 503, in run


Paso necesario para los datasets que se hayan tenido que convertir a wav.

A continuación se hace un paso necesario para los datasets del profesor de Segovia, debido a que, al principio estaban en un formato diferente. Después, mediante un script se convirtieron de mp4 a wav https://colab.research.google.com/drive/14izC7G5e-R2LlxvdRG8oCUH25QwZmlmH#scrollTo=OXVrAcJ0ebrQ. Pero faltaba por cambiar el canal, ya que estaba en estereo. Se cambia a mono, y de esta manera, funcionan los siguientes pasos.

In [None]:
#Esto soluciona un error de canales: así si que se extrae la información suplementaria: se pasa de estereo a mono.
from pydub import AudioSegment
import os

folder_path = "/home/irene/datasets/datasetEsMapa152Finetuning/esmapa152"

# Iterate over the files in the folder
# for i in range(92):
    # file_name = f"esmapa{i}.wav" # Formato normal datasetBueno
for i in range(153):
    file_name = f"esmapa{i:04d}.wav"  # Formato con varios 0 DatasetEsMapa
    file_path = os.path.join(folder_path, file_name)

    # Check if the file exists before processing
    if os.path.exists(file_path):
        # Load the audio file
        sound = AudioSegment.from_wav(file_path)

        # Set the number of channels to 1
        sound = sound.set_channels(1)

        # Export the modified audio back to the same file path
        sound.export(file_path, format="wav")

        print(f"Converted file: {file_name}")
    else:
        print(f"File not found: {file_name}")

## 2. Extracción de los datos suplementarios

Para acelerar y estabilizar el entrenamiento, se necesitan extraer datos suplementarios para cada audio, estimando las estadísticas de tono (media, desviación estándar, mínimo y máximo). Para realizar esto, se necesita iterar sobre los datos una vez, a través del script `extract_sup_data.py`.
Se han retocado algunos parámetros del fichero de configuración: /home/irene/datasets/NemoSpanishTTS/ds_for_fastpitch_align.yaml

**Nota**: Este es un paso opcional, pero se ha realizado al crear el modelo base.


In [None]:
import glob, os
os.chdir("/home/irene/datasets/NemoSpanishTTSEsMapa152Finetuning")

!python extract_sup_data.py \
        --config-path ./ \
        --config-name ds_for_fastpitch_align.yaml \
        manifest_filepath=train_manifest_text_normed.json \
        sup_data_path=sup_data \
        ++dataloader_params.num_workers=4

## 3. Entrenamiento

Antes de entrenar el modelo, hay que definir la configuración del mismo. Se han cambiado algunas cosas con respecto al modelo original
 `examples/tts/conf/de/fastpitch_align_22050_grapheme.yaml`.
 
 Por otro lado, los valores de `pitch_mean` y `pitch_std` deben ser actualziados con los valores que se han estimado en el paso de`extract_sup_data.py`.

Se ha usado Wandb para tener los resultados de los experimentos [enlace](https://docs.wandb.ai/ref/cli/wandb-login).

In [1]:
!wandb login 7f8717dd64209b51a51493f579c375a7ca34fd2f

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Ahora se va a proceder a entrenar el modelo, en este caso FastPitch.En el siguiente comando se van a poner los valores de `PITCH_MEAN` y `PITCH_STD`.

In [None]:
!(cd /home/irene/datasets/NemoSpanishTTS && CUDA_VISIBLE_DEVICES=0 python fastpitch.py --config-path . --config-name fastpitch_align_44100 \
  model.train_ds.dataloader_params.batch_size=32 \
    model.validation_ds.dataloader_params.batch_size=32 \
    train_dataset=train_manifest_text_normed.json \
    validation_datasets=val_manifest_text_normed.json \
    sup_data_path=sup_data \
    exp_manager.exp_dir=resultSpanishTTS \
    trainer.check_val_every_n_epoch=1 \
    pitch_mean=126.73465728759766 \
    pitch_std=38.099849700927734 \
    +exp_manager.create_wandb_logger=true \
    +exp_manager.wandb_logger_kwargs.name="tutorial" \
    +exp_manager.wandb_logger_kwargs.project="SpanishTTS")

Nota:
1. Se usa `CUDA_VISIBLE_DEVICES=0` para limitar el entrenamiento a una sola GPU.
2. Para hacer el debugging se utiliza el siguiente flag: `HYDRA_FULL_ERROR=1`, `CUDA_LAUNCH_BLOCKING=1`

Después del entreamiento, se procede a realizar la evaluación. Para ello se tiene el cuaderno llamado FP_ES_TTS_Evaluate.ipynb. En este cuaderno, está tanto la evaluación y la creación de los audios del modelo base como del finetuned de FastPitch.

# 4. Finetuning FastPitch
Mejora de la calidad del habla mediante el ajuste de FastPitch.

Para realizar el Finetuning, se poseen otros dos datasets. Una vez hecho el proceso de ETL, y obtención de los manifiestos y la extracción de los datos suplementarios, se procede a la realización del finetuning. Éste se encuentra en el script  fptts-finetuningFastPitch.sh.
La evaluación del modelo se puede realizar en el cuaderno FP_ES_TTS_Evaluate.ipynb.

# 5. Finetuning HiFi-GAN
Mejora de la calidad del habla mediante el ajuste de HiFi-GAN en mel-espectrogramas sintetizados de FastPitch.
 
Se ha realizado el siguiente cuaderno: FP_ES_TTS_Finetuning_HiFiGAN.ipynb.

Para evaluar los audios resultantes, se ha creado el cuaderno FP_ES_TTS_Evaluate-FTHifiGAN.ipynb

In [9]:
!pip install tts-scores

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


# MEJORES MODELOS
Se van a elegir los mejores modelos, para después realizar una evaluación.

Los mejores modelos son:
- Modelo base FastPitch: '/home/irene/datasets/NemoSpanishTTS/resultSpanishTTS/FastPitch/2023-06-28_18-47-07/checkpoints/FastPitch--val_loss=0.7140-epoch=265.ckpt'
- Modelo base HiFiGAN: -

# MÉTRICAS CLVP 
En un principio se encontraron las métricas CLVP. Sin embargo, cuando se inicializa la métrica CLVP, salta un error: 

RuntimeError: Error(s) in loading state_dict for CLVP:
   Missing key(s) in state_dict: "text_pos_emb.weight", "text_transformer.layers.layers.0.0.scale" ....
   .....  mismatch for to_speech_latent.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 512]).

   Se ha buscado en Internet y se vió un issue abierto en github, en donde se comentaba que todavía no se había encontrado solución:
   https://github.com/neonbjb/tts-scores/issues/6

In [1]:
from tts_scores.clvp import CLVPMetric

In [31]:
pretrained_checkpoint = '/home/irene/notebooks/data/clvp.pth' 

In [3]:
from tts_scores.clvp import CLVPMetric

# Specify the absolute path to the clvp.pth file
pretrained_path = './data/clvp.pth'  # Replace with the actual path to clvp.pth

# Initialize the CLVP metric with the correct device (e.g., 'cuda') and the path to the model
cv_metric = CLVPMetric(device='cuda', pretrained_path=pretrained_path)

# Now you can use cv_metric as intended

RuntimeError: Error(s) in loading state_dict for CLVP:
	Missing key(s) in state_dict: "text_pos_emb.weight", "text_transformer.layers.layers.0.0.scale", "text_transformer.layers.layers.0.0.fn.norm.weight", "text_transformer.layers.layers.0.0.fn.norm.bias", "text_transformer.layers.layers.0.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.0.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.0.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.0.1.scale", "text_transformer.layers.layers.0.1.fn.norm.weight", "text_transformer.layers.layers.0.1.fn.norm.bias", "text_transformer.layers.layers.0.1.fn.fn.net.0.weight", "text_transformer.layers.layers.0.1.fn.fn.net.0.bias", "text_transformer.layers.layers.0.1.fn.fn.net.3.weight", "text_transformer.layers.layers.0.1.fn.fn.net.3.bias", "text_transformer.layers.layers.1.0.scale", "text_transformer.layers.layers.1.0.fn.norm.weight", "text_transformer.layers.layers.1.0.fn.norm.bias", "text_transformer.layers.layers.1.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.1.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.1.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.1.1.scale", "text_transformer.layers.layers.1.1.fn.norm.weight", "text_transformer.layers.layers.1.1.fn.norm.bias", "text_transformer.layers.layers.1.1.fn.fn.net.0.weight", "text_transformer.layers.layers.1.1.fn.fn.net.0.bias", "text_transformer.layers.layers.1.1.fn.fn.net.3.weight", "text_transformer.layers.layers.1.1.fn.fn.net.3.bias", "text_transformer.layers.layers.2.0.scale", "text_transformer.layers.layers.2.0.fn.norm.weight", "text_transformer.layers.layers.2.0.fn.norm.bias", "text_transformer.layers.layers.2.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.2.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.2.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.2.1.scale", "text_transformer.layers.layers.2.1.fn.norm.weight", "text_transformer.layers.layers.2.1.fn.norm.bias", "text_transformer.layers.layers.2.1.fn.fn.net.0.weight", "text_transformer.layers.layers.2.1.fn.fn.net.0.bias", "text_transformer.layers.layers.2.1.fn.fn.net.3.weight", "text_transformer.layers.layers.2.1.fn.fn.net.3.bias", "text_transformer.layers.layers.3.0.scale", "text_transformer.layers.layers.3.0.fn.norm.weight", "text_transformer.layers.layers.3.0.fn.norm.bias", "text_transformer.layers.layers.3.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.3.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.3.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.3.1.scale", "text_transformer.layers.layers.3.1.fn.norm.weight", "text_transformer.layers.layers.3.1.fn.norm.bias", "text_transformer.layers.layers.3.1.fn.fn.net.0.weight", "text_transformer.layers.layers.3.1.fn.fn.net.0.bias", "text_transformer.layers.layers.3.1.fn.fn.net.3.weight", "text_transformer.layers.layers.3.1.fn.fn.net.3.bias", "text_transformer.layers.layers.4.0.scale", "text_transformer.layers.layers.4.0.fn.norm.weight", "text_transformer.layers.layers.4.0.fn.norm.bias", "text_transformer.layers.layers.4.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.4.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.4.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.4.1.scale", "text_transformer.layers.layers.4.1.fn.norm.weight", "text_transformer.layers.layers.4.1.fn.norm.bias", "text_transformer.layers.layers.4.1.fn.fn.net.0.weight", "text_transformer.layers.layers.4.1.fn.fn.net.0.bias", "text_transformer.layers.layers.4.1.fn.fn.net.3.weight", "text_transformer.layers.layers.4.1.fn.fn.net.3.bias", "text_transformer.layers.layers.5.0.scale", "text_transformer.layers.layers.5.0.fn.norm.weight", "text_transformer.layers.layers.5.0.fn.norm.bias", "text_transformer.layers.layers.5.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.5.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.5.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.5.1.scale", "text_transformer.layers.layers.5.1.fn.norm.weight", "text_transformer.layers.layers.5.1.fn.norm.bias", "text_transformer.layers.layers.5.1.fn.fn.net.0.weight", "text_transformer.layers.layers.5.1.fn.fn.net.0.bias", "text_transformer.layers.layers.5.1.fn.fn.net.3.weight", "text_transformer.layers.layers.5.1.fn.fn.net.3.bias", "text_transformer.layers.layers.6.0.scale", "text_transformer.layers.layers.6.0.fn.norm.weight", "text_transformer.layers.layers.6.0.fn.norm.bias", "text_transformer.layers.layers.6.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.6.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.6.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.6.1.scale", "text_transformer.layers.layers.6.1.fn.norm.weight", "text_transformer.layers.layers.6.1.fn.norm.bias", "text_transformer.layers.layers.6.1.fn.fn.net.0.weight", "text_transformer.layers.layers.6.1.fn.fn.net.0.bias", "text_transformer.layers.layers.6.1.fn.fn.net.3.weight", "text_transformer.layers.layers.6.1.fn.fn.net.3.bias", "text_transformer.layers.layers.7.0.scale", "text_transformer.layers.layers.7.0.fn.norm.weight", "text_transformer.layers.layers.7.0.fn.norm.bias", "text_transformer.layers.layers.7.0.fn.fn.to_qkv.weight", "text_transformer.layers.layers.7.0.fn.fn.to_out.0.weight", "text_transformer.layers.layers.7.0.fn.fn.to_out.0.bias", "text_transformer.layers.layers.7.1.scale", "text_transformer.layers.layers.7.1.fn.norm.weight", "text_transformer.layers.layers.7.1.fn.norm.bias", "text_transformer.layers.layers.7.1.fn.fn.net.0.weight", "text_transformer.layers.layers.7.1.fn.fn.net.0.bias", "text_transformer.layers.layers.7.1.fn.fn.net.3.weight", "text_transformer.layers.layers.7.1.fn.fn.net.3.bias", "speech_enc.weight", "speech_enc.bias", "speech_pos_emb.weight", "speech_transformer.layers.layers.0.0.scale", "speech_transformer.layers.layers.0.0.fn.norm.weight", "speech_transformer.layers.layers.0.0.fn.norm.bias", "speech_transformer.layers.layers.0.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.0.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.0.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.0.1.scale", "speech_transformer.layers.layers.0.1.fn.norm.weight", "speech_transformer.layers.layers.0.1.fn.norm.bias", "speech_transformer.layers.layers.0.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.0.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.0.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.0.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.1.0.scale", "speech_transformer.layers.layers.1.0.fn.norm.weight", "speech_transformer.layers.layers.1.0.fn.norm.bias", "speech_transformer.layers.layers.1.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.1.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.1.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.1.1.scale", "speech_transformer.layers.layers.1.1.fn.norm.weight", "speech_transformer.layers.layers.1.1.fn.norm.bias", "speech_transformer.layers.layers.1.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.1.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.1.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.1.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.2.0.scale", "speech_transformer.layers.layers.2.0.fn.norm.weight", "speech_transformer.layers.layers.2.0.fn.norm.bias", "speech_transformer.layers.layers.2.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.2.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.2.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.2.1.scale", "speech_transformer.layers.layers.2.1.fn.norm.weight", "speech_transformer.layers.layers.2.1.fn.norm.bias", "speech_transformer.layers.layers.2.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.2.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.2.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.2.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.3.0.scale", "speech_transformer.layers.layers.3.0.fn.norm.weight", "speech_transformer.layers.layers.3.0.fn.norm.bias", "speech_transformer.layers.layers.3.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.3.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.3.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.3.1.scale", "speech_transformer.layers.layers.3.1.fn.norm.weight", "speech_transformer.layers.layers.3.1.fn.norm.bias", "speech_transformer.layers.layers.3.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.3.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.3.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.3.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.4.0.scale", "speech_transformer.layers.layers.4.0.fn.norm.weight", "speech_transformer.layers.layers.4.0.fn.norm.bias", "speech_transformer.layers.layers.4.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.4.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.4.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.4.1.scale", "speech_transformer.layers.layers.4.1.fn.norm.weight", "speech_transformer.layers.layers.4.1.fn.norm.bias", "speech_transformer.layers.layers.4.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.4.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.4.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.4.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.5.0.scale", "speech_transformer.layers.layers.5.0.fn.norm.weight", "speech_transformer.layers.layers.5.0.fn.norm.bias", "speech_transformer.layers.layers.5.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.5.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.5.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.5.1.scale", "speech_transformer.layers.layers.5.1.fn.norm.weight", "speech_transformer.layers.layers.5.1.fn.norm.bias", "speech_transformer.layers.layers.5.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.5.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.5.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.5.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.6.0.scale", "speech_transformer.layers.layers.6.0.fn.norm.weight", "speech_transformer.layers.layers.6.0.fn.norm.bias", "speech_transformer.layers.layers.6.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.6.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.6.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.6.1.scale", "speech_transformer.layers.layers.6.1.fn.norm.weight", "speech_transformer.layers.layers.6.1.fn.norm.bias", "speech_transformer.layers.layers.6.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.6.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.6.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.6.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.7.0.scale", "speech_transformer.layers.layers.7.0.fn.norm.weight", "speech_transformer.layers.layers.7.0.fn.norm.bias", "speech_transformer.layers.layers.7.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.7.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.7.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.7.1.scale", "speech_transformer.layers.layers.7.1.fn.norm.weight", "speech_transformer.layers.layers.7.1.fn.norm.bias", "speech_transformer.layers.layers.7.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.7.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.7.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.7.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.8.0.scale", "speech_transformer.layers.layers.8.0.fn.norm.weight", "speech_transformer.layers.layers.8.0.fn.norm.bias", "speech_transformer.layers.layers.8.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.8.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.8.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.8.1.scale", "speech_transformer.layers.layers.8.1.fn.norm.weight", "speech_transformer.layers.layers.8.1.fn.norm.bias", "speech_transformer.layers.layers.8.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.8.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.8.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.8.1.fn.fn.net.3.bias", "speech_transformer.layers.layers.9.0.scale", "speech_transformer.layers.layers.9.0.fn.norm.weight", "speech_transformer.layers.layers.9.0.fn.norm.bias", "speech_transformer.layers.layers.9.0.fn.fn.to_qkv.weight", "speech_transformer.layers.layers.9.0.fn.fn.to_out.0.weight", "speech_transformer.layers.layers.9.0.fn.fn.to_out.0.bias", "speech_transformer.layers.layers.9.1.scale", "speech_transformer.layers.layers.9.1.fn.norm.weight", "speech_transformer.layers.layers.9.1.fn.norm.bias", "speech_transformer.layers.layers.9.1.fn.fn.net.0.weight", "speech_transformer.layers.layers.9.1.fn.fn.net.0.bias", "speech_transformer.layers.layers.9.1.fn.fn.net.3.weight", "speech_transformer.layers.layers.9.1.fn.fn.net.3.bias". 
	Unexpected key(s) in state_dict: "speech_emb.weight", "text_transformer.transformer.attn_layers.layers.0.0.0.g", "text_transformer.transformer.attn_layers.layers.0.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.0.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.0.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.0.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.0.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.1.0.0.g", "text_transformer.transformer.attn_layers.layers.1.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.1.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.1.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.1.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.2.0.0.g", "text_transformer.transformer.attn_layers.layers.2.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.2.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.2.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.2.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.2.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.3.0.0.g", "text_transformer.transformer.attn_layers.layers.3.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.3.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.3.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.3.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.4.0.0.g", "text_transformer.transformer.attn_layers.layers.4.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.4.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.4.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.4.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.4.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.5.0.0.g", "text_transformer.transformer.attn_layers.layers.5.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.5.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.5.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.5.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.6.0.0.g", "text_transformer.transformer.attn_layers.layers.6.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.6.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.6.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.6.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.6.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.7.0.0.g", "text_transformer.transformer.attn_layers.layers.7.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.7.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.7.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.7.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.8.0.0.g", "text_transformer.transformer.attn_layers.layers.8.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.8.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.8.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.8.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.8.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.9.0.0.g", "text_transformer.transformer.attn_layers.layers.9.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.9.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.9.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.9.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.10.0.0.g", "text_transformer.transformer.attn_layers.layers.10.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.10.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.10.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.10.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.10.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.11.0.0.g", "text_transformer.transformer.attn_layers.layers.11.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.11.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.11.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.11.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.12.0.0.g", "text_transformer.transformer.attn_layers.layers.12.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.12.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.12.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.12.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.12.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.13.0.0.g", "text_transformer.transformer.attn_layers.layers.13.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.13.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.13.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.13.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.14.0.0.g", "text_transformer.transformer.attn_layers.layers.14.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.14.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.14.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.14.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.14.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.15.0.0.g", "text_transformer.transformer.attn_layers.layers.15.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.15.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.15.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.15.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.16.0.0.g", "text_transformer.transformer.attn_layers.layers.16.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.16.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.16.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.16.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.16.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.17.0.0.g", "text_transformer.transformer.attn_layers.layers.17.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.17.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.17.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.17.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.18.0.0.g", "text_transformer.transformer.attn_layers.layers.18.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.18.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.18.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.18.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.18.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.19.0.0.g", "text_transformer.transformer.attn_layers.layers.19.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.19.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.19.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.19.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.20.0.0.g", "text_transformer.transformer.attn_layers.layers.20.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.20.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.20.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.20.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.20.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.21.0.0.g", "text_transformer.transformer.attn_layers.layers.21.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.21.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.21.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.21.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.layers.22.0.0.g", "text_transformer.transformer.attn_layers.layers.22.1.wrap.to_q.weight", "text_transformer.transformer.attn_layers.layers.22.1.wrap.to_k.weight", "text_transformer.transformer.attn_layers.layers.22.1.wrap.to_v.weight", "text_transformer.transformer.attn_layers.layers.22.1.wrap.to_out.weight", "text_transformer.transformer.attn_layers.layers.22.1.wrap.to_out.bias", "text_transformer.transformer.attn_layers.layers.23.0.0.g", "text_transformer.transformer.attn_layers.layers.23.1.wrap.net.0.proj.weight", "text_transformer.transformer.attn_layers.layers.23.1.wrap.net.0.proj.bias", "text_transformer.transformer.attn_layers.layers.23.1.wrap.net.3.weight", "text_transformer.transformer.attn_layers.layers.23.1.wrap.net.3.bias", "text_transformer.transformer.attn_layers.rotary_pos_emb.inv_freq", "text_transformer.transformer.norm.weight", "text_transformer.transformer.norm.bias", "speech_transformer.transformer.attn_layers.layers.0.0.0.g", "speech_transformer.transformer.attn_layers.layers.0.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.0.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.0.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.0.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.0.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.1.0.0.g", "speech_transformer.transformer.attn_layers.layers.1.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.1.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.1.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.1.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.2.0.0.g", "speech_transformer.transformer.attn_layers.layers.2.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.2.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.2.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.2.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.2.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.3.0.0.g", "speech_transformer.transformer.attn_layers.layers.3.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.3.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.3.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.3.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.4.0.0.g", "speech_transformer.transformer.attn_layers.layers.4.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.4.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.4.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.4.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.4.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.5.0.0.g", "speech_transformer.transformer.attn_layers.layers.5.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.5.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.5.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.5.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.6.0.0.g", "speech_transformer.transformer.attn_layers.layers.6.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.6.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.6.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.6.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.6.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.7.0.0.g", "speech_transformer.transformer.attn_layers.layers.7.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.7.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.7.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.7.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.8.0.0.g", "speech_transformer.transformer.attn_layers.layers.8.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.8.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.8.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.8.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.8.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.9.0.0.g", "speech_transformer.transformer.attn_layers.layers.9.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.9.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.9.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.9.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.10.0.0.g", "speech_transformer.transformer.attn_layers.layers.10.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.10.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.10.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.10.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.10.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.11.0.0.g", "speech_transformer.transformer.attn_layers.layers.11.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.11.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.11.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.11.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.12.0.0.g", "speech_transformer.transformer.attn_layers.layers.12.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.12.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.12.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.12.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.12.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.13.0.0.g", "speech_transformer.transformer.attn_layers.layers.13.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.13.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.13.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.13.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.14.0.0.g", "speech_transformer.transformer.attn_layers.layers.14.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.14.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.14.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.14.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.14.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.15.0.0.g", "speech_transformer.transformer.attn_layers.layers.15.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.15.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.15.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.15.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.16.0.0.g", "speech_transformer.transformer.attn_layers.layers.16.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.16.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.16.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.16.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.16.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.17.0.0.g", "speech_transformer.transformer.attn_layers.layers.17.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.17.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.17.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.17.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.18.0.0.g", "speech_transformer.transformer.attn_layers.layers.18.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.18.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.18.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.18.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.18.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.19.0.0.g", "speech_transformer.transformer.attn_layers.layers.19.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.19.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.19.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.19.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.20.0.0.g", "speech_transformer.transformer.attn_layers.layers.20.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.20.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.20.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.20.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.20.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.21.0.0.g", "speech_transformer.transformer.attn_layers.layers.21.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.21.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.21.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.21.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.layers.22.0.0.g", "speech_transformer.transformer.attn_layers.layers.22.1.wrap.to_q.weight", "speech_transformer.transformer.attn_layers.layers.22.1.wrap.to_k.weight", "speech_transformer.transformer.attn_layers.layers.22.1.wrap.to_v.weight", "speech_transformer.transformer.attn_layers.layers.22.1.wrap.to_out.weight", "speech_transformer.transformer.attn_layers.layers.22.1.wrap.to_out.bias", "speech_transformer.transformer.attn_layers.layers.23.0.0.g", "speech_transformer.transformer.attn_layers.layers.23.1.wrap.net.0.proj.weight", "speech_transformer.transformer.attn_layers.layers.23.1.wrap.net.0.proj.bias", "speech_transformer.transformer.attn_layers.layers.23.1.wrap.net.3.weight", "speech_transformer.transformer.attn_layers.layers.23.1.wrap.net.3.bias", "speech_transformer.transformer.attn_layers.rotary_pos_emb.inv_freq", "speech_transformer.transformer.norm.weight", "speech_transformer.transformer.norm.bias". 
	size mismatch for text_emb.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([148, 512]).

In [4]:
# score = cv_metric.compute_fd('<path_to_your_generated_audio>, '<path_to_your_real_audio>')
score = cv_metric.compute_fd('/home/irene/notebooks/audioFinetuned1.wav', '/home/irene/notebooks/audioOrig1.wav')

NameError: name 'cv_metric' is not defined

In [3]:
# import librosa
# import numpy as np
# from sklearn.metrics.pairwise import cosine_similarity

# # Load your generated and original audio files
# generated_audio, _ = librosa.load('/home/irene/notebooks/audioOrig1.wav', sr=None)
# original_audio, _ = librosa.load('/home/irene/notebooks/audioFinetuned1.wav', sr=None)

# # Extract audio features (e.g., Mel-frequency cepstral coefficients, MFCCs)
# mfcc_generated = librosa.feature.mfcc(generated_audio, sr=44100)
# mfcc_original = librosa.feature.mfcc(original_audio, sr=44100)

# # Calculate cosine similarity between the MFCC features
# similarity_score = cosine_similarity(mfcc_generated.T, mfcc_original.T)

# # Print or use the similarity score as needed
# print(f"Cosine Similarity Score: {similarity_score[0][0]}")

# MÉTRICAS CALCULADAS CON PYSEPM

In [4]:
!pip3 install https://github.com/schmiph2/pysepm/archive/master.zip

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting https://github.com/schmiph2/pysepm/archive/master.zip
  Downloading https://github.com/schmiph2/pysepm/archive/master.zip
[2K     [32m/[0m [32m1.8 MB[0m [31m6.3 MB/s[0m [33m0:00:00[0m0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting pesq@ https://github.com/ludlows/python-pesq/archive/master.zip#egg=pesq (from pysepm==0.1)
  Downloading https://github.com/ludlows/python-pesq/archive/master.zip
[2K     [32m\[0m [32m223.1 kB[0m [31m2.4 MB/s[0m [33m0:00:00[0mm
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting SRMRpy@ https://github.com/jfsantos/SRMRpy/archive/master.zip#egg=SRMRpy (from pysepm==0.1)
  Downloading https://github.com/jfsantos/SRMRpy/archive/master.zip
[2K     [32m\[0m [32m39.3 kB[0m [31m414.5 kB/s[0m [33m0:0

In [1]:
from scipy.io import wavfile
from scipy.signal import resample
import sys
sys.path.append("../") 
import pysepm

In [2]:
from scipy.io import wavfile
from scipy.signal import resample
import pysepm

# List of audio file paths
audio_files = [
    '/home/irene/notebooks/muestrasModelosBase/audio_modelo_base_1.wav',
    '/home/irene/notebooks/muestrasModelosBase/audio_modelo_base_2.wav',
    '/home/irene/notebooks/muestrasModelosBase/audio_modelo_base_3.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedFastpitch/audio_modelo_Finetuned_FastPitch_1.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedFastpitch/audio_modelo_Finetuned_FastPitch_2.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedFastpitch/audio_modelo_Finetuned_FastPitch_3.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedHifiGAN/audio_modelo_Finetuned_HifiGAN_1.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedHifiGAN/audio_modelo_Finetuned_HifiGAN_2.wav',
    '/home/irene/notebooks/muestrasModelosFinetunedHifiGAN/audio_modelo_Finetuned_HifiGAN_3.wav',
    '/home/irene/notebooks/muestrasModelosBase/audioOrigBase.wav'
]

# Load the reference audio signal
fs_orig, base_speech = wavfile.read(audio_files[-1])  # Assuming the reference audio is the last one in the list

# Initialize dictionaries to store metrics for each audio file
metrics_dict = {}

# Loop through the audio files and compute metrics, but terminate after the third iteration
for i, audio_file in enumerate(audio_files):
    fs, orig_speech = wavfile.read(audio_file)

    # Check and ensure matching sampling frequencies
    if fs != fs_orig:
        orig_speech = resample(orig_speech, len(base_speech))
        fs = fs_orig

    # Check and ensure matching lengths
    if len(orig_speech) != len(base_speech):
        min_length = min(len(orig_speech), len(base_speech))
        orig_speech = orig_speech[:min_length]
        base_speech = base_speech[:min_length]

    # Compute the metrics for the current audio file
    # Métrica fwSNRseg
    fwSNRseg = pysepm.fwSNRseg(orig_speech, base_speech, fs)
    # Métrica SNRreg
    SNRseg = pysepm.SNRseg(orig_speech, base_speech, fs)
    # Métrica LLR
    llr = pysepm.llr(orig_speech, base_speech, fs)
    # Métrica WSS
    wss = pysepm.wss(orig_speech, base_speech, fs)
    # Métrica Cepstrum Distance
    cepstrum_distance = pysepm.cepstrum_distance(orig_speech, base_speech, fs)
    # Métrica STOI
    stoi = pysepm.stoi(orig_speech, base_speech, fs)
    # Métrica CSII
    csii = pysepm.csii(orig_speech, base_speech, fs)
    # Métrica BSD
    bsd = pysepm.bsd(orig_speech, base_speech, fs)

    # Store the metrics in the dictionary
    metrics_dict[f'Audio_{i + 1}'] = {
        'fwSNRseg': fwSNRseg,
        'SNRseg': SNRseg,
        'LLR': llr,
        'WSS': wss,
        'Cepstrum Distance': cepstrum_distance,
        'STOI': stoi,
        'CSII': csii,
        'BSD': bsd
    }

    # Print the metrics for the current audio file
    print(f"Metrics for {audio_file}:")
    for metric_name, value in metrics_dict[f'Audio_{i + 1}'].items():
        print(f"{metric_name}: {value}")
    print("\n")

    # Break out of the loop after the third iteration
    # finBucle = len(audio_files)-1
    if i == 8:
        break


Metrics for /home/irene/notebooks/muestrasModelosBase/audio_modelo_base_1.wav:
fwSNRseg: 2.7873186064696553
SNRseg: -3.397573660436624
LLR: 1.9719273110635167
WSS: 98.31711705773556
Cepstrum Distance: 9.804188018349352
STOI: 0.05917911758771761
CSII: (0.0, 0.0, 0.0)
BSD: 2299267594.1963573


Metrics for /home/irene/notebooks/muestrasModelosBase/audio_modelo_base_2.wav:
fwSNRseg: 3.3267717360769318
SNRseg: -3.3908011467302246
LLR: 1.9556438871058446
WSS: 95.37886867601392
Cepstrum Distance: 9.797060584897759
STOI: 0.11277212699886188
CSII: (0.0, 4.822978402682473e-05, 0.007052310347891178)
BSD: 460804545.3298916


Metrics for /home/irene/notebooks/muestrasModelosBase/audio_modelo_base_3.wav:
fwSNRseg: 2.765908126817959
SNRseg: -3.414098847116286
LLR: 1.9505945042540762
WSS: 95.54108886873074
Cepstrum Distance: 9.824989104957096
STOI: 0.04685633646868701
CSII: (0.0, 0.00010091247406352087, 0.0)
BSD: 2772008744.813369


Metrics for /home/irene/notebooks/muestrasModelosFinetunedFastpitch/a

Por último se realiza la evaluación perceptual. En ella se van a ejecutar los dos mejores modelos. Se utilizarán las 6 frases de la evaluación perceptal. Habrá un total de 12 frases, 2 por cada frase de test. El cuaderno donde se puede ejecutar se llama evaluateEvaluacionPerceptual.ipynb