# TTS Evaluation QA

TTS training are serverly affected by any inconsistency between audio clips and transcripts. The purpose of this notebook is to find out any inconsistent audio, transcript pair by comparing ASR generated transcripts of the audios to the original transcripts.

### Download Data

For our tutorial, we will use a small part of the Hi-Fi Multi-Speaker English TTS (Hi-Fi TTS) dataset. You can read more about dataset [here](https://arxiv.org/abs/2104.01497)

In [1]:
!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz

--2022-10-06 14:15:49--  https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz
Resolving nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)... 52.219.96.144
Connecting to nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)|52.219.96.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11002569 (10M) [application/x-gzip]
Saving to: ‘6097_5_mins.tar.gz.9’


2022-10-06 14:16:03 (790 KB/s) - ‘6097_5_mins.tar.gz.9’ saved [11002569/11002569]



In [2]:
!head -n 1 ./6097_5_mins/manifest.json

{"audio_filepath": "audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,"}


In [3]:
## Fix filepath in manifest.
audio_dir = "/home/siddhartht/tts/tutorials/6097_5_mins" # Change this to the location of your audio dir
! sed -i 's,audio/,{audio_dir}/audio/,g' 6097_5_mins/manifest.json

### Synthesize text from the asr.

In [4]:
! cd ~ && git clone https://github.com/NVIDIA/NeMo.git
### Change the dataset_manifest variabel to the path of your manifest file.
### Change the output_filename variable to the path of your output file.
! cd /home/siddhartht/NeMo/examples/asr/ && python transcribe_speech.py pretrained_name=stt_en_conformer_ctc_large \
    dataset_manifest=/home/siddhartht/tts/tutorials/6097_5_mins/manifest.json \
    output_filename=/home/siddhartht/tts/tutorials/6097_5_mins/asr_pred.json \
    batch_size=32 +compute_langs=False cuda=0 amp=True

fatal: destination path 'NeMo' already exists and is not an empty directory.
[NeMo W 2022-10-06 14:16:05 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo I 2022-10-06 14:16:12 transcribe_speech:105] Hydra config: model_path: null
    pretrained_name: stt_en_conformer_ctc_large
    audio_dir: null
    dataset_manifest: /home/siddhartht/tts/tutorials/6097_5_mins/manifest.json
    output_filename: /home/siddhartht/tts/tutorials/6097_5_mins/asr_pred.json
    batch_size: 32
    num_workers: 0
    cuda: 0
    amp: true
    audio_type: wav
    overwrite_transcripts: true
    ctc_decoding:
      strategy: greedy
      preserve_alignments: null
      compute_timestamps: null
      word_seperator: ' '
      ctc_timestamp_type: all
      batch_dim_index: 0
      greedy:
        preserve_alignments: false
        compute_timestamps: false
    rnnt_decoding:
      strategy: greedy_batch
      compute_hypothesis_token_set: false
      preserve_alignment

In [5]:
!head -2 ~/tts/tutorials/6097_5_mins/asr_pred.json

{"audio_filepath": "/home/siddhartht/tts/tutorials/6097_5_mins/audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,", "pred_text": "not to stop more than ten minutes by the way"}
{"audio_filepath": "/home/siddhartht/tts/tutorials/6097_5_mins/audio/roots_19_morris_0269.wav", "text": "they were men having no country to go back to", "duration": 2.68, "text_no_preprocessing": "they were men having no country to go back to,", "text_normalized": "they were men having no country to go back to,", "pred_text": "they were men having no country to go back to"}


## Calculate distance.

Use the file generated above by `transcribe_speech.py` and calculate [Levenshtein distance](https://pypi.org/project/editdistance/) to measure the distance and error rate between the ASR and ground truth transcript. Use an appripriate value to flag predictions that are below threshold.

In [6]:
import editdistance
import ndjson
import string

In [7]:
distance_threshold = 5 #Can be finetuned.
error_threshold = 0.5 #Can be finetuned.

In [8]:
## Punctuation translation dictionary.
punct_dict = str.maketrans('', '', string.punctuation)

f = open("6097_5_mins/asr_pred.json")
manifest = ndjson.load(f)
f.close()

In [9]:
for line in manifest:
    transcript = line["text"].lower().translate(punct_dict)
    pred_text = line["pred_text"]
    try:
        distance = editdistance.eval(transcript, pred_text)
        error_rate = distance / len(transcript)
    except Exception as e:
        print(f"Got error: {e} for line: {line}")
        distance = 0
        error_rate = 0
    if distance > distance_threshold or error_rate > error_threshold:
        print(f"Low confidence for {line}")

## Calculate WER(Word error rate)
Now we have listed all the sentences with high edit distance. We will list all the sentences with high Word error rate. We will use python package [jiwer](https://github.com/jitsi/jiwer). Finetune the thresholds to appropriately flag predictions.

In [10]:
from jiwer import wer

In [11]:
wer_threshold = 0.8 #Can be finetuned.

In [12]:
for line in manifest:
    transcript = line["text"].lower().translate(punct_dict)
    pred_text = line["pred_text"]
    try:
        error_rate = wer(transcript, pred_text)
    except Exception as e:
        print(f"Got error: {e} for line: {line}")
        error_rate = 0
    if error_rate > wer_threshold:
        print(f"Low confidence for file: {line['audio_filepath']} --- Transcript: {transcript} --- Predicted text: {pred_text} --- Word error rate: {error_rate}")

Low confidence for file: /home/siddhartht/tts/tutorials/6097_5_mins/audio/hartmann_11_fawcett_0337.wav --- Transcript: hitherto --- Predicted text: hither two --- Word error rate: 2.0
