# Audio Quality
We will use **PESQ (Perceptual Evaluation of Speech Quality)**, **STOI (Short-Time Objective Intelligibility)**, and **DNSMOS (Deep Noise Suppression Mean Opinion Score)** to evaluate audio quality after watermark removal.

**PESQ** is a metric designed to simulate human perception of speech quality. It compares a reference (clean) signal to a degraded (processed) one, taking into account perceptual distortion, time alignment, and other audio artifacts. PESQ is widely used in audio codec evaluation and speech processing tasks. It is part of the ITU-T P.862 standard. The PESQ score typically ranges from 1.0 (bad quality) to 4.5 (excellent quality), with higher values indicating better perceived audio quality.

**STOI** is a metric that estimates how intelligible a piece of speech is to human listeners. Unlike PESQ, which focuses on overall perceptual quality, STOI is specifically designed to predict the understandability of speech. It operates on short-time overlapping windows and compares the clean and processed signals. The score ranges from 0 to 1, where higher values indicate better intelligibility.

**DNSMOS** is a non-intrusive, deep learning-based metric that estimates the quality of speech signals without requiring a reference audio. It predicts four scores: `ovrl_mos` (overall speech quality), `sig_mos` (speech signal quality), `bak_mos` (background noise quality), and `p808_mos` (an advanced perceptual metric). It is particularly useful in scenarios like speech enhancement or watermark removal where a clean reference is unavailable or hard to align. All scores typically range from 1 to 5, with higher being better.


Here is a reference table for interpreting PESQ, STOI and DNSMOS scores:

| Metric     | Range       | Measures                 | Bad    | Poor     | Fair     | Good     | Excellent |
|------------|-------------|--------------------------|--------|----------|----------|----------|-----------|
| PESQ       | 1.0 – 4.5   | Audio quality (ref-based)| <1.5   | 1.5–2.4  | 2.5–3.4  | 3.5–4.2  | 4.3–4.5   |
| STOI       | 0.0 – 1.0   | Speech intelligibility   | <0.60  | 0.60–0.75| 0.75–0.85| 0.85–0.95| 0.95–1.00 |
| DNSMOS     | 1.0 – 5.0   | Non-intrusive MOS scores | <2.0   | 2.0–2.9  | 3.0–3.5  | 3.6–4.3  | 4.4–5.0   |

In [1]:
import os
from pesq import pesq
from scipy.io import wavfile
from pystoi.stoi import stoi
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import librosa
from speechmos import dnsmos
from concurrent.futures import ProcessPoolExecutor, as_completed

In [2]:
watermarked_path = "../Dataset/Watermarked Audio"
unwatermarked_path = "../Dataset/Unwatermarked Audio"
transcription_path = '../Dataset/Transcriptions/transcriptions_complete.csv'
results_path = '../Dataset/Results'

# Get all filepaths
unwatermarked_files = os.listdir(unwatermarked_path)
unwatermarked_files = [i for i in unwatermarked_files if i[-4:] == ".mp3"]
unwatermarked_files = [i for i in unwatermarked_files if "audioseal" in i]

watermarked_files = os.listdir(watermarked_path)
watermarked_files = [i for i in watermarked_files if i[-4:] == ".mp3"]
watermarked_files = [i for i in watermarked_files if "audioseal" in i]

# Helper to extract the ID from the filename
def extract_id(filename):
    match = re.search(r'common_voice_en_(\d+)', filename)
    return match.group(1) if match else None

# Build dicts by ID
un_dict = {extract_id(f): f for f in unwatermarked_files}
w_dict = {extract_id(f): f for f in watermarked_files}

# Find common IDs and build a dict with (un, w) tuples
matched = {id_: (un_dict[id_], w_dict[id_]) for id_ in un_dict.keys() & w_dict.keys()}

print(f"Audios to evaluate: {len(matched):0,.0f}")

# Find IDs already processed
processed_ids = []
last_file_num = 0
for file in os.listdir(results_path):
    file_path = os.path.join(results_path, file)
    if file.startswith("results_audio_quality2_"):
        file_num = re.search(r'results_audio_quality2_(\d+).csv', file).group(1)
        last_file_num = np.max([int(file_num), last_file_num])
        temp = pd.read_csv(file_path, usecols=["id"])
        processed_ids.extend(temp["id"].astype(str).tolist())

remaining_clips = list(matched.keys() - set(processed_ids))
print(f"Remaining clips {len(remaining_clips):0,.0f} ({len(remaining_clips)/len(matched.keys()):0.1%})")
print("Last file number: ", last_file_num)

Audios to evaluate: 14,124
Remaining clips 14,124 (100.0%)
Last file number:  2


In [3]:
target_sr = 16000
all_results = []
column_names = ["id", "pesq", "stoi"]

for n, i in tqdm(enumerate(remaining_clips[::-1]), total=len(remaining_clips)):
    un, w = matched[i]
    # Load the watermarked audio file
    w_path = os.path.join(watermarked_path, w)
    w_wav, sr = librosa.load(w_path, sr=target_sr)

    # Load the unwatermarked audio file
    un_path = os.path.join(unwatermarked_path, un)
    un_wav, sr = librosa.load(un_path, sr=target_sr)

    # trims both to the shorter length
    min_len = min(len(w_wav), len(un_wav))
    w_wav = w_wav[:min_len]
    un_wav = un_wav[:min_len]

    # Scores
    try:
        pesq_score = pesq(sr, w_wav, un_wav, 'wb')  # wb -> sr: 16k or nb -> sr: 8k
        stoi_score = stoi(w_wav, un_wav, sr, extended=True)
        result = [i, pesq_score, stoi_score]
    except:
        result = [np.nan]*len(column_names)

    all_results.append(result)
    # Save results every 1000 items or at the end
    if (n % 1000 == 0 and n > 0) or (n == len(remaining_clips) - 1):
        last_file_num += 1
        results_df = pd.DataFrame(all_results, columns=column_names)
        output_filename = f"results_audio_quality2_{last_file_num}.csv"
        results_df.to_csv(os.path.join(results_path, output_filename), index=False)
        all_results = []

100%|██████████| 14124/14124 [1:20:52<00:00,  2.91it/s]
