# Introduction to PyDub

**Purpose.** Why PyDub matters in an ASR pipeline.

| Component          | Role                         | Why it matters                                                                   |
| ------------------ | ---------------------------- | -------------------------------------------------------------------------------- |
| PyDub              | Audio manipulation in Python | One class (`AudioSegment`) to load, inspect, edit, and export audio consistently |
| Consistent formats | Normalize inputs             | Transcription APIs expect specific sample rate, width, channels, and container   |
| ffmpeg             | Codec backend                | Enables non-WAV formats (e.g., MP3, M4A, FLAC)                                   |

**Setup.** How to install and enable codecs.

| Step                  | Command             | Note                          |
| --------------------- | ------------------- | ----------------------------- |
| Install PyDub         | `pip install pydub` | Pure Python wrapper           |
| Enable non-WAV codecs | Install ffmpeg      | Required for MP3/M4A/FLAC I/O |
| WAV-only usage        | none                | Works out of the box          |

**Main abstraction.** `AudioSegment` is the workhorse.

| Task           | Method                   | Required args                            | Notes                                                       |
| -------------- | ------------------------ | ---------------------------------------- | ----------------------------------------------------------- |
| Load from file | `AudioSegment.from_file` | `file` (path), optional `format`         | Format inferred from extension if present                   |
| Play audio     | `pydub.playback.play`    | `AudioSegment`                           | Needs `simpleaudio` or `pyaudio` for WAV; ffmpeg for others |
| Export         | `segment.export`         | `out_f`, optional `format`, `parameters` | Write WAV/MP3/M4A, etc.                                     |

**Playback dependencies.** What you need to hear audio.

| Format       | Dependency                 | Install hint                                    |
| ------------ | -------------------------- | ----------------------------------------------- |
| WAV          | `simpleaudio` or `pyaudio` | `pip install simpleaudio`                       |
| MP3/M4A/FLAC | ffmpeg                     | Get binaries from ffmpeg.org or package manager |

**Introspection.** Inspect key audio parameters from an `AudioSegment`.

| Attribute/function | Meaning                        | Typical values           |
| ------------------ | ------------------------------ | ------------------------ |
| `channels`         | 1=mono, 2=stereo               | 1 or 2                   |
| `frame_rate`       | Sampling rate in Hz            | 8k–48k                   |
| `sample_width`     | Bytes per sample               | 1=8-bit, 2=16-bit        |
| `max`              | Max amplitude (loudest sample) | For normalization checks |
| `len(segment)`     | Duration in ms                 | e.g., 2000 for 2 s       |

**Normalization tools.** Set parameters to API-friendly targets.

| Goal         | PyDub method                         | Example target                |
| ------------ | ------------------------------------ | ----------------------------- |
| Sample width | `set_sample_width`                   | 2 bytes (16-bit PCM)          |
| Sample rate  | `set_frame_rate`                     | ≥16,000 Hz (often 16 kHz)     |
| Channels     | `set_channels`                       | 1 (mono)                      |
| Loudness     | scale by ratio or normalize by `max` | Keep headroom, avoid clipping |

**Minimal I/O.** Load, check, normalize, play, and export for ASR.

```python

```

**Operational targets.** Defaults that keep most ASR APIs happy.

| Parameter    | Recommended value | Rationale                          |
| ------------ | ----------------- | ---------------------------------- |
| Container    | WAV (PCM16)       | Uncompressed, widely accepted      |
| Channels     | Mono (1)          | Deterministic input, half the data |
| Sample rate  | 16 kHz            | Standard for speech models         |
| Sample width | 16-bit (2 bytes)  | Quality vs size balance            |

**Pipeline.** Where PyDub sits in your speech stack.

| Step       | Action                                   | Output              |
| ---------- | ---------------------------------------- | ------------------- |
| Ingest     | `from_file`                              | `AudioSegment`      |
| Normalize  | `set_*` + optional gain                  | Canonical audio     |
| Validate   | Inspect attrs, quick play                | Sanity-checked clip |
| Export     | `export(..., format="wav")`              | ASR-ready file      |
| Transcribe | Feed to `SpeechRecognition` or cloud API | Text hypothesis     |


`PyDub` is for .wav, if we want .mp3 or other formats, we need `ffmpeg` installed.

In [1]:
# I/O and playback
from pydub import AudioSegment
from pydub.playback import play  # requires simpleaudio (WAV) or ffmpeg (others)



In [2]:
# 1) Load (format inferred by extension; for raw paths, pass format="wav"/"mp3"/...)
file = "../data/audio.wav"  # any format supported by ffmpeg/avlib
seg = AudioSegment.from_file(file)  # works out of the box for WAV

In [3]:
# 2) Inspect core parameters
print({
    "channels": seg.channels,
    "frame_rate": seg.frame_rate,      # Hz
    "sample_width": seg.sample_width,  # bytes per sample. A byte per sample means that each sample is an integer between 0 and 255
    "duration_ms": len(seg),           # milliseconds
    "max_amplitude": seg.max           # max possible amplitude the range is 0 to (2^(8*sample_width))-1
})

{'channels': 2, 'frame_rate': 44100, 'sample_width': 2, 'duration_ms': 23313, 'max_amplitude': 11912}


In [4]:
# 3) Normalize to ASR-friendly PCM: mono, 16 kHz, 16-bit
target = (
    seg.set_channels(1)
       .set_frame_rate(16_000)
       .set_sample_width(2)  # 2 bytes = 16-bit PCM
)

In [5]:
# Optional: simple peak normalization to a safe headroom (e.g., 0.95 of full-scale)
if target.max > 0:
    gain = 0.95 * (seg.max / target.max)
    target = target.apply_gain(20 * __import__("math").log10(gain))

In [None]:
from io import BytesIO
import winsound

seg = AudioSegment.from_file(r"..\data\audio.wav")

def play_winsound(seg):
    buf = BytesIO()
    seg.export(buf, format="wav")  # RIFF WAV bytes
    winsound.PlaySound(buf.getvalue(), winsound.SND_MEMORY | winsound.SND_NODEFAULT)


In [None]:
play_winsound(seg)

In [6]:
# 5) Export as WAV PCM16 for downstream ASR
target.export("normalized.wav", format="wav")

<_io.BufferedRandom name='normalized.wav'>

In [7]:
wav_file = AudioSegment.from_file(file)
# Create a new wav file with adjusted frame rate
wav_file_16k = wav_file.set_frame_rate(16000)
# Check the frame rate of the new wav file
print(f"Old frame rate: {wav_file.frame_rate}")
print(f"New frame rate: {wav_file_16k.frame_rate}")

Old frame rate: 44100
New frame rate: 16000


In [None]:
# Set number of channels to 1, i.e., mono
wav_file_1_ch = wav_file.set_channels(1)

# Check the number of channels
print(f"Old number of channels: {wav_file.channels}")
print(f"New number of channels: {wav_file_1_ch.channels}")

Old number of channels: 2
New number of channels: 1


In [21]:
# Set sample_width to 1
wav_file_sw_1 = wav_file.set_sample_width(1)

# Check new sample_width
print(f"Old sample width: {wav_file.sample_width}")
print(f"New sample width: {wav_file_sw_1.sample_width}")

Old sample width: 2
New sample width: 1


# Manupulating with PyDub

**Loudness and normalization.** Control perceived volume or even it out before transcription.

| Operation       | Code pattern             | Effect                        | Notes                       |
| --------------- | ------------------------ | ----------------------------- | --------------------------- |
| Reduce volume   | `seg - N`                | Lowers loudness by N dB       | Too quiet can break ASR     |
| Increase volume | `seg + N`                | Raises loudness by N dB       | Avoid clipping on hot audio |
| Normalize peak  | `effects.normalize(seg)` | Lifts quiet parts toward peak | Uses peak, not LUFS         |

```python

```


In [None]:
# Loudness: reduce, boost, normalize
from pydub import effects

seg = AudioSegment.from_file(file)
too_quiet = seg - 60        # 60 dB quieter
louder    = seg + 6        # +60 dB gain
balanced  = effects.normalize(seg)  # peak normalization

In [41]:
play_winsound(louder)

In [17]:
# normalize to -20 LUFS
target = effects.normalize(louder)  # peak normalization

# loudness 
print(f"Original max amplitude: {seg.max}, dBFS: {seg.dBFS}, RMS: {seg.rms}")
print(f"Balanced max amplitude: {balanced.max}, dBFS: {balanced.dBFS}, RMS: {balanced.rms}")
print(f"Louder max amplitude: {louder.max}, dBFS: {louder.dBFS}, RMS: {louder.rms}")
print(f"Too quiet max amplitude: {target.max}, dBFS: {target.dBFS}, RMS: {target.rms}")

Original max amplitude: 11912, dBFS: -24.052762174952594, RMS: 2055
Balanced max amplitude: 32393, dBFS: -15.360762541465892, RMS: 5590
Louder max amplitude: 23768, dBFS: -18.049085578547405, RMS: 4102
Too quiet max amplitude: 32393, dBFS: -15.360762541465892, RMS: 5590



**Slicing and remixing.** Trim silence/static and assemble parts deterministically.

| Task           | Syntax             | Output                    | Notes                       |
| -------------- | ------------------ | ------------------------- | --------------------------- |
| Drop first 5 s | `seg[5000:]`       | Audio without initial 5 s | Times are in ms             |
| Keep window    | `seg[a:b]`         | Segment a–b ms            | Half-open slice             |
| Concatenate    | `seg1 + seg2`      | End-to-end join           | Metadata reconciled         |
| Join + gain    | `seg1 + seg2 + 10` | Join then +10 dB          | Operator precedence applies |

```python

```


In [None]:
file1 = "../data/audio.wav"
file2 = "../data/audio2.wav"

seg   = AudioSegment.from_file(file1)
seg1  = AudioSegment.from_file(file2)

trimmed     = seg[5000:]            # drop first 5 s
mayonnaise  = seg[12_000:15_000]    # keep 12–15 s window
combo       = seg1 + trimmed      # concatenate two segments
mayonnaise21 = mayonnaise * 21               # repeat 21 times

In [34]:
play_winsound(mayonnaise)

In [None]:
play_winsound(mayonnaise21)

In [None]:
spedup_audio = mayonnaise21.speedup(playback_speed=15, chunk_size=150, crossfade=25)
play_winsound(spedup_audio)


**Channel handling.** Split stereo to per-speaker mono for cleaner ASR.

| Input                 | Method            | Output          | Use                                |
| --------------------- | ----------------- | --------------- | ---------------------------------- |
| Stereo `AudioSegment` | `split_to_mono()` | `[left, right]` | Transcribe each speaker separately |

```python

```


In [None]:
stereo = AudioSegment.from_file(file)   # 2 channels
left, right = stereo.split_to_mono()            # list of mono segments

left_norm  = left.set_frame_rate(16_000).set_sample_width(2)
right_norm = right.set_frame_rate(16_000).set_sample_width(2)

In [22]:
play_winsound(left_norm) 

In [23]:
play_winsound(left)


**ASR-oriented fixes.** Map common issues to PyDub tools.

| Issue                    | Symptom                 | Fix                | PyDub tool                |
| ------------------------ | ----------------------- | ------------------ | ------------------------- |
| Too quiet                | ASR misses words        | Boost or normalize | `+N`, `effects.normalize` |
| Variable loudness        | Inconsistent confidence | Normalize          | `effects.normalize`       |
| Leading noise/static     | Garbage tokens at start | Trim head          | `seg[ms:]`                |
| Mixed speakers in stereo | Crosstalk in transcript | Split channels     | `split_to_mono()`         |


# Converting and saving

In [1]:
from pathlib import Path
from pydub import AudioSegment



In [None]:
file = "data/raw/wav/audio1.wav"  # any format supported by ffmpeg/avlib

**Export goal.** Move edited `AudioSegment`s to disk in a consistent ASR-ready format.

| Component             | Role                                   | Why it matters                                 |
| --------------------- | -------------------------------------- | ---------------------------------------------- |
| `AudioSegment.export` | Write audio to file                    | Finalizes edits for downstream ASR             |
| `format` arg          | Target container (`"wav"`, `"mp3"`, …) | Default is `"mp3"`; specify `"wav"` explicitly |
| ffmpeg                | Codec backend for non-WAV              | Required for MP3/FLAC/M4A I/O                  |

**Single-file export.** One clip → boost volume → save as WAV PCM16.

In [14]:
# PyDub: load → gain → ensure PCM16 mono 16 kHz → export as WAV
seg = AudioSegment.from_file(file)      # WAV works without ffmpeg
proc = (
    (seg + 10)               # +10 dB gain
    .set_channels(1)         # mono
    .set_sample_width(2)     # 16-bit PCM
    .set_frame_rate(16_000)  # 16 kHz
    .normalize()             # peak normalization
)

out = Path("data/processed/audio1_louder.wav")
proc.export(out.as_posix(), format="wav")      # for MP3/FLAC you need ffmpeg installed
print(f"wrote: {out}")

wrote: data\processed\audio1_louder.wav



**Batch reformatting.** Convert a folder of MP3/FLAC to WAV in one pass.

| Step        | Python tool                         | Note                            |
| ----------- | ----------------------------------- | ------------------------------- |
| List files  | `pathlib.Path.glob` or `os.scandir` | Filter by suffix                |
| Derive name | `Path.stem`                         | Keep base name                  |
| Load        | `AudioSegment.from_file`            | Format inferred by extension    |
| Export      | `segment.export(..., format="wav")` | Needs ffmpeg for non-WAV inputs |


```bash
winget install --id Gyan.FFmpeg -e
```

In [12]:
# Convert .mp3/.flac → .wav across a directory tree
# Requires ffmpeg/avlib for MP3/FLAC/etc.
# winget install --id Gyan.FFmpeg -e
from scripts import mp3_to_wav

mp3_to_wav.convert(r"data\raw\mp3\audio3.mp3", r"data\raw\wav\audio3.wav")

FFBIN = c:\Users\herie\OneDrive - Fundacion Universidad de las Americas Puebla\Proyectos\En Proceso\Paralinguistic Speech Classification for Human Vocalizations\scripts\ffmpeg-8.0-essentials_build\bin



**Batch manipulate then export.** Trim leading static, apply gain, standardize, save.

| Issue fixed         | Operation      | PyDub method                                       |
| ------------------- | -------------- | -------------------------------------------------- |
| Leading static      | Drop first 3 s | `seg[3000:]`                                       |
| Too quiet           | Add gain       | `seg + 10`                                         |
| Non-standard params | Canonicalize   | `set_channels / set_frame_rate / set_sample_width` |

```python

In [15]:
# Suppose we have audios with inconsistent formats and quality.
def clean_and_export(in_dir, out_dir, head_ms=3000, gain_db=10):
    in_dir  = Path(in_dir)
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    for p in in_dir.rglob("*"):
        if p.suffix.lower() in {".wav", ".mp3", ".flac", ".m4a"}:
            seg = AudioSegment.from_file(p.as_posix())
            proc = (
                seg[head_ms:]                # trim head
                .apply_gain(gain_db)         # boost
                .set_channels(1)             # mono
                .set_sample_width(2)         # 16-bit PCM
                .set_frame_rate(16_000)      # 16 kHz
            )
            out = out_dir / f"{p.stem}.wav"
            proc.export(out.as_posix(), format="wav")
            print("wrote", out)

clean_and_export("data/raw/mp3/", "data/processed/")

wrote data\processed\audio3.wav



**Operational defaults.** Safe targets for most ASR backends.

| Parameter      | Recommended value | Rationale                              |
| -------------- | ----------------- | -------------------------------------- |
| Container      | WAV (PCM16)       | Widely accepted, lossless              |
| Channels       | Mono (1)          | Simpler models, smaller files          |
| Sample rate    | 16 kHz            | Standard for speech                    |
| Sample width   | 16-bit            | Quality-size balance                   |
| Head/tail trim | Data-dependent    | Remove static/silence to reduce errors |

**Pipeline.** End-to-end path to transcription.

| Stage      | Action                             | Output          |
| ---------- | ---------------------------------- | --------------- |
| Gather     | Load originals                     | `AudioSegment`  |
| Fix        | Trim, gain, normalize, standardize | Canonical clips |
| Save       | `export(..., format="wav")`        | On-disk WAVs    |
| Transcribe | Feed to ASR (`recognize_*`)        | Text hypotheses |
