# Preprocessing Pipeline

This notebook collects the preprocessing commands used to build IRMAS and Chinese instrument datasets.
Each section mirrors a Makefile target so you can run them interactively.


## Setup

Define shared paths and a helper `run` function so later cells stay concise.


In [1]:
from __future__ import annotations

from pathlib import Path
from typing import Sequence
import os
import subprocess

PROJECT_ROOT = Path.cwd()

DATA_DIR = PROJECT_ROOT / "data"
IRMAS_TRAIN_DIR = DATA_DIR / "audio" / "IRMAS" / "IRMAS-TrainingData"
IRMAS_TEST_DIR = DATA_DIR / "audio" / "IRMAS" / "IRMAS-TestingData-Part1"
CHN_DIR = DATA_DIR / "audio" / "chinese_instruments"
CHN_SOURCES_DIR = CHN_DIR / "sources"

CACHE_DIR = PROJECT_ROOT / ".cache"
IRMAS_MELS_DIR = CACHE_DIR / "mels" / "irmas"
IRMAS_TEST_MELS_DIR = IRMAS_MELS_DIR / "test"

MANIFEST_DIR = DATA_DIR / "manifests"
IRMAS_TRAIN_MANIFEST = MANIFEST_DIR / "irmas_train.csv"
IRMAS_TRAIN_MELS_CSV = MANIFEST_DIR / "irmas_train_mels.csv"
IRMAS_TEST_MELS_CSV = MANIFEST_DIR / "irmas_test_mels.csv"

SR = 44100
DUR = 3.0
N_MELS = 128
WIN_MS = 30.0
HOP_MS = 10.0
STRIDE_S = 3

ENV = dict(os.environ)
ENV["PYTHONPATH"] = str(PROJECT_ROOT / "src")

def run(cmd: Sequence[str]) -> None:
    """Execute a command with the project PYTHONPATH set."""
    print("Running:", " ".join(str(item) for item in cmd))
    subprocess.run(cmd, check=True, cwd=PROJECT_ROOT, env=ENV)


## Generate Train Manifests

Rebuild the manifest CSVs that list training audio files for IRMAS and Chinese instruments.


In [2]:
run([
    "python",
    "-m",
    "scripts.generate_train_manifests",
    "--irmas_dir",
    str(IRMAS_TRAIN_DIR),
    "--chinese_dir",
    str(CHN_DIR),
    "--out_dir",
    str(MANIFEST_DIR),
])


Running: python -m scripts.generate_train_manifests --irmas_dir e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\IRMAS\IRMAS-TrainingData --chinese_dir e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments --out_dir e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\manifests


## Generate IRMAS Train Mel Cache

Create or refresh the mel-spectrogram cache and manifest used for training.


In [3]:
run([
    "python",
    "-m",
    "scripts.generate_irmas_train_mels",
    "--irmas_train_dir",
    str(IRMAS_TRAIN_DIR),
    "--cache_root",
    str(IRMAS_MELS_DIR / "train"),
    "--mel_manifest_out",
    str(IRMAS_TRAIN_MELS_CSV),
    "--sr",
    str(SR),
    "--dur",
    str(DUR),
    "--n_mels",
    str(N_MELS),
    "--win_ms",
    str(WIN_MS),
    "--hop_ms",
    str(HOP_MS),
])


Running: python -m scripts.generate_irmas_train_mels --irmas_train_dir e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\IRMAS\IRMAS-TrainingData --cache_root e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\.cache\mels\irmas\train --mel_manifest_out e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\manifests\irmas_train_mels.csv --sr 44100 --dur 3.0 --n_mels 128 --win_ms 30.0 --hop_ms 10.0


## Generate IRMAS Test Windows

Produce sliding-window mel spectrograms and manifest for the IRMAS test set.


In [4]:
run([
    "python",
    "-m",
    "scripts.generate_irmas_test_mels",
    "--irmas_test_dir",
    str(IRMAS_TEST_DIR),
    "--cache_root",
    str(IRMAS_TEST_MELS_DIR),
    "--mel_manifest_out",
    str(IRMAS_TEST_MELS_CSV),
    "--sr",
    str(SR),
    "--dur",
    str(DUR),
    "--n_mels",
    str(N_MELS),
    "--win_ms",
    str(WIN_MS),
    "--hop_ms",
    str(HOP_MS),
    "--stride_s",
    str(STRIDE_S),
])


Running: python -m scripts.generate_irmas_test_mels --irmas_test_dir e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\IRMAS\IRMAS-TestingData-Part1 --cache_root e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\.cache\mels\irmas\test --mel_manifest_out e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\manifests\irmas_test_mels.csv --sr 44100 --dur 3.0 --n_mels 128 --win_ms 30.0 --hop_ms 10.0 --stride_s 3


## Build Chinese Instrument Datasets

Generate audio datasets for each Chinese instrument based on the source JSON manifests.


In [5]:
CHINESE_JSON_SOURCES = [
    ("percussion", CHN_SOURCES_DIR / "percussion.json"),
    ("dizi", CHN_SOURCES_DIR / "dizi.json"),
    ("guzheng", CHN_SOURCES_DIR / "guzheng.json"),
    ("suona", CHN_SOURCES_DIR / "suona.json"),
]

for instrument, json_path in CHINESE_JSON_SOURCES:
    print(f"\n[{instrument}]")
    run([
        "python",
        "-m",
        "scripts.generate_data_from_json",
        "--input",
        str(json_path),
    ])



[percussion]
Running: python -m scripts.generate_data_from_json --input e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments\sources\percussion.json

[dizi]
Running: python -m scripts.generate_data_from_json --input e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments\sources\dizi.json

[guzheng]
Running: python -m scripts.generate_data_from_json --input e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments\sources\guzheng.json

[suona]
Running: python -m scripts.generate_data_from_json --input e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments\sources\suona.json


## Summaries

Update dataset statistics to inspect the generated audio directories.


In [6]:
run([
    "python",
    "-m",
    "scripts.summarise_data",
    "--root",
    str(CHN_DIR),
])

run([
    "python",
    "-m",
    "scripts.summarise_data",
    "--root",
    str(IRMAS_TRAIN_DIR),
])


Running: python -m scripts.summarise_data --root e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\chinese_instruments
Running: python -m scripts.summarise_data --root e:\qingchaolaopian\Instrument Sound\Github\ML-based-analysis-of-sound\data\audio\IRMAS\IRMAS-TrainingData


## Clean Cache (Optional)

Removing cached mel data is destructive. Uncomment the commands below if you need to start over.


In [7]:
# run(["rm", "-rf", str(CACHE_DIR / "mels")])
# run(["rm", "-rf", str(CACHE_DIR / "mels_chinese")])
# run(["rm", "-rf", str(CACHE_DIR / "canonical")])
# run(["rm", "-rf", str(CACHE_DIR / "video_tmp")])
