# Dataset Comparison: Original vs. Post-Augmentation

This notebook calculates and displays a comparison table between the **Original (Raw)** training dataset and the **Post-Augmentation** dataset used in the model training.

### Metrics Calculated:
1. **Total Audio Clips:** Number of files in the training split.
2. **Total Duration (min):** Sum of audio length.
3. **Avg. Clip Duration (s):** Average length of a single clip.
4. **Number of Classes:** Count of unique Surahs present.

--- 
## 1. Imports & Configuration

In [None]:
import os
import pandas as pd
import torchaudio
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm

# --- CONFIGURATION ---
# Must match the training notebook settings
DATASET_ROOT = "audio_data_processed"
SAMPLE_RATE = 16000
DURATION = 5  # Target duration in seconds (post-processing)

--- 
## 2. Replicate Data Split Logic
We use the exact same splitting logic as the training notebook to ensure we are analyzing the correct set of files (Training Set only).

In [None]:
if not os.path.exists(DATASET_ROOT):
    raise FileNotFoundError(f"Dataset root '{DATASET_ROOT}' not found. Please check the path.")

# Get all qari folders
all_qaris = [d for d in sorted(os.listdir(DATASET_ROOT)) if os.path.isdir(os.path.join(DATASET_ROOT, d))]
print(f"Found {len(all_qaris)} unique reciters.")

# Replicate the split (Seed 42)
train_qaris, test_val_qaris = train_test_split(all_qaris, test_size=0.30, random_state=42)
print(f"Analyzing Training Set ({len(train_qaris)} reciters)...")

--- 
## 3. Collect File Metadata
Scanning the raw files to calculate the original duration and count.

In [None]:
train_file_paths = []
train_surah_labels = []

# Collect paths
for qari_folder in train_qaris:
    qari_path = os.path.join(DATASET_ROOT, qari_folder)
    for filename in os.listdir(qari_path):
        if filename.endswith(".mp3"):
            file_id = filename.split('.')[0]
            # Basic validation logic from training script
            if len(file_id) != 6 or not file_id.isdigit():
                continue
            
            full_path = os.path.join(DATASET_ROOT, qari_folder, filename)
            train_file_paths.append(full_path)
            train_surah_labels.append(file_id[:3])

print(f"Total training files found: {len(train_file_paths)}")

# Calculate Original Stats
total_original_duration_sec = 0
valid_files_count = 0

print("Scanning file durations (Original)... ")
for path in tqdm(train_file_paths):
    try:
        # torchaudio.info is faster than loading the whole file
        metadata = torchaudio.info(path)
        duration = metadata.num_frames / metadata.sample_rate
        total_original_duration_sec += duration
        valid_files_count += 1
    except Exception as e:
        print(f"Skipping corrupt file: {path} ({e})")

print("Scan complete.")

--- 
## 4. Generate Comparison Table

In [None]:
# --- CALCULATION ---

# 1. Original Stats
orig_clips = valid_files_count
orig_dur_min = round(total_original_duration_sec / 60, 2)
orig_avg_dur = round(total_original_duration_sec / valid_files_count, 2) if valid_files_count else 0
orig_classes = len(set(train_surah_labels))

# 2. Post-Augmentation Stats
# Logic: The dataset class forces every clip to be exactly DURATION (5s) via padding or cropping.
# It does NOT increase the number of files (1-to-1 mapping during iteration).
aug_clips = valid_files_count 
aug_dur_min = round((aug_clips * DURATION) / 60, 2)
aug_avg_dur = float(DURATION)
aug_classes = orig_classes

# --- DATAFRAME ---
stats_data = {
    "Metric": ["Original (Raw Training Set)", "Post-Augmentation (Processed)"],
    "Total Audio Clips": [orig_clips, aug_clips],
    "Total Duration (min)": [orig_dur_min, aug_dur_min],
    "Avg. Clip Duration (s)": [orig_avg_dur, aug_avg_dur],
    "Number of Classes (Surah)": [orig_classes, aug_classes]
}

df_comparison = pd.DataFrame(stats_data)

print("\n=== DATASET COMPARISON TABLE ===")
display(df_comparison)

# Optional: Save to CSV
# df_comparison.to_csv("dataset_comparison_table.csv", index=False)