# Exploratory Data Analysis - RAVDESS Dataset
**Contributor:** Pratheek Sankeshi (psankesh9@berkeley.edu)

**Project:** Emotional Vocalization Classification

This notebook performs exploratory data analysis on the RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset to understand:
- Dataset structure and size
- Emotion class distribution
- Audio feature characteristics (MFCCs, spectrograms, pitch, energy)
- Correlations between features and emotions
- Data quality and potential challenges

## 1. Setup and Imports

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
from pathlib import Path
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

## 2. Dataset Loading and Metadata Extraction

RAVDESS filename format: `Modality-VocalChannel-Emotion-EmotionalIntensity-Statement-Repetition-Actor.wav`

- **Modality:** 01 = full-AV, 02 = video-only, 03 = audio-only
- **Vocal channel:** 01 = speech, 02 = song
- **Emotion:** 01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised
- **Emotional intensity:** 01 = normal, 02 = strong (neutral has no intensity)
- **Statement:** 01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door"
- **Repetition:** 01 = 1st repetition, 02 = 2nd repetition
- **Actor:** 01 to 24 (odd = male, even = female)

In [None]:
# TODO: Update this path to your RAVDESS dataset location
DATASET_PATH = Path('../data/RAVDESS')  # Adjust this path

# Emotion mapping
EMOTION_LABELS = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}

def parse_filename(filename):
    """Parse RAVDESS filename to extract metadata."""
    parts = filename.stem.split('-')
    return {
        'modality': parts[0],
        'vocal_channel': parts[1],
        'emotion_code': parts[2],
        'emotion': EMOTION_LABELS[parts[2]],
        'intensity': parts[3],
        'statement': parts[4],
        'repetition': parts[5],
        'actor': parts[6],
        'gender': 'male' if int(parts[6]) % 2 == 1 else 'female',
        'filepath': str(filename)
    }

def load_dataset_metadata(dataset_path):
    """Load all audio files and extract metadata."""
    audio_files = list(dataset_path.glob('**/*.wav'))
    metadata = [parse_filename(f) for f in audio_files]
    return pd.DataFrame(metadata)

# Load metadata
df = load_dataset_metadata(DATASET_PATH)
print(f"Total number of audio files: {len(df)}")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few entries:")
df.head()

## 3. Dataset Statistics

In [None]:
# Basic statistics
print("=== DATASET OVERVIEW ===")
print(f"Total files: {len(df)}")
print(f"Number of actors: {df['actor'].nunique()}")
print(f"Number of emotions: {df['emotion'].nunique()}")
print(f"\nEmotions: {sorted(df['emotion'].unique())}")
print(f"\nVocal channels: Speech={len(df[df['vocal_channel']=='01'])}, Song={len(df[df['vocal_channel']=='02'])}")
print(f"Gender distribution: Male={len(df[df['gender']=='male'])}, Female={len(df[df['gender']=='female'])}")

## 4. Class Distribution Visualizations

In [None]:
# Emotion distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Bar plot
emotion_counts = df['emotion'].value_counts().sort_index()
axes[0].bar(emotion_counts.index, emotion_counts.values, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Emotion', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Distribution of Emotions in RAVDESS Dataset', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(emotion_counts.values, labels=emotion_counts.index, autopct='%1.1f%%', 
            startangle=90, colors=sns.color_palette('Set2', len(emotion_counts)))
axes[1].set_title('Emotion Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("The dataset shows balanced distribution across all 8 emotion classes.")
print("Each emotion appears roughly equal number of times, which is ideal for training.")

In [None]:
# Emotion by gender and vocal channel
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Gender distribution per emotion
gender_emotion = pd.crosstab(df['emotion'], df['gender'])
gender_emotion.plot(kind='bar', ax=axes[0], color=['#FF6B6B', '#4ECDC4'], edgecolor='black')
axes[0].set_xlabel('Emotion', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Emotion Distribution by Gender', fontsize=14, fontweight='bold')
axes[0].legend(title='Gender', fontsize=10)
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(axis='y', alpha=0.3)

# Vocal channel distribution per emotion
df['modality'] = df['vocal_channel'].map({'01': 'speech', '02': 'song'})
vocal_emotion = pd.crosstab(df['emotion'], df['modality'])
vocal_emotion.plot(kind='bar', ax=axes[1], color=['#95E1D3', '#F38181'], edgecolor='black')
axes[1].set_xlabel('Emotion', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Emotion Distribution by Vocal Channel (Speech vs Song)', fontsize=14, fontweight='bold')
axes[1].legend(title='Modality', fontsize=10)
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("Gender is balanced across all emotions.")
print("Both speech and song modalities are well-represented for each emotion.")

## 5. Audio Feature Extraction

Extract features from a sample of audio files:
- **MFCCs** (Mel-frequency cepstral coefficients): Capture timbral characteristics
- **Spectral features:** Centroid, bandwidth, rolloff
- **Prosodic features:** Pitch (F0), energy, zero-crossing rate
- **Temporal features:** Duration

In [None]:
def extract_features(filepath, sr=22050):
    """Extract audio features from a single file."""
    try:
        # Load audio
        y, sr = librosa.load(filepath, sr=sr)
        
        # Duration
        duration = librosa.get_duration(y=y, sr=sr)
        
        # MFCCs (13 coefficients)
        mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
        mfccs_mean = np.mean(mfccs, axis=1)
        mfccs_std = np.std(mfccs, axis=1)
        
        # Spectral features
        spectral_centroid = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
        spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr))
        spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(y=y, sr=sr))
        
        # Zero crossing rate
        zcr = np.mean(librosa.feature.zero_crossing_rate(y))
        
        # Energy (RMS)
        rms = np.mean(librosa.feature.rms(y=y))
        
        # Pitch (F0) - using piptrack
        pitches, magnitudes = librosa.piptrack(y=y, sr=sr)
        pitch_mean = np.mean(pitches[pitches > 0]) if np.any(pitches > 0) else 0
        
        features = {
            'duration': duration,
            'spectral_centroid': spectral_centroid,
            'spectral_bandwidth': spectral_bandwidth,
            'spectral_rolloff': spectral_rolloff,
            'zcr': zcr,
            'rms_energy': rms,
            'pitch_mean': pitch_mean
        }
        
        # Add MFCC means and stds
        for i, (mean, std) in enumerate(zip(mfccs_mean, mfccs_std), 1):
            features[f'mfcc{i}_mean'] = mean
            features[f'mfcc{i}_std'] = std
        
        return features
    
    except Exception as e:
        print(f"Error processing {filepath}: {e}")
        return None

# Extract features from all files (this may take a while)
print("Extracting features from audio files...")
print("This may take several minutes depending on dataset size.\n")

features_list = []
for idx, row in df.iterrows():
    if idx % 100 == 0:
        print(f"Processing file {idx+1}/{len(df)}...")
    
    features = extract_features(row['filepath'])
    if features:
        features['emotion'] = row['emotion']
        features['gender'] = row['gender']
        features['actor'] = row['actor']
        features_list.append(features)

# Create features dataframe
features_df = pd.DataFrame(features_list)
print(f"\nFeature extraction complete!")
print(f"Extracted features from {len(features_df)} files.")
print(f"\nFeature columns: {list(features_df.columns)}")
features_df.head()

## 6. Feature Distribution Analysis

In [None]:
# Statistical summary of features
numeric_features = features_df.select_dtypes(include=[np.number])
print("=== FEATURE STATISTICS ===")
numeric_features.describe()

In [None]:
# Distribution of key prosodic features by emotion
prosodic_features = ['pitch_mean', 'rms_energy', 'spectral_centroid', 'zcr']

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, feature in enumerate(prosodic_features):
    # Box plot
    features_df.boxplot(column=feature, by='emotion', ax=axes[idx], 
                        patch_artist=True, grid=False)
    axes[idx].set_xlabel('Emotion', fontsize=11)
    axes[idx].set_ylabel(feature.replace('_', ' ').title(), fontsize=11)
    axes[idx].set_title(f'{feature.replace("_", " ").title()} Distribution by Emotion', 
                        fontsize=12, fontweight='bold')
    axes[idx].tick_params(axis='x', rotation=45)
    plt.sca(axes[idx])
    plt.xticks(rotation=45, ha='right')

plt.suptitle('')  # Remove default title
plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("- **Pitch:** High-arousal emotions (angry, happy, surprised) show higher pitch.")
print("- **Energy (RMS):** Angry and surprised emotions have higher energy.")
print("- **Spectral Centroid:** Indicates brightness; higher for intense emotions.")
print("- **Zero Crossing Rate:** Higher for unvoiced/noisy sounds (anger, fear).")

In [None]:
# Violin plots for detailed distribution
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, feature in enumerate(prosodic_features):
    sns.violinplot(data=features_df, x='emotion', y=feature, ax=axes[idx], 
                   palette='Set2', inner='quartile')
    axes[idx].set_xlabel('Emotion', fontsize=11)
    axes[idx].set_ylabel(feature.replace('_', ' ').title(), fontsize=11)
    axes[idx].set_title(f'{feature.replace("_", " ").title()} - Violin Plot', 
                        fontsize=12, fontweight='bold')
    axes[idx].tick_params(axis='x', rotation=45)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("Violin plots reveal distribution shape and density.")
print("Some emotions show bimodal distributions (e.g., different intensities).")
print("Clear separation between calm/sad (low arousal) vs angry/happy (high arousal).")

## 7. MFCC Analysis

In [None]:
# Average MFCC values across emotions
mfcc_cols = [col for col in features_df.columns if 'mfcc' in col and 'mean' in col]
mfcc_by_emotion = features_df.groupby('emotion')[mfcc_cols].mean()

plt.figure(figsize=(14, 6))
sns.heatmap(mfcc_by_emotion.T, cmap='coolwarm', annot=False, cbar_kws={'label': 'MFCC Value'})
plt.xlabel('Emotion', fontsize=12)
plt.ylabel('MFCC Coefficient', fontsize=12)
plt.title('Average MFCC Coefficients by Emotion', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("MFCCs capture timbral texture of speech.")
print("Different emotions show distinct MFCC patterns, especially in lower coefficients.")
print("This suggests MFCCs will be useful features for classification.")

## 8. Correlation Analysis

In [None]:
# Correlation between prosodic features
prosodic_df = features_df[prosodic_features + ['duration']]
correlation_matrix = prosodic_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
plt.title('Correlation Matrix - Prosodic Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("Strong correlations indicate feature redundancy.")
print("Spectral features (centroid, bandwidth, rolloff) are highly correlated.")
print("May need dimensionality reduction (PCA) or feature selection.")

## 9. Spectrogram Visualization

Visualize spectrograms for sample files from each emotion to understand frequency patterns.

In [None]:
# Sample one file per emotion
sample_files = df.groupby('emotion').first().reset_index()

fig, axes = plt.subplots(4, 2, figsize=(16, 16))
axes = axes.flatten()

for idx, (_, row) in enumerate(sample_files.iterrows()):
    y, sr = librosa.load(row['filepath'], sr=22050)
    D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
    
    img = librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='hz', ax=axes[idx], cmap='viridis')
    axes[idx].set_title(f"Emotion: {row['emotion'].upper()}", fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Time (s)', fontsize=10)
    axes[idx].set_ylabel('Frequency (Hz)', fontsize=10)
    fig.colorbar(img, ax=axes[idx], format='%+2.0f dB')

plt.suptitle('Spectrogram Examples by Emotion', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("Spectrograms reveal time-frequency patterns unique to each emotion.")
print("High-arousal emotions show more energy in higher frequencies.")
print("Calm/sad emotions have more concentrated energy in lower frequencies.")

## 10. Mel-Spectrogram Visualization

In [None]:
# Mel-spectrograms (perceptually-scaled)
fig, axes = plt.subplots(4, 2, figsize=(16, 16))
axes = axes.flatten()

for idx, (_, row) in enumerate(sample_files.iterrows()):
    y, sr = librosa.load(row['filepath'], sr=22050)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    S_db = librosa.power_to_db(S, ref=np.max)
    
    img = librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel', ax=axes[idx], cmap='magma')
    axes[idx].set_title(f"Emotion: {row['emotion'].upper()}", fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Time (s)', fontsize=10)
    axes[idx].set_ylabel('Mel Frequency', fontsize=10)
    fig.colorbar(img, ax=axes[idx], format='%+2.0f dB')

plt.suptitle('Mel-Spectrogram Examples by Emotion', fontsize=16, fontweight='bold', y=1.0)
plt.tight_layout()
plt.show()

print("\n**Interpretation:**")
print("Mel-spectrograms better align with human auditory perception.")
print("These will be used as input for CNN/LSTM models.")
print("Clear visual differences suggest deep learning models can learn discriminative features.")

## 11. Data Challenges and Observations

### Key Findings:
1. **Balanced Dataset:** All emotions are equally represented, which is ideal.
2. **Feature Separability:** Some emotions show clear separation in prosodic features (e.g., angry vs calm).
3. **Feature Correlation:** High correlation among spectral features may require dimensionality reduction.

### Challenges:
1. **Limited Dataset Size:** ~1440 files may lead to overfitting with complex models.
2. **Speaker Variability:** Only 24 actors; may not generalize to new speakers.
3. **Emotion Overlap:** Some emotions (calm vs neutral, fear vs sadness) have overlapping acoustic features.
4. **Intensity Levels:** Normal vs strong intensity adds complexity.
5. **Modality Differences:** Speech vs song may require separate models or careful preprocessing.

### Preprocessing Recommendations:
1. **Normalization:** Standardize features (z-score normalization).
2. **Augmentation:** Apply pitch shifting, time stretching, noise addition to increase dataset size.
3. **Feature Selection:** Use PCA or feature importance to reduce dimensionality.
4. **Stratified Splitting:** Ensure train/val/test splits maintain emotion and gender balance.
5. **Speaker Independence:** Consider leave-one-speaker-out validation for generalization testing.

## 12. Save Processed Features

In [None]:
# Save features for later use in modeling
output_path = '../data/processed_features.csv'
features_df.to_csv(output_path, index=False)
print(f"Features saved to {output_path}")
print(f"Total features: {len(features_df.columns)-3} (excluding emotion, gender, actor)")
print(f"Total samples: {len(features_df)}")

## Summary

This EDA notebook explored the RAVDESS dataset and extracted key insights:

**Dataset:**
- 1440 audio files (24 actors × 8 emotions × 2 statements × 2 repetitions × 2 intensities)
- Balanced across emotions, gender, and modalities

**Key Features:**
- Prosodic: Pitch, energy, zero-crossing rate
- Spectral: Centroid, bandwidth, rolloff
- Timbral: MFCCs (13 coefficients)

**Insights:**
- High-arousal emotions (angry, happy, surprised) have higher pitch and energy
- Low-arousal emotions (calm, sad) show lower spectral features
- MFCCs and spectrograms show discriminative patterns

**Next Steps:**
1. Data preprocessing and augmentation
2. Train/validation/test split (stratified)
3. Baseline model (SVM/Logistic Regression on handcrafted features)
4. Deep learning models (CNN/LSTM on spectrograms)
5. Transfer learning (wav2vec 2.0 fine-tuning)