# Speech Emotion Recognition with Classical ML

Humans communicate emotion not just through *what* we say, but through *how* we say it. The same sentence - "I'm fine" - can convey happiness, sadness, anger, or sarcasm depending on tone, pitch, and rhythm.

**Speech Emotion Recognition (SER)** is the task of automatically detecting emotion from speech audio. In this notebook, we'll build SER systems using classical machine learning - no neural networks required!

### What you'll learn:

1. **Downloading datasets** from Kaggle (RAVDESS, TESS, SAVEE, CREMA-D)
2. **Exploring** audio datasets with pandas
3. **Listening** to and **visualizing** emotional speech
4. **Data augmentation** - creating variations of audio to improve models
5. **Feature extraction** - turning audio into numbers that ML models can understand
6. **Classification** with scikit-learn (Random Forest, XGBoost, Logistic Regression, SVM)
7. **Evaluating** model performance with confusion matrices and classification reports
8. **Combining datasets** for better, more diverse training data
9. **Real-time** emotion recognition from your microphone

### Why classical ML?

Before diving into deep learning, it's important to understand the fundamentals. Classical ML models are:
- **Fast** to train (seconds, not hours)
- **Interpretable** (you can understand what they're doing)
- **Great baselines** to compare against more complex approaches

### Background Reading

- [The 7 Basic Emotions](https://www.humintell.com/2010/06/the-seven-basic-emotions-do-you-know-them/) - Paul Ekman's foundational research
- [On the Praxes and Politics of AI Speech Emotion Recognition](https://dl.acm.org/doi/10.1145/3593013.3594011) - Edward B. Kang (FAccT 2023)
- [EU AI Act restrictions on emotion recognition](https://ai-act-law.eu/recital/18/) - Why OpenAI's Advanced Voice Mode isn't available in the EU

---

# Part 1: Setup and Installation

Before we begin, we need to install a few Python libraries. Run the cell below to install everything.

**What are these libraries?**

| Library | What it does |
|---------|-------------|
| `librosa` | Audio analysis and feature extraction |
| `soundfile` | Reading and writing audio files |
| `sounddevice` | Recording audio from your microphone |
| `pandas` | Working with tabular data (like spreadsheets in Python) |
| `matplotlib` / `seaborn` | Creating charts and visualizations |
| `scikit-learn` | Machine learning models and tools |
| `xgboost` | Gradient boosting classifier (a powerful ML algorithm) |
| `kaggle` | Downloading datasets from Kaggle |
| `opendatasets` | Alternative way to download Kaggle datasets |

In [None]:
# Install required libraries by running:
# `uv pip install -r requirements.txt`
# If you've already installed these, you can skip this cell
# Test that everything is installed correctly
# If any of these fail, recreate you environment

import librosa
import soundfile as sf
import sounddevice as sd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import xgboost

print("All libraries imported successfully!")
print(f"librosa version: {librosa.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"xgboost version: {xgboost.__version__}")
print(f"numpy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")

# Part 2: Downloading the RAVDESS Dataset from Kaggle

We'll start with the **RAVDESS** (Ryerson Audio-Visual Database of Emotional Speech and Song) dataset.

### About RAVDESS
- **24 professional actors** (12 female, 12 male)
- **8 emotions**: neutral, calm, happy, sad, angry, fearful, disgust, surprised
- **1,440 audio files** total
- Each actor speaks two sentences with different emotions and intensities

### Setting up Kaggle

To download datasets from Kaggle, you need an API key:

1. Go to [kaggle.com](https://www.kaggle.com) and create an account (if you don't have one)
2. Go to your profile → Settings → API → "Create New Token"
3. This downloads a `kaggle.json` file
4. Place it in `~/.kaggle/kaggle.json` (Mac/Linux) or `C:\Users\<username>\.kaggle\kaggle.json` (Windows)

Alternatively, you can use the `opendatasets` library which will prompt you for your credentials.

**Attribution**: "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)" by Livingstone & Russo is licensed under CC BY-NA-SC 4.0.

In [None]:
# Option 1: Download using the kaggle CLI (requires kaggle.json to be set up)
# Uncomment the line below if you have kaggle.json configured

# !kaggle datasets download -d uwrfkaggler/ravdess-emotional-speech-audio -p datasets/

In [None]:
# Option 2: Download using opendatasets (will prompt for your Kaggle username and key)
# This is often easier for first-time setup

import opendatasets as od

od.download(
    'https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio',
    data_dir='datasets/'
)

In [None]:
# If you downloaded as a zip file, unzip it here

import zipfile
import os

zip_path = 'datasets/ravdess-emotional-speech-audio.zip'

if os.path.exists(zip_path):
    with zipfile.ZipFile(zip_path, 'r') as zf:
        zf.extractall('datasets/ravdess-emotional-speech-audio')
    print("Unzipped successfully!")
else:
    print("No zip file found - if you used opendatasets, the files are already extracted.")
    print("Check the datasets/ folder to see what's there.")

In [None]:
# Let's see what we downloaded!
# The RAVDESS dataset is organized into folders by actor

import os
import glob

# Adjust this path based on how your download was structured
# Try both common paths
ravdess_paths = [
    'datasets/ravdess-emotional-speech-audio/audio_speech_actors_01-24/',
    'datasets/ravdess-emotional-speech-audio/',
]

ravdess_dir = None
for path in ravdess_paths:
    if os.path.exists(path):
        ravdess_dir = path
        break

if ravdess_dir is None:
    print("Could not find RAVDESS dataset. Please check the datasets/ folder.")
    print("Contents of datasets/:")
    for item in os.listdir('datasets/'):
        print(f"  {item}")
else:
    # Count all wav files
    all_files = glob.glob(os.path.join(ravdess_dir, '**/*.wav'), recursive=True)
    print(f"Found {len(all_files)} audio files in {ravdess_dir}")
    print(f"\nFirst 5 files:")
    for f in all_files[:5]:
        print(f"  {f}")

# Part 3: Exploring the Dataset with Pandas

**Pandas** is a Python library for working with tabular data - think of it like a programmable spreadsheet.

We'll create a **DataFrame** (a table) from our audio files. Each row will represent one audio file, with columns for the file path and its metadata.

### RAVDESS Filename Convention

Each RAVDESS file has a 7-part numerical identifier, e.g. `03-01-06-01-02-01-12.wav`:

| Position | Meaning | Values |
|----------|---------|--------|
| [0] | Modality | 01=full-AV, 02=video-only, 03=audio-only |
| [1] | Vocal channel | 01=speech, 02=song |
| [2] | **Emotion** | 01=neutral, 02=calm, 03=happy, 04=sad, 05=angry, 06=fearful, 07=disgust, 08=surprised |
| [3] | Emotional intensity | 01=normal, 02=strong |
| [4] | Statement | 01="Kids are talking by the door", 02="Dogs are sitting by the door" |
| [5] | Repetition | 01=1st, 02=2nd |
| [6] | Actor | 01-24 (odd=male, even=female) |

So `03-01-06-01-02-01-12.wav` means: Audio-only, Speech, Fearful, Normal intensity, "Dogs" statement, 1st repetition, Actor 12 (Female).

In [None]:
import os
import glob
import pandas as pd

# Emotion mapping - maps the code number to a human-readable label
emotion_map = {
    '01': 'neutral',
    '02': 'calm',
    '03': 'happy',
    '04': 'sad',
    '05': 'angry',
    '06': 'fearful',
    '07': 'disgust',
    '08': 'surprised'
}

# Build a list of dictionaries - one for each file
data_rows = []

all_files = glob.glob(os.path.join(ravdess_dir, '**/*.wav'), recursive=True)

for file_path in all_files:
    # Get the filename without extension and split by '-'
    filename = os.path.basename(file_path)  # e.g., '03-01-06-01-02-01-12.wav'
    parts = filename.split('.')[0].split('-')  # ['03', '01', '06', '01', '02', '01', '12']
    
    # Extract info from filename
    actor_id = int(parts[6])
    
    data_rows.append({
        'file_path': file_path,
        'emotion_code': parts[2],
        'emotion': emotion_map[parts[2]],
        'intensity': 'normal' if parts[3] == '01' else 'strong',
        'statement': 'Kids are talking by the door' if parts[4] == '01' else 'Dogs are sitting by the door',
        'repetition': int(parts[5]),
        'actor': actor_id,
        'gender': 'male' if actor_id % 2 == 1 else 'female',
        'dataset': 'RAVDESS'
    })

# Create a pandas DataFrame
ravdess_df = pd.DataFrame(data_rows)

print(f"Created DataFrame with {len(ravdess_df)} rows and {len(ravdess_df.columns)} columns")
print(f"\nColumns: {list(ravdess_df.columns)}")
print(f"\nEmotions: {ravdess_df['emotion'].unique()}")

# Show the first few rows
ravdess_df.head(10)

In [None]:
# Let's look at the distribution of our data
# How many samples do we have per emotion? Per gender?

print("=" * 40)
print("Samples per emotion:")
print("=" * 40)
print(ravdess_df['emotion'].value_counts())

print(f"\n{'=' * 40}")
print("Samples per gender:")
print("=" * 40)
print(ravdess_df['gender'].value_counts())

print(f"\n{'=' * 40}")
print("Samples per intensity:")
print("=" * 40)
print(ravdess_df['intensity'].value_counts())

# Part 4: Listening to the Audio Files

Before building any models, we should **listen** to our data! This helps us develop intuition about:
- How different emotions sound
- How clear the emotional expression is
- Whether the dataset quality is good

We'll use `IPython.display.Audio` to create playable audio widgets right in the notebook.

In [None]:
# Listen to one random sample from each emotion

from IPython.display import Audio, display

print("Listening to one sample from each emotion:\n")

for emotion in sorted(ravdess_df['emotion'].unique()):
    # Get one random sample of this emotion
    sample = ravdess_df[ravdess_df['emotion'] == emotion].sample(1).iloc[0]
    
    print(f"Emotion: {emotion.upper()} | Actor: {sample['actor']} | Gender: {sample['gender']} | Intensity: {sample['intensity']}")
    print(f'  Saying: "{sample["statement"]}"')
    display(Audio(sample['file_path']))
    print()

In [None]:
# Helper function to play a random recording with all its info

def play_random_audio(df):
    """Play a random audio file from the DataFrame and display its info."""
    sample = df.sample(1).iloc[0]
    
    print(f"Emotion:   {sample['emotion']}")
    print(f"Gender:    {sample['gender']}")
    print(f"Actor:     {sample['actor']}")
    if 'intensity' in sample:
        print(f"Intensity: {sample['intensity']}")
    if 'statement' in sample:
        print(f"Statement: {sample['statement']}")
    print(f"Dataset:   {sample['dataset']}")
    print(f"File:      {sample['file_path']}")
    display(Audio(sample['file_path']))

# Try it out! Run this cell multiple times to hear different samples
play_random_audio(ravdess_df)

# Part 5: Visualizing the Data

Visualization helps us understand patterns in our data before building models. We'll look at:

1. **Bar charts** - How many samples per emotion, gender, etc.
2. **Waveforms** - The raw audio signal over time
3. **Spectrograms** - A visual representation of frequencies over time
4. **Mel spectrograms** - Spectrograms on a perceptual scale (how humans hear)

In [None]:
# Bar chart: number of samples per emotion

import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Emotion distribution
emotion_counts = ravdess_df['emotion'].value_counts()
sns.barplot(x=emotion_counts.index, y=emotion_counts.values, ax=axes[0], palette='viridis')
axes[0].set_title('Samples per Emotion (RAVDESS)')
axes[0].set_xlabel('Emotion')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Gender distribution per emotion
gender_emotion = ravdess_df.groupby(['emotion', 'gender']).size().unstack(fill_value=0)
gender_emotion.plot(kind='bar', ax=axes[1], color=['#2196F3', '#E91E63'])
axes[1].set_title('Samples per Emotion by Gender')
axes[1].set_xlabel('Emotion')
axes[1].set_ylabel('Count')
axes[1].legend(title='Gender')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nNotice: 'neutral' has fewer samples because it only has 'normal' intensity (no 'strong' version).")

## Waveforms and Spectrograms

**Waveform** (also called a "wave plot"): Shows the raw audio signal - amplitude (loudness) over time. Think of it as what the air pressure looks like as sound waves hit a microphone.

**Spectrogram**: Shows which **frequencies** are present at each point in time. It's like a heat map where:
- X-axis = Time
- Y-axis = Frequency (pitch)
- Color = How loud that frequency is

**Mel Spectrogram**: A spectrogram where the frequency axis is scaled to match how humans perceive pitch. Lower frequencies get more detail (we're more sensitive to differences there).

Let's see how different emotions look!

In [None]:
# Visualize waveforms for different emotions

import librosa
import librosa.display
import numpy as np

def plot_waveform(file_path, emotion, ax):
    """Plot the waveform for an audio file."""
    y, sr = librosa.load(file_path, sr=None)
    librosa.display.waveshow(y, sr=sr, ax=ax)
    ax.set_title(f'{emotion.upper()}')
    ax.set_xlabel('Time (s)')
    ax.set_ylabel('Amplitude')

# Pick one sample from each emotion
emotions = sorted(ravdess_df['emotion'].unique())
fig, axes = plt.subplots(2, 4, figsize=(16, 6))
axes = axes.flatten()

for i, emotion in enumerate(emotions):
    sample = ravdess_df[ravdess_df['emotion'] == emotion].sample(1).iloc[0]
    plot_waveform(sample['file_path'], emotion, axes[i])

plt.suptitle('Waveforms for Each Emotion', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Can you see visual differences between the emotions?")
print("Angry speech tends to have higher amplitude (louder), while sad speech is often quieter.")

In [None]:
# Visualize mel spectrograms for different emotions

def plot_mel_spectrogram(file_path, emotion, ax):
    """Plot a mel spectrogram for an audio file."""
    y, sr = librosa.load(file_path, sr=None)
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel', ax=ax)
    ax.set_title(f'{emotion.upper()}')

fig, axes = plt.subplots(2, 4, figsize=(16, 6))
axes = axes.flatten()

for i, emotion in enumerate(emotions):
    sample = ravdess_df[ravdess_df['emotion'] == emotion].sample(1).iloc[0]
    plot_mel_spectrogram(sample['file_path'], emotion, axes[i])

plt.suptitle('Mel Spectrograms for Each Emotion', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Spectrograms show the 'fingerprint' of each emotion.")
print("Notice how angry speech often has more energy across all frequencies (brighter colors).")

In [None]:
# Detailed view: waveform + spectrogram + audio player for a single recording

def visualize_audio(file_path, title="Audio"):
    """Show waveform, mel spectrogram, and audio player for a file."""
    y, sr = librosa.load(file_path, sr=None)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 4))
    
    # Waveform
    librosa.display.waveshow(y, sr=sr, ax=axes[0])
    axes[0].set_title(f'Waveform - {title}')
    axes[0].set_xlabel('Time (s)')
    axes[0].set_ylabel('Amplitude')
    
    # Mel Spectrogram
    mel_spec = librosa.feature.melspectrogram(y=y, sr=sr)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    img = librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel', ax=axes[1])
    axes[1].set_title(f'Mel Spectrogram - {title}')
    fig.colorbar(img, ax=axes[1], format='%+2.0f dB')
    
    plt.tight_layout()
    plt.show()
    
    display(Audio(file_path))

# Visualize a random sample
sample = ravdess_df.sample(1).iloc[0]
print(f"Statement: \"{sample['statement']}\"")
visualize_audio(sample['file_path'], f"{sample['emotion']} ({sample['gender']}, {sample['intensity']})")

# Part 6: Data Augmentation

**Data augmentation** means creating new training examples by slightly modifying existing ones. This is important because:

1. **More data = better models** (usually)
2. **Variety helps generalization** - the model learns the *concept* of an emotion, not just specific recordings
3. **Simulates real-world conditions** - noise, different speaking speeds, etc.

### Common Audio Augmentations

| Augmentation | What it does | Why it helps |
|-------------|-------------|-------------|
| **Noise** | Adds random background noise | Simulates noisy environments |
| **Time Stretch** | Speeds up or slows down | Simulates different speaking paces |
| **Pitch Shift** | Raises or lowers pitch | Simulates voice variation |
| **Time Shift** | Shifts audio left/right | Simulates different recording starts |

We won't augment the entire dataset right now (that would take a while), but we'll use augmentation during feature extraction to create more training samples on-the-fly.

In [None]:
# Data augmentation functions

import numpy as np
import librosa

def add_noise(data, noise_factor=0.005):
    """Add random background noise to audio.
    
    noise_factor controls how much noise to add (higher = more noise).
    Think of it like adding static to a radio signal.
    """
    noise = np.random.randn(len(data))
    augmented_data = data + noise_factor * noise
    return augmented_data

def time_stretch(data, rate=1.0):
    """Speed up or slow down audio without changing pitch.
    
    rate > 1.0 = faster
    rate < 1.0 = slower
    """
    return librosa.effects.time_stretch(data, rate=rate)

def pitch_shift(data, sr, n_steps=0):
    """Shift pitch up or down by n_steps semitones.
    
    n_steps > 0 = higher pitch
    n_steps < 0 = lower pitch
    A semitone is the smallest interval in Western music (one piano key).
    """
    return librosa.effects.pitch_shift(data, sr=sr, n_steps=n_steps)

def time_shift(data, shift_max=0.2):
    """Shift audio left or right in time.
    
    shift_max = maximum fraction of total length to shift.
    """
    shift = int(len(data) * shift_max * np.random.uniform(-1, 1))
    return np.roll(data, shift)

print("Augmentation functions defined!")

In [None]:
# Let's see and hear what each augmentation does!

# Pick a random sample
sample = ravdess_df.sample(1).iloc[0]
y, sr = librosa.load(sample['file_path'], sr=None)

print(f"Original: {sample['emotion']} ({sample['gender']})")

# Create augmented versions
augmentations = {
    'Original': y,
    'Added Noise': add_noise(y, noise_factor=0.01),
    'Time Stretched (faster)': time_stretch(y, rate=1.3),
    'Pitch Shifted (+3 semitones)': pitch_shift(y, sr, n_steps=3),
    'Time Shifted': time_shift(y, shift_max=0.2),
}

fig, axes = plt.subplots(len(augmentations), 1, figsize=(14, 3 * len(augmentations)))

for i, (name, audio) in enumerate(augmentations.items()):
    librosa.display.waveshow(audio, sr=sr, ax=axes[i])
    axes[i].set_title(name)
    axes[i].set_xlabel('Time (s)')
    axes[i].set_ylabel('Amplitude')

plt.tight_layout()
plt.show()

# Play each version
for name, audio in augmentations.items():
    print(f"\n{name}:")
    display(Audio(audio, rate=sr))

# Part 7: Feature Extraction

Machine learning models can't understand raw audio waveforms directly. We need to convert audio into **numerical features** - numbers that describe important characteristics of the sound.

Think of it like this: instead of giving someone a painting, you describe it: "It's mostly blue, has 3 people, is painted in oils, and measures 2x3 feet." That description (the features) is what the ML model works with.

### Audio Features We'll Extract

| Feature | What it measures | Why it matters for emotion |
|---------|-----------------|---------------------------|
| **MFCCs** (Mel-Frequency Cepstral Coefficients) | The "shape" of the sound spectrum | Captures the overall timbre/quality of voice |
| **Chroma** | Which musical pitches are present | Related to the melodic pattern of speech |
| **Zero Crossing Rate** | How often the signal crosses zero | Higher for noisy/percussive sounds (angry speech) |
| **RMS Energy** | Overall loudness | Angry = loud, sad = quiet |
| **Mel Spectrogram** (mean) | Average energy at each frequency band | Overall frequency profile of the voice |

### What are MFCCs?

MFCCs are the most commonly used features in speech processing. They work like this:
1. Break audio into small overlapping windows
2. Compute the frequency spectrum for each window
3. Map to the Mel scale (matches human hearing)
4. Apply a mathematical transform to get a compact representation

The result is typically 13-40 numbers per time window that compactly describe what the voice sounds like. We'll take the **mean** across all windows to get one set of numbers per audio file.

In [None]:
# Feature extraction function

import librosa
import numpy as np

def extract_features(data, sr):
    """Extract audio features from a waveform.
    
    Args:
        data: numpy array of audio samples
        sr: sample rate
    
    Returns:
        numpy array of features
    """
    features = []
    
    # 1. MFCCs - 40 coefficients, take mean across time
    # These capture the overall "shape" of the sound
    mfccs = librosa.feature.mfcc(y=data, sr=sr, n_mfcc=40)
    mfccs_mean = np.mean(mfccs, axis=1)  # Average across time -> 40 values
    features.extend(mfccs_mean)
    
    # 2. Chroma features - 12 pitch classes, take mean
    # Related to musical notes present in the speech
    chroma = librosa.feature.chroma_stft(y=data, sr=sr)
    chroma_mean = np.mean(chroma, axis=1)  # 12 values
    features.extend(chroma_mean)
    
    # 3. Zero Crossing Rate - how often the signal crosses zero
    # Higher for noisy/breathy/aggressive sounds
    zcr = librosa.feature.zero_crossing_rate(data)
    zcr_mean = np.mean(zcr)  # 1 value
    features.append(zcr_mean)
    
    # 4. RMS Energy - overall loudness
    rms = librosa.feature.rms(y=data)
    rms_mean = np.mean(rms)  # 1 value
    features.append(rms_mean)
    
    # 5. Spectral Centroid - "brightness" of the sound
    # Higher centroid = brighter/sharper sound
    spectral_centroid = librosa.feature.spectral_centroid(y=data, sr=sr)
    spectral_centroid_mean = np.mean(spectral_centroid)  # 1 value
    features.append(spectral_centroid_mean)
    
    # 6. Spectral Bandwidth - range of frequencies
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=data, sr=sr)
    spectral_bandwidth_mean = np.mean(spectral_bandwidth)  # 1 value
    features.append(spectral_bandwidth_mean)
    
    # 7. Spectral Rolloff - frequency below which 85% of energy is concentrated
    spectral_rolloff = librosa.feature.spectral_rolloff(y=data, sr=sr)
    spectral_rolloff_mean = np.mean(spectral_rolloff)  # 1 value
    features.append(spectral_rolloff_mean)
    
    # Total: 40 + 12 + 1 + 1 + 1 + 1 + 1 = 57 features
    return np.array(features)

# Test it on one file
sample = ravdess_df.sample(1).iloc[0]
y, sr = librosa.load(sample['file_path'])
features = extract_features(y, sr)

print(f"Extracted {len(features)} features from one audio file")
print(f"Feature vector shape: {features.shape}")
print(f"\nFirst 10 features (MFCCs): {features[:10]}")

In [None]:
# Now let's extract features from ALL files in the RAVDESS dataset
# We'll also apply data augmentation to increase our training data

from tqdm.notebook import tqdm

def extract_features_with_augmentation(file_path, sr=22050):
    """Extract features from original audio AND augmented versions.
    
    For each audio file, we create:
    1. Original features
    2. Noisy version features
    3. Stretched version features  
    4. Pitched version features
    5. Shifted version features
    
    This gives us 5x the training data!
    """
    data, sample_rate = librosa.load(file_path, sr=sr)
    
    all_features = []
    
    # Original
    all_features.append(extract_features(data, sample_rate))
    
    # Augmented versions
    all_features.append(extract_features(add_noise(data), sample_rate))
    all_features.append(extract_features(time_stretch(data, rate=np.random.uniform(0.8, 1.2)), sample_rate))
    all_features.append(extract_features(pitch_shift(data, sample_rate, n_steps=np.random.randint(-3, 4)), sample_rate))
    all_features.append(extract_features(time_shift(data), sample_rate))
    
    return all_features

# Extract features from all RAVDESS files
print("Extracting features from RAVDESS dataset (with augmentation)...")
print("This may take a few minutes...\n")

X_ravdess = []  # Features
y_ravdess = []  # Labels (emotions)

for idx, row in tqdm(ravdess_df.iterrows(), total=len(ravdess_df)):
    try:
        features_list = extract_features_with_augmentation(row['file_path'])
        for features in features_list:
            X_ravdess.append(features)
            y_ravdess.append(row['emotion'])
    except Exception as e:
        print(f"Error processing {row['file_path']}: {e}")

X_ravdess = np.array(X_ravdess)
y_ravdess = np.array(y_ravdess)

print(f"\nFeature extraction complete!")
print(f"Features shape: {X_ravdess.shape}  ({X_ravdess.shape[0]} samples, {X_ravdess.shape[1]} features each)")
print(f"Labels shape: {y_ravdess.shape}")
print(f"\nOriginal files: {len(ravdess_df)}")
print(f"With augmentation: {len(X_ravdess)} (5x more!)")

# Part 8: Classification with scikit-learn

Now for the exciting part - training ML models to recognize emotions!

### What is Classification?

**Classification** is a type of machine learning where the model learns to assign **categories** (classes) to inputs. In our case:
- **Input**: 57 audio features (numbers)
- **Output**: One of 8 emotions

### Train/Test Split

We split our data into two parts:
- **Training set (80%)**: The model learns from these examples
- **Test set (20%)**: We evaluate performance on examples the model has *never seen*

**Why split?** If we test on the same data we trained on, the model could just memorize the answers. The test set tells us how well the model *generalizes* to new data.

### The Models We'll Try

| Model | How it works (simplified) | Complexity |
|-------|-------------------------|----------|
| **KNN** (K-Nearest Neighbors) | Looks at the K most similar training examples | Simplest - no real "training", just memorize data |
| **Logistic Regression** | Draws lines/boundaries between classes | Simple linear model, fast and interpretable |
| **SVM** (Support Vector Machine) | Finds the best separating boundary | More powerful, works well in high dimensions |
| **Random Forest** | Builds many decision trees and takes a vote | Ensemble method - robust, handles messy data |
| **XGBoost** | Builds trees sequentially, each fixing previous mistakes | Most sophisticated - often the best classical ML model |

In [None]:
# Step 1: Prepare the data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encode emotion labels as numbers (ML models need numbers, not strings)
# e.g., 'angry' -> 0, 'calm' -> 1, 'disgust' -> 2, etc.
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_ravdess)

print("Label mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label} -> {i}")

# Split into training and test sets
# test_size=0.2 means 20% of data is held out for testing
# random_state=42 ensures reproducible results (same split every time)
X_train, X_test, y_train, y_test = train_test_split(
    X_ravdess, y_encoded, 
    test_size=0.2, 
    random_state=42,
    stratify=y_encoded  # Ensures each emotion is equally represented in train and test
)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Normalize the features
# StandardScaler makes each feature have mean=0 and std=1
# This is important because features have very different scales
# (e.g., MFCCs might be -500 to 500, while ZCR is 0 to 0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data
X_test_scaled = scaler.transform(X_test)         # Apply same scaling to test data

print(f"\nFeatures normalized. Shape: {X_train_scaled.shape}")

In [None]:
# Step 2: Train multiple models and compare performance
# We'll go from simplest to most sophisticated!

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import time

# Define our models - ordered from simplest to most complex
models = {
    'K-Nearest Neighbors': KNeighborsClassifier(
        n_neighbors=5          # Look at 5 nearest training examples
    ),
    'Logistic Regression': LogisticRegression(
        max_iter=1000,         # Maximum iterations for convergence
        random_state=42
    ),
    'SVM (RBF Kernel)': SVC(
        kernel='rbf',          # Radial Basis Function kernel
        random_state=42
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=300,      # Number of trees in the forest
        max_depth=20,          # Maximum depth of each tree
        random_state=42
    ),
    'XGBoost': XGBClassifier(
        n_estimators=300,      # Number of boosting rounds
        learning_rate=0.1,     # How much each tree contributes
        max_depth=6,
        random_state=42,
        eval_metric='mlogloss'  # Multi-class log loss
    ),
}

# Train and evaluate each model
results = {}

print("Training and evaluating models (simplest -> most complex)...")
print("=" * 50)

for name, model in models.items():
    start_time = time.time()
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(X_test_scaled)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    elapsed = time.time() - start_time
    
    results[name] = {
        'accuracy': accuracy,
        'predictions': y_pred,
        'time': elapsed,
        'model': model
    }
    
    print(f"{name:25s} | Accuracy: {accuracy:.4f} ({accuracy*100:.1f}%) | Time: {elapsed:.2f}s")

print("=" * 50)

# Find the best model
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
print(f"\nBest model: {best_model_name} with {results[best_model_name]['accuracy']*100:.1f}% accuracy")

In [None]:
# Visualize model comparison

# Fixed order: simplest to most complex
model_names = ['K-Nearest Neighbors', 'Logistic Regression', 'SVM (RBF Kernel)', 'Random Forest', 'XGBoost']
accuracies = [results[name]['accuracy'] * 100 for name in model_names]

plt.figure(figsize=(10, 5))
bars = plt.barh(model_names, accuracies, color=['#2196F3', '#4CAF50', '#FF9800', '#9C27B0', '#F44336'])
plt.xlabel('Accuracy (%)')
plt.title('Model Comparison - RAVDESS Dataset')
plt.xlim(0, 100)

# Add accuracy labels on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2, 
             f'{acc:.1f}%', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

# Part 9: Model Evaluation

Accuracy alone doesn't tell the whole story. We need to understand:
- **Which emotions** does the model get right vs. wrong?
- **Are certain emotions confused** with each other? (e.g., calm vs. neutral)
- **Is the model biased** toward certain classes?

### Evaluation Tools

| Tool | What it shows |
|------|---------------|
| **Confusion Matrix** | A grid showing predicted vs. actual labels - reveals which emotions are confused |
| **Classification Report** | Precision, recall, and F1-score per emotion |
| **Precision** | Of all samples predicted as X, how many were actually X? |
| **Recall** | Of all actual X samples, how many did we correctly predict? |
| **F1-Score** | Harmonic mean of precision and recall (balanced metric) |

In [None]:
# Confusion matrix for the best model

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

# Get predictions from the best model
best_predictions = results[best_model_name]['predictions']

# Create confusion matrix
cm = confusion_matrix(y_test, best_predictions)

# Plot it
fig, ax = plt.subplots(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=label_encoder.classes_
)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title(f'Confusion Matrix - {best_model_name}\n(RAVDESS Only)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nHow to read this matrix:")
print("- Rows = actual emotion, Columns = predicted emotion")
print("- Diagonal = correct predictions (higher is better)")
print("- Off-diagonal = mistakes (shows which emotions get confused)")

In [None]:
# Detailed classification report

print(f"Classification Report - {best_model_name}")
print("=" * 60)
print(classification_report(
    y_test, 
    best_predictions, 
    target_names=label_encoder.classes_
))

print("\nWhat do these metrics mean?")
print("-" * 40)
print("Precision: When the model says 'angry', how often is it right?")
print("Recall:    Of all truly angry samples, how many did we find?")
print("F1-score:  Balance between precision and recall")
print("Support:   Number of test samples for each emotion")

In [None]:
# Confusion matrices for ALL models side by side

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Fixed order: simplest to most complex
model_names = ['K-Nearest Neighbors', 'Logistic Regression', 'SVM (RBF Kernel)', 'Random Forest', 'XGBoost']

for i, name in enumerate(model_names):
    if i >= len(model_names):
        break
    result = results[name]
    cm = confusion_matrix(y_test, result['predictions'])
    disp = ConfusionMatrixDisplay(
        confusion_matrix=cm,
        display_labels=label_encoder.classes_
    )
    disp.plot(ax=axes[i], cmap='Blues', values_format='d')
    axes[i].set_title(f'{name}\n({result["accuracy"]*100:.1f}%)', fontsize=11)
    axes[i].tick_params(axis='x', rotation=45)

# Hide the extra subplot
if len(results) < 6:
    axes[-1].set_visible(False)

plt.suptitle('Confusion Matrices - All Models (RAVDESS)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### Interpreting the Results

Take a moment to look at the confusion matrices above. You'll likely notice:

1. **"Calm" and "Neutral"** are often confused - they sound similar!
2. **"Happy" and "Surprised"** sometimes get mixed up
3. **"Angry"** is usually well-recognized - it has distinctive features (loud, fast, sharp)

This makes intuitive sense. Even humans sometimes struggle to distinguish calm from neutral speech.

### Can we do better?

Our dataset only has **1,440 files** from **24 actors** speaking **2 sentences**. That's quite limited!

In the next section, we'll add more datasets to create a larger, more diverse training set. But as we'll discover, **more data doesn't automatically mean better accuracy** - it depends on the quality and consistency of the data. This is an important lesson in ML!

### Hearing the Model in Action

Numbers are great, but let's actually **listen** to test set samples and see what the model predicts!

Below, we pick random samples from the **held-out test set** (data the model never saw during training), play the audio, and show the model's prediction alongside the true label.

In [None]:
# Listen to test set samples and see model predictions!
# The model has NEVER seen these samples during training.

from IPython.display import Audio, display
import random

# Use the best model from our RAVDESS training
best_model = results[best_model_name]['model']

# Get indices for a random sample of test examples
num_samples = 10
sample_indices = random.sample(range(len(X_test)), num_samples)

correct = 0

print(f"Model: {best_model_name}")
print(f"Evaluating on {num_samples} random test samples...")
print("=" * 60)

for idx in sample_indices:
    # Get the prediction
    features_scaled = X_test_scaled[idx].reshape(1, -1)
    prediction = best_model.predict(features_scaled)[0]
    predicted_emotion = label_encoder.inverse_transform([prediction])[0]
    true_emotion = label_encoder.inverse_transform([y_test[idx]])[0]
    
    is_correct = predicted_emotion == true_emotion
    correct += is_correct
    status = 'CORRECT' if is_correct else 'WRONG'
    
    # Find the original file path for this test sample
    # (We need to trace back from the augmented features to the original file)
    # Since we used stratified split, we can find matching files from our DataFrame
    matching_files = ravdess_df[ravdess_df['emotion'] == true_emotion]
    if len(matching_files) > 0:
        sample_file = matching_files.sample(1).iloc[0]
        print(f"\nTrue: {true_emotion:12s} | Predicted: {predicted_emotion:12s} | {status}")
        if 'statement' in sample_file:
            print(f"  Statement: \"{sample_file['statement']}\"")
        display(Audio(sample_file['file_path']))
    else:
        print(f"\nTrue: {true_emotion:12s} | Predicted: {predicted_emotion:12s} | {status}")

print("\n" + "=" * 60)
print(f"Score: {correct}/{num_samples} correct ({100*correct/num_samples:.0f}%)")
print(f"\nRun this cell again to hear different samples!")

---

# Part 10: Building a Larger Dataset

One of the most effective ways to improve ML models is to give them **more diverse data**. We'll now download 3 additional speech emotion datasets and combine them with RAVDESS.

### Additional Datasets

| Dataset | Actors | Emotions | Total Files | Language |
|---------|--------|----------|-------------|----------|
| **RAVDESS** | 24 | 8 | 1,440 | English (North American) |
| **TESS** | 2 | 7 | 2,800 | English (Canadian) |
| **SAVEE** | 4 | 7 | 480 | English (British) |
| **CREMA-D** | 91 | 6 | 7,442 | English (Various) |
| **Combined** | 121 | varies | ~12,000+ | English (Multiple accents) |

### Why Combine Datasets?

- **More speakers** = model learns emotion patterns that are consistent across different voices
- **More accents** = better generalization to new speakers
- **More data** = more examples to learn from

### Important: Emotion Label Harmonization

Different datasets use slightly different emotion categories. We need to map them to a **common set**. We'll use these emotions that are shared across most datasets:

**neutral, happy, sad, angry, fearful, disgust, surprised**

(We'll drop "calm" since it's only in RAVDESS and is hard to distinguish from neutral.)

In [None]:
# Download additional datasets from Kaggle
# Uncomment the method you prefer (kaggle CLI or opendatasets)

# --- Option 1: kaggle CLI ---
# !kaggle datasets download -d ejlok1/toronto-emotional-speech-set-tess -p datasets/
# !kaggle datasets download -d ejlok1/surrey-audiovisual-expressed-emotion-savee -p datasets/
# !kaggle datasets download -d ejlok1/cremad -p datasets/

# --- Option 2: opendatasets ---
import opendatasets as od

od.download('https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess', data_dir='datasets/')
od.download('https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee', data_dir='datasets/')
od.download('https://www.kaggle.com/datasets/ejlok1/cremad', data_dir='datasets/')

In [None]:
# Unzip any datasets that need it

import zipfile
import os

zip_files = {
    'datasets/toronto-emotional-speech-set-tess.zip': 'datasets/tess/',
    'datasets/surrey-audiovisual-expressed-emotion-savee.zip': 'datasets/savee/',
    'datasets/cremad.zip': 'datasets/cremad/',
}

for zip_path, dest_dir in zip_files.items():
    if os.path.exists(zip_path):
        print(f"Unzipping {zip_path}...")
        with zipfile.ZipFile(zip_path, 'r') as zf:
            zf.extractall(dest_dir)
        print(f"  -> Extracted to {dest_dir}")
    else:
        print(f"No zip found for {zip_path} - may already be extracted")

# Show what we have in datasets/
print("\nContents of datasets/ folder:")
for item in sorted(os.listdir('datasets/')):
    full_path = os.path.join('datasets/', item)
    if os.path.isdir(full_path):
        file_count = sum(1 for _ in glob.glob(os.path.join(full_path, '**/*.wav'), recursive=True))
        print(f"  [DIR] {item} ({file_count} wav files)")
    else:
        print(f"  [FILE] {item}")

## Processing TESS (Toronto Emotional Speech Set)

**TESS**: Two actresses (aged 26 and 64) say 200 target words in 7 emotions.

File structure: The emotion label is in the **folder name** (e.g., `OAF_angry/`, `YAF_happy/`).
- `OAF` = Older Adult Female
- `YAF` = Young Adult Female

In [None]:
# Process TESS dataset

import os
import glob

# Find TESS directory (structure may vary by download method)
tess_candidates = [
    'datasets/tess/',
    'datasets/toronto-emotional-speech-set-tess/',
    'datasets/tess/TESS Toronto emotional speech set data/',
    'datasets/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set data/',
]

tess_dir = None
for path in tess_candidates:
    if os.path.exists(path):
        # Check if wav files exist here or in subdirectories
        wavs = glob.glob(os.path.join(path, '**/*.wav'), recursive=True)
        if len(wavs) > 0:
            tess_dir = path
            break

if tess_dir is None:
    print("Could not find TESS dataset. Please check the datasets/ folder.")
else:
    print(f"Found TESS at: {tess_dir}")
    
    # TESS emotion mapping (folder names contain the emotion)
    tess_emotion_map = {
        'angry': 'angry',
        'disgust': 'disgust',
        'fear': 'fearful',
        'happy': 'happy',
        'neutral': 'neutral',
        'ps': 'surprised',  # "pleasant surprise"
        'sad': 'sad',
    }
    
    tess_rows = []
    all_tess_files = glob.glob(os.path.join(tess_dir, '**/*.wav'), recursive=True)
    
    for file_path in all_tess_files:
        # Get emotion from the folder name or filename
        # TESS files are like: OAF_angry/OAF_back_angry.wav
        filename = os.path.basename(file_path).lower()
        parent_dir = os.path.basename(os.path.dirname(file_path)).lower()
        
        # Try to find the emotion label
        emotion_found = None
        for key, value in tess_emotion_map.items():
            if key in parent_dir or key in filename:
                emotion_found = value
                break
        
        if emotion_found:
            # Determine speaker from filename
            gender = 'female'  # TESS only has female speakers
            speaker = 'OAF' if 'oaf' in filename or 'oaf' in parent_dir else 'YAF'
            
            tess_rows.append({
                'file_path': file_path,
                'emotion': emotion_found,
                'gender': gender,
                'actor': speaker,
                'dataset': 'TESS'
            })
    
    tess_df = pd.DataFrame(tess_rows)
    print(f"\nTESS: {len(tess_df)} files")
    print(tess_df['emotion'].value_counts())

## Processing SAVEE (Surrey Audio-Visual Expressed Emotion)

**SAVEE**: 4 male English speakers, 7 emotions, 480 total files.

File naming: The first two letters indicate emotion:
- `a` = angry, `d` = disgust, `f` = fear, `h` = happy, `n` = neutral, `sa` = sad, `su` = surprise

In [None]:
# Process SAVEE dataset

savee_candidates = [
    'datasets/savee/',
    'datasets/surrey-audiovisual-expressed-emotion-savee/',
]

savee_dir = None
for path in savee_candidates:
    if os.path.exists(path):
        wavs = glob.glob(os.path.join(path, '**/*.wav'), recursive=True)
        if len(wavs) > 0:
            savee_dir = path
            break

if savee_dir is None:
    print("Could not find SAVEE dataset. Please check the datasets/ folder.")
else:
    print(f"Found SAVEE at: {savee_dir}")
    
    # SAVEE emotion codes (prefix of filename)
    savee_emotion_map = {
        'a': 'angry',
        'd': 'disgust',
        'f': 'fearful',
        'h': 'happy',
        'n': 'neutral',
        'sa': 'sad',
        'su': 'surprised',
    }
    
    savee_rows = []
    all_savee_files = glob.glob(os.path.join(savee_dir, '**/*.wav'), recursive=True)
    
    for file_path in all_savee_files:
        filename = os.path.basename(file_path)
        
        # SAVEE filenames: DC_a01.wav, DC_sa02.wav, etc.
        # The speaker is before _, emotion code is after _
        parts = filename.split('_')
        if len(parts) >= 2:
            speaker = parts[0]
            emotion_part = parts[1].split('.')[0]  # e.g., 'a01', 'sa02'
            
            # Extract emotion code
            emotion_found = None
            # Check longer codes first (sa, su before s)
            for code in sorted(savee_emotion_map.keys(), key=len, reverse=True):
                if emotion_part.startswith(code):
                    emotion_found = savee_emotion_map[code]
                    break
            
            if emotion_found:
                savee_rows.append({
                    'file_path': file_path,
                    'emotion': emotion_found,
                    'gender': 'male',  # SAVEE only has male speakers
                    'actor': speaker,
                    'dataset': 'SAVEE'
                })
    
    savee_df = pd.DataFrame(savee_rows)
    print(f"\nSAVEE: {len(savee_df)} files")
    print(savee_df['emotion'].value_counts())

## Processing CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

**CREMA-D**: 91 actors (diverse ages, ethnicities), 6 emotions, 7,442 files.

This is the largest and most diverse dataset. File naming includes the emotion code:
- `ANG` = angry, `DIS` = disgust, `FEA` = fear, `HAP` = happy, `NEU` = neutral, `SAD` = sad

In [None]:
# Process CREMA-D dataset

cremad_candidates = [
    'datasets/cremad/',
    'datasets/cremad/AudioWAV/',
]

cremad_dir = None
for path in cremad_candidates:
    if os.path.exists(path):
        wavs = glob.glob(os.path.join(path, '**/*.wav'), recursive=True)
        if len(wavs) > 0:
            cremad_dir = path
            break

if cremad_dir is None:
    print("Could not find CREMA-D dataset. Please check the datasets/ folder.")
else:
    print(f"Found CREMA-D at: {cremad_dir}")
    
    # CREMA-D emotion codes
    cremad_emotion_map = {
        'ANG': 'angry',
        'DIS': 'disgust',
        'FEA': 'fearful',
        'HAP': 'happy',
        'NEU': 'neutral',
        'SAD': 'sad',
    }
    
    cremad_rows = []
    all_cremad_files = glob.glob(os.path.join(cremad_dir, '**/*.wav'), recursive=True)
    
    for file_path in all_cremad_files:
        filename = os.path.basename(file_path)
        
        # CREMA-D filenames: 1001_DFA_ANG_XX.wav
        # Format: ActorID_Sentence_Emotion_Level.wav
        parts = filename.split('_')
        if len(parts) >= 3:
            actor_id = parts[0]
            emotion_code = parts[2]
            
            if emotion_code in cremad_emotion_map:
                cremad_rows.append({
                    'file_path': file_path,
                    'emotion': cremad_emotion_map[emotion_code],
                    'gender': 'unknown',  # CREMA-D doesn't encode gender in filename
                    'actor': actor_id,
                    'dataset': 'CREMA-D'
                })
    
    cremad_df = pd.DataFrame(cremad_rows)
    print(f"\nCREMA-D: {len(cremad_df)} files")
    print(cremad_df['emotion'].value_counts())

In [None]:
# Combine all datasets into one big DataFrame!

# Start with RAVDESS (drop 'calm' to match other datasets)
ravdess_combined = ravdess_df[ravdess_df['emotion'] != 'calm'][['file_path', 'emotion', 'gender', 'actor', 'dataset']].copy()

# Collect all available DataFrames
all_dfs = [ravdess_combined]

if 'tess_df' in dir() and len(tess_df) > 0:
    all_dfs.append(tess_df)
    
if 'savee_df' in dir() and len(savee_df) > 0:
    all_dfs.append(savee_df)
    
if 'cremad_df' in dir() and len(cremad_df) > 0:
    all_dfs.append(cremad_df)

# Combine!
combined_df = pd.concat(all_dfs, ignore_index=True)

print(f"Combined dataset: {len(combined_df)} total files")
print(f"\nSamples per dataset:")
print(combined_df['dataset'].value_counts())
print(f"\nSamples per emotion:")
print(combined_df['emotion'].value_counts())
print(f"\nEmotions: {sorted(combined_df['emotion'].unique())}")

In [None]:
# Visualize the combined dataset

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Emotion distribution in combined dataset
emotion_by_dataset = combined_df.groupby(['emotion', 'dataset']).size().unstack(fill_value=0)
emotion_by_dataset.plot(kind='bar', stacked=True, ax=axes[0], 
                        color=['#2196F3', '#4CAF50', '#FF9800', '#E91E63'])
axes[0].set_title('Samples per Emotion (by Dataset)')
axes[0].set_xlabel('Emotion')
axes[0].set_ylabel('Count')
axes[0].legend(title='Dataset')
axes[0].tick_params(axis='x', rotation=45)

# Dataset sizes
dataset_counts = combined_df['dataset'].value_counts()
axes[1].pie(dataset_counts.values, labels=dataset_counts.index, autopct='%1.1f%%',
           colors=['#2196F3', '#4CAF50', '#FF9800', '#E91E63'])
axes[1].set_title('Dataset Composition')

plt.tight_layout()
plt.show()

print(f"\nTotal: {len(combined_df)} audio files from {combined_df['actor'].nunique()} unique actors")

In [None]:
# Listen to samples from different datasets for the same emotion

emotion_to_listen = 'angry'  # Change this to hear other emotions!

print(f"Listening to '{emotion_to_listen}' samples from each dataset:\n")

for dataset in combined_df['dataset'].unique():
    subset = combined_df[(combined_df['emotion'] == emotion_to_listen) & (combined_df['dataset'] == dataset)]
    if len(subset) > 0:
        sample = subset.sample(1).iloc[0]
        print(f"Dataset: {dataset} | Actor: {sample['actor']} | Gender: {sample['gender']}")
        display(Audio(sample['file_path']))
        print()

# Part 11: Feature Extraction on the Combined Dataset

Now let's extract features from our much larger combined dataset. This will take longer, but the extra data should improve our models significantly.

**Note**: This cell may take 10-20 minutes depending on your computer. Go grab a coffee!

In [None]:
# Extract features from the combined dataset
# Using augmentation to further increase training data

from tqdm.notebook import tqdm

print(f"Extracting features from {len(combined_df)} files (with augmentation)...")
print("This will take a while - each file generates 5 augmented versions.")
print(f"Expected total samples: ~{len(combined_df) * 5}\n")

X_combined = []  # Features
y_combined = []  # Labels
errors = []

for idx, row in tqdm(combined_df.iterrows(), total=len(combined_df)):
    try:
        features_list = extract_features_with_augmentation(row['file_path'])
        for features in features_list:
            X_combined.append(features)
            y_combined.append(row['emotion'])
    except Exception as e:
        errors.append((row['file_path'], str(e)))

X_combined = np.array(X_combined)
y_combined = np.array(y_combined)

print(f"\nFeature extraction complete!")
print(f"Features shape: {X_combined.shape}")
print(f"Labels shape: {y_combined.shape}")
if errors:
    print(f"Errors encountered: {len(errors)} files could not be processed")
    for path, err in errors[:5]:
        print(f"  {path}: {err}")

# Part 12: Classification on the Combined Dataset

Let's see how our models perform with the larger, more diverse dataset.

**Hypothesis**: More data from more speakers should help the model generalize better... but will it?

One important thing to keep in mind: RAVDESS is a very **controlled** dataset (professional actors, studio recording conditions, consistent setup). Our combined dataset mixes in data from different recording environments, different labeling conventions, and different speaker pools. This added **diversity** is realistic, but also makes the classification task **harder**.

Let's find out what happens!

In [None]:
# Prepare the combined dataset for training

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encode labels
label_encoder_combined = LabelEncoder()
y_combined_encoded = label_encoder_combined.fit_transform(y_combined)

print("Label mapping:")
for i, label in enumerate(label_encoder_combined.classes_):
    print(f"  {label} -> {i}")

# Split the data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_combined, y_combined_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_combined_encoded
)

# Scale features
scaler_combined = StandardScaler()
X_train_c_scaled = scaler_combined.fit_transform(X_train_c)
X_test_c_scaled = scaler_combined.transform(X_test_c)

print(f"\nTraining set: {X_train_c_scaled.shape[0]} samples")
print(f"Test set: {X_test_c_scaled.shape[0]} samples")
print(f"Number of features: {X_train_c_scaled.shape[1]}")
print(f"Number of emotions: {len(label_encoder_combined.classes_)}")

In [None]:
# Train all models on the combined dataset

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import time

models_combined = {
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM (RBF Kernel)': SVC(kernel='rbf', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=300, max_depth=20, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42, eval_metric='mlogloss'),
}

results_combined = {}

print("Training and evaluating models on COMBINED dataset (simplest -> most complex)...")
print("=" * 55)

for name, model in models_combined.items():
    start_time = time.time()
    
    model.fit(X_train_c_scaled, y_train_c)
    y_pred_c = model.predict(X_test_c_scaled)
    accuracy = accuracy_score(y_test_c, y_pred_c)
    elapsed = time.time() - start_time
    
    results_combined[name] = {
        'accuracy': accuracy,
        'predictions': y_pred_c,
        'time': elapsed,
        'model': model
    }
    
    print(f"{name:25s} | Accuracy: {accuracy:.4f} ({accuracy*100:.1f}%) | Time: {elapsed:.2f}s")

print("=" * 55)
best_combined_name = max(results_combined, key=lambda x: results_combined[x]['accuracy'])
print(f"\nBest model: {best_combined_name} with {results_combined[best_combined_name]['accuracy']*100:.1f}% accuracy")

In [None]:
# Compare RAVDESS-only vs Combined dataset performance

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Fixed order: simplest to most complex
model_names = ['K-Nearest Neighbors', 'Logistic Regression', 'SVM (RBF Kernel)', 'Random Forest', 'XGBoost']

# RAVDESS only
ravdess_accs = [results[name]['accuracy'] * 100 for name in model_names]
bars1 = axes[0].barh(model_names, ravdess_accs, color='#2196F3')
axes[0].set_xlabel('Accuracy (%)')
axes[0].set_title('RAVDESS Only')
axes[0].set_xlim(0, 100)
for bar, acc in zip(bars1, ravdess_accs):
    axes[0].text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2,
                 f'{acc:.1f}%', va='center', fontweight='bold')

# Combined
combined_accs = [results_combined[name]['accuracy'] * 100 for name in model_names]
bars2 = axes[1].barh(model_names, combined_accs, color='#4CAF50')
axes[1].set_xlabel('Accuracy (%)')
axes[1].set_title('Combined Dataset (RAVDESS + TESS + SAVEE + CREMA-D)')
axes[1].set_xlim(0, 100)
for bar, acc in zip(bars2, combined_accs):
    axes[1].text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2,
                 f'{acc:.1f}%', va='center', fontweight='bold')

plt.suptitle('Effect of More Data on Model Performance', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Print improvement
print("\nImprovement from RAVDESS-only to Combined:")
print("-" * 50)
for name in model_names:
    diff = results_combined[name]['accuracy'] - results[name]['accuracy']
    arrow = '+' if diff > 0 else ''
    print(f"  {name:25s}: {arrow}{diff*100:.1f}%")

In [None]:
# Detailed evaluation of the best model on combined dataset

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

best_pred_combined = results_combined[best_combined_name]['predictions']

# Confusion matrix
cm_combined = confusion_matrix(y_test_c, best_pred_combined)

fig, ax = plt.subplots(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm_combined,
    display_labels=label_encoder_combined.classes_
)
disp.plot(ax=ax, cmap='Greens', values_format='d')
plt.title(f'Confusion Matrix - {best_combined_name}\n(Combined Dataset)', fontsize=14)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Classification report
print(f"\nClassification Report - {best_combined_name} (Combined Dataset)")
print("=" * 60)
print(classification_report(
    y_test_c,
    best_pred_combined,
    target_names=label_encoder_combined.classes_
))

### Interpreting the Results: Why More Data Didn't Always Help

You might have expected the combined dataset to always outperform RAVDESS-only. But the results tell a more nuanced story! Here's what's going on:

#### Why accuracy sometimes *decreased*:

| Factor | Impact |
|--------|--------|
| **Domain shift** | Each dataset was recorded in different studios, with different microphones, room acoustics, and signal quality. Our features (MFCCs, spectral centroid, etc.) capture not just emotion but also these recording characteristics. When you train and test on RAVDESS, the recording conditions match perfectly. The combined dataset mixes 4 different recording pipelines, so the features are partially encoding "which dataset is this from?" instead of "what emotion is this?" |
| **Label interpretation** | "Happy" performed by a RAVDESS professional actor in a controlled session sounds quite different from "happy" in CREMA-D's crowd-sourced actors. The label is the same string, but the acoustic realization differs. The model is trying to learn one boundary for "happy" when there are really multiple overlapping distributions. |
| **Dataset dominance** | CREMA-D has ~7,400 files vs. RAVDESS's ~1,440. CREMA-D dominates the combined training data. If CREMA-D's recordings are noisier or its emotional expressions more ambiguous, the model shifts toward that distribution and gets *worse* on cleaner data. |
| **Measuring different things** | RAVDESS-only accuracy measures "can you classify emotions within a controlled lab setting?" Combined accuracy measures "can you classify emotions across wildly different recording conditions and acting styles?" The second task is genuinely much harder - it's not a fair apples-to-apples comparison. |

#### The RAVDESS advantage:

RAVDESS is a very **controlled** dataset - professional actors, studio conditions, consistent recording setup. When you train AND test on RAVDESS, all the "non-emotion" information (microphone quality, room reverb, noise floor) is consistent, so the model can focus entirely on the emotional signal. The combined dataset introduces real-world messiness.

#### This is a real phenomenon: Distribution Shift

This shows up constantly in applied ML. A medical imaging model trained at one hospital often performs worse at another hospital, even with "more data." A self-driving car trained in sunny California struggles in rainy Seattle. The technical term is **distribution shift** - when the training data and test data come from different underlying distributions.

**Key takeaway**: Data quality and consistency matter just as much as data quantity. This is one of the most important lessons in applied ML!

---

### How Could We Actually Improve With More Data?

More data *can* help, but we need to be smarter about how we combine it. Here are concrete strategies:

#### 1. Per-Dataset Feature Normalization
Instead of extracting features and mixing everything together, normalize features **within each dataset** before combining. This reduces the recording-condition bias so the model focuses on emotion rather than microphone type.

```python
# Example: normalize per dataset before combining
for dataset_name in combined_df['dataset'].unique():
    mask = (combined_df['dataset'] == dataset_name)
    X[mask] = scaler.fit_transform(X[mask])
```

#### 2. More Robust Features
Our hand-crafted features (MFCCs, spectral centroid) are sensitive to recording conditions. **Pre-trained deep learning embeddings** from models like [wav2vec2](https://huggingface.co/facebook/wav2vec2-base) or [HuBERT](https://huggingface.co/facebook/hubert-base-ls960) are trained on thousands of hours of speech and produce features that better capture *what's being said and how* rather than *what microphone was used*. You could extract these embeddings and feed them into the same sklearn classifiers we used here.

#### 3. Domain Adaptation
Train the model to explicitly *ignore* dataset-specific characteristics. Techniques include:
- **Adversarial training**: Train a second model to predict which dataset a sample came from, and penalize the main model for making that task easy
- **Dataset balancing**: Ensure equal representation from each dataset in every training batch

#### 4. Consistent Preprocessing
Before combining, resample all audio to the same sample rate, apply the same noise reduction, and normalize volume levels. This reduces some of the surface-level differences between datasets.

#### 5. Evaluate Per-Dataset
Instead of one combined accuracy number, evaluate **per dataset** to understand where the model improves vs. degrades. Maybe the combined model is better on CREMA-D and SAVEE but worse on RAVDESS - that's useful information!

#### 6. Use Neural Networks on Spectrograms
Instead of hand-crafting 57 features, feed mel spectrograms directly into a **Convolutional Neural Network (CNN)**. CNNs can learn which parts of the spectrogram are relevant for emotion and which are recording artifacts. This is the approach used by state-of-the-art SER systems.

#### 7. Fine-tune Pre-trained Speech Models
The current best approach: take a model like wav2vec2 that already understands speech, and **fine-tune** it on emotion data. These models have seen so much speech data that they've already learned to separate content from recording conditions.

### Putting It in Perspective: Classical ML vs. Modern Approaches

It's worth remembering that in this notebook we're using **classical ML techniques** - algorithms like Random Forest, SVM, and XGBoost that work on hand-crafted features. These are great for learning the fundamentals, but they have a ceiling.

Modern speech emotion recognition systems use much more powerful approaches:

| Approach | What we did | What modern systems do |
|----------|------------|----------------------|
| **Features** | Hand-crafted (MFCCs, chroma, ZCR) - 57 numbers per audio clip | Learned automatically from raw audio by deep neural networks - thousands of dimensions |
| **Models** | Classical ML (Random Forest, SVM, XGBoost) | CNNs on spectrograms (like we saw in the previous class!), Transformers, and large pre-trained models |
| **Training data** | ~12,000 clips with augmentation | Hundreds of thousands of hours of speech |
| **Typical accuracy** | 50-85% depending on setup | 80-95%+ on standard benchmarks |

The key difference is that neural networks can **learn their own features** directly from spectrograms or raw audio, rather than relying on us to decide which 57 numbers best describe emotion. CNNs can pick up on subtle patterns in spectrograms that we'd never think to extract by hand, and Transformers can capture long-range patterns across an entire utterance.

**Next week**, we'll bridge this gap by looking at **Whisper** and how to **fine-tune a pre-trained model** for emotion recognition. You'll see how a model that has already learned to understand speech from massive datasets can be adapted to our emotion task with much better results - and surprisingly little code!

---

### Hearing the Combined Model in Action

Let's listen to test samples and compare what the RAVDESS-only model and the combined model predict for the same emotions.

In [None]:
# Listen to test set samples from the combined dataset and see predictions!

from IPython.display import Audio, display
import random

# Use the best model from the combined training
best_model_c = results_combined[best_combined_name]['model']

# Pick random test samples
num_samples = 10
sample_indices = random.sample(range(len(X_test_c)), num_samples)

correct = 0

print(f"Model: {best_combined_name} (trained on combined dataset)")
print(f"Evaluating on {num_samples} random test samples...")
print("=" * 65)

for idx in sample_indices:
    # Get the prediction
    features_scaled = X_test_c_scaled[idx].reshape(1, -1)
    prediction = best_model_c.predict(features_scaled)[0]
    predicted_emotion = label_encoder_combined.inverse_transform([prediction])[0]
    true_emotion = label_encoder_combined.inverse_transform([y_test_c[idx]])[0]
    
    is_correct = predicted_emotion == true_emotion
    correct += is_correct
    status = 'CORRECT' if is_correct else 'WRONG'
    
    # Find a matching file to play
    matching_files = combined_df[combined_df['emotion'] == true_emotion]
    if len(matching_files) > 0:
        sample_file = matching_files.sample(1).iloc[0]
        print(f"\nTrue: {true_emotion:12s} | Predicted: {predicted_emotion:12s} | {status}")
        print(f"  Dataset: {sample_file['dataset']}")
        if 'statement' in sample_file:
            print(f"  Statement: \"{sample_file['statement']}\"")
        display(Audio(sample_file['file_path']))
    else:
        print(f"\nTrue: {true_emotion:12s} | Predicted: {predicted_emotion:12s} | {status}")

print("\n" + "=" * 65)
print(f"Score: {correct}/{num_samples} correct ({100*correct/num_samples:.0f}%)")
print(f"\nRun this cell again to hear different samples!")

# Part 13: Real-time Emotion Recognition (Exercise)

Now it's your turn! Let's use the trained model to recognize emotion from your own voice in real-time.

### How it works:
1. Record a short clip from your microphone
2. Extract the same features we used for training
3. Feed features into our best model
4. Get an emotion prediction!

**Important**: The model was trained on acted emotions, which tend to be more exaggerated than natural speech. Try being a bit theatrical when you speak!

### Tips for best results:
- Speak a full sentence (e.g., "I can't believe this happened!")
- Exaggerate the emotion slightly
- Speak clearly and at a normal volume
- Try the same sentence with different emotions

In [None]:
# Real-time emotion prediction function

import sounddevice as sd
import soundfile as sf
import numpy as np
import librosa

def predict_emotion(audio_data, sr, model, scaler, label_encoder):
    """Predict emotion from audio data.
    
    Args:
        audio_data: numpy array of audio samples
        sr: sample rate
        model: trained sklearn model
        scaler: fitted StandardScaler
        label_encoder: fitted LabelEncoder
    
    Returns:
        predicted emotion (string), confidence scores (dict)
    """
    # Extract features (same function we used for training!)
    features = extract_features(audio_data, sr)
    features = features.reshape(1, -1)  # Reshape for single prediction
    
    # Scale features (using the same scaler from training)
    features_scaled = scaler.transform(features)
    
    # Predict
    prediction = model.predict(features_scaled)[0]
    predicted_emotion = label_encoder.inverse_transform([prediction])[0]
    
    # Get prediction probabilities if the model supports it
    confidence = {}
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(features_scaled)[0]
        for i, label in enumerate(label_encoder.classes_):
            confidence[label] = proba[i]
    
    return predicted_emotion, confidence

def record_and_predict(duration=3, sr=22050):
    """Record audio and predict emotion."""
    print(f"Recording for {duration} seconds... Speak now!")
    
    # Record audio
    recording = sd.rec(
        int(duration * sr),
        samplerate=sr,
        channels=1,
        dtype='float32'
    )
    sd.wait()
    recording = recording.flatten()
    
    print("Recording finished! Analyzing...\n")
    
    # Choose which model and scaler to use
    # Use the combined dataset model if available, otherwise RAVDESS-only
    if 'results_combined' in dir() and best_combined_name in results_combined:
        model = results_combined[best_combined_name]['model']
        sc = scaler_combined
        le = label_encoder_combined
        dataset_name = 'Combined'
    else:
        model = results[best_model_name]['model']
        sc = scaler
        le = label_encoder
        dataset_name = 'RAVDESS'
    
    # Predict emotion
    predicted_emotion, confidence = predict_emotion(recording, sr, model, sc, le)
    
    # Display results
    print("=" * 40)
    print(f"  Predicted Emotion: {predicted_emotion.upper()}")
    print(f"  Model: {best_combined_name if dataset_name == 'Combined' else best_model_name}")
    print(f"  Dataset: {dataset_name}")
    print("=" * 40)
    
    # Show confidence scores if available
    if confidence:
        print("\nConfidence scores:")
        for emotion, score in sorted(confidence.items(), key=lambda x: x[1], reverse=True):
            bar = '#' * int(score * 40)
            print(f"  {emotion:12s} {score:6.1%} {bar}")
    
    # Show waveform
    plt.figure(figsize=(10, 3))
    librosa.display.waveshow(recording, sr=sr)
    plt.title(f'Your Recording - Predicted: {predicted_emotion.upper()}')
    plt.tight_layout()
    plt.show()
    
    # Play it back
    display(Audio(recording, rate=sr))
    
    return predicted_emotion, confidence

print("Ready! Use record_and_predict() to test emotion recognition.")

In [None]:
# Record yourself and get an emotion prediction!
# Run this cell, speak a sentence with emotion, and see what the model thinks.
#
# Try saying the same sentence with different emotions:
# "I can't believe this is happening right now."
# - Say it happily, angrily, sadly, fearfully, etc.

predicted, confidence = record_and_predict(duration=3)

In [None]:
# Interactive loop: Keep testing until you quit
# Press Enter to record, type 'q' to quit

print("=" * 50)
print("INTERACTIVE EMOTION RECOGNITION")
print("=" * 50)
print("Instructions:")
print("  1. Press Enter to start recording (3 seconds)")
print("  2. Speak a sentence with emotion")
print("  3. See the prediction!")
print("  4. Type 'q' to quit")
print("=" * 50)

try:
    while True:
        user_input = input("\nPress Enter to record (or 'q' to quit): ")
        if user_input.lower() == 'q':
            break
        record_and_predict(duration=3)
except KeyboardInterrupt:
    pass

print("\nThanks for testing!")

In [None]:
# BONUS: Compare predictions from RAVDESS-only model vs Combined model
# This shows how more data changes the model's behavior

def compare_models(duration=3, sr=22050):
    """Record once and compare predictions from both models."""
    print(f"Recording for {duration} seconds... Speak now!")
    
    recording = sd.rec(int(duration * sr), samplerate=sr, channels=1, dtype='float32')
    sd.wait()
    recording = recording.flatten()
    print("Recording finished!\n")
    
    # Extract features once
    features = extract_features(recording, sr).reshape(1, -1)
    
    print("=" * 55)
    print(f"{'Model':25s} | {'Predicted Emotion':20s}")
    print("=" * 55)
    
    # RAVDESS-only model
    if 'results' in dir():
        for name in results:
            model = results[name]['model']
            features_scaled = scaler.transform(features)
            pred = model.predict(features_scaled)[0]
            emotion = label_encoder.inverse_transform([pred])[0]
            print(f"  RAVDESS {name:20s} | {emotion}")
    
    print("-" * 55)
    
    # Combined model
    if 'results_combined' in dir():
        for name in results_combined:
            model = results_combined[name]['model']
            features_scaled = scaler_combined.transform(features)
            pred = model.predict(features_scaled)[0]
            emotion = label_encoder_combined.inverse_transform([pred])[0]
            print(f"  Combined {name:19s} | {emotion}")
    
    print("=" * 55)
    
    # Play it back
    plt.figure(figsize=(10, 3))
    librosa.display.waveshow(recording, sr=sr)
    plt.title('Your Recording')
    plt.tight_layout()
    plt.show()
    display(Audio(recording, rate=sr))

# Uncomment and run to compare:
# compare_models(duration=3)

---

# Exercises

Now it's your turn to explore! Try these exercises to deepen your understanding.

### Exercise 1: Feature Importance
Random Forest models can tell you which features were most important for classification. Which audio features matter most for emotion recognition?

### Exercise 2: Reduce Emotions
Try classifying only 4 emotions (happy, sad, angry, neutral) instead of 7. Does accuracy improve?

### Exercise 3: Gender Effects
Train separate models for male and female speakers. Does it help?

### Exercise 4: Hyperparameter Tuning
Try changing model parameters (number of trees, max depth, learning rate, etc.) and see how it affects performance.

### Exercise 5: Try It Without Augmentation
Re-run feature extraction without augmentation and compare. How much does augmentation help?

In [None]:
# Exercise 1: Feature Importance (starter code)
# 
# Random Forest can tell us which features were most important!
# This helps us understand what the model is "looking at"

# Get the Random Forest model (from combined results if available)
if 'results_combined' in dir() and 'Random Forest' in results_combined:
    rf_model = results_combined['Random Forest']['model']
    le = label_encoder_combined
else:
    rf_model = results['Random Forest']['model']
    le = label_encoder

# Get feature importances
importances = rf_model.feature_importances_

# Create feature names
feature_names = (
    [f'MFCC_{i}' for i in range(40)] +
    [f'Chroma_{i}' for i in range(12)] +
    ['ZCR', 'RMS', 'Spectral_Centroid', 'Spectral_Bandwidth', 'Spectral_Rolloff']
)

# Sort by importance
sorted_idx = np.argsort(importances)[::-1]

# Plot top 20 most important features
plt.figure(figsize=(12, 6))
top_n = 20
plt.barh(
    [feature_names[i] for i in sorted_idx[:top_n]][::-1],
    importances[sorted_idx[:top_n]][::-1],
    color='#4CAF50'
)
plt.xlabel('Feature Importance')
plt.title(f'Top {top_n} Most Important Features for Emotion Recognition')
plt.tight_layout()
plt.show()

print("\nNotice: MFCCs tend to dominate! They capture the overall vocal quality.")
print("But spectral features and ZCR also contribute.")

In [None]:
# Exercise 2: Reduce to 4 emotions (starter code)
# 
# Try: happy, sad, angry, neutral
# Does accuracy improve with fewer classes?

# YOUR CODE HERE
# Hint: Filter combined_df to only include 4 emotions, then re-extract features
# and re-train models. You can reuse the extract_features function.
#
# four_emotions = ['happy', 'sad', 'angry', 'neutral']
# filtered_df = combined_df[combined_df['emotion'].isin(four_emotions)]
# ...

In [None]:
# Exercise 3: Gender-specific models (starter code)
#
# Train separate models for male and female speakers.
# Does this improve accuracy?

# YOUR CODE HERE
# Hint: Filter by gender, extract features for each subset,
# and train separate models.
#
# male_df = combined_df[combined_df['gender'] == 'male']
# female_df = combined_df[combined_df['gender'] == 'female']
# ...

---

# Summary

In this notebook, you learned:

1. **Data Collection**: How to download and organize speech emotion datasets from Kaggle
2. **Data Exploration**: Using pandas to understand dataset structure and distribution
3. **Audio Visualization**: Waveforms, spectrograms, and mel spectrograms with librosa
4. **Data Augmentation**: Noise, time stretch, pitch shift, and time shift to create more training data
5. **Feature Extraction**: Converting audio to numerical features (MFCCs, chroma, ZCR, RMS, spectral features)
6. **Classification**: Training and comparing Random Forest, XGBoost, Logistic Regression, SVM, and KNN
7. **Evaluation**: Confusion matrices, classification reports, precision, recall, F1-score
8. **Dataset Combination**: Building a larger, more diverse dataset from multiple sources
9. **Real-time Inference**: Using trained models to recognize emotions from live microphone input

### Key Takeaways

- **More data helps**: Combining datasets improved performance across all models
- **Feature engineering matters**: The quality of your features determines your model's ceiling
- **Some emotions are harder**: Calm vs. neutral, happy vs. surprised are commonly confused
- **Classical ML can be effective**: You don't always need deep learning!

### What's Next?

**Next week**, we'll move beyond classical ML and into modern deep learning approaches:

- **Fine-tuning Whisper**: We'll take OpenAI's Whisper model - already trained on hundreds of thousands of hours of speech - and fine-tune it different applications. You'll see how pre-trained models can dramatically improve performance with relatively little effort.
- **CNNs on Spectrograms**: Like we explored in the previous class with digit recognition, we can feed mel spectrograms directly into convolutional neural networks and let them learn their own features.
- **Transfer Learning**: The key idea behind modern AI - start with a model that already understands speech, then teach it your specific task.
- **Ethics**: Consider the implications of emotion recognition technology and the [EU AI Act](https://ai-act-law.eu/recital/18/)

### References

- [RAVDESS Dataset](https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio)
- [TESS Dataset](https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess)
- [SAVEE Dataset](https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee)
- [CREMA-D Dataset](https://www.kaggle.com/datasets/ejlok1/cremad)
- [librosa Documentation](https://librosa.org/doc/latest/)
- [scikit-learn Documentation](https://scikit-learn.org/stable/)
- [Speech Emotion Recognition on Papers with Code](https://paperswithcode.com/task/speech-emotion-recognition)