<a href="https://colab.research.google.com/github/kodandachalla/Data_Analysis_and_Visualization__Music/blob/main/3_project_1_assignment(colab_version)_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***N.B. this notebook has been tested in Google Colab***

# Music Genre Classification

In this project, you will build a machine learning algorithm to classify music genres using audio features. Starting with the provided dataset, your task is to develop a model that effectively solves this multiclass classification problem. Use the baseline notebook as a starting point and improve upon it.

***Overall Goal: to design a complete pipeline that improves accuracy from the current 37% to at least 70%, ideally approaching 80%***


## Introduction

In this project, you will work with a dataset of music samples from various genres. The dataset has been purposely left a bit messy, with some entries missing labels and others containing empty audio files. To start with, your task is to clean and explore the dataset, turning it into a well-organized resource for analysis.

This notebook includes a basic, "weak" baseline to get you started. It serves as a simple starting point, but it is neither thorough nor accurate. You are expected to build upon it, applying your own strategies to improve the data science pipeline (including data cleaning, curation, feature engineering, etc) before moving into model building, parameters tuning, and model evaluation.

The formal details of the assignment are provided at the end of the notebook. To start with, focus on understanding the dataset and planning your strategies to tackle its challenges.

**We expect you to submit a modified version of this notebook with your improvements. Please download a copy of this assignment in your private Python programming environment, before making any changes.**

## Baseline

Let's install all required dependencies:

- **datasets**: Access to large-scale datasets.
- **librosa**: Tools for audio analysis.
- **pandas** & **numpy**: Tabular data manipulation and numerical operations.
- **scikit-learn**: Machine learning algorithms and tools.
- **tqdm**: Progress bar.

You might be familiar with most of these already.

In [None]:
%%capture
!pip install datasets==3.5.0 librosa pandas numpy scikit-learn tqdm

And import the necessary modules

In [None]:
import numpy as np
import pandas as pd
import librosa
from datasets import load_dataset
from IPython.display import Audio, display
from tqdm.notebook import tqdm

In [None]:
# Include here any additional modules that you might need
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Description

The dataset consists of music samples from various genres, including:

- **Genres**: `Blues`, `Classical`, `Country`, `Disco`, `HipHop`, `Jazz`, `Metal`, `Pop`, `Reggae`, and `Rock`.

The dataset is a bit messy and includes some **unlabeled data** and **empty audio files**. We have provided basic preprocessing, but more in-depth data cleaning, feature extraction, and preparation will be a part of your assignment.

Let's download the audio dataset using the Hugging Face datasets library.

In [None]:
dataset = load_dataset("unibz-ds-course/audio_assignment", split="train")

In [None]:
dataset

In [None]:
print(f"Num of samples in the dataset: {len(dataset)}")

Let's take a glance at a sample from the dataset

In [None]:
entry = dataset[10]

audio_array = entry['audio']['array']
sampling_rate = entry['audio']['sampling_rate']

print(f"Element: {entry}\n")
print(f"File Path: {entry['file']}")
print(f"Number of Samples: {len(audio_array)}")
print(f"Sampling Rate: {sampling_rate} Hz")

audio_length_seconds = len(audio_array) / sampling_rate
print(f"Audio Length: {audio_length_seconds:.2f} seconds")

genre_id = entry['genre']
genre_label = dataset.features['genre'].int2str(genre_id)
print(f"Genre (ID): {genre_id}")
print(f"Genre (Label): {genre_label}")

display(Audio(audio_array, rate=sampling_rate))

**Draw a plot with distribution of the classes (5 points)**

Create a visualization (e.g., bar chart or histogram) that shows how many samples belong to each class.
This helps identify whether the dataset is balanced or if some classes are underrepresented.

In [None]:
genre_counts = pd.Series(dataset['genre']).value_counts().sort_index()
genre_labels = [dataset.features['genre'].int2str(int(g_id)) for g_id in genre_counts.index]

display("The dataset contains balanced classes across different genres")
plt.figure(figsize=(15, 6))
ax = sns.barplot(x=genre_labels, y=genre_counts.values, palette='tab10', hue=genre_labels, legend=False)
plt.title('Distribution of Music Genres in the Dataset')
plt.xlabel('Genre')
plt.ylabel('Number of Samples')
plt.xticks(rotation=0, ha='right')

total_samples = len(dataset)
for p in ax.patches:
    height = p.get_height()
    percentage = '{:.1f}%'.format(100 * height / total_samples)
    ax.annotate(percentage,
                (p.get_x() + p.get_width() / 2., height),
                ha='center', va='bottom', xytext=(0, 5),
                textcoords='offset points')

plt.tight_layout()
plt.show()

**Draw distribution of lengths of audios (5 points)**

Plot the distribution of audio lengths in the dataset to analyze how durations vary across samples.

In [None]:
audio_array_lengths = [len(sample['audio']['array']) for sample in dataset]
sampling_rates = [sample['audio']['sampling_rate'] for sample in dataset]

audio_length_seconds = [length / sampling_rate for length, sampling_rate in zip(audio_array_lengths, sampling_rates)]

print(f"Minium Audio Length: {min(audio_length_seconds):.2f} seconds")
print(f"Maximum Audio Length: {max(audio_length_seconds):.2f} seconds")
plt.figure(figsize=(12, 6))
sns.histplot(audio_length_seconds, bins=50, kde=True)
plt.title('Distribution of Audio Lengths')
plt.xlabel('Audio Length (seconds)')
plt.ylabel('Number of Samples')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.show()

**Delete empty samples (5 points)**

Implement the function to remove empty samples (audios with silence only)

In [None]:
def filter_empty_samples(entry):
    audio_array = entry['audio']['array']

    # Check if the audio array is truly empty (length 0)
    if len(audio_array) == 0:
        return False
    # Check if all elements in the audio array are zero (silence only)
    if np.all(audio_array == 0):
        return False

    return True

In [None]:
filtered_dataset = dataset.filter(filter_empty_samples)

In [None]:
assert len(filtered_dataset) == 970, "Your filtering function is wrong"

Uncomment the code above. If the assertion fails, please check your function for bugs

**Delete unlabeled samples (5 points)**

Implement the function to remove unlabeled samples.

In [None]:
def filter_unlabeled_samples(entry):
    genre_id = entry['genre']

    if genre_id not in genre_counts.index: # 0 to 9
        return False

    return True

In [None]:
filtered_dataset = filtered_dataset.filter(filter_unlabeled_samples)

In [None]:
assert len(filtered_dataset) == 848, "Your filtering function is wrong"

Uncomment the code above. If the assertion fails, please check your function for bugs

Now we can extract some features from the dataset.

### **Mel Frequency Cepstral Coefficients**

**[Mel Frequency Cepstral Coefficients (MFCCs)](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)** are commonly used in audio analysis to capture key features of sound. They help represent the important characteristics of an audio signal, making them ideal for tasks like music genre classification and speech recognition.

We're not going to dive deep into the complex details of audio processing, but it's useful to know that MFCCs help simplify raw audio data while retaining important information.

#### Basic Steps in MFCC Extraction:
1. **Frequency Domain Conversion**: The audio signal is split into short frames, and we apply the Fourier Transform to convert them from the time domain to the frequency domain.
2. **Mel Scale Mapping**: The frequency spectrum is converted to the Mel scale, which better represents how humans perceive sound, emphasizing lower frequencies.
3. **Logarithm and DCT**: After mapping to the Mel scale, we apply a logarithm and the Discrete Cosine Transform (DCT) to get the MFCCs. These summarize the "cepstral" information of the audio signal.

The parameter `n_mfcc` controls **how many MFCC coefficients** are extracted for each frame. For example, setting `n_mfcc=8` means we extract 8 coefficients, where lower coefficients capture broad audio features, and higher coefficients capture the more finer details.

#### Why MFCCs Are Important:
MFCCs help capture the **tonal quality** of the sound and reduce the complexity of the raw audio signal. By summarizing the audio into a smaller set of features, they allow machine learning models to classify and recognize different types of sounds more effectively.

In this notebook, we'll use the **mean** and **variance** of the MFCCs over time to create a robust feature set for our classification model. Adjusting the `n_mfcc` parameter allows us to control the number of features extracted for each audio sample.

#### **Additional Features**
Consider exploring additional audio features to enhance your model's performance. There are various acoustic properties you could extract from the audio signals, such as zero crossings, harmonic-percussive separation, tempo, spectral centroids, spectral rolloff, chromagram, RMS energy, spectral bandwidth, etc. When working with these features, it's often useful to compute summary statistics like the mean and variance across the audio sample. These summary statistics can capture the overall characteristics and variability of the feature, reducing the dimensionality of your data while retaining important information. Experimenting with these features and their statistical summaries could potentially improve your model's accuracy and robustness in distinguishing between different audio characteristics.

#### **Feature Analysis**
Don´t forget to optimize the use of features, identifying and handling irrelevant and reduntant features. Then use feature ranking to identify which features are more influential, and evaluate quantitatively how many top-features to retain.

In [None]:
def extract_mfcc_features(dataset, n_mfcc):
    mfcc_features = []

    # here we might have used Dataset.map method, unfortunately, it consumes extra memory and runs out of RAM in colab
    for entry in tqdm(dataset, desc="Extracting MFCC Features"):
        audio_array = entry['audio']['array']
        sampling_rate = entry['audio']['sampling_rate']

        mfcc = librosa.feature.mfcc(y=audio_array, sr=sampling_rate, n_mfcc=n_mfcc)

        # print(mfcc.shape)
        mfcc_mean = np.mean(mfcc, axis=1)
        mfcc_var = np.var(mfcc, axis=1)

        feature_dict = {}

        for i in range(n_mfcc):
            feature_dict[f'mfcc_mean{i+1}'] = mfcc_mean[i]

        for i in range(n_mfcc):
            feature_dict[f'mfcc_var{i+1}'] = mfcc_var[i]

        feature_dict['genre'] = entry['genre']

        mfcc_features.append(feature_dict)

    return mfcc_features

Let's take a look at the output of the function. We will pass there just 2 samples from the dataset.

In [None]:
# extract_mfcc_features(filtered_dataset.select(range(2)), n_mfcc=5)

The function generates `n_mfcc * 2` features for each sample. Consider analyzing their correlation with a matrix and experimenting with different `n_mfcc` values to observe how feature relationships change. While MFCC features are effective for audio analysis, you might also improve performance by incorporating additional features such as RMS or Spectral Contrast. Once you've explored these options, proceed to training the model using the extracted features.

**Implement functions to extract RMS and Spectral Contrast (10 points in total)**

Write functions to extract these features, also add their description and discuss why they might be useful in this assignment.

In `librosa`, **RMS** and **Spectral Contrast** are essential features for audio analysis, but they measure very different things: one focuses on **power (volume)**, while the other focuses on **texture (clarity)**.


**1. Spectral Contrast**

Spectral Contrast measures the **spectral texture** of the sound by looking at the "gap" between peaks and valleys in the frequency spectrum.

* **Definition:** For each frequency sub-band, it calculates the difference between the peaks (high energy) and valleys (low energy).
* **Physical Meaning:**
  - **High Contrast:** Indicates a "clear" sound with distinct harmonics (like a piano or a violin).
  - **Low Contrast:** Indicates a "noisy" or "muddy" sound where the energy is spread flat (like white noise, rain, or a snare drum).


**Key Uses:**
* **Music Genre Classification:** Jazz and Classical usually have higher contrast than heavy metal or distorted rock.
* **Music vs. Speech:** Helping algorithms distinguish between a person talking and an instrument playing.
* **Audio Fingerprinting:** Identifying specific sounds based on their unique "texture."


In [None]:
def extract_spectral_contrast(dataset):
    spectral_contrast_features = []

    for entry in tqdm(dataset, desc="Extracting Spectral Contrast Features"):
        audio_array = entry['audio']['array']
        sampling_rate = entry['audio']['sampling_rate']

        # Compute spectral contrast
        S = np.abs(librosa.stft(audio_array))
        spectral_contrast = librosa.feature.spectral_contrast(sr=sampling_rate, S=S)

        # print(spectral_contrast.shape)

        # Calculate mean and variance
        sc_mean = np.mean(spectral_contrast, axis=1)
        sc_var = np.var(spectral_contrast, axis=1)

        feature_dict = {}
        for i in range(len(sc_mean)):
            feature_dict[f'spectral_contrast_mean{i+1}'] = sc_mean[i]

        for i in range(len(sc_var)):
            feature_dict[f'spectral_contrast_var{i+1}'] = sc_var[i]

        feature_dict['genre'] = entry['genre']

        spectral_contrast_features.append(feature_dict)

    return spectral_contrast_features

In [None]:
# extract_spectral_contrast(filtered_dataset.select(range(2)))


**2. RMS (Root-Mean-Square) Energy**

RMS is the standard way to measure the **instantaneous loudness** or energy of an audio signal.

* **Definition:** It calculates the square root of the arithmetic mean of the squares of the signal values.
* **Physical Meaning:** It represents how "strong" the vibrations are. Unlike a simple peak-to-peak measurement, RMS correlates better with how humans perceive volume.

**Key Uses:**
* **Silence Detection:** Identifying segments of audio where no sound is occurring.
* **Audio Normalization:** Adjusting different tracks to have the same average volume.
* **Event Detection:** Finding "hits" or "beats" in a track by looking for sudden rises in energy.

In [None]:
def extract_rms(dataset):
    rms_features = []

    for entry in tqdm(dataset, desc="Extracting RMS Features"):
        audio_array = entry['audio']['array']

        rms = librosa.feature.rms(y=audio_array)
        # print(rms.shape)
        # Calculate mean and variance
        rms_mean = np.mean(rms)
        rms_var = np.var(rms)

        feature_dict = {
            'rms_mean': rms_mean,
            'rms_var': rms_var,
            'genre': entry['genre']
        }
        rms_features.append(feature_dict)

    return rms_features

In [None]:
# extract_rms(filtered_dataset.select(range(2)))

**Explore and add other features useful for classification (10 points)**

It can be in a single function or in separated functions. Select some features and provide a description for each one. You can choose as many features as you like.

**Zero Crossing Rate (ZCR)** is a simple yet powerful feature that describes the "tonal" versus "noisy" character of a sound.


The **Zero Crossing Rate** is the rate at which the signal changes sign—from positive to negative or vice versa—divided by the length of the frame.

* **Physical Meaning:** It essentially measures how many times the waveform "crosses" the zero-axis.
* **Correlation to Frequency:** For a simple sine wave, the ZCR correlates directly to the frequency (pitch). For complex signals, it acts as a proxy for the **dominant frequency** or the "brightness" of the sound.


**Key Applications**

1. **Voiced vs. Unvoiced Speech:**
    * **Voiced sounds** (like "a", "e", "o") have low ZCR because they are periodic and lower in frequency.
    * **Unvoiced sounds** (like "s", "sh", "f") have high ZCR because they are essentially white noise with high-frequency components.


2. **Percussive vs. Harmonic Separation:** It helps distinguish between sharp, noisy hits (high ZCR) and melodic notes (low ZCR).
3. **Genre Classification:** Rock or Metal music often has a higher average ZCR compared to Classical music due to distortion and heavy percussion.


In [None]:
def extract_zcr_features(dataset):

    zcr_features = []

    for entry in tqdm(dataset, desc="Extracting ZCR Features"):
        audio_array = entry['audio']['array']
        sampling_rate = entry['audio']['sampling_rate']

        # Compute Zero Crossing Rate
        zcr = librosa.feature.zero_crossing_rate(y=audio_array)
        # print(zcr.shape)
        # Calculate mean and variance
        zcr_mean = np.mean(zcr)
        zcr_var = np.var(zcr)

        feature_dict = {
            'zcr_mean': zcr_mean,
            'zcr_var': zcr_var,
            'genre': entry['genre']
        }
        zcr_features.append(feature_dict)

    return zcr_features

In [None]:
# extract_zcr_features(filtered_dataset.select(range(2)))

 **Chroma STFT** (Short-Time Fourier Transform) is a powerful tool used to represent the **harmonic content** of an audio signal. It projects the entire spectrum into 12 bins representing the 12 distinct semitones (or "chroma") of the musical octave (C, C\#, D, D\#, E, F, F\#, G, G\#, A, A\#, B).


Chroma features are based on the concept of **octave equivalence**, meaning that notes separated by one or more octaves are treated as the same. For example, a "Middle C" and a "High C" are both collapsed into the single "C" bin.

* **Function:** `librosa.feature.chroma_stft(y=y, sr=sr)`
* **Input:** The raw audio or a power spectrogram.
* **Output:** A "Chromagram," which shows the intensity of each of the 12 semitones over time.


**Key Characteristics**

1. **Pitch Class Profile:** It ignores which octave a note is in and focuses strictly on the **pitch class**. This makes it incredibly robust for analyzing melody and harmony.
2. **Timbre Invariance:** Because it collapses octaves, Chroma STFT is relatively "blind" to the specific instrument playing. A piano playing a C major chord and a guitar playing a C major chord will produce very similar chromagrams.
3. **Intensity Mapping:** The values in the chromagram represent the energy present in that specific pitch class at a specific time.


**Key Applications**

* **Chord Recognition:** Since chords are combinations of specific pitch classes (e.g., C Major is C, E, and G), Chroma STFT is the primary feature used to identify chords in a song.
* **Cover Song Identification:** It can identify that two songs are the same melody even if played in different octaves or by different instruments.
* **Music Alignment:** It helps sync audio files with MIDI files by matching the harmonic progressions.
* **Key Detection:** Analyzing the most dominant chroma bins over a whole track helps determine the musical key (e.g., G Minor).



In [None]:
def extract_chroma_features(dataset):

    chroma_features = []

    for entry in tqdm(dataset, desc="Extracting Chroma Features"):
        audio_array = entry['audio']['array']
        sampling_rate = entry['audio']['sampling_rate']

        # Compute Chromagram
        chroma = librosa.feature.chroma_stft(y=audio_array, sr=sampling_rate)
        # print(chroma.shape)

        # Calculate mean and variance for each chroma bin (12 bins)
        chroma_mean = np.mean(chroma, axis=1)
        chroma_var = np.var(chroma, axis=1)

        feature_dict = {}
        for i in range(12):
            feature_dict[f'chroma_mean{i+1}'] = chroma_mean[i]
        for i in range(12):
            feature_dict[f'chroma_var{i+1}'] = chroma_var[i]

        feature_dict['genre'] = entry['genre']

        chroma_features.append(feature_dict)

    return chroma_features

In [None]:
# extract_chroma_features(filtered_dataset.select(range(2)))

| Feature | Category | What it tells the AI |
| --- | --- | --- |
| **RMS** | Energy | "How loud is the sound?" |
| **ZCR** | Temporal | "Is it noisy/hissing or smooth?" |
| **Spectral Contrast** | Spectral | "Is it a clear note or muddy noise?" |
| **Chroma STFT** | Harmonic | "What musical notes/chords are being played?" |
| **MFCCs** | Cepstral | "What is the specific 'texture' or 'voice' of the sound?" |


Other features can be explored in the following link

https://librosa.org/doc/latest/feature.html

**(ADV) Analyze correlation of the features (5 points)**

Plot correlation diagram and conclude which features are too much correlated and could be removed.

In [None]:
mfcc_features_list = extract_mfcc_features(filtered_dataset, n_mfcc=8)
spectral_contrast_features_list = extract_spectral_contrast(filtered_dataset)
rms_features_list = extract_rms(filtered_dataset)
zcr_features_list = extract_zcr_features(filtered_dataset)
chroma_features_list = extract_chroma_features(filtered_dataset)

In [None]:
# Convert lists of dictionaries to DataFrames
df_mfcc = pd.DataFrame(mfcc_features_list)
df_spectral_contrast = pd.DataFrame(spectral_contrast_features_list).drop(columns=['genre'])
df_rms = pd.DataFrame(rms_features_list).drop(columns=['genre'])
df_zcr = pd.DataFrame(zcr_features_list).drop(columns=['genre'])
df_chroma = pd.DataFrame(chroma_features_list).drop(columns=['genre'])

# Merge all feature DataFrames
# Assuming the order of entries is preserved across extraction functions
df0 = pd.concat([
    df_mfcc,
    df_spectral_contrast,
    df_rms,
    df_zcr,
    df_chroma
], axis=1)

df = df0.drop(columns=['genre'])
corr = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

# corelation Threshold
threshold = 0.9

upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

to_drop = set()
pairs = []

for col in upper.columns:
    # features highly correlated with `col`
    high_corr = upper.index[upper[col].abs() > threshold].tolist()
    for other in high_corr:
        # decide which to drop (here: drop `col`, keep `other`)
        to_drop.add(col)
        pairs.append((col, other))

print("Highly correlated pairs (deleted, kept):")
for deleted, kept in pairs:
    print(f"delete: {deleted}  |  keep: {kept}")

to_drop = list(to_drop)
print("\nFinal features to delete:", to_drop)

df_reduced = df.drop(columns=to_drop)

**(ADV) Analyze feature importance (5 points)**

Analyze the importance of each feature used in the model to understand which variables have the greatest impact on predictions.

**Tree-Based Importance**

machine learning (especially for audio classification using the features we discussed), **Tree-Based Importance** refers to techniques used by decision tree algorithms (like **Random Forest** or **XGBoost**) to rank which features were most useful in making a prediction.

If you have a dataset containing RMS, ZCR, Chroma, and MFCCs, Tree-Based Importance tells you which of those specific values actually helped the model distinguish between, for example, "Jazz" and "Rock."


**1. How It Works (The Mechanics)**

Decision trees make splits based on features that most effectively decrease **impurity** (usually Gini Impurity or Entropy).

* **Gini Importance (Mean Decrease Impurity):** Every time a feature is used to split a node, the algorithm calculates how much the "purity" of the samples increased. The more a feature improves purity across all trees in a forest, the higher its importance score.
* **Permutation Importance:** A more robust method where a single feature's values are randomly shuffled. If the model's accuracy drops significantly after shuffling "MFCC-1," then MFCC-1 is highly important.



**2. Interpreting Importance for Audio**

When you run a tree-based model on the librosa features we've covered, you often see patterns like these:

* **High MFCC Importance:** Usually means the model is relying on **timbre** (e.g., distinguishing a human voice from a guitar).
* **High Chroma Importance:** Usually means the model is relying on **musical keys or harmony** (e.g., distinguishing a Major key pop song from a Minor key blues song).
* **High ZCR Importance:** Usually means the model is focusing on **percussive vs. melodic** content (e.g., detecting drum-heavy sections).


**3. Why Use It? (Feature Selection)**

1. **Dimensionality Reduction:** If your `Chroma_STFT` features all have near-zero importance, you can remove them to make your model faster and less prone to overfitting.
2. **Model Explainability:** It allows you to tell a "story" about your data (e.g., "Our AI identifies bird species primarily based on Spectral Contrast and ZCR").

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(df_reduced, df0['genre'])
importances = model.feature_importances_
indices = np.argsort(importances)[::-1][:10]  # Top 10
print("Top Ten Features using TIME based Importance:",df.columns[indices])

**Permutation Importance**

 **Tree-Based Importance** (Gini Importance) looks at how the model was *built*, **Permutation Importance** looks at how the model *performs* when specific information is taken away.

It is often considered more reliable because it is "model-agnostic"—you can use it on Random Forests, SVMs, or even Neural Networks.



**1. How It Works: The "Shuffle" Test**

The core idea is to measure how much the model's accuracy drops when you "break" a specific feature.

1. **Train a model** as you normally would on your audio features (RMS, MFCC, etc.).
2. **Record the baseline accuracy** on a validation set.
3. **Shuffle one feature:** Take a single column (e.g., `ZCR`) and randomly reorder its values across the rows. This keeps the distribution of the data the same but destroys its relationship with the target label.
4. **Re-calculate accuracy:** If the accuracy drops significantly, that feature was **important**. If the accuracy barely changes, the model wasn't really using that feature to make decisions.

**2. Why it’s better for Audio Features**

In audio analysis, features are often highly correlated. For example, **RMS** and **Spectral Centroid** might both increase during a loud, bright drum hit.

* **The Flaw in Gini Importance:** Standard tree-based importance can be biased toward "high-cardinality" features (features with many unique floating-point values) like MFCCs, even if they aren't the most predictive.
* **The Permutation Advantage:** It reveals if a feature is *actually* providing unique predictive power. If shuffling `MFCC_1` kills your model's performance but shuffling `RMS` does nothing, you know your model is relying on timbre, not volume.

**3. Comparison Table**

| Feature | Gini Importance (Default Tree) | Permutation Importance |
| --- | --- | --- |
| **Speed** | Extremely fast (calculated during training). | Slower (requires multiple re-evaluations). |
| **Bias** | Biased toward high-cardinality/continuous features. | Unbiased; reflects true predictive power. |
| **Data used** | Uses Training Data. | Usually uses Validation/Test Data (shows generalization). |
| **Reliability** | Good for a quick glance. | The "Gold Standard" for feature selection. |




In [None]:
from sklearn.inspection import permutation_importance
result = permutation_importance(model, df_reduced, df0['genre'], n_repeats=10, random_state=42)
perm_mean = result.importances_mean
perm_indices = np.argsort(perm_mean)[::-1]  # Highest importance first
top_features = df_reduced.columns[perm_indices][:10]
print("Top Ten Features using Permutation Importance:",top_features)

### Train a classifier


Let's import all necessary functions and classes from sklearn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [None]:
# include here any additional libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import learning_curve

Function to extract features dataset from initial audio dataset

In [None]:
def prepare_dataset(dataset,to_drop, n_mfcc):
    mfcc_features_list = extract_mfcc_features(dataset, n_mfcc=n_mfcc)

    # here you can insert feature extraction functions calls
    spectral_contrast_features_list = extract_spectral_contrast(dataset)
    rms_features_list = extract_rms(dataset)
    zcr_features_list = extract_zcr_features(dataset)
    chroma_features_list = extract_chroma_features(dataset)

    # Convert lists of dictionaries to DataFrames
    df_mfcc = pd.DataFrame(mfcc_features_list)
    df_spectral_contrast = pd.DataFrame(spectral_contrast_features_list).drop(columns=['genre'])
    df_rms = pd.DataFrame(rms_features_list).drop(columns=['genre'])
    df_zcr = pd.DataFrame(zcr_features_list).drop(columns=['genre'])
    df_chroma = pd.DataFrame(chroma_features_list).drop(columns=['genre'])

    # Merge all feature DataFrames
    # Assuming the order of entries is preserved across extraction functions
    df = pd.concat([
        df_mfcc,
        df_spectral_contrast,
        df_rms,
        df_zcr,
        df_chroma
    ], axis=1)

    # Ensure 'genre' is the target variable
    # The 'genre' column from mfcc_features_list is used as it's consistent
    df['genre'] = df['genre'].fillna(-1) # Handle missing target values if any

    X = df.drop(columns=['genre'])
    X = X.drop(columns=to_drop)
    y = df['genre']

    return X, y

Now let's prepare train and test data

In [None]:
X, y = prepare_dataset(filtered_dataset, to_drop, n_mfcc=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

The best practice is to use a pipeline because it allows us to streamline preprocessing steps (like scaling) and model training into a single workflow. This ensures that all steps are applied consistently during both training and testing, preventing data leakage and simplifying cross-validation and hyperparameter tuning.

In [None]:
# Logistic Regression Pipeline
pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

pipeline_lr.fit(X_train, y_train)
y_pred = pipeline_lr.predict(X_test)

accuracy_lr = accuracy_score(y_test, y_pred)
print(f"{pipeline_lr['classifier'].__class__.__name__}: {accuracy_lr:.5f}")

**Train other models and compare their performance (10 points)**

Try training different models (at least 3) from sklearn and boosting libraries (e.g. Random Forest, SVM, Gradient Boosting, etc)

In [None]:
# K-Nearest Neighbors Pipeline
pipeline_knn = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

pipeline_knn.fit(X_train, y_train)
y_pred_knn = pipeline_knn.predict(X_test)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"{pipeline_knn['knn'].__class__.__name__} Accuracy Score: {accuracy_knn:.5f}")

In [None]:
# Gradient Boosting Pipeline
pipeline_gb = Pipeline([
    ('scaler', StandardScaler()),
    ('gb', GradientBoostingClassifier(random_state=42))
])

pipeline_gb.fit(X_train, y_train)
y_pred_gb = pipeline_gb.predict(X_test)

accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"{pipeline_gb['gb'].__class__.__name__} Accuracy Score: {accuracy_gb:.5f}")

In [None]:
# Random Forest Pipeline
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

pipeline_rf.fit(X_train, y_train)
y_pred_rf = pipeline_rf.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"{pipeline_rf['rf'].__class__.__name__} Accuracy Score: {accuracy_rf:.5f}")

In [None]:
# SVM Pipeline
pipeline_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svc', SVC(gamma='auto'))
])

pipeline_svm.fit(X_train, y_train)
y_pred_svm = pipeline_svm.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"{pipeline_svm['svc'].__class__.__name__} Accuracy Score: {accuracy_svm:.5f}")

**(ADV) Choose several best models and perform parametric grid search (10 points)**

Choose several of the best-performing models from your previous experiments and tune their hyperparameters using a parametric grid search. Compare the results and discuss which combination performs best.

*Logistic Regression,  Random Forest, SVM* are chosen for Grid search

In [None]:
# Logistic Regression grid search

# Define hyperparameter grid
param_grid = {
    'classifier__C': [1,2,3],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear']
}

# Perform grid search
grid_lr = GridSearchCV(pipeline_lr, param_grid, scoring='accuracy')
grid_lr.fit(X_train, y_train)

print("LogisticRegression grid search:")
print(f"\nBest parameters: {grid_lr.best_params_}")
print("Best score:", grid_lr.best_score_)


# Predict and evaluate
y_pred_lr = grid_lr.predict(X_test)
accuracy_grid_lr = accuracy_score(y_test, y_pred_lr)

print(f"\nAccuracy with grid search: {accuracy_grid_lr:.5f}")
print(f"Accuracy without grid search: {accuracy_lr:.5f}")
print(f'Improvement in accuracy = {accuracy_grid_lr - accuracy_lr:.5f}')

In [None]:
# Random Forest grid search
param_grid_rf = {
    'rf__n_estimators': [100, 300],
    'rf__max_depth': [None, 1,2],
    'rf__min_samples_split': [2, 3],
    'rf__min_samples_leaf': [1,2,3]
}

grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, scoring='accuracy')
grid_rf.fit(X_train, y_train)

print("RandomForestClassifier grid search:")
print("\nBest params:", grid_rf.best_params_)
print("Best score:", grid_rf.best_score_)

y_pred_rf = grid_rf.predict(X_test)
accuracy_grid_rf = accuracy_score(y_test, y_pred_rf)

print(f"\nAccuracy with grid search: {accuracy_grid_rf:.5f}")
print(f"Accuracy without grid search: {accuracy_rf:.5f}")
print(f'Improvement in accuracy = {accuracy_grid_rf - accuracy_rf:.5f}')

In [None]:
# SVC grid search

param_grid_svm = {
    'svc__C': [1, 2, 3, 6],
    'svc__gamma': [0.015, 0.01, 0.005],
    'svc__kernel': ['rbf','sigmoid','poly','linear']
}

grid_svm = GridSearchCV(pipeline_svm, param_grid_svm, scoring='accuracy')
grid_svm.fit(X_train, y_train)

print("SVC grid search:")
print("\nBest params:", grid_svm.best_params_)
print("Best Grid search score:", grid_svm.best_score_)

y_pred_svm = grid_svm.predict(X_test)
accuracy_grid_svm = accuracy_score(y_test, y_pred_svm)

print(f"\nAccuracy with grid search: {accuracy_grid_svm:.5f}")
print(f"Accuracy without grid search: {accuracy_svm:.5f}")
print(f'Improvement in accuracy = {accuracy_grid_svm - accuracy_svm:.5f}')

**(ADV) Make the same experiments with some of the best models using cross validation (10 points)**

Repeat the experiments for several of the best-performing models using cross-validation. Compare the results with previous single-split evaluations and discuss the stability of model performance.

In [None]:
# Evaluate using the best models found above
best_lr = grid_lr.best_estimator_
best_rf = grid_rf.best_estimator_
best_svm = grid_svm.best_estimator_

# Single Split Score
ss_lr = best_lr.score(X_test, y_test)
ss_rf = best_rf.score(X_test, y_test)
ss_svm = best_svm.score(X_test, y_test)

# Cross-Validation Score (Stability Check)
cv_lr = cross_val_score(best_lr, X_train, y_train, cv=10)
cv_rf = cross_val_score(best_rf, X_train, y_train, cv=10)
cv_svm = cross_val_score(best_svm, X_train, y_train, cv=10)

# Comparison Table
results = pd.DataFrame({
    'Model': ['Logistic Regression','Random Forest', 'SVM'],
    'Single Split': [ss_lr,ss_rf, ss_svm],
    'CV Mean': [cv_lr.mean(),cv_rf.mean(), cv_svm.mean()],
    'CV Std Dev (Stability)': [cv_lr.std(),cv_rf.std(), cv_svm.std()]
})
print(results)
print("the Support Vector Machine (SVM) is the recommended model has hign accuracy and stability")

In [None]:
train_sizes, train_scores, test_scores = learning_curve(
    best_lr, X_train, y_train, cv=10, n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy'
)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.title("Learning Curve (Logistic Regression)")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.legend(loc="best")
plt.show()

In [None]:
train_sizes, train_scores, test_scores = learning_curve(
    best_rf, X_train, y_train, cv=10, n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy'
)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.title("Learning Curve (Random Forest)")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.legend(loc="best")
plt.show()

**Training score is 1 for random forcast, may be overfitting**

In [None]:
train_sizes, train_scores, test_scores = learning_curve(
    best_svm, X_train, y_train, cv=10, n_jobs=-1,
    train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy'
)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.title("Learning Curve (SVM)")
plt.xlabel("Training Examples")
plt.ylabel("Score")
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.legend(loc="best")
plt.show()



```
# This is formatted as code
```

## Assignment specifications and rubric

**Please use only the provided dataset: no external datasets are allowed.**

**The solution notebook MUST run without errors when executing all cells in google colab.**

Focus only on classical machine learning algorithms available in *sklearn* for your analysis, other than artificial neural networks.

You are welcome to make models based on artificial neural networks too, but these will not be evaluated.

You are free to perform additional experiments or analyses beyond those explicitly mentioned in the exercises.

Exercises marked **(ADV)** are slightly more advanced and can be completed to earn extra points.
To achieve the minimum passing score, it is sufficient to complete only the base-level exercises.

### Main steps

1. **Data Visualization and Exploration**:
   - Visualize and analyze the dataset to gain insights into the distribution and characteristics of different features.

2. **Handle Unlabeled and Irrelevant Data**:
   - Investigate the dataset for unlabeled data.
   - Filter out irrelevant audios, especially those that are just zero signals or contain no meaningful information.

3. **Feature Engineering**:
   - Experiment with adding new features or refining existing ones.
   - Adjust feature extraction parameters to better capture the characteristics of the audio samples.
   - Rank the features and assess how many top-features to use.

4. **Apply Different Machine Learning Algorithms**:
   - Try various machine learning algorithms (e.g., Random Forest, SVM, Gradient Boosting\*) to improve performance.
   - Evaluate the models not only based on the **average accuracy**, but also consider the confusion matrix along with other **per-class** evaluation.
   - Explain which metrics are important for evaluating model performance, based on your findings from the exploratory data analysis.


\**Gradient Boosting is commonly implemented using specialized libraries like XGBoost, CatBoost, or LightGBM.*

## Assignment Evaluation Criteria (maximum 100%)

1. **Data Handling and Preprocessing (5 + 5 = 10 points)**
2. **Exploratory Data Analysis (5 + 5 = 10 points)**
3. **Feature Engineering (10 + 10 + 5 + 5 = 30 points)**
4. **Model Selection and Comparison (10 + 10 + 10 = 30 points)**
5. **Clarity, Creativity, and Originality (20 points)**



## Assignment submission instructions
Complete the assignment in your own copy of the notebook. Ensure that your notebook is runnable and free of errors. Once finished, test out the notebook in Goggle Colab (to make sure it runs in that environment too). Then RUN-ALL the project and export both the **.ipynb** file and the **html** file. Finally, submit both files via the Teams Assignment section.

**The deadline is 14 days ahead of the oral exam.**

**Actual deadlines are updated in the Teams Assignment Portal.**

Calculate and plot the learning curves for the Logistic Regression, Random Forest, and Support Vector Machine (SVM) models using their respective best hyperparameters. Analyze these learning curves to understand each model's performance with varying training data sizes, discussing insights into bias, variance, and the impact of more data.