<a href="https://colab.research.google.com/github/kaiozwald/DV-Assignments/blob/main/Copy_of_music_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***N.B. this notebook has been tested in Google Colab***

# Music Genre Classification

In this project, you will build a machine learning algorithm to classify music genres using audio features. Starting with the provided dataset, your task is to develop a model that effectively solves this multiclass classification problem. Use the baseline notebook as a starting point and improve upon it.

***Overall Goal: to design a complete pipeline that improves accuracy from the current 37% to at least 70%, ideally approaching 80%***


## Introduction

In this project, you will work with a dataset of music samples from various genres. The dataset has been purposely left a bit messy, with some entries missing labels and others containing empty audio files. To start with, your task is to clean and explore the dataset, turning it into a well-organized resource for analysis.

This notebook includes a basic, "weak" baseline to get you started. It serves as a simple starting point, but it is neither thorough nor accurate. You are expected to build upon it, applying your own strategies to improve the data science pipeline (including data cleaning, curation, feature engineering, etc) before moving into model building, parameters tuning, and model evaluation.

The formal details of the assignment are provided at the end of the notebook. To start with, focus on understanding the dataset and planning your strategies to tackle its challenges.

**We expect you to submit a modified version of this notebook with your improvements. Please download a copy of this assignment in your private Python programming environment, before making any changes.**

## Baseline

Let's install all required dependencies:

- **datasets**: Access to large-scale datasets.
- **librosa**: Tools for audio analysis.
- **pandas** & **numpy**: Tabular data manipulation and numerical operations.
- **scikit-learn**: Machine learning algorithms and tools.
- **tqdm**: Progress bar.

You might be familiar with most of these already.

In [None]:
%%capture
!pip install datasets==3.5.0 librosa pandas numpy scikit-learn tqdm

And import the necessary modules

In [None]:
import numpy as np
import pandas as pd
import librosa
from datasets import load_dataset
from IPython.display import Audio, display
from tqdm import tqdm_notebook as tqdm


In [None]:
# Include here any additional modules that you might need
# ...
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

### Dataset Description

The dataset consists of music samples from various genres, including:

- **Genres**: `Blues`, `Classical`, `Country`, `Disco`, `HipHop`, `Jazz`, `Metal`, `Pop`, `Reggae`, and `Rock`.

The dataset is a bit messy and includes some **unlabeled data** and **empty audio files**. We have provided basic preprocessing, but more in-depth data cleaning, feature extraction, and preparation will be a part of your assignment.

Let's download the audio dataset using the Hugging Face datasets library.

In [None]:
dataset = load_dataset("unibz-ds-course/audio_assignment", split="train")

In [None]:
dataset

In [None]:
print(f"Num of samples in the dataset: {len(dataset)}")

Let's take a glance at a sample from the dataset

In [None]:
entry = dataset[10]

audio_array = entry['audio']['array']
sampling_rate = entry['audio']['sampling_rate']

print(f"Element: {entry}")
print(f"File Path: {entry['file']}")
print(f"Number of Samples: {len(audio_array)}")
print(f"Sampling Rate: {sampling_rate} Hz")

audio_length_seconds = len(audio_array) / sampling_rate
print(f"Audio Length: {audio_length_seconds:.2f} seconds")

genre_id = entry['genre']
genre_label = dataset.features['genre'].int2str(genre_id)
print(f"Genre (ID): {genre_id}")
print(f"Genre (Label): {genre_label}")

display(Audio(audio_array, rate=sampling_rate))

**Draw a plot with distribution of the classes (5 points)**

Create a visualization (e.g., bar chart or histogram) that shows how many samples belong to each class.
This helps identify whether the dataset is balanced or if some classes are underrepresented.

In [None]:
# write code to draw the distribution
genre = []

for i in range(len(dataset)):
  entry = dataset[i]
  genre_id = entry['genre']
  genre_label = dataset.features['genre'].int2str(genre_id) if genre_id is not None else 'Unknown'
  genre.append(genre_label)
# print(genre)
counts_dict = Counter(genre)
counts_list = list(counts_dict.values())
uniqe_genre = list(set(genre))
colors = plt.cm.tab10_r(np.linspace(0, 1, len(uniqe_genre)))

plt.figure(figsize=(8, 5))
plt.bar(uniqe_genre, counts_list, color=colors)
plt.title('Distribution of num of samples per class')
plt.xlabel('Classes')
plt.ylabel('Number of Samples')
plt.show()

# we can see that generally classes are balanced and there is no inbalance categories

**Draw distribution of lengths of audios (5 points)**

Plot the distribution of audio lengths in the dataset to analyze how durations vary across samples.

In [None]:
# write code to draw the distribution
audio_len_arr = []
for i in range(len(dataset)):
  entry = dataset[i]
  audio_array = entry['audio']['array']
  sampling_rate = entry['audio']['sampling_rate']

  audio_length_seconds = len(audio_array) / sampling_rate
  audio_len_arr.append(audio_length_seconds)

plt.figure(figsize=(8, 5),)
plt.hist(audio_len_arr, color='skyblue',)
plt.title('Distribution of Audio Lengths per sample')
plt.xlabel('Length')
plt.ylabel('Number of Samples')
plt.show()

# we can notice that audio length in general are between (29.9 - 30.1) and they are common between 500+ samples

**Delete empty samples (5 points)**

Implement the function to remove empty samples (audios with silence only)

In [None]:
def filter_empty_samples(entry):
    threshold = 1e-6
    audio = entry['audio']['array']
    return any(abs(sample) > threshold for sample in audio)

In [None]:
filtered_dataset = dataset.filter(filter_empty_samples)

In [None]:
assert len(filtered_dataset) == 970, "Your filtering function is wrong"

Uncomment the code above. If the assertion fails, please check your function for bugs

**Delete unlabeled samples (5 points)**

Implement the function to remove unlabeled samples.

In [None]:
def filter_unlabeled_samples(entry):
  return entry['genre'] is not None

In [None]:
filtered_dataset = filtered_dataset.filter(filter_unlabeled_samples)

In [None]:
assert len(filtered_dataset) == 848, "Your filtering function is wrong"

Uncomment the code above. If the assertion fails, please check your function for bugs

Now we can extract some features from the dataset.

### **Mel Frequency Cepstral Coefficients**

**[Mel Frequency Cepstral Coefficients (MFCCs)](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum)** are commonly used in audio analysis to capture key features of sound. They help represent the important characteristics of an audio signal, making them ideal for tasks like music genre classification and speech recognition.

We're not going to dive deep into the complex details of audio processing, but it's useful to know that MFCCs help simplify raw audio data while retaining important information.

#### Basic Steps in MFCC Extraction:
1. **Frequency Domain Conversion**: The audio signal is split into short frames, and we apply the Fourier Transform to convert them from the time domain to the frequency domain.
2. **Mel Scale Mapping**: The frequency spectrum is converted to the Mel scale, which better represents how humans perceive sound, emphasizing lower frequencies.
3. **Logarithm and DCT**: After mapping to the Mel scale, we apply a logarithm and the Discrete Cosine Transform (DCT) to get the MFCCs. These summarize the "cepstral" information of the audio signal.

The parameter `n_mfcc` controls **how many MFCC coefficients** are extracted for each frame. For example, setting `n_mfcc=8` means we extract 8 coefficients, where lower coefficients capture broad audio features, and higher coefficients capture the more finer details.

#### Why MFCCs Are Important:
MFCCs help capture the **tonal quality** of the sound and reduce the complexity of the raw audio signal. By summarizing the audio into a smaller set of features, they allow machine learning models to classify and recognize different types of sounds more effectively.

In this notebook, we'll use the **mean** and **variance** of the MFCCs over time to create a robust feature set for our classification model. Adjusting the `n_mfcc` parameter allows us to control the number of features extracted for each audio sample.

#### **Additional Features**
Consider exploring additional audio features to enhance your model's performance. There are various acoustic properties you could extract from the audio signals, such as zero crossings, harmonic-percussive separation, tempo, spectral centroids, spectral rolloff, chromagram, RMS energy, spectral bandwidth, etc. When working with these features, it's often useful to compute summary statistics like the mean and variance across the audio sample. These summary statistics can capture the overall characteristics and variability of the feature, reducing the dimensionality of your data while retaining important information. Experimenting with these features and their statistical summaries could potentially improve your model's accuracy and robustness in distinguishing between different audio characteristics.

#### **Feature Analysis**
Don´t forget to optimize the use of features, identifying and handling irrelevant and reduntant features. Then use feature ranking to identify which features are more influential, and evaluate quantitatively how many top-features to retain.

In [None]:
# func to extract mfcc features
# describes how the sound sounds (tone/character)

def extract_mfcc_features(dataset, n_mfcc):
    mfcc_features = []

    # here we might have used Dataset.map method, unfortunately, it consumes extra memory and runs out of RAM in colab
    for entry in tqdm(dataset, desc="Extracting MFCC Features"):
        audio_array = entry['audio']['array']
        sampling_rate = entry['audio']['sampling_rate']

        mfcc = librosa.feature.mfcc(y=audio_array, sr=sampling_rate, n_mfcc=n_mfcc)

        mfcc_mean = np.mean(mfcc, axis=1)
        mfcc_var = np.var(mfcc, axis=1)

        feature_dict = {}

        for i in range(n_mfcc):
            feature_dict[f'mfcc_mean{i+1}'] = mfcc_mean[i]

        for i in range(n_mfcc):
            feature_dict[f'mfcc_var{i+1}'] = mfcc_var[i]

        feature_dict['genre'] = entry['genre']

        mfcc_features.append(feature_dict)

    return mfcc_features

Let's take a look at the output of the function. We will pass there just 2 samples from the dataset.

In [None]:
extract_mfcc_features(filtered_dataset.select(range(2)), n_mfcc=5)

The function generates `n_mfcc * 2` features for each sample. Consider analyzing their correlation with a matrix and experimenting with different `n_mfcc` values to observe how feature relationships change. While MFCC features are effective for audio analysis, you might also improve performance by incorporating additional features such as RMS or Spectral Contrast. Once you've explored these options, proceed to training the model using the extracted features.

**Implement functions to extract RMS and Spectral Contrast (10 points in total)**

Write functions to extract these features, also add their description and discuss why they might be useful in this assignment.

In [None]:
# func for extracting rms
# shows how loud the sound is
def extract_rms(entry):
    audio = entry["audio"]["array"]

    rms = librosa.feature.rms(y=audio)
    rms_mean = float(np.mean(rms))
    rms_var = float(np.var(rms))

    return {
        "rms_mean": rms_mean,
        "rms_var": rms_var
    }

In [None]:
# func for extracting spectral contrast
# shows sharp vs smooth parts of sound

def extract_spectral_contrast(entry):
    audio = entry["audio"]["array"]
    sr = entry["audio"]["sampling_rate"]

    contrast = librosa.feature.spectral_contrast(y=audio, sr=sr)
    contrast_mean = np.mean(contrast, axis=1)
    contrast_var = np.var(contrast, axis=1)

    features = {}
    for i in range(contrast.shape[0]):
        features[f"contrast_mean{i+1}"] = contrast_mean[i]
        features[f"contrast_var{i+1}"] = contrast_var[i]

    return features

**Explore and add other features useful for classification (10 points)**

It can be in a single function or in separated functions. Select some features and provide a description for each one. You can choose as many features as you like.

In [None]:
# write your code here

# func for extracting zcr
# how often the sound changes sign (roughness)
def extract_zcr(entry):
  audio = entry["audio"]["array"]
  zcr = librosa.feature.zero_crossing_rate(y=audio)
  zcr_mean = float(np.mean(zcr))
  zcr_var = float(np.var(zcr))
  return {
      "zcr_mean":zcr_mean,
      "zcr_var":zcr_var}

# func to extract hpss
# splits melody and beats
def extract_hpss_single(entry):
    audio = entry["audio"]["array"]

    harmonic, percussive = librosa.effects.hpss(audio)

    return {
        "harmonic_energy": float(np.mean(harmonic**2)),
        "percussive_energy": float(np.mean(percussive**2))
    }

# func to extract tempo
# speed of the music
def extract_tempo_single(entry):
    audio = entry["audio"]["array"]
    sr = entry["audio"]["sampling_rate"]

    tempo, _ = librosa.beat.beat_track(y=audio, sr=sr)

    return {
        "tempo": float(tempo)
    }

# func to extract spectral rolloff
# where most sound energy ends
def extract_spectral_rolloff_single(entry):
    audio = entry["audio"]["array"]
    sr = entry["audio"]["sampling_rate"]

    rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)[0]

    return {
        "rolloff_mean": float(np.mean(rolloff)),
        "rolloff_var": float(np.var(rolloff))
    }


**(ADV) Analyze correlation of the features (5 points)**

Plot correlation diagram and conclude which features are too much correlated and could be removed.

In [None]:
# write your code here
mfcc_features = extract_mfcc_features(filtered_dataset, n_mfcc=5)
feature_df = pd.DataFrame(mfcc_features)

corr = feature_df.corr()

plt.figure(figsize=(14,10))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.show()

**(ADV) Analyze feature importance (5 points)**

Analyze the importance of each feature used in the model to understand which variables have the greatest impact on predictions.

In [None]:
# write your code here

X = feature_df.drop('genre', axis=1)
y = feature_df['genre']

f_scores, p_values = f_classif(X, y)

anova_df = pd.DataFrame({
    'feature': X.columns,
    'f_score': f_scores,
    'p_value': p_values
}).sort_values('f_score', ascending=False)

print(anova_df)

# Interpretation:
# High F-score → feature distinguishes genres well → important
# High p-value (> 0.05) → not statistically useful → candidate for removal

In [None]:
corr = feature_df.corr().abs()
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
df_reduced_corr = feature_df.drop(columns=to_drop)
df_reduced_corr
# drop highly correlated features but i found out that features are not that high correlated

In [None]:
corr_matrix = feature_df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]
df_reduced_corr = feature_df.drop(columns=to_drop)
df_reduced_corr
# tried another way to drop features with more than 0.85 correlation but end up with same features

In [None]:
from sklearn.ensemble import RandomForestClassifier

X = df_reduced_corr.drop(columns=['genre'])
y = df_reduced_corr['genre']

model = RandomForestClassifier()
model.fit(X, y)

importances = pd.Series(model.feature_importances_, index=X.columns)
low_imp = importances[importances < 0.01].index.tolist()

df_final = df_reduced_corr.drop(columns=low_imp)

df_final
# another way of removing less important features using random forest trained on the features determend less importance ones

In [None]:
# ANOVA → all features useful → keep all
# Correlation filtering → no redundancy → keep all
# RandomForest importance → no low-importance → keep all

### Train a classifier


Let's import all necessary functions and classes from sklearn

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [None]:
# include here any additional libraries
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import KFold, cross_val_score

Function to extract features dataset from initial audio dataset

In [None]:
# changed the func to one that suits my code
def prepare_dataset(dataset, n_mfcc):
    mfcc_list = extract_mfcc_features(dataset, n_mfcc)

    rows = []

    # Loop over dataset and MFCC features at the same time, so that we collect all features needed
    for entry, mfcc_feats in tqdm(zip(dataset, mfcc_list), total=len(dataset), desc="Preparing Dataset"):
        row = dict(mfcc_feats)   # start with MFCC features

        # Add RMS features
        rms_dict = extract_rms(entry)
        row.update(rms_dict)

        # Add Spectral Contrast
        contrast_dict = extract_spectral_contrast(entry)
        row.update(contrast_dict)

        # Add ZCR
        zcr_dict = extract_zcr(entry)
        row.update(zcr_dict)

        # Add HPSS
        hpss_dict = extract_hpss_single(entry)
        row.update(hpss_dict)

        # Add Tempo
        tempo_dict = extract_tempo_single(entry)
        row.update(tempo_dict)

        # Add Rolloff
        rolloff_dict = extract_spectral_rolloff_single(entry)
        row.update(rolloff_dict)

        rows.append(row)

    df = pd.DataFrame(rows)

    # Remove samples with missing labels instead of just replacing
    df = df.dropna(subset=["genre"])

    X = df.drop(columns=["genre"])
    y = df["genre"]

    return X, y


Now let's prepare train and test data

In [None]:
X, y = prepare_dataset(filtered_dataset, n_mfcc=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The best practice is to use a pipeline because it allows us to streamline preprocessing steps (like scaling) and model training into a single workflow. This ensures that all steps are applied consistently during both training and testing, preventing data leakage and simplifying cross-validation and hyperparameter tuning.

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"{pipeline['classifier'].__class__.__name__}: {accuracy:.5f}")

**Train other models and compare their performance (10 points)**

Try training different models (at least 3) from sklearn and boosting libraries (e.g. Random Forest, SVM, Gradient Boosting, etc)

In [None]:
# write your code here

### RandomForestClassifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=1000, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"{pipeline['classifier'].__class__.__name__}: {accuracy:.5f}")

### PCA and SVC
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=15)),
    ('classifier', SVC(kernel='rbf', C=10, gamma='scale'))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"{pipeline['classifier'].__class__.__name__}: {accuracy:.5f}")



**(ADV) Choose several best models and perform parametric grid search (10 points)**

Choose several of the best-performing models from your previous experiments and tune their hyperparameters using a parametric grid search. Compare the results and discuss which combination performs best.

In [None]:
# write your code here
### GridSearchCV

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

param_grid = [
    {
        'clf': [SVC()],
        'clf__C': [0.1, 1, 10],
        'clf__kernel': ['rbf', 'linear']
    },
    {
        'clf': [RandomForestClassifier()],
        'clf__n_estimators': [200, 500],
        'clf__max_depth': [10, 20]
    },
]

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)
print("Best model:", grid.best_estimator_)
print("Best score:", grid.best_score_)

# after adding the new features we can see that best score became 0.76

**(ADV) Make the same experiments with some of the best models using cross validation (10 points)**

Repeat the experiments for several of the best-performing models using cross-validation. Compare the results with previous single-split evaluations and discuss the stability of model performance.

In [None]:
# write your code here
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize the RandomForestClassifier model
model = RandomForestClassifier(random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(scores)

# kfolds using random forest after adding new features changed from less 70 ranges to all above 70%



```
# This is formatted as code
```

## Assignment specifications and rubric

**Please use only the provided dataset: no external datasets are allowed.**

**The solution notebook MUST run without errors when executing all cells in google colab.**

Focus only on classical machine learning algorithms available in *sklearn* for your analysis, other than artificial neural networks.

You are welcome to make models based on artificial neural networks too, but these will not be evaluated.

You are free to perform additional experiments or analyses beyond those explicitly mentioned in the exercises.

Exercises marked **(ADV)** are slightly more advanced and can be completed to earn extra points.
To achieve the minimum passing score, it is sufficient to complete only the base-level exercises.

### Main steps

1. **Data Visualization and Exploration**:
   - Visualize and analyze the dataset to gain insights into the distribution and characteristics of different features.

2. **Handle Unlabeled and Irrelevant Data**:
   - Investigate the dataset for unlabeled data.
   - Filter out irrelevant audios, especially those that are just zero signals or contain no meaningful information.

3. **Feature Engineering**:
   - Experiment with adding new features or refining existing ones.
   - Adjust feature extraction parameters to better capture the characteristics of the audio samples.
   - Rank the features and assess how many top-features to use.

4. **Apply Different Machine Learning Algorithms**:
   - Try various machine learning algorithms (e.g., Random Forest, SVM, Gradient Boosting\*) to improve performance.
   - Evaluate the models not only based on the **average accuracy**, but also consider the confusion matrix along with other **per-class** evaluation.
   - Explain which metrics are important for evaluating model performance, based on your findings from the exploratory data analysis.


\**Gradient Boosting is commonly implemented using specialized libraries like XGBoost, CatBoost, or LightGBM.*

## Assignment Evaluation Criteria (maximum 100%)

1. **Data Handling and Preprocessing (5 + 5 = 10 points)**
2. **Exploratory Data Analysis (5 + 5 = 10 points)**
3. **Feature Engineering (10 + 10 + 5 + 5 = 30 points)**
4. **Model Selection and Comparison (10 + 10 + 10 = 30 points)**
5. **Clarity, Creativity, and Originality (20 points)**



## Assignment submission instructions
Complete the assignment in your own copy of the notebook. Ensure that your notebook is runnable and free of errors. Once finished, test out the notebook in Goggle Colab (to make sure it runs in that environment too). Then RUN-ALL the project and export both the **.ipynb** file and the **html** file. Finally, submit both files via the Teams Assignment section.

**The deadline is 14 days ahead of the oral exam.**

**Actual deadlines are updated in the Teams Assignment Portal.**