In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
from imblearn.pipeline import Pipeline as Pip
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LinearRegression
from scripts.spotify_cleaner import clean_data
from scripts.spotify_feature_engineering import create_binary_classification, GENRE_PRIORITY, assign_meta_genres, remove_meta_duplicates, engineer_features

warnings.simplefilter(action='ignore', category=FutureWarning)
sns.set_style('whitegrid')

# importing dataset
if 'data' in os.listdir() and 'dataset.csv' in os.listdir('data'):
    print('Dataset exists, skipping import.')
    pass
else:
    print('No dataset found. Importing dataset.')
    %run ./scripts/import_kaggle_dataset.py

# Danceability Prediction & Music Feature Analysis

## Project Overview

This project explores Spotify song datao analyse what makes a song danceable and build a machine learning modelo predict danceability. 

Using a snapshot datasetrom October 2022, we investigate:
- How various audio features (e.g., tempo, valence, energy) impact danceability.
- Whether meta-genres and mood scores influence danceability.
- The performance of different regression and classification models in predicting danceability.

## Objectives

1. **Perform Exploratory Data Analysis (EDA)**  
   - Identify trends & correlations in song characteristics.  
   - Examine the impact of engineered features (e.g., tempo category, beat density).  
   - Visualize how meta-genres and mood scores relate to danceability.  

2. **Engineer & Select Features for Modelling**  
   - Create new features based on genre, rhythm, and energy.  
   - Evaluate which features contribute most to danceability predictions.  

3. **Build Machine Learning Models**  
   - Start with Linear Regression as a baseline.  
   - Improve predictions using Random Forest Classification.  
   - Tune hyperparameters to optimise performance (GridSearchCV).  

4. **Evaluate Model Performance & Feature Importance**  
   - Measure accuracy, precision, recall of models.  
   - Use feature importance analysis to understand what truly makes a song danceable.  

---

## **Dataset Overview**
The dataset consists of Spotify audio features for a large number of songs. Each song has:
- *Numerical features* (e.g., `tempo`, `valence`, `energy`, `loudness`).
- *Categorical features* (`track_genre`, later converted into `meta_genre`).
- *Engineered features* (`beat_density`, `mood_score`, etc.).

### **Sample Features**
| Feature | Description |
|---------|------------|
| `danceability` | Float (0-1) indicating danceability |
| `tempo` | Beats per minute (BPM) |
| `energy` | Song intensity (0-1) |
| `valence` | Positivity of the song (0-1) |
| `speechiness` | Amount of spoken words in the track |
| `instrumentalness` | Likelihood of the track being instrumental |
| `meta_genre` | Grouped genre category (e.g., Rock, Pop, Hip-Hop) |
| `mood_score` | Engineered feature combining valence, energy, and tempo |

# Exploratory Data Analysis (EDA)

## Dataset Overview

Before building our model, we first explore the dataset to:
- Understand the distribution of key features.
- Identify trends and correlations that may influence danceability.
- Evaluate the usefulness of engineered features (`mood_score`, `beat_density`, etc.).
- Examine how meta-genres impact danceability.

To begin, we clean the data and apply feature engineering to ensure a consistent dataset.

In [None]:
df = pd.read_csv('data/dataset.csv', index_col=0)
df = clean_data(df)

# assign meta-genre & remove duplicates
df = assign_meta_genres(df, GENRE_PRIORITY)
df = remove_meta_duplicates(df, GENRE_PRIORITY)

# further feature engineering
df = engineer_features(df)

# general DF information
print(df.info())
display(df.describe())
df.head()

## Feature Distributions

We first visualize the distributions of key numerical features to understand their spread. This helps identify outliers and assess whether transformations are necessary.


In [None]:
num_features = ['danceability', 'tempo', 'energy', 'valence', 'speechiness', 'acousticness', 
                'instrumentalness', 'liveness', 'loudness', 'mood_score', 'beat_density', 'popularity']

# Plot distributions
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 10))
axes = axes.flatten()

for i, feature in enumerate(num_features):
    sns.histplot(df[feature], bins=30, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {feature}')
plt.tight_layout()

plt.savefig('visuals/feature_distribution.svg')

## Correlation Analysis

To understand relationships between features, we compute a correlation matrix. This allows us to identify:
- Which features are highly correlated.
- Whether any features are redundant.
- How well different features correlate with danceability.

In [None]:
plt.figure(figsize=(12, 6))
sns.heatmap(df.select_dtypes(include=['number']).corr())
plt.title('Feature Correlations')
plt.tight_layout()

plt.savefig('visuals/full_feature_heatmap.svg')

## Meta-Genre Analysis

Next, we analyse meta-genres to see how different broad genre categories relate to danceability. This allows us to examine whether certain genres tend to produce more danceable music.

In [None]:
# Group by meta-genre and compute average danceability
genre_danceability = df.groupby('meta_genre')['danceability'].mean().sort_values()

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_danceability.index, y=genre_danceability.values, palette='rocket', hue=genre_danceability.index)
plt.xticks(rotation=45, fontsize=12)
plt.xlabel('Meta-Genre', fontsize=16)
plt.ylabel('Average Danceability', fontsize=16)
plt.title('Average Danceability by Meta-Genre', fontsize=20)
plt.savefig('visuals/Genre_danceability.svg', dpi=150, format='svg', bbox_inches='tight')

## Insights from EDA

From the above analyses, we can extract key observations:
- Danceability tends to correlate with features like energy and valence, while acousticness and instrumentalness are negatively correlated.
- Some meta-genres have significantly higher average danceability than others, suggesting genre is a useful feature.
- The engineered features, such as mood score and beat density, show clear variation across songs, making them strong candidates for feature selection.

With these insights, we move forward to feature engineering and model building.

# Feature Engineering

## Handling Meta-Genres

Spotify provides detailed genre labels, but these are often too specific (e.g., 'alternative metal' vs. 'heavy metal'). To make our analysis more effective, we grouped similar genres into meta-genres. This helps:
- Reduce dimensionality and noise.
- Improve interpretability for analysis and modelling.

### **Meta-Genre Assignment**
Meta-Genres were created based on market categories where possible, as Spotify is a commercial platform, and on musical history in unclear cases. Afterwards, the meta-genres were assigned using a priority mapping to ensure consistency.


In [None]:
df['meta_genre'].value_counts()

The distribution of meta-genres showed that some categories were much more common than others. To ensure better representation, we adjusted duplicate removal priorities using the `GENRE_PRIORITY` dictionary. This means that any song found in multiple meta-genres is now only represented in the most 'overarching' genre, removing nuance from the set, but ideally allowing for more precise analysis of the meta genres.

## Engineered Features

To enhance our models, we created several new features based on tempo, rhythm, energy, and acoustic properties. Below are the most important engineered features:

| Feature | Description |
|---------|------------|
| `tempo_category` | Binned tempo into 'slow,' 'mid,' and 'fast' |
| `mood_score` | Composite score of valence, energy, and tempo |
| `beat_density` | Ratio of tempo to time signature |
| `energy_tempo_interaction` | Interaction between energy and tempo |
| `party_factor` | Combination of valence and energy |
| `groove_score` | Danceability weighted by instrumentalness |

These features were designed to capture musical attributes beyond raw Spotify-provided data.

In [None]:
df.select_dtypes(include=['number']).corr()['danceability'].sort_values(ascending=False)

### **Feature Performance: What Worked & What Didn’t**
After experimenting with multiple features, we evaluated their impact on danceability predictions. As the 'groove score' is a proxy for danceability by instrumentalness, it was left out of analysis.

**Useful Features**:
- *Meta-genre* → Different genres showed clear trends in danceability.
- *Mood Score* → Captured energy-valence interactions well.
- *Beat Density* → Helped differentiate rhythmic structures.

**Less Useful Features**:
- *Energy-Tempo Interaction* → Did not provide additional predictive power beyond individual features.
- *Tempo Category (One-Hot Encoded)* → Only had a marginal impact on model performance.

Based on these insights, we selected only the most informative features for our final model.

# Modelling Approaches

## Overview

With our features engineered and cleaned, we now test two machine learning approaches to predict danceability:

1. **Linear Regression**  
   - Serves as a baseline model to understand feature relationships.  
   - Tests polynomial features to capture non-linear relationships.  
   
2. **Random Forest Classification**  
   - More robust for structured data with categorical features.  
   - Uses hyperparameter tuning (`GridSearchCV`) for optimisation.  
   - Incorporates `SMOTE` to handle class imbalance.

We evaluate models using accuracy, precision, recall, and analyse feature importance to understand what drives danceability.

## Linear Regression (Baseline Model)

To start, we apply Polynomial Features with a Linear Regression model.  
The goal is to establish whether danceability follows a simple polynomial relationship with other features.

In [None]:
# prepare modelling data (feature & target definition, train/test-split)
features_cols = ['energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'valence', 'tempo',
                     
    # New Engineered Features:
    'beat_density',
    'energy_tempo_interaction',
    'party_factor',
    'mood_score']

target_col = ['danceability']

features = df[features_cols]
target = df[target_col]

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.1, random_state=42)

poly_lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly_features', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ('linear_regression', LinearRegression())
])

poly_lr_pipeline.fit(features_train, target_train)

# Predict & evaluate
target_pred_poly = poly_lr_pipeline.predict(features_test)

rmse_poly = np.sqrt(mean_squared_error(target_test, target_pred_poly))
r2_poly = r2_score(target_test, target_pred_poly)

print('RMSE: {:.4f}'.format(rmse_poly))
print('R² Score: {:.4f}'.format(r2_poly))

### Results:
- **RMSE**: 12.88%
- **R² Score**: 47.65%

While the model provides insight into feature relationships, its predictive power is limited, suggesting a need for more complex modelling.

## Random Forest Classification

A Random Forest model is used for classification, where we define a binary target:
- **1 = Danceable** (top 25% of `danceability` scores).
- **0 = Not Danceable** (bottom 75%).

After GridSearchCV, the hyperparameters seen below were chosen for optimal balance between scoring values, and best overall f1-score. Notably, at a max_depth over 20 and 400 or more n_estimators, the model began to degrade in all scores.


In [None]:
# Classification Model
df = create_binary_classification(df, new_col_name='danceable', base_col='danceability', percentile_cutoff=0.693) #picked from top 25%

clf_num_cols = [
    'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness',
    'valence', 'tempo', 'beat_density', 'energy_tempo_interaction', 'party_factor'
]
clf_cat_cols = ['tempo_category', 'meta_genre']

clf_features = df[clf_num_cols+clf_cat_cols]
clf_target = df['danceable']

clf_features_train, clf_features_test, clf_target_train, clf_target_test = train_test_split(clf_features, clf_target, stratify=clf_target,
                                                                                            test_size=0.1, random_state=42)

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), clf_num_cols),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), clf_cat_cols)
])

#Random Forest Classification
rf_clf_pipeline = Pip([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=350, max_depth=20, 
                                          min_samples_split=5, min_samples_leaf=2, 
                                          class_weight='balanced', random_state=42))
])

rf_clf_pipeline.fit(clf_features_train, clf_target_train)

rf_clf_target_pred = rf_clf_pipeline.predict(clf_features_test)

accuracy = accuracy_score(clf_target_test, rf_clf_target_pred)
precision = precision_score(clf_target_test, rf_clf_target_pred)
recall = recall_score(clf_target_test, rf_clf_target_pred)
conf_matrix = confusion_matrix(clf_target_test, rf_clf_target_pred)

print('Accuracy:  {:.4f}'.format(accuracy))
print('Precision: {:.4f}'.format(precision))
print('Recall: {:.4f}'.format(recall))
print('Confusion Matrix:\n', conf_matrix)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay.from_estimator(rf_clf_pipeline, clf_features_test, clf_target_test, ax=ax, cmap='rocket')
plt.title('Confusion Matrix - Random Forest Classifier')
plt.savefig('visuals/confusion_matrix.svg', format='svg')

### Results:
- **Accuracy**: 84.19%
- **Precision**: 66.42%
- **Recall**: 74.99%

The Random Forest model significantly outperformed Linear Regression in classification performance.  
Notably, recall improved by ~4.5% after encoding `tempo_category` and `meta_genre`.

In [None]:
feature_importances = rf_clf_pipeline.named_steps['classifier'].feature_importances_

feature_names = (
    clf_num_cols + 
    rf_clf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(clf_cat_cols).tolist()
)

# importance df
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False).head(15)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Feature Importance Score', fontsize=14)
plt.ylabel('Features', fontsize=14)
plt.yticks(fontsize=12)
plt.title('Feature Importance for Danceability Prediction', fontsize=20)
plt.gca().invert_yaxis()
plt.savefig('visuals/feature_importance.svg', format='svg', bbox_inches='tight')

This shows the top 15 most important features for danceability prediction. As 'danceability' in the terms of the dataset seems to be very clearly defined as the type of dancing that is done in a club on the dance floor, the relative importance of 'valence' is unsurprising, although speechiness (the lyrics content of a piece) is unexpected. Broadly speaking, however, it can be seen that danceability is not dependant on one single characteristic, and is clearly the result of a complex calculation of other values.

# Conclusion & Next Steps

## Summary of Findings

This project aimed to analyse what makes a song danceable and build a machine learning model to predict danceability. Through exploratory data analysis, feature engineering, and model experimentation, we arrived at the following key insights:

### **Danceability Trends**
- Energy, valence, and tempo were the strongest predictors of danceability.
- Meta-genre had a minor but noticeable impact, with genres like electronic and hip-hop tending to be more danceable.
- Mood score, which combined valence, energy, and tempo, was a meaningful composite feature.

### **Feature Engineering Outcomes**
- **Successful Features**:  
  - `beat_density` (tempo-to-time signature ratio)  
  - `mood_score` (composite of valence, energy, and tempo)  
  - `meta_genre` (broad genre categorisation)  

- **Less Effective Features**:  
  - `groove_score` (highly correlated with danceability, making it redundant)  
  - `energy_tempo_interaction` (did not add significant predictive power)  

### **Model Performance**
- *Linear Regression* failed to accurately capture danceability, confirming that a purely linear approach was insufficient.
- *Random Forest Classification* provided the best results, achieving:
  - ~84% accuracy
  - Strong recall (~75%), ensuring that truly danceable songs were identified.
  - Feature importance analysis confirmed that energy, valence, and genre had the greatest influence.

## Next Steps & Potential Improvements

If we were to further refine this project, possible extensions include:

1. **Trying XGBoost**  
   - Random Forest performed well, but an optimised XGBoost model could provide additional improvements.

2. **Optimising Feature Selection**  
   - Some features (e.g., `tempo_category`) only provided minor gains. Further feature selection could enhance model efficiency.

3. **Expanding the Dataset**  
   - This analysis used a static snapshot from October 2022. A time-based dataset could allow us to explore if and how danceability trends evolve over time.

4. **Recommendation Function**  
   - While this is not directly part of modelling, it would be illustrative to have the insights from this analysis/modelling demonstrated in the form of a simple recommendation algorithm that, for example, creates a list of songs that increase or decrease in a selected feature over the course of the list.

## Final Thoughts

This project successfully demonstrated how audio features impact danceability and how machine learning models can predict danceability with high accuracy. The findings also reinforce the importance of rhythm, energy, and mood in making music danceable.

The Random Forest model, combined with carefully engineered features, provided a strong balance between interpretability and predictive power.

