# Hit Song Predictor: A Logistic Regression Approach

## Project Overview
This project uses logistic regression to predict whether a song will be a 'hit' based on audio features like danceability, energy, valence, and tempo. The model analyzes patterns in successful tracks to identify what makes a song popular.

**Key Skills Demonstrated:**
- Binary classification with logistic regression
- Feature engineering and selection
- Handling imbalanced datasets
- Model evaluation and interpretation
- Data visualization for music analytics

### Import Necessary Tools & Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Load the Dataset

In [None]:
df = pd.read_csv("songs_normalize.csv")

### Inspect Data

In [None]:
df.info()

In [None]:
df.head()

### Exploratory Data Analysis

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)
print("\nBasic Statistics:")
df.describe()

In [None]:
# Define what makes a 'hit' - songs with popularity > 70
# Adjust this threshold based on dataset
popularity_threshold = df['popularity'].quantile(0.75)
df['is_a_hit'] = (df['popularity'] > popularity_threshold).astype(int)

print(f"Popularity threshold for 'hit': {popularity_threshold}")
print(f"\nClass distribution:")
print(df['is_a_hit'].value_counts())
print(f"\nHit rate: {df['is_a_hit'].mean():.2%}")

### Visualize audio feature distributions for hits vs non-hits

#### The Visualization shows
Expected Strong Features:

- Danceability ⭐⭐⭐
- Energy ⭐⭐⭐
- Valence ⭐⭐

Expected Weak Features:

- Tempo, Acousticness, Instrumentalness, Liveness, Speechiness, and Loudness

In [None]:
audio_features = ['danceability', 'energy', 'valence', 'tempo', 'acousticness', 
                  'instrumentalness', 'liveness', 'speechiness', 'loudness']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(audio_features):
    axes[idx].hist(df[df['is_a_hit']==1][feature], alpha=0.5, label='Hit', bins=30, color='green')
    axes[idx].hist(df[df['is_a_hit']==0][feature], alpha=0.5, label='Not a Hit', bins=30, color='red')
    axes[idx].set_title(f'{feature.capitalize()} Distribution')
    axes[idx].legend()
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

### Correlation heatmap

In [None]:
plt.figure(figsize=(12, 10))
correlation_matrix = df[audio_features + ['is_a_hit']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

## Correlation Analysis

The correlation heatmap reveals an that no individual audio feature shows string correlation with hit status (as expected since music is subjective, and there are many factors that go into having a hit, such as marketing, artist's popularity, timing, and cultureal context). The strongest correlations are:

- Danceability: -0.02
- Energy: -0.07  
- Valence: -0.09
- Acousticness: 0.07
- All other features: ≈ 0.00

### My Solution: Feature Engineering

Since individual features lack predictive power, I will create interaction features that combine multiple attributes:

1. **Energy × Danceability**: Captures the "party factor" - songs that are both energetic & danceable
2. **Valence × Energy**: Represents upbeat, positive vibes
3. **Mood Score**: Average of valence and energy for overall emotional tone
4. **Normalized Tempo**: Standardized BPM for better model scaling

**Hypothesis**: These engineered features will capture the non-linear relationships and feature combinations that differentiate hits from non-hits, where individual features alone cannot.

### Feature Engineering

In [None]:
df['energy_danceability'] = df['energy'] * df['danceability']
df['valence_energy'] = df['valence'] * df['energy']

# Normalize tempo (convert BPM to a 0-1 scale)
df['tempo_normalized'] = (df['tempo'] - df['tempo'].min()) / (df['tempo'].max() - df['tempo'].min())

# Create a 'mood' feature combining valence and energy
df['mood_score'] = (df['valence'] + df['energy']) / 2

### Data Selection

In [None]:
# Select features for modeling
feature_columns = ['danceability', 'energy', 'valence', 'tempo_normalized', 
                   'acousticness', 'instrumentalness', 'liveness', 'speechiness', 
                   'loudness', 'energy_danceability', 'valence_energy', 'mood_score']

X = df[feature_columns]
y = df['is_a_hit']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts(normalize=True))

### Standardizing Features for Logistic Regression

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier interpretation
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)

### Train Model (Baseline)

In [None]:
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train_scaled, y_train)

baseline_score = baseline_model.score(X_test_scaled, y_test)
print(f"Baseline Model Accuracy: {baseline_score:.4f}")

### Hyperparameter Training

In [None]:
# Using GridSearchCV (Test all possible combinations of settings and pick the best one)
param_grid = {
    # How much the model will trust the data: 
        # 0.001 [Don't Trust data too much] 
        # 100   [Trust the data and get every detail]
    'C': [0.001, 0.01, 0.1, 1, 10, 100], 

    # How to simplify the model:
        # L1: turn off uselss features completely
        # L2: Make all features contribute a little bit 
    'penalty': ['l1', 'l2'], 

    #Which algorithm trains the model?
        # liblinear: Fast and good for smaller datasets
        # saga: Better for larger datasets
    'solver': ['liblinear', 'saga'],

    #Fix the imbalance problem
        # None: Treat hits and non-hits equally
        # 'balanced': Force the model to care about hits
    'class_weight': [None, 'balanced']
}
# Test the best combo of settings
grid_search = GridSearchCV(LogisticRegression(random_state=42, max_iter=2000),
                           param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1)

grid_search.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation ROC-AUC score: {grid_search.best_score_:.4f}")

### Model Performance Analysis

**Results:**
- ROC-AUC Score: 0.5942
- Best Parameters: C=1, penalty='l2', solver='liblinear', class_weight=None

**Score Interpretation:**
- 0.50 = Random guessing
- 0.59 = Our model (barely better than random)
- 0.70+ = Acceptable
- 0.80+ = Good

#### Why Performance is Low

This aligns with our correlation analysis showing no individual audio feature strongly predicts hits (all correlations < 0.10). 

**Key factors:**
- Audio features alone are insufficient—commercial success depends on artist popularity, marketing budget, playlist placement, and timing
- Linear model may miss non-linear patterns in the data
- Interaction features may not capture complex relationships

#### Cross-Validation Check
```
Scores: [0.623, 0.569, 0.572, 0.560, 0.647]
Mean: 0.5942 ± 0.0687
```

Moderate variance across folds confirms the model is stable but consistently weak due to limited predictive power of audio features.

#### Key Takeaway

Audio characteristics alone cannot reliably predict hit songs—success depends more on marketing, artist popularity, and timing. This finding is valuable for understanding the limits of audio-based recommendation systems.

### Train model with the current best parameters

In [None]:
best_model = grid_search.best_estimator_

# Cross-validation scores
cv_scores = cross_val_score(best_model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-validation ROC-AUC scores: {cv_scores}")
print(f"Mean CV ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

## Cross-Validation Consistency Check

To ensure our model's performance is reliable and not due to a lucky/unlucky train-test split, we performed 5-fold cross-validation.

**Results:**
```
Cross-validation ROC-AUC scores: [0.623, 0.569, 0.572, 0.560, 0.647]
Mean CV ROC-AUC: 0.5942 ± 0.0687
```

**Interpretation:**

The scores show **moderate variance** across folds, ranging from 0.560 to 0.647. This indicates:

1. **Model performance is somewhat dependent on the data split** - with a relatively small dataset (1,600 training samples), different folds capture slightly different patterns

2. **The best fold (0.647) shows marginal predictive ability** - this suggests there is *some* signal in the audio features, but it's weak and inconsistent

3. **Standard deviation of ±0.069 is acceptable** for a dataset of this size, indicating the model isn't wildly unstable, just working with limited predictive features

**Key Takeaway:** The consistency of low scores across all folds confirms that the weak performance is due to **insufficient predictive power of audio features**, not model instability or a poor train-test split. This reinforces our earlier finding that commercial success depends more on non-audio factors (marketing, artist popularity, timing) than on audio characteristics alone.