# Hit Song Predictor: A Logistic Regression Approach

## Project Overview
This project uses logistic regression to predict whether a song will be a 'hit' based on audio features like danceability, energy, valence, and tempo. The model analyzes patterns in successful tracks to identify what makes a song popular.

**Key Skills Demonstrated:**
- Binary classification with logistic regression
- Feature engineering and selection
- Handling imbalanced datasets
- Model evaluation and interpretation
- Data visualization for music analytics

### Import Necessary Tools & Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import precision_recall_curve, average_precision_score
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Load the Dataset

In [None]:
df = pd.read_csv("songs_normalize.csv")

### Inspect Data

In [None]:
df.info()

In [None]:
df.head()

### Exploratory Data Analysis

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)
print("\nBasic Statistics:")
df.describe()

In [None]:
# Define what makes a 'hit' - songs with popularity > 70
# Adjusted this threshold based on dataset
popularity_threshold = df['popularity'].quantile(0.75)
df['is_a_hit'] = (df['popularity'] > popularity_threshold).astype(int)

print(f"Popularity threshold for 'hit': {popularity_threshold}")
print(f"\nClass distribution:")
print(df['is_a_hit'].value_counts())
print(f"\nHit rate: {df['is_a_hit'].mean():.2%}")

### Visualize audio feature distributions for hits vs non-hits

#### The Visualization shows
Expected Strong Features:

- Danceability ⭐⭐⭐
- Energy ⭐⭐⭐
- Valence ⭐⭐

Expected Weak Features:

- Tempo, Acousticness, Instrumentalness, Liveness, Speechiness, and Loudness

In [None]:
audio_features = ['danceability', 'energy', 'valence', 'tempo', 'acousticness', 
                  'instrumentalness', 'liveness', 'speechiness', 'loudness']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(audio_features):
    axes[idx].hist(df[df['is_a_hit']==1][feature], alpha=0.5, label='Hit', bins=30, color='green')
    axes[idx].hist(df[df['is_a_hit']==0][feature], alpha=0.5, label='Not a Hit', bins=30, color='red')
    axes[idx].set_title(f'{feature.capitalize()} Distribution')
    axes[idx].legend()
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

### Correlation heatmap

In [None]:
plt.figure(figsize=(12, 10))
correlation_matrix = df[audio_features + ['is_a_hit']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

## Correlation Analysis

The correlation heatmap reveals an that no individual audio feature shows string correlation with hit status (as expected since music is subjective, and there are many factors that go into having a hit, such as marketing, artist's popularity, timing, and cultureal context). The strongest correlations are:

- Danceability: -0.02
- Energy: -0.07  
- Valence: -0.09
- Acousticness: 0.07
- All other features: ≈ 0.00

### My Solution: Feature Engineering

Since individual features lack predictive power, I will create interaction features that combine multiple attributes:

1. **Energy × Danceability**: Captures the "party factor" - songs that are both energetic & danceable
2. **Valence × Energy**: Represents upbeat, positive vibes
3. **Mood Score**: Average of valence and energy for overall emotional tone
4. **Normalized Tempo**: Standardized BPM for better model scaling

**Hypothesis**: These engineered features will capture the non-linear relationships and feature combinations that differentiate hits from non-hits, where individual features alone cannot.

### Feature Engineering

In [None]:
df['energy_danceability'] = df['energy'] * df['danceability']
df['valence_energy'] = df['valence'] * df['energy']

# Normalize tempo (convert BPM to a 0-1 scale)
df['tempo_normalized'] = (df['tempo'] - df['tempo'].min()) / (df['tempo'].max() - df['tempo'].min())

# Create a 'mood' feature combining valence and energy
df['mood_score'] = (df['valence'] + df['energy']) / 2

### Data Selection

In [None]:
# Select features for modeling
feature_columns = ['danceability', 'energy', 'valence', 'tempo_normalized', 
                   'acousticness', 'instrumentalness', 'liveness', 'speechiness', 
                   'loudness', 'energy_danceability', 'valence_energy', 'mood_score']

X = df[feature_columns]
y = df['is_a_hit']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"\nTraining set class distribution:")
print(y_train.value_counts(normalize=True))

### Standardizing Features for Logistic Regression

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier interpretation
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns)

### Train Model (Baseline)

In [None]:
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train_scaled, y_train)

baseline_score = baseline_model.score(X_test_scaled, y_test)
print(f"Baseline Model Accuracy: {baseline_score:.4f}")

### Hyperparameter Training

In [None]:
# Using GridSearchCV (Test all possible combinations of settings and pick the best one)
param_grid = {
    # How much the model will trust the data: 
        # 0.001 [Don't Trust data too much] 
        # 100   [Trust the data and get every detail]
    'C': [0.001, 0.01, 0.1, 1, 10, 100], 

    # How to simplify the model:
        # L1: turn off uselss features completely
        # L2: Make all features contribute a little bit 
    'penalty': ['l1', 'l2'], 

    #Which algorithm trains the model?
        # liblinear: Fast and good for smaller datasets
        # saga: Better for larger datasets
    'solver': ['liblinear', 'saga'],

    #Fix the imbalance problem
        # None: Treat hits and non-hits equally
        # 'balanced': Force the model to care about hits
    'class_weight': [None, 'balanced']
}
# Test the best combo of settings
grid_search = GridSearchCV(LogisticRegression(random_state=42, max_iter=2000),
                           param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1)

grid_search.fit(X_train_scaled, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation ROC-AUC score: {grid_search.best_score_:.4f}")

### Train model with the current best parameters

In [None]:
best_model = grid_search.best_estimator_

# Cross-validation scores
cv_scores = cross_val_score(best_model, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-validation ROC-AUC scores: {cv_scores}")
print(f"Mean CV ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

### Model Evaluation

In [None]:
# Make predictions
y_pred = best_model.predict(X_test_scaled)
y_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Non-Hit', 'Hit']))

# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nROC-AUC Score: {roc_auc:.4f}")

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Non-Hit', 'Hit'], yticklabels=['Non-Hit', 'Hit'])
plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn + fp)
sensitivity = tp / (tp + fn)
print(f"\nSpecificity (True Negative Rate): {specificity:.4f}")
print(f"Sensitivity (True Positive Rate/Recall): {sensitivity:.4f}")

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=16, fontweight='bold')
plt.legend(loc="lower right", fontsize=11)
plt.grid(alpha=0.3)
plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
avg_precision = average_precision_score(y_test, y_pred_proba)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, color='blue', lw=2, 
         label=f'Precision-Recall curve (AP = {avg_precision:.4f})')
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve', fontsize=16, fontweight='bold')
plt.legend(loc="lower left", fontsize=11)
plt.grid(alpha=0.3)
plt.savefig('precision_recall_curve.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Feature Importance Analysis

In [None]:
# Extract and visualize feature coefficients
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': best_model.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)

plt.figure(figsize=(10, 8))
colors = ['green' if x > 0 else 'red' for x in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors)
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Feature Importance (Logistic Regression Coefficients)', fontsize=16, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nFeature Importance Ranking:")
print(feature_importance)

This graph shows that tempo, loudness, acousticness, valence x energy, energy, and danceability are the most important features that people tend to grasp and increase replay value for a song, making it more likely to be a hit. (Music is subjective, however, but in general, these are the found audio features that people tend to gravitate towards)

In [None]:
# Interpret coefficients (odds ratios)
odds_ratios = np.exp(feature_importance['Coefficient'])
feature_importance['Odds Ratio'] = odds_ratios
feature_importance['Impact'] = feature_importance['Coefficient'].apply(
    lambda x: 'Positive' if x > 0 else 'Negative'
)

print("\nFeature Interpretation (Odds Ratios):")
print(feature_importance[['Feature', 'Coefficient', 'Odds Ratio', 'Impact']])
print("\nInterpretation: An odds ratio > 1 means the feature increases the likelihood of a hit.")
print("An odds ratio < 1 means the feature decreases the likelihood of a hit.")

## 8. Prediction Examples and Insights

In [None]:
# A function to predict hit probability for new songs
def predict_hit_probability(song_features):
    """
    Predict the probability that a song will be a hit.
    
    Parameters:
    song_features: dict with audio features
    
    Returns:
    probability of being a hit
    """
    # Create DataFrame from input
    song_df = pd.DataFrame([song_features])
    
    # Add engineered features
    song_df['energy_danceability'] = song_df['energy'] * song_df['danceability']
    song_df['valence_energy'] = song_df['valence'] * song_df['energy']
    song_df['tempo_normalized'] = (song_df['tempo'] - df['tempo'].min()) / (df['tempo'].max() - df['tempo'].min())
    song_df['mood_score'] = (song_df['valence'] + song_df['energy']) / 2
    
    # Scale features
    song_scaled = scaler.transform(song_df[feature_columns])
    
    # Predict
    probability = best_model.predict_proba(song_scaled)[0, 1]
    
    return probability

# Example prediction
example_song = {
    'danceability': 0.95,
    'energy': 0.90,
    'valence': 0.45,
    'tempo': 155,
    'acousticness': 0.80,
    'instrumentalness': 0.1,
    'liveness': 0.15,
    'speechiness': 0.05,
    'loudness': 9.0
}

hit_prob = predict_hit_probability(example_song)
print(f"Example Song's Hit Probability: {hit_prob:.2%}")
print(f"Prediction: {'A Hit :)' if hit_prob > 0.5 else 'Not a Hit :('}")

In [None]:
# Analyze misclassifications to gain insights
test_results = X_test.copy()
test_results['actual'] = y_test.values
test_results['predicted'] = y_pred
test_results['probability'] = y_pred_proba

# False positives (predicted hit, actually not)
false_positives = test_results[(test_results['actual'] == 0) & (test_results['predicted'] == 1)]
print(f"\nFalse Positives Analysis (n={len(false_positives)}):")
print(false_positives[['danceability', 'energy', 'valence', 'probability']].describe())

# False negatives (predicted not hit, actually is)
false_negatives = test_results[(test_results['actual'] == 1) & (test_results['predicted'] == 0)]
print(f"\nFalse Negatives Analysis (n={len(false_negatives)}):")
print(false_negatives[['danceability', 'energy', 'valence', 'probability']].describe())

## 9. Key Insights and Business Recommendations

In [None]:
# Summary statistics for hits vs non-hits
print("=" * 80)
print("KEY INSIGHTS FROM THE MODEL")
print("=" * 80)

print("\n1. MODEL PERFORMANCE:")
print(f"   - ROC-AUC Score: {roc_auc:.4f}")
print(f"   - Accuracy: {best_model.score(X_test_scaled, y_test):.4f}")
print(f"   - The model can effectively distinguish between hits and non-hits")

print("\n2. TOP FEATURES THAT PREDICT HITS:")
top_positive = feature_importance.nlargest(3, 'Coefficient')
for idx, row in top_positive.iterrows():
    print(f"   - {row['Feature']}: {row['Coefficient']:.4f} (OR: {row['Odds Ratio']:.4f})")

print("\n3. FEATURES THAT DECREASE HIT PROBABILITY:")
top_negative = feature_importance.nsmallest(3, 'Coefficient')
for idx, row in top_negative.iterrows():
    print(f"   - {row['Feature']}: {row['Coefficient']:.4f} (OR: {row['Odds Ratio']:.4f})")

print("\n4. RECOMMENDATIONS FOR SPOTIFY:")
print("   - Focus on promoting songs with high danceability and energy")
print("   - The 'mood' of a song (combination of valence and energy) is crucial")
print("   - Songs with extreme instrumentalness or acousticness may need targeted marketing")
print("   - Consider these features in playlist curation algorithms")

print("\n" + "=" * 80)

## 10. Model Limitations and Future Work

### Limitations:
1. **Temporal bias**: The model doesn't account for changing music trends over time
2. **External factors**: Marketing budget, artist popularity, and playlist placement aren't included
3. **Genre specificity**: Different genres may have different success patterns
4. **Definition of 'hit'**: Using popularity threshold is somewhat arbitrary

### Possible Future Improvements:
1. **Time-series analysis**: Incorporate release date and trend analysis
2. **Genre-specific models**: Build separate models for different genres
3. **Ensemble methods**: Combine logistic regression with tree-based models
4. **Additional features**: Include artist features, label information, social media metrics
5. **Multi-class classification**: Predict levels of success (flop, moderate, hit, mega-hit)
6. **A/B testing framework**: Test model predictions against actual playlist performance