# Hit Song Predictor: A Logistic Regression Approach

## Project Overview
This project uses logistic regression to predict whether a song will be a 'hit' based on audio features like danceability, energy, valence, and tempo. The model analyzes patterns in successful tracks to identify what makes a song popular.

**Key Skills Demonstrated:**
- Binary classification with logistic regression
- Feature engineering and selection
- Handling imbalanced datasets
- Model evaluation and interpretation
- Data visualization for music analytics

### Import Necessary Tools & Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

### Load the Dataset

In [None]:
df = pd.read_csv("songs_normalize.csv")

### Inspect Data

In [None]:
df.info()

In [None]:
df.head()

### Exploratory Data Analysis

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\nData Types:")
print(df.dtypes)
print("\nBasic Statistics:")
df.describe()

In [None]:
# Define what makes a 'hit' - songs with popularity > 70
# Adjust this threshold based on dataset
popularity_threshold = df['popularity'].quantile(0.75)
df['is_a_hit'] = (df['popularity'] > popularity_threshold).astype(int)

print(f"Popularity threshold for 'hit': {popularity_threshold}")
print(f"\nClass distribution:")
print(df['is_a_hit'].value_counts())
print(f"\nHit rate: {df['is_a_hit'].mean():.2%}")

### Visualize audio feature distributions for hits vs non-hits

#### The Visualizations show
Expected Strong Features:

- Danceability ⭐⭐⭐
- Energy ⭐⭐⭐
- Valence ⭐⭐

Expected Weak Features:

- Tempo, Acousticness, Instrumentalness, Liveness, Speechiness, and Loudness

In [None]:
audio_features = ['danceability', 'energy', 'valence', 'tempo', 'acousticness', 
                  'instrumentalness', 'liveness', 'speechiness', 'loudness']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, feature in enumerate(audio_features):
    axes[idx].hist(df[df['is_a_hit']==1][feature], alpha=0.5, label='Hit', bins=30, color='green')
    axes[idx].hist(df[df['is_a_hit']==0][feature], alpha=0.5, label='Not a Hit', bins=30, color='red')
    axes[idx].set_title(f'{feature.capitalize()} Distribution')
    axes[idx].legend()
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()