# Song Post-Release Success Prediction Using Logistic Regression

This notebook explores a logistic regression model to predict whether a song will sustain high listener engagement after release, using core Spotify audio features as inputs. The project focuses on understanding the full machine learning workflow, including data inspection, feature selection, label engineering, model training, evaluation using classification metrics, and interpretation of results.

**Goal:** Predict post-release song success using logistic regression

**Tools:** Python, pandas, scikit-learn, matplotlib

### Import Necessary Tools and Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

### Load the Dataset

In [None]:
df = pd.read_csv("spotify_2015_2025_85k.csv")

### Inspect Data

In [None]:
df.info()

In [None]:
df.head()

### Create Copy of Dataset

In [None]:
df_model = df.copy()

### Create Binary Target Variable (success)
To construct a clear binary classification target, post-release song success was defined using the distribution of stream_count. Songs in the top 25% of stream counts were labeled as 1, representing high sustained listener engagement, while songs in the bottom 25% were labeled as 0, representing low engagement. Songs falling within the middle 50% of the distribution were excluded from the dataset, as they do not clearly indicate either success or failure and may introduce noise into the learning process. Removing these ambiguous samples allows the logistic regression model to focus on high-confidence examples and learn a more stable and interpretable decision boundary. Finally, the success labels were explicitly cast to integer values (0 or 1) to ensure the target variable is clean, unambiguous, and fully compatible with classification algorithms and evaluation metrics.

In [None]:
# Define thresholds based on stream count distribution
high_threshold = df_model['stream_count'].quantile(0.75)
low_threshold = df_model['stream_count'].quantile(0.25)

# Create binary success label
df_model['success'] = df_model['stream_count'].apply(
    lambda x: 1 if x >= high_threshold else (0 if x <= low_threshold else None)
)

# Remove ambiguous middle cases and ensure integer labels
df_model = df_model.dropna(subset=['success'])
df_model['success'] = df_model['success'].astype(int)


In [None]:
df_model.info()

### Clean & Select Relevant Data

In [None]:
df = df_model[
    [
        'duration_ms',
        'danceability',
        'energy',
        'key',
        'loudness',
        'mode',
        'instrumentalness',
        'tempo',
        'success'
    ]
]


### Split into Features (X) and Target (y)

In [None]:
X = df_model[
    [
        'duration_ms',
        'danceability',
        'energy',
        'key',
        'loudness',
        'mode',
        'instrumentalness',
        'tempo'
    ]
]
y = df_model ['success']

### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Create the Model

In [None]:
model = LogisticRegression()