# Binary Modeling: Identifying Short-Lived Tracks

Objective:  
Evaluate whether weak but real audio-feature signals can be combined to
identify tracks that are unlikely to sustain chart retention.

This is framed as a **risk-filtering** task rather than exact prediction.


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)


In [2]:
DATA_PATH = r"D:\Data_analysis_BI\Project\Spotify\data\spotify_top_songs_audio_features.csv"

df = pd.read_csv(DATA_PATH)

df["short_lived"] = (df["weeks_on_chart"] <= 5).astype(int)

df["short_lived"].value_counts(normalize=True)


short_lived
1    0.545217
0    0.454783
Name: proportion, dtype: float64

In [5]:
features = [
    "danceability",
    "energy",
    "valence",
    "loudness",
    "speechiness",
    "acousticness",
    "instrumentalness",
    "liveness",
    "tempo"
]

X = df[features]
y = df["short_lived"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)



In [6]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(
    class_weight="balanced",
    max_iter=1000,
    random_state=42
)

log_reg.fit(X_train_scaled, y_train)

y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]


In [7]:
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_prob))


              precision    recall  f1-score   support

           0       0.52      0.62      0.56       741
           1       0.62      0.51      0.56       888

    accuracy                           0.56      1629
   macro avg       0.57      0.57      0.56      1629
weighted avg       0.57      0.56      0.56      1629

ROC AUC: 0.582627414864257


In [8]:
coef_df = pd.DataFrame({
    "feature": features,
    "coefficient": log_reg.coef_[0]
}).sort_values(by="coefficient")

coef_df


Unnamed: 0,feature,coefficient
2,valence,-0.186595
3,loudness,-0.179311
0,danceability,-0.1106
8,tempo,0.014286
5,acousticness,0.021499
6,instrumentalness,0.032827
1,energy,0.060317
7,liveness,0.098047
4,speechiness,0.272345


### Logistic Regression Results

The logistic regression model achieves a ROC-AUC of ~0.58, indicating modest
but non-random discriminative ability. Recall for short-lived tracks is ~0.51,
suggesting the model captures roughly half of early-exit cases using audio
features alone.

These results are consistent with prior statistical analysis: audio features
contain real but weak signal. While insufficient for precise prediction, they
are suitable for coarse risk filtering and prioritization tasks.


Coefficient inspection provides additional interpretability. Higher **valence**,
**loudness**, and **danceability** are associated with reduced short-lived risk,
while **speechiness** shows the strongest positive association with early chart
exit. **Liveness** and **energy** contribute smaller risk-increasing effects,
whereas **acousticness**, **instrumentalness**, and **tempo** exhibit negligible
impact.

These directions are consistent with earlier EDA and statistical tests,
reinforcing that observed effects are coherent but limited in magnitude.



### Non-Linear Model (Random Forest)

In [11]:
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=6,
    class_weight="balanced",
    random_state=42
)

rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)
rf_prob = rf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, rf_pred))
print("ROC AUC:", roc_auc_score(y_test, rf_prob))


              precision    recall  f1-score   support

           0       0.54      0.55      0.54       741
           1       0.62      0.60      0.61       888

    accuracy                           0.58      1629
   macro avg       0.58      0.58      0.58      1629
weighted avg       0.58      0.58      0.58      1629

ROC AUC: 0.6142926833716307


In [12]:
importances = pd.DataFrame({
    "feature": features,
    "importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)

importances


Unnamed: 0,feature,importance
4,speechiness,0.16837
3,loudness,0.159146
2,valence,0.14699
7,liveness,0.109897
0,danceability,0.108225
8,tempo,0.106298
5,acousticness,0.093167
1,energy,0.066185
6,instrumentalness,0.041722


### Random Forest Interpretation

The Random Forest model provides a small performance improvement, suggesting
that nonlinear feature interactions add limited additional signal. Feature
importance rankings are consistent with earlier statistical and linear-model
results, reinforcing that identified effects are stable but weak.


## Modeling Summary

Both linear and nonlinear models confirm that audio features provide limited but
consistent signal for identifying tracks at risk of early chart exit.

Logistic regression highlights interpretable effects: higher valence, loudness,
and danceability are associated with longer retention, while speechiness and
liveness increase early-exit risk.

A Random Forest model modestly improves discrimination (ROC-AUC â‰ˆ 0.61),
suggesting that feature interactions add incremental value. Feature importance
rankings closely align with statistical and EDA findings.

Overall, the results support a **risk-filtering** use case rather than precise
prediction, consistent with real-world media and content analytics workflows.
