# Song Post-Release Success Prediction Using Logistic Regression

This notebook explores a logistic regression model to predict whether a song will sustain high listener engagement after release, using core Spotify audio features as inputs. The project focuses on understanding the full machine learning workflow, including data inspection, feature selection, label engineering, model training, evaluation using classification metrics, and interpretation of results.

**Goal:** Predict post-release song success using logistic regression

**Tools:** Python, pandas, scikit-learn, matplotlib

### Import Necessary Tools and Libraries

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score
)

### Load the Dataset

In [14]:
df = pd.read_csv("spotify_2015_2025_85k.csv")

### Inspect Data

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85000 entries, 0 to 84999
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          85000 non-null  object 
 1   track_name        84979 non-null  object 
 2   artist_name       85000 non-null  object 
 3   album_name        84954 non-null  object 
 4   release_date      85000 non-null  object 
 5   genre             85000 non-null  object 
 6   duration_ms       85000 non-null  int64  
 7   popularity        85000 non-null  int64  
 8   danceability      85000 non-null  float64
 9   energy            85000 non-null  float64
 10  key               85000 non-null  int64  
 11  loudness          85000 non-null  float64
 12  mode              85000 non-null  int64  
 13  instrumentalness  85000 non-null  float64
 14  tempo             85000 non-null  float64
 15  stream_count      85000 non-null  int64  
 16  country           85000 non-null  object

In [16]:
df.head()

Unnamed: 0,track_id,track_name,artist_name,album_name,release_date,genre,duration_ms,popularity,danceability,energy,key,loudness,mode,instrumentalness,tempo,stream_count,country,explicit,label
0,TRK-BEBD53DA84E1,Agent every (0),Noah Rhodes,Beautiful instead,2016-04-01,Pop,234194,55,0.15,0.74,9,-32.22,0,0.436,73.12,13000,Brazil,0,Universal Music
1,TRK-6A32496762D7,Night respond,Jennifer Cole,Table,2022-04-15,Metal,375706,45,0.44,0.46,0,-14.02,0,0.223,157.74,1000,France,1,Island Records
2,TRK-47AA7523463E,Future choice whatever,Brandon Davis,Page southern,2016-02-23,Rock,289191,55,0.62,0.8,8,-48.26,1,0.584,71.03,1000,Germany,1,XL Recordings
3,TRK-25ADA22E3B06,Bad fall pick those,Corey Jones,Spring,2015-10-12,Pop,209484,51,0.78,0.98,1,-34.47,1,0.684,149.0,1000,France,0,Warner Music
4,TRK-9245F2AD996A,Husband,Mark Diaz,Great prove,2022-07-08,Indie,127435,39,0.74,0.18,10,-17.84,0,0.304,155.85,2000,United States,0,Independent


### Create Copy of Dataset

In [17]:
df_model = df.copy()

### Drop models that won't be used for training

In [18]:
df_model = df_model.drop(
    columns=[
        'track_id',
        'track_name',
        'artist_name',
        'album_name',
        'release_date',
        'country',
        'label'
    ]
)


### Confirm Remaining Columns

In [19]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85000 entries, 0 to 84999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   genre             85000 non-null  object 
 1   duration_ms       85000 non-null  int64  
 2   popularity        85000 non-null  int64  
 3   danceability      85000 non-null  float64
 4   energy            85000 non-null  float64
 5   key               85000 non-null  int64  
 6   loudness          85000 non-null  float64
 7   mode              85000 non-null  int64  
 8   instrumentalness  85000 non-null  float64
 9   tempo             85000 non-null  float64
 10  stream_count      85000 non-null  int64  
 11  explicit          85000 non-null  int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 7.8+ MB
