## Churn Prediction of Spotify Users by applying Logistic Regression and BorderlineSMOTE 

Dataset : https://www.kaggle.com/datasets/nabihazahid/spotify-dataset-for-churn-analysis

**Importing relevant Libraries**

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [13]:
import os
from kaggle.api.kaggle_api_extended import KaggleApi

# Authenticate Kaggle API
api = KaggleApi()
api.authenticate()

# Download the dataset
api.dataset_download_files('nabihazahid/spotify-dataset-for-churn-analysis', path='spotify_churn', unzip=True)

print("Dataset downloaded to 'spotify_churn' folder.")

Dataset URL: https://www.kaggle.com/datasets/nabihazahid/spotify-dataset-for-churn-analysis
Dataset downloaded to 'spotify_churn' folder.


**Loading Dataset**

In [14]:
df = pd.read_csv('spotify_churn/spotify_churn_dataset.csv')

In [15]:
df

Unnamed: 0,user_id,gender,age,country,subscription_type,listening_time,songs_played_per_day,skip_rate,device_type,ads_listened_per_week,offline_listening,is_churned
0,1,Female,54,CA,Free,26,23,0.20,Desktop,31,0,1
1,2,Other,33,DE,Family,141,62,0.34,Web,0,1,0
2,3,Male,38,AU,Premium,199,38,0.04,Mobile,0,1,1
3,4,Female,22,CA,Student,36,2,0.31,Mobile,0,1,0
4,5,Other,29,US,Family,250,57,0.36,Mobile,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
7995,7996,Other,44,DE,Student,237,36,0.30,Mobile,0,1,1
7996,7997,Male,34,AU,Premium,61,64,0.59,Mobile,0,1,0
7997,7998,Female,17,US,Free,81,62,0.33,Desktop,5,0,0
7998,7999,Female,34,IN,Student,245,94,0.27,Desktop,0,1,0


**Selecting relevant features**

In [16]:
df = df.iloc[:,1:12]

In [17]:
df

Unnamed: 0,gender,age,country,subscription_type,listening_time,songs_played_per_day,skip_rate,device_type,ads_listened_per_week,offline_listening,is_churned
0,Female,54,CA,Free,26,23,0.20,Desktop,31,0,1
1,Other,33,DE,Family,141,62,0.34,Web,0,1,0
2,Male,38,AU,Premium,199,38,0.04,Mobile,0,1,1
3,Female,22,CA,Student,36,2,0.31,Mobile,0,1,0
4,Other,29,US,Family,250,57,0.36,Mobile,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
7995,Other,44,DE,Student,237,36,0.30,Mobile,0,1,1
7996,Male,34,AU,Premium,61,64,0.59,Mobile,0,1,0
7997,Female,17,US,Free,81,62,0.33,Desktop,5,0,0
7998,Female,34,IN,Student,245,94,0.27,Desktop,0,1,0


In [18]:
df = pd.get_dummies(df, columns=['gender', 'country', 'device_type','subscription_type'])

In [22]:
features = df.columns
features = features.drop('is_churned')
features

Index(['age', 'listening_time', 'songs_played_per_day', 'skip_rate',
       'ads_listened_per_week', 'offline_listening', 'gender_Female',
       'gender_Male', 'gender_Other', 'country_AU', 'country_CA', 'country_DE',
       'country_FR', 'country_IN', 'country_PK', 'country_UK', 'country_US',
       'device_type_Desktop', 'device_type_Mobile', 'device_type_Web',
       'subscription_type_Family', 'subscription_type_Free',
       'subscription_type_Premium', 'subscription_type_Student'],
      dtype='object')

In [23]:
target = ['is_churned']

**Training and Testing Splits**

In [24]:
X = df[features]
y = df[target]

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Applying Logsitc Regression Model**

In [28]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000) # 'lbfgs' is a good solver for multinomial
model.fit(X, y)


  y = column_or_1d(y, warn=True)


In [29]:
# 5. Making Predictions

y_pred = model.predict(X_test)

In [30]:
# 6. Evaluate the Model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.75

Classification Report:
              precision    recall  f1-score   support

           0       0.75      1.00      0.86      1200
           1       0.00      0.00      0.00       400

    accuracy                           0.75      1600
   macro avg       0.38      0.50      0.43      1600
weighted avg       0.56      0.75      0.64      1600



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Addressing Class Imbalance**

***Dataset has lesser instances/records of those users who got churned i.e. where target variable's value = 1. Hence SMOTE will be applied to improve predictions.***

In [32]:
y.value_counts()

is_churned
0             5929
1             2071
Name: count, dtype: int64

In [33]:
from imblearn.over_sampling import SMOTE

smote=SMOTE(sampling_strategy='minority') 
X_resampled, y_resampled = smote.fit_resample(X, y)
y_resampled.value_counts()

is_churned
0             5929
1             5929
Name: count, dtype: int64

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.20)

In [35]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000) # 'lbfgs' is a good solver for multinomial
model.fit(X_train, y_train)


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [36]:
y_pred = model.predict(X_test)

In [37]:
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.82

Classification Report:
              precision    recall  f1-score   support

           0       0.74      1.00      0.85      1196
           1       1.00      0.64      0.78      1176

    accuracy                           0.82      2372
   macro avg       0.87      0.82      0.82      2372
weighted avg       0.87      0.82      0.82      2372



**Applying BorderLineSMOTE to further imporve the predictions.**

In [39]:
from imblearn.over_sampling import BorderlineSMOTE

blsmote = BorderlineSMOTE(sampling_strategy='minority', kind='borderline-1')
X_resampled, y_resampled = blsmote.fit_resample(X, y)
y_resampled.value_counts()

is_churned
0             5929
1             5929
Name: count, dtype: int64

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.15)

In [41]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000) # 'lbfgs' is a good solver for multinomial
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [42]:
y_pred = model.predict(X_test)

In [43]:
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.83

Classification Report:
              precision    recall  f1-score   support

           0       0.75      1.00      0.85       903
           1       1.00      0.65      0.79       876

    accuracy                           0.83      1779
   macro avg       0.87      0.82      0.82      1779
weighted avg       0.87      0.83      0.82      1779



##   Project Summary: Spotify User Churn Prediction

### What I Did
- **Data Preparation:** Cleaned and preprocessed Spotify user dataset for churn analysis.  
- **Class Imbalance Handling:** Applied both **SMOTE** and **BorderlineSMOTE (blSMOTE)** to balance churn vs. nonâ€‘churn classes.  
- **Model Development:** Built a **Logistic Regression model** (multinomial, lbfgs solver) to predict churn.  
- **Evaluation:** Assessed performance using **accuracy score, precision, recall, and F1 metrics**.  
- **Validation:** Generated classification reports to compare model effectiveness across sampling techniques.  

### Results

#### ðŸ”¹ SMOTE
- **Accuracy:** 0.82  
- **Classification Report:**

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.74      | 1.00   | 0.85     | 1196    |
| 1     | 1.00      | 0.64   | 0.78     | 1176    |
| **Accuracy** |       |        | **0.82** | 2372    |
| **Macro Avg** | 0.87 | 0.82   | 0.82     | 2372    |
| **Weighted Avg** | 0.87 | 0.82 | 0.82     | 2372    |


#### ðŸ”¹ BorderlineSMOTE (blSMOTE)
- **Accuracy:** 0.83  
- **Classification Report:**

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.75      | 1.00   | 0.85     | 903     |
| 1     | 1.00      | 0.65   | 0.79     | 876     |
| **Accuracy** |       |        | **0.83** | 1779    |
| **Macro Avg** | 0.87 | 0.82   | 0.82     | 1779    |
| **Weighted Avg** | 0.87 | 0.83 | 0.82     | 1779    |

### Achievements
- Demonstrated how **different oversampling techniques (SMOTE vs. blSMOTE)** impact churn prediction performance.  
- Achieved **accuracy between 82â€“83%**, with strong precision and recall across classes.  
- Showed ability to **extract, clean, and model user behavior data** to inform retention strategies.  
- Highlighted how **machine learning techniques** can support **audience engagement and campaign success** in the music industry.  

---

### Key Takeaway
This project illustrates how predictive modeling can identify atâ€‘risk users, enabling platforms and labels to design **dataâ€‘driven retention campaigns** that improve audience loyalty and reduce churn.


