# Exploratory Data Analysis: Spotify Bot vs. Human Behavior

This notebook covers Phase 2 of the MSDS 720 Final Project:
- Load and inspect the raw Spotify User Behavior dataset
- Clean the data and handle missing values
- Engineer continuous features from categorical survey responses
- Produce descriptive statistics and visualizations
- Export the cleaned dataset for modeling in Phase 3

## 1. Setup

In [None]:
import sys
import os

# Add project root to path so we can import from src/
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_loader import load_raw_data, clean_data, engineer_features, load_and_clean_data
from src.eda import (
    descriptive_stats,
    plot_histograms,
    plot_boxplots,
    plot_scatterplot,
    plot_pairplot,
    plot_correlation_matrix,
    plot_class_balance,
)

%matplotlib inline
pd.set_option("display.max_columns", 30)

## 2. Raw Data Overview

In [None]:
RAW_PATH = os.path.join("..", "data", "raw", "Spotify_data.xlsx")
df_raw = load_raw_data(RAW_PATH)

print(f"Shape: {df_raw.shape[0]} rows x {df_raw.shape[1]} columns")
df_raw.head()

In [None]:
print("Data types:")
print(df_raw.dtypes)
print()
print("Missing values:")
print(df_raw.isnull().sum()[df_raw.isnull().sum() > 0])

## 3. Data Cleaning

Steps performed by `clean_data()`:
- Standardize column names to snake_case
- Consolidate rare genre categories
- Fill missing podcast-related columns with "Unknown"
- Fill missing premium plan values with "None"

In [None]:
df_clean = clean_data(df_raw)

print("Columns after cleaning:")
print(df_clean.columns.tolist())
print()
print("Missing values after cleaning:")
print(df_clean.isnull().sum().sum(), "total missing")

In [None]:
print("Genre distribution after consolidation:")
print(df_clean["fav_music_genre"].value_counts())

## 4. Feature Engineering

Since the raw dataset is entirely categorical/ordinal survey data, we engineer continuous proxy variables for regression analysis:

| Engineered Variable | Source Column(s) | Logic |
|---|---|---|
| `age_numeric` | `age` | Midpoint of age bracket |
| `listening_time` | `music_time_slot`, `music_lis_frequency` | Base hours by time slot, scaled by number of listening contexts |
| `skip_rate` | `music_recc_rating` | Inverse of recommendation satisfaction (1-5 inverted) |
| `diversity_score` | `music_lis_frequency` | Number of listening contexts normalized to 0-1 |
| `streams` | Composite | Usage period + listening contexts + listening time + podcast engagement |
| `bot_like` | Derived | Binary flag based on high streams + low diversity + low recommendation engagement |

In [None]:
df = engineer_features(df_clean)

engineered_cols = [
    "age_numeric", "listening_time", "skip_rate",
    "diversity_score", "streams", "bot_like",
]

print("Engineered features sample:")
df[engineered_cols].head(10)

## 5. Descriptive Statistics

In [None]:
continuous_vars = [
    "age_numeric", "listening_time", "skip_rate",
    "diversity_score", "streams",
]

desc = descriptive_stats(df, continuous_vars)
desc

In [None]:
print("Bot-like class distribution:")
print(df["bot_like"].value_counts())
print()
print(f"Bot-like rate: {df['bot_like'].mean():.1%}")

## 6. Visualizations

### 6.1 Histograms of Continuous Variables

In [None]:
fig = plot_histograms(df, ["listening_time", "skip_rate", "diversity_score", "streams"])
plt.show()

### 6.2 Class Balance

In [None]:
fig = plot_class_balance(df, "bot_like")
plt.show()

### 6.3 Boxplots by Bot-like Status

In [None]:
for var in ["listening_time", "streams", "skip_rate", "diversity_score"]:
    fig = plot_boxplots(df, var, "bot_like")
    plt.show()

### 6.4 Scatterplots

In [None]:
fig = plot_scatterplot(df, "listening_time", "streams", hue_col="bot_like")
plt.show()

In [None]:
fig = plot_scatterplot(df, "diversity_score", "streams", hue_col="bot_like")
plt.show()

### 6.5 Correlation Matrix

In [None]:
corr_vars = [
    "listening_time", "skip_rate", "diversity_score",
    "streams", "age_numeric", "bot_like",
]
fig = plot_correlation_matrix(df, corr_vars)
plt.show()

### 6.6 Pairplot of Key Predictors

In [None]:
pair_vars = ["listening_time", "skip_rate", "diversity_score", "streams"]
fig = plot_pairplot(df, pair_vars, hue_col="bot_like")
plt.show()

## 7. Observations

Key findings from the EDA (to be filled in after reviewing the plots above):

1. **Class balance:** Proportion of bot-like vs. human accounts
2. **Listening time:** Distribution shape and differences by bot status
3. **Skip rate:** Whether bot-like accounts show different recommendation engagement
4. **Diversity score:** Whether low-diversity accounts cluster with bot-like labels
5. **Streams:** Whether bot-like accounts have inflated stream counts
6. **Correlations:** Which predictors are most correlated with the outcome variables

## 8. Export Cleaned Dataset

In [None]:
output_path = os.path.join("..", "data", "cleaned", "spotify_clean_v1.csv")
df.to_csv(output_path, index=False)
print(f"Saved cleaned dataset to {output_path}")
print(f"Final shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")