# Rolling Stones Spotify — Cohort Analysis

**Goal:** create song cohorts to improve recommendations.

**Data:** Spotify audio features + popularity + metadata for Rolling Stones tracks.

**Deliverables (what this notebook produces):**
1. Cleaning (dedup by `id`, parse `release_date` → `year/decade`, numeric coercion, clip `loudness` to [-60, 0]).
2. EDA + album recommendation (≥60 popularity): charts + `album_popular_counts.csv`.
3. Correlations: global heatmap + yearly corr(popularity, feature) CSV and per-feature plots.
4. PCA: scree and PC1–PC2 scatter.
5. Clustering (KMeans on PCA): cluster scatter, **cluster_profiles_unscaled.csv**, **cluster_definitions.csv**.



In [1]:
# 0) Imports
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


## 1) Problem & Data
**Outputs directory:** All artifacts are saved to `rs_spotify_outputs/`.

**Assumptions & conventions**
- Popularity threshold for “popular tracks”: **≥ 60** (used for album picks).
- Loudness is clipped to a realistic Spotify range **[-60, 0] dB**.
- “Cohorts” = clusters of songs based on audio features.
- K selection: **(state your approach)** — e.g., fixed `K=4` for musical interpretability, or chosen by **silhouette**.

**What this section produced**
- Clean dataset in-memory (`df`).
- Summary stats saved to **`numeric_summary.csv`**.


In [2]:
from pathlib import Path

OUT = Path("rs_spotify_outputs")
OUT.mkdir(exist_ok=True, parents=True)
# 1) Load + cleaning
df = pd.read_csv('rolling_stones_spotify.csv')
if "track_number" in df.columns and "track number" not in df.columns:
    df = df.rename(columns={"track_number":"track number"})

num_cols = ["acousticness","danceability","energy","instrumentalness","liveness",
            "loudness","speechiness","tempo","valence","popularity","duration_ms","track number"]
for c in num_cols:
    if c in df.columns: df[c] = pd.to_numeric(df[c], errors="coerce")

df["release_date"] = pd.to_datetime(df["release_date"], errors="coerce")
df["year"]   = df["release_date"].dt.year
df["decade"] = (df["year"]//10*10).astype("Int64")
if "loudness" in df.columns:
    df["loudness"] = df["loudness"].clip(-60, 0)

df = df.dropna(subset=["id","name","album","popularity"]).drop_duplicates(subset="id")
df[num_cols].describe().T.to_csv(OUT/"numeric_summary.csv")


## 2) EDA: Album Recommendation
- Drop exact duplicates and duplicate `id`s.
- Parse `release_date` → `year`, `decade`.
- Coerce numeric types; clip `loudness` to [-60, 0] dB.
- Album recommendation rule: choose albums with the most tracks where `popularity ≥ 60` `(primary)`, and show `average popularity` as a `secondary view`.

In [17]:
# 2) Album recommendation (≥60 popular threshold)
pop_th = 60
album_pop = (df.assign(popular=(df["popularity"]>=pop_th).astype(int))
               .groupby("album")["popular"]
               .sum()
               .sort_values(ascending=False))

album_pop.head(20).to_csv(OUT/"album_popular_counts.csv")

plt.figure(figsize=(12, 6))
album_pop.head(10).plot(kind="bar")
plt.title("Top 10 Albums by Number of Popular Tracks (≥60 Popularity)")
plt.xlabel("Album")
plt.ylabel("Number of Popular Tracks")
plt.xticks(rotation=45, ha="right")
plt.tight_layout(pad=2.0)            # add space for x-labels
plt.savefig(OUT / "top_albums_by_popular_tracks.png", dpi=150, bbox_inches="tight")
plt.close()

album_avg = df.groupby("album")["popularity"].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12, 6))
album_avg.plot(kind="bar")
plt.title("Top albums by average popularity")
plt.xlabel("Album"); plt.ylabel("Average popularity")
plt.tight_layout(); plt.savefig(OUT/"top_albums_by_avg_popularity.png", dpi=150); plt.close()

print("Recommend these two:", list(album_pop.head(2).index))


Recommend these two: ['Sticky Fingers (Remastered)', 'Let It Bleed']


## 3) EDA: Feature Correlation
- Descriptive stats (mean, std, min, max).
- **Album recommendation**: choose top 2 albums by **count of popular tracks** (popularity ≥ 60) and show **avg popularity by album**.
- **Correlation heatmap** of numeric features.

In [8]:
# 3) Correlations (global) + heatmap
corr = df[num_cols].corr(numeric_only=True)
corr.to_csv(OUT/"correlations_numeric.csv")

plt.figure()
im = plt.imshow(corr.values, aspect="auto")
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.index)), corr.index)
plt.title("Correlation heatmap")
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.tight_layout(); plt.savefig(OUT/"correlation_heatmap.png", dpi=150); plt.close()


## 4) Popularity vs Features Over Time
- For each year, compute corr(popularity, feature) for: danceability, energy, valence, acousticness, speechiness, tempo.
- Plot correlation vs year (one chart per feature) to see trend changes.
- Method: Pearson correlation per year; years with small n may be noisy.

In [19]:
# 4) Popularity vs features over years
key = [c for c in ["danceability","energy","valence","acousticness","speechiness","tempo"] if c in df.columns]
rows = []

for y, g in df.dropna(subset=["year"]).groupby("year"):
    if len(g) < 5:
        continue
    for f in key:
        # compute sample size for this pair
        n = int(min(g["popularity"].notna().sum(), g[f].notna().sum()))
        if n >= 5:  # keep your original threshold
            c = g[["popularity", f]].corr(numeric_only=True).iloc[0, 1]
            if pd.notna(c):  # NaN guard
                rows.append({"year": int(y), "feature": f, "corr_with_popularity": c, "n": n})


trend = pd.DataFrame(rows).sort_values(["feature","year"])
trend.to_csv(OUT/"yearly_corr_with_popularity.csv", index=False)

for f in key:
    sub = trend[trend["feature"]==f]
    if sub.empty: continue
    plt.figure()
    plt.plot(sub["year"], sub["corr_with_popularity"], marker="o")
    plt.title(f"Correlation (popularity vs {f}) over years")
    plt.xlabel("Year"); plt.ylabel("Correlation")
    plt.tight_layout(); plt.savefig(OUT/f"yearly_correlation_popularity_{f}.png", dpi=150); plt.close()


## 5) Dimensionality Reduction (PCA)
- Standardize audio features, fit PCA.
- Show scree (variance ratio) and 2D scatter (PC1 vs PC2).
- Why PCA: reduces multicollinearity, denoises, and helps clustering.
- Method: Pearson correlation per year; years with n < 10 skipped; outputs saved to rs_spotify_outputs/yearly_corr_with_popularity.csv and yearly_correlation_popularity_<feature>.png.

In [None]:
# 5) PCA
audio = [c for c in ["acousticness","danceability","energy","instrumentalness",
                     "liveness","loudness","speechiness","tempo","valence","duration_ms"]
         if c in df.columns]
pca_df = df.dropna(subset=audio)[audio].copy()

scaler = StandardScaler()
X = scaler.fit_transform(pca_df)

pca = PCA(n_components=min(10, X.shape[1]), random_state=42)
X_pca = pca.fit_transform(X)
evr = pca.explained_variance_ratio_

plt.figure()
plt.plot(range(1,len(evr)+1), evr, marker="o")
plt.title("PCA Scree Plot"); plt.xlabel("PC"); plt.ylabel("Explained variance ratio")
plt.tight_layout(); plt.savefig(OUT/"pca_scree.png", dpi=150); plt.close()

plt.figure()
plt.scatter(X_pca[:,0], X_pca[:,1], s=10)
plt.title("PCA (PC1 vs PC2)"); plt.xlabel("PC1"); plt.ylabel("PC2")
plt.tight_layout(); plt.savefig(OUT/"pca_scatter.png", dpi=150); plt.close()

X_pca = pca.fit_transform(X)
evr = pca.explained_variance_ratio_

# handy for clustering on 2D
X_pca_2d = X_pca[:, :2]



## 6) Clustering
- Run KMeans on **PCA(2D)** space (K=4).
- Visualize clusters on PC1–PC2 chart.

In [28]:
# 6) Clustering (K=4 for speed and stable musical groupings)

# align DF to PCA rows (do this once)
df = df.loc[pca_df.index].copy()

# prepare 2D PCA for clustering (once)
pca2 = X_pca[:, :2]

# fit KMeans once and attach labels
km = KMeans(n_clusters=4, n_init=10, random_state=42)
labels = km.fit_predict(pca2)
df["cluster"] = labels

# --- scatter on PC1–PC2 (optional centroid overlay) ---
plt.figure()
plt.scatter(pca2[:, 0], pca2[:, 1], c=labels, s=12)
plt.title("KMeans on PCA (K=4)"); plt.xlabel("PC1"); plt.ylabel("PC2")
cent_pc = np.vstack([pca2[labels == k].mean(axis=0) for k in range(km.n_clusters)])
plt.scatter(cent_pc[:, 0], cent_pc[:, 1], s=120, marker="X")
plt.tight_layout(); plt.savefig(OUT / "pca_cluster_scatter.png", dpi=150); plt.close()

# --- cluster profiles back to ORIGINAL units ---
cent_scaled = np.vstack([X[labels == k].mean(axis=0) for k in range(km.n_clusters)])
cent_unscaled = scaler.inverse_transform(cent_scaled)
profiles = pd.DataFrame(cent_unscaled, columns=audio)
profiles.insert(0, "cluster", range(km.n_clusters))
profiles.to_csv(OUT / "cluster_profiles_unscaled.csv", index=False)

# --- human-friendly definitions: top ±z features ---
Xz = pd.DataFrame(X, columns=audio); Xz["cluster"] = labels
rows = []
for k in range(km.n_clusters):
    mu = Xz[Xz["cluster"] == k][audio].mean().sort_values(ascending=False)
    rows.append({
        "cluster": k,
        "top_positive_z_features": list(mu.head(3).index),
        "top_negative_z_features": list(mu.tail(3).index),
    })
pd.DataFrame(rows).to_csv(OUT / "cluster_definitions.csv", index=False)

# optional: cluster sizes
pd.Series(labels).value_counts().sort_index().rename("size").to_csv(OUT / "cluster_sizes.csv")



## 7) Define Clusters
- Inverse-scale cluster centers to original units.
- Provide **cluster profiles** and **human-friendly definitions** by top ±z features (e.g., “high energy & tempo; low acousticness”).

## 8) Insights & Next Steps
- Summarize which clusters map to “live/energetic rock,” “mellow/acoustic,” “speech-like/experimental,” etc.
- Suggest adding lyrics/sentiment or sub-genre tags to refine cohorts.


Cleaning: removed duplicates by id, parsed dates, enforced numeric types; clipped loudness to physical range (−60..0 dB).

Album picks: recommended the top two albums by # of tracks with popularity ≥ 60 (see bar chart and CSV).

Correlations: globally, look for strong relations (e.g., loudness/energy/tempo vs popularity), then yearly trends reveal how those relationships changed over time.

PCA: the scree plot justifies component count; 2D scatter shows natural groupings.

Clustering (K=4): gives coherent musical cohorts; see cluster_profiles_unscaled.csv and cluster_definitions.csv to interpret each cluster (e.g., “energetic & loud”, “acoustic & low energy”, etc.).