# Project Stage I Report: Spotify Song Popularity Prediction

---

## Task 1: Problem Framing (10 points)

We aim to predict the **popularity** of songs on Spotify using track metadata and audio features.  
- **Why:** Popularity scores (0–100) reflect how often a song is streamed, making this an interesting and measurable target.  
- The task can be framed as **regression** (predicting exact score) or **classification** (grouping into low/medium/high popularity).  
- **Why it matters:** Insights can benefit streaming platforms, producers, and artists by understanding what makes songs more appealing.  
- **Why this dataset:** It contains over 230k tracks with a wide range of features (danceability, energy, loudness, etc.), making it rich enough for predictive modeling.

---


## Task 2: Dataset Exploration (20 points)

### First 10 Rows

We display the first 10 rows to understand the structure and confirm data integrity.  
**Why:** A quick look at sample rows ensures we understand the format, ranges, and potential irregularities before deeper analysis.


In [3]:
import pandas as pd

df = pd.read_csv("SpotifyFeatures.csv")
df.head(10)  # first 10 rows


Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39
5,Movie,Henri Salvador,Le petit souper aux chandelles,0Mf1jKa8eNAf1a4PwTbizj,0,0.749,0.578,160627,0.0948,0.0,C#,0.107,-14.97,Major,0.143,87.479,4/4,0.358
6,Movie,Martin & les fées,"Premières recherches (par Paul Ventimila, Lori...",0NUiKYRd6jt1LKMYGkUdnZ,2,0.344,0.703,212293,0.27,0.0,C#,0.105,-12.675,Major,0.953,82.873,4/4,0.533
7,Movie,Laura Mayne,Let Me Let Go,0PbIF9YVD505GutwotpB5C,15,0.939,0.416,240067,0.269,0.0,F#,0.113,-8.949,Major,0.0286,96.827,4/4,0.274
8,Movie,Chorus,Helka,0ST6uPfvaPpJLtQwhE6KfC,0,0.00104,0.734,226200,0.481,0.00086,C,0.0765,-7.725,Major,0.046,125.08,4/4,0.765
9,Movie,Le Club des Juniors,Les bisous des bisounours,0VSqZ3KStsjcfERGdcWpFO,10,0.319,0.598,152694,0.705,0.00125,G,0.349,-7.79,Major,0.0281,137.496,4/4,0.718


### Dataset Dictionary
We create a dictionary describing each feature, its type, example, and missing %.

**Why:** This documents the dataset schema and helps identify data quality issues before modeling.

In [5]:
import pandas as pd

# Load dataset
df = pd.read_csv("SpotifyFeatures.csv")

# Build dataset dictionary
dataset_dict = pd.DataFrame({
    "Feature": df.columns,
    "Type": [df[col].dtype for col in df.columns],
    "Examples": [df[col].dropna().unique()[:3] for col in df.columns],  # first 3 unique values
    "Missing %": [df[col].isnull().mean() * 100 for col in df.columns]
})

# Display dataset dictionary
dataset_dict


Unnamed: 0,Feature,Type,Examples,Missing %
0,genre,object,"[Movie, R&B, A Capella]",0.0
1,artist_name,object,"[Henri Salvador, Martin & les fées, Joseph Wil...",0.0
2,track_name,object,"[C'est beau de faire un Show, Perdu d'avance (...",0.00043
3,track_id,object,"[0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNud...",0.0
4,popularity,int64,"[0, 1, 3]",0.0
5,acousticness,float64,"[0.611, 0.246, 0.952]",0.0
6,danceability,float64,"[0.389, 0.59, 0.663]",0.0
7,duration_ms,int64,"[99373, 137373, 170267]",0.0
8,energy,float64,"[0.91, 0.737, 0.131]",0.0
9,instrumentalness,float64,"[0.0, 0.123, 0.00086]",0.0


Observation:  
- The dataset has **0% missing values**, which reduces preprocessing effort.  
- Features cover both **numeric audio descriptors** and **categorical metadata**.

## Task 3: Feature Exploration (40 points)

We checked for outliers and unusual values in key features.

- **Duration (`duration_ms`)**  
  - Found 150 tracks shorter than 30 seconds and 626 tracks longer than 15 minutes.  
  - **Why remove:** Such songs are atypical (intros, errors, compilations) and add noise to popularity prediction.  

- **Loudness (`loudness`)**  
  - Normal range is [-60, 0] dB. Found 91 tracks outside this range.  
  - **Why clip:** Retaining rows while constraining values avoids losing data and keeps features realistic.  

- **Popularity (`popularity`)**  
  - Distribution: mean ~41, median 43, min 0, max 100.  
  - **Why keep:** Values are valid; no cleaning required.  

- **Other Features (`danceability`, `energy`, `valence`, etc.)**  
  - All scaled between 0 and 1.  
  - **Why keep:** No anomalies detected.  

### Member Contributions
- *Saranya Pettela*: Investigated `duration_ms`, identified and removed extreme short/long tracks.  
- *VarunReddy Pakeru*: Analyzed `loudness`, clipped invalid values.  
- *Keerthi Devireddy*: Validated popularity and categorical features (`key`, `mode`, `time_signature`), ensured consistency.


## Task 4: Save Cleaned Data (10 points)

We applied the cleaning steps:  
1. Removed **776 tracks** with unrealistic durations.  
2. Clipped **91 loudness outliers** to [-60, 0].  

Final dataset size:  
- Before: 232,725 rows  
- After: 231,949 rows  

**Why save:** A cleaned dataset (`newdata.csv`) ensures reproducibility and provides a consistent input for all team members.  


In [9]:
# Cleaning code
import pandas as pd

df = pd.read_csv("SpotifyFeatures.csv")
df = df[(df["duration_ms"] >= 30000) & (df["duration_ms"] <= 900000)]
df["loudness"] = df["loudness"].clip(-60, 0)
df.to_csv("newdata.csv", index=False, encoding="utf-8")

print("Final dataset shape:", df.shape)

Final dataset shape: (231949, 18)


## Summary
In this stage, we:  
- Selected a rich dataset (Spotify audio features).  
- Framed the prediction problem (song popularity).  
- Explored and described the dataset.  
- Identified and cleaned outliers.  
- Prepared a final cleaned dataset (`newdata.csv`) for modeling.  

This sets up a strong foundation for Stage II, where we will perform deeper feature analysis and predictive modeling.