# Music Recommendation System (Top-3)

## Project Overview
This project builds a **music recommender system** that generates **top-3 song recommendations per user** using **implicit feedback** (listens, likes, skips, playlist adds).

The goal is to simulate a realistic recommendation pipeline:
- start from **messy interaction logs**
- perform **data cleaning and preprocessing**
- train a **collaborative filtering model**
- evaluate recommendations using **ranking-based metrics**

## Objective
For each user, recommend **3 songs** they are most likely to positively engage with next.

Formally: Given a user’s historical interactions, rank unseen songs and return the top 3.

## Dataset
- File: `music_interactions_messy.csv`
- Type: implicit user–song interaction data
- Characteristics:
  - duplicates
  - missing values
  - inconsistent labels
  - outliers
  - invalid timestamps

This dataset is intentionally **not clean** to reflect real-world data challenges. It is not a real dataset but one I have create to simulate the process.
I felt that in the real world you are rarely given a clean dataset, thus I decided give my self a messy dataset to have fun with.


## Modeling Approach
- Feedback type: **Implicit**
- Primary model: **Matrix Factorization (ALS)**
- Library: `implicit`
- Recommendation type: **Top-N ranking (N = 3)**


## Evaluation
Offline evaluation using:
- **Recall@3**
- **Precision@3**
- **NDCG@3**

Baseline comparison:
- Global popularity-based recommender


## Deliverables
- Cleaned interaction dataset
- Trained recommender model
- Offline evaluation results
- `top3_recommendations.csv` containing:
  - user_id
  - top-3 recommended song_ids (and scores)


In [4]:
# Core
import numpy as np
import pandas as pd

# Visualization (optional but useful for EDA)
import matplotlib.pyplot as plt
import seaborn as sns

# Sparse matrices
from scipy.sparse import coo_matrix, csr_matrix

# Recommender models
from implicit.als import AlternatingLeastSquares

# Evaluation helpers
from sklearn.preprocessing import MinMaxScaler

# Utilities
import warnings
warnings.filterwarnings("ignore")

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)


## 1. Environment Setup and Imports

This section prepares the Python environment for the recommender system project.  
No data is loaded and no modeling is performed here, this cell only configures tools and settings that will be used throughout the notebook.

### Library Imports

- **NumPy (`numpy`)**  
  Used for numerical operations, random number generation, and efficient array handling.

- **Pandas (`pandas`)**  
  Used for loading, cleaning, and manipulating tabular data such as the user–song interaction logs.

- **Matplotlib & Seaborn**  
  Visualization libraries used during exploratory data analysis (EDA) to inspect distributions, missing values, and outliers.

- **SciPy Sparse Matrices**  
  Recommender systems operate on large, mostly-empty user–item matrices.  
  Sparse matrix formats (`COO`, `CSR`) allow efficient storage and computation.

- **Implicit ALS (`implicit.als`)**  
  Provides the Alternating Least Squares matrix factorization model optimized for **implicit feedback** data.

- **Scikit-learn utilities**  
  Used later for scaling values and supporting evaluation tasks.

### Warning Management

```python
warnings.filterwarnings("ignore")
```

### Reproducibility

```python
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
```

Setting a fixed random seed ensures:
- consistent train/test splits
- repeatable model initialization
- reproducible evaluation results
- This is essential for debugging and fair model comparison.

### Display Configuration

```python
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)
```

Configures how DataFrames are displayed in the notebook:
- all columns are shown
- tables are wider and easier to read
These settings **do not modify the data itself.**

### Summary

This setup cell initializes all dependencies and global settings required for the project.
Once this cell runs successfully, the notebook is ready to load and inspect the dataset.

In [5]:
# Load the dataset
df = pd.read_csv("music_interactions_messy.csv")

# Basic inspection
print("Shape of dataset:", df.shape)
display(df.head())

# Column info and data types
df.info()

Shape of dataset: (5180, 13)


Unnamed: 0,user_id,song_id,event_type,play_count,liked,added_to_playlist,skipped,listen_seconds,timestamp,device,country,artist_id,genre
0,u46164,S281353,PLAY,1,0,0,0,160,2025-11-10 22:57:00,web,BR,a76415,latin
1,u57763,s393120,play,1,0,0,0,114,2025-09-10 18:02:32,car,,a33775,Indie
2,u45592,s237113,PLAY,2,0,0,0,138,2025-11-10 22:18:12,console,KR,a95594,pop
3,u95570,s174374,add_to_playlist,1,0,1,0,50,2025-11-30 00:22:26,console,FR,a36976,pop
4,u69598,s205813,Play,2,0,0,0,147,2025-10-13 17:01:34,android,,a48274,jazz


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5180 entries, 0 to 5179
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   user_id            5173 non-null   object
 1   song_id            5171 non-null   object
 2   event_type         5180 non-null   object
 3   play_count         5155 non-null   object
 4   liked              5180 non-null   int64 
 5   added_to_playlist  5180 non-null   int64 
 6   skipped            5180 non-null   int64 
 7   listen_seconds     5180 non-null   int64 
 8   timestamp          5158 non-null   object
 9   device             4976 non-null   object
 10  country            5002 non-null   object
 11  artist_id          5085 non-null   object
 12  genre              4936 non-null   object
dtypes: int64(4), object(9)
memory usage: 526.2+ KB


## 2. Initial Data Inspection

After loading the dataset, we perform a high-level inspection to understand its structure, size, and obvious quality issues.


### Dataset Shape
The dataset contains a large number of user-to-song interaction records, with each row representing a single interaction event between a user and a song.

This confirms that the data is in a **long format**, which is appropriate for recommender systems.

### Columns Overview
From the preview and `df.info()` output, we observe the following types of variables:

- **Identifiers**
  - `user_id`
  - `song_id`
  - `artist_id`

- **Interaction Signals**
  - `event_type` (e.g., play, like, skip)
  - `play_count`
  - `liked`
  - `added_to_playlist`
  - `skipped`
  - `listen_seconds`

- **Contextual Metadata**
  - `timestamp`
  - `device`
  - `country`
  - `genre`

This structure provides enough information to construct an implicit feedback signal for recommendation.


### Data Quality Issues Observed
Based on the initial inspection:

- Several columns contain **missing or null values**
- `play_count` is not consistently numeric
- `event_type` uses **inconsistent labels and casing**
- `timestamp` includes invalid or non-standard values
- Categorical fields (e.g., `device`, `genre`) contain inconsistent formatting
- Duplicate rows may be present

These issues are expected and intentional, as the dataset is designed to simulate real-world logging data.


### Implications for Modeling
Before building a recommender model, the dataset will require:
- cleaning and standardization
- removal or correction of invalid records
- conversion into a numerical interaction strength
- filtering of sparse users and items

The next section focuses on **systematically identifying and addressing these data quality problems**.


In [6]:
# 1. Check for duplicate rows
print("Duplicate rows:", df.duplicated().sum())

# 2. Missing values per column
missing_summary = df.isna().sum().sort_values(ascending=False)
display(missing_summary)

# 3. Basic statistics for numeric columns
display(df.describe(include=[np.number]))

# 4. Unique value inspection for key categorical columns
for col in ["event_type", "device", "genre"]:
    print(f"\nUnique values in {col}:")
    print(df[col].value_counts().head(15))

# 5. Check play_count type issues
print("\nNon-numeric play_count values:")
display(df[pd.to_numeric(df["play_count"], errors="coerce").isna()][["play_count"]].head())

# 6. Timestamp parsing issues
parsed_ts = pd.to_datetime(df["timestamp"], errors="coerce")
print("\nInvalid timestamps:", parsed_ts.isna().sum())

# 7. Unknown / blank user or song IDs
bad_users = df["user_id"].astype(str).str.strip().isin(["", "nan", "None", "guest"])
bad_songs = df["song_id"].astype(str).str.strip().isin(["", "nan", "None", "UNKNOWN"])

print("Rows with invalid user_id:", bad_users.sum())
print("Rows with invalid song_id:", bad_songs.sum())


Duplicate rows: 9


genre                244
device               204
country              178
artist_id             95
play_count            25
timestamp             22
song_id                9
user_id                7
event_type             0
liked                  0
listen_seconds         0
added_to_playlist      0
skipped                0
dtype: int64

Unnamed: 0,liked,added_to_playlist,skipped,listen_seconds
count,5180.0,5180.0,5180.0,5180.0
mean,0.085328,0.046525,0.120656,151.400772
std,0.279396,0.21064,0.325759,875.782625
min,0.0,0.0,0.0,-118.0
25%,0.0,0.0,0.0,49.0
50%,0.0,0.0,0.0,102.0
75%,0.0,0.0,0.0,146.0
max,1.0,1.0,1.0,19909.0



Unique values in event_type:
event_type
Play             917
listen           907
PLAY             898
play             892
skip             229
skipped          213
SKIP             183
thumbs_up        150
LIKE             148
like             144
AddToPlaylist     93
add2playlist      86
repeat            70
REPLAY            66
replay            64
Name: count, dtype: int64

Unique values in device:
device
android     799
ios         764
smart_tv    748
car         739
web         729
console     728
IOS          83
SMARTTV      83
ANDROID      82
CAR          77
CONSOLE      72
WEB          72
Name: count, dtype: int64

Unique values in genre:
genre
r&b          430
metal        424
country      421
hip-hop      420
rock         396
edm          392
classical    377
pop          366
k-pop        361
indie        309
jazz         301
latin        246
R&B           53
Hip-Hop       52
Rock          52
Name: count, dtype: int64

Non-numeric play_count values:


Unnamed: 0,play_count
72,ten
110,
137,ten
178,one
217,ten



Invalid timestamps: 100
Rows with invalid user_id: 24
Rows with invalid song_id: 27


## 3. Data Audit and Quality Assessment

Before performing any cleaning or preprocessing, we conduct a systematic audit of the dataset to identify data quality issues that could impact modeling and evaluation.

This step focuses on **measuring and confirming problems**, not fixing them yet.


### Duplicate Records
The audit reveals the presence of duplicate rows, indicating that the same interaction may have been logged multiple times.  
These duplicates must be addressed to avoid inflating interaction counts and biasing recommendation scores.



### Missing and Null Values
Several columns contain missing or null values, particularly in:
- contextual fields such as `device`, `country`, and `genre`
- identifier fields such as `artist_id`

Missing values must be handled carefully, especially for identifiers required to construct the user anD item matrix.



### Inconsistent and NOISY Labels
Categorical columns exhibit inconsistent formatting:
- `event_type` contains multiple variants of the same action (different casing or alternative labels)
- `device` and `genre` values show inconsistent capitalization and formatting

These inconsistencies prevent reliable grouping and aggregation and will be standardized in the next section.



### Data Type Issues and Outliers
- The `play_count` column contains non-numeric values and extreme outliers
- `listen_seconds` includes negative values and unusually large durations

These issues suggest logging errors and require conversion, correction, or clipping to reasonable ranges.



### Invalid Timestamps
Some rows contain timestamps that cannot be parsed into valid datetime objects.  
Since time-based splitting is required for realistic evaluation, invalid timestamps must be handled or removed.



### Invalid or Unknown Identifiers
The audit also identifies rows with:
- blank or placeholder `user_id` values
- unknown or invalid `song_id` values

Such rows cannot be used for collaborative filtering and will be removed during preprocessing.



### Summary
This audit confirms that the dataset reflects realistic, noisy interaction logs.  
The next section focuses on **cleaning and standardizing the data** so it can be transformed into a numerical implicit feedback representation suitable for recommendation modeling.
