In [1]:
import pandas as pd
import numpy as np

In [2]:
ratings = pd.read_csv("../data/rating.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [3]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB


Filter Sparse User and Movies

We have to consider this step because:

1. Many users rated only 1-2 movies
2. Many movies only rated few times

This will not work for collaborative filtering

In [4]:
#Defining thresholds

MIN_USER_RATINGS = 50
MIN_MOVIE_RATINGS = 100

In [5]:
#Filter Users

user_counts = ratings['userId'].value_counts() #ratings per user
active_users = user_counts[user_counts >= MIN_USER_RATINGS].index

ratings = ratings[ratings['userId'].isin(active_users)]

Here, we are only keeping those users that falls under active_users category. 

In [6]:
#Filter Movies

movie_counts = ratings['movieId'].value_counts() #ratings per movie
popular_movies = movie_counts[movie_counts >= MIN_MOVIE_RATINGS].index

ratings = ratings[ratings['movieId'].isin(popular_movies)]

Similarly, we are also keeping those movies which falls under popular_movies category

In [7]:
from scipy.sparse import csr_matrix

# Original userIds: [5, 100, 5, 200]
# After .cat.codes: [0, 1, 0, 2]

user_ids = ratings['userId'].astype('category').cat.codes
movie_ids = ratings['movieId'].astype('category').cat.codes

# Create sparse matrix
user_movie_matrix = csr_matrix(
    (ratings['rating'], (user_ids, movie_ids))
)

# Keep mappings for later use
user_mapping = dict(enumerate(ratings['userId'].astype('category').cat.categories))
movie_mapping = dict(enumerate(ratings['movieId'].astype('category').cat.categories))

print(f"Matrix shape: {user_movie_matrix.shape}")
print(f"Memory usage: {user_movie_matrix.data.nbytes / 1024**2:.2f} MB")

Matrix shape: (85307, 8473)
Memory usage: 137.65 MB


The datasets are very large so we are using the concept of sparse, as it saves memory.

1. Converting IDs to Category code

2. Creating a Sparse matrix
**Structure:** `csr_matrix((data, (row_indices, col_indices)))`

- `ratings['rating']`: The actual rating values (e.g., [4.5, 3.0, 5.0, ...])
- `user_ids`: Row positions (which user)
- `movie_ids`: Column positions (which movie)

**What it creates:**
A matrix where:
- Rows = users
- Columns = movies
- Values = ratings
- **Empty cells are NOT stored** (that's the "sparse" part!)

**Visual Example:**
```
Regular matrix (wastes memory):
         Movie1  Movie2  Movie3  Movie4
User1      4.5     0.0     0.0     3.0
User2      0.0     5.0     0.0     0.0
User3      0.0     0.0     4.0     0.0

Sparse matrix (only stores non-zero):
Stored data: [4.5, 3.0, 5.0, 4.0]
Row indices: [0, 0, 1, 2]
Col indices: [0, 3, 1, 2]

3. Creating Mapping Dictionaries

Converting from matrix indices to original ids
.cat.categories: gives unique original IDs in order
enumerate(): pairs them with their index position

# If original userIds were: [5, 100, 200]
user_mapping = {
    0: 5,    # Matrix row 0 = userId 5
    1: 100,  # Matrix row 1 = userId 100
    2: 200   # Matrix row 2 = userId 200
}

4. Memory Comparision


In [8]:
# Save filtered ratings
ratings.to_csv("../data/ratings_processed.csv", index=False)

# Save sparse matrix
import pickle

with open("../data/user_movie_sparse.pkl", "wb") as f:
    pickle.dump(user_movie_matrix, f)

# Save mappings
with open("../data/user_mapping.pkl", "wb") as f:
    pickle.dump(user_mapping, f)

with open("../data/movie_mapping.pkl", "wb") as f:
    pickle.dump(movie_mapping, f)

print("Preprocessing artifacts saved successfully")


Preprocessing artifacts saved successfully
