# Collaborative Filtering

Dataset from [https://grouplens.org/datasets/movielens/]

In [32]:
!wget -O ml-25m.zip http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip -q ml-25m.zip -d ml-25m

--2025-08-28 01:53:48--  http://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2025-08-28 01:53:54 (44.5 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]

replace ml-25m/ml-25m/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

#Preprocessing

In [33]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [59]:
# Load only needed columns to save memory
movies_df = pd.read_csv("/content/ml-25m/ml-25m/movies.csv", usecols=['movieId','title'])
ratings_df = pd.read_csv("/content/ml-25m/ml-25m/ratings.csv", usecols=['userId','movieId','rating'])

In [60]:
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


Remove the year from the title column and place it into its own one

In [61]:
movies_df[['title', 'year']] = movies_df['title'].str.extract(r'^(.*)\s*\((\d{4})\)\s*$')
movies_df['title'] = movies_df['title'].str.strip()

In [62]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [63]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5


#Building the C F

Collaborativa Filetering == User-User Filtering    
- Uses another users to recommend items to the input user

Based on the Pearson Correlation Function.

Step by step to create the User Based recommendation system   
- Select a user with the movies the user has watched
- Based on his rating of the movies, find the top X neighbours
- Get the watched movie record of the user for each neighbour
- Calculate a similarity score using some formula
- Recommend the items with the highest score

In [64]:
userInput = [
    {'title': 'Matrix, The', 'rating':5},
    {'title': 'Inception', 'rating':3},
    {'title': 'Amélie', 'rating':4},
    {'title': 'Fight Club', 'rating':1},
    {'title': 'Forrest Gump', 'rating':4},
    {'title': 'Shawshank Redemption, The', 'rating':5},
    {'title': 'Pulp Fiction', 'rating':5},
    {'title': 'Spirited Away', 'rating':5},
    {'title': 'Godfather, The', 'rating':5},
    {'title': 'Dark Knight, The', 'rating':5},
    {'title': 'Interstellar', 'rating':4},
    {'title': 'Gladiator', 'rating':2},
    {'title': 'Lord of the Rings: The Fellowship of the Ring, The', 'rating':5},
    {'title': 'Lord of the Rings: The Two Towers, The', 'rating':5},
    {'title': 'Lord of the Rings: The Return of the King, The', 'rating':5},
    {'title': 'Goodfellas', 'rating':3},
    {'title': 'Se7en', 'rating':2},
    {'title': 'Silence of the Lambs, The', 'rating':5},
    {'title': 'Saving Private Ryan', 'rating':5},
    {'title': 'Schindler\'s List', 'rating':5},
    {'title': 'Parasite', 'rating':5},
    {'title': 'Whiplash', 'rating':2},
    {'title': 'La La Land', 'rating':4},
    {'title': 'Avengers: Endgame', 'rating':5},
    {'title': 'Joker', 'rating':3}
]

inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Matrix, The",5
1,Inception,3
2,Amélie,4
3,Fight Club,1
4,Forrest Gump,4
5,"Shawshank Redemption, The",5
6,Pulp Fiction,5
7,Spirited Away,5
8,"Godfather, The",5
9,"Dark Knight, The",5


We have the user preferences, now we are going to add the movie id to this random user.

In [65]:
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId,inputMovies)
inputMovies = inputMovies.drop('year', axis=1)
inputMovies

Unnamed: 0,movieId,title,rating
0,296,Pulp Fiction,5
1,318,"Shawshank Redemption, The",5
2,356,Forrest Gump,4
3,527,Schindler's List,5
4,593,"Silence of the Lambs, The",5
5,858,"Godfather, The",5
6,1213,Goodfellas,3
7,2028,Saving Private Ryan,5
8,2256,Parasite,5
9,2571,"Matrix, The",5


To compare we need the subset of other users that have watched and reviewed these movies.

In [66]:
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
36,1,5952,4.0
79,2,318,5.0
82,2,356,4.5
89,2,527,5.0


In [67]:
userSubsetGroup = userSubset.groupby(['userId'])

In [68]:
userSubsetGroup.get_group(1130)

  userSubsetGroup.get_group(1130)


Unnamed: 0,userId,movieId,rating
158211,1130,318,5.0
158212,1130,527,4.0
158213,1130,858,4.0
158226,1130,2571,4.5
158228,1130,2959,4.0
158237,1130,4993,5.0
158241,1130,5952,5.0
158245,1130,7153,5.0


In [69]:
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

(    
  (130333,),  
  userId
  DataFrame of all ratings by that user   
)


In [70]:
userSubsetGroup[0:3]

[((130333,),
            userId  movieId  rating
  20047795  130333      296     4.5
  20047802  130333      318     4.0
  20047811  130333      356     4.0
  20047846  130333      527     2.0
  20047860  130333      593     3.5
  20047900  130333      858     4.0
  20047988  130333     1213     1.0
  20048216  130333     2028     3.5
  20048249  130333     2256     0.5
  20048325  130333     2571     5.0
  20048408  130333     2959     5.0
  20048507  130333     3578     5.0
  20048708  130333     4993     5.0
  20048808  130333     5952     5.0
  20048943  130333     7153     5.0
  20048997  130333     8132     0.5
  20049486  130333    58559     5.0
  20049744  130333    79132     5.0
  20050162  130333   109487     5.0
  20050202  130333   112552     3.0
  20050785  130333   164909     3.5
  20051306  130333   202439     3.5
  20051319  130333   204698     3.0),
 ((158109,),
            userId  movieId  rating
  24332996  158109      296     5.0
  24333000  158109      318     3.5


## Similarity of users to input user


Pearson Correlation Coefficient, measure the strength of a linear (how the numbers vary togheter) association between two variables.      



---

  
### 💡 Why this is important in recommendation systems

Different users might have different rating habits:
- Some users rate everything high (e.g., mostly 4–5 stars).
- Some rate more conservatively (e.g., 2–4 stars).

Pearson correlation ignores these differences in scale, so it can detect users who agree on what they like, even if they use different absolute numbers.

In [71]:
userSubsetGroup = userSubsetGroup[0:100]

In [72]:
pearsonCorrelationDict = {}

for name, group in userSubsetGroup:
  group = group.sort_values(by='movieId')
  inputMovies = inputMovies.sort_values(by='movieId')
  nRatings = len(group)

  temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
  tempRatingList = temp_df['rating'].tolist()
  tempGroupList = group['rating'].tolist()

  #PC
  Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
  Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
  Sxy = sum(i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

  if Sxx!=0 and Syy!=0:
    pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
  else:
    pearsonCorrelationDict[name] = 0


In [73]:
pearsonCorrelationDict.items()

dict_items([((130333,), 0.1374204545864351), ((158109,), -0.13928450671420162), ((18602,), -0.06697177886332609), ((21410,), 0.13120364894349423), ((23535,), 0.07211758195112714), ((34047,), 0.29254452656700525), ((41657,), 0.1871918126406011), ((54137,), -0.24033232534229348), ((59398,), -0.036729159401702335), ((61803,), 0.12188038324127964), ((63752,), -0.13608276348795434), ((68778,), 0.12275379077928716), ((72315,), 0.466577668410434), ((92046,), 0.330588950065471), ((138024,), 0.25879559174166716), ((153983,), 0.5041418102873962), ((156844,), 0.0712881833265872), ((5990,), 0.024027837181195374), ((6115,), 0.03287720589157604), ((6497,), -0.01661927696363753), ((9463,), -0.2751706676328229), ((15735,), 0.35489953293025456), ((19122,), 0.19337311703843996), ((19379,), 0.1820090486747789), ((19997,), -0.30474413594255856), ((20179,), -0.15098670622088586), ((20917,), -0.08611693225534396), ((22175,), 0.3604957807067839), ((24441,), 0.16026820281515552), ((26279,), 0.083833838333928)

In [74]:
pearsonDf = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDf.columns = ['similarityIndex']
pearsonDf['userId'] = pearsonDf.index
pearsonDf.index = range(len(pearsonDf))
pearsonDf.head()

Unnamed: 0,similarityIndex,userId
0,0.13742,"(130333,)"
1,-0.139285,"(158109,)"
2,-0.066972,"(18602,)"
3,0.131204,"(21410,)"
4,0.072118,"(23535,)"


### Top 50 users

In [75]:
topUsers = pearsonDf.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
99,0.646496,"(11539,)"
49,0.557469,"(74972,)"
15,0.504142,"(153983,)"
36,0.469339,"(47355,)"
12,0.466578,"(72315,)"


In [76]:
topUsers.dtypes

Unnamed: 0,0
similarityIndex,float64
userId,object


In [77]:
ratings_df.dtypes

Unnamed: 0,0
userId,int64
movieId,int64
rating,float64


### Rating of selected users to all movies

Correlation will be the weight

In [78]:
topUsers['userId'] = topUsers['userId'].apply(lambda x: int(x[0]) if isinstance(x, tuple) else int(x))

In [79]:
topUsers.head()

Unnamed: 0,similarityIndex,userId
99,0.646496,11539
49,0.557469,74972
15,0.504142,153983
36,0.469339,47355
12,0.466578,72315


In [80]:
topUsersRating = topUsers.merge(ratings_df, on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.646496,11539,1,3.0
1,0.646496,11539,7,3.0
2,0.646496,11539,11,4.0
3,0.646496,11539,14,4.0
4,0.646496,11539,17,3.0


(Movie rating x Weight + new ratings) / Weight Sum

In [81]:
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.646496,11539,1,3.0,1.939487
1,0.646496,11539,7,3.0,1.939487
2,0.646496,11539,11,4.0,2.585983
3,0.646496,11539,14,4.0,2.585983
4,0.646496,11539,17,3.0,1.939487


In [82]:
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,13.719961,52.909312
2,9.836641,32.926734
3,1.737467,4.176573
4,0.682364,1.318807
5,2.895931,7.869695


In [83]:
recommendation_df = pd.DataFrame()

recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.856375,1
2,3.347355,2
3,2.403829,3
4,1.932704,4
5,2.717501,5


### Getting the recommendations

In [84]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
6178,5.0,6178
31487,5.0,31487
159349,5.0,159349
44947,5.0,44947
185497,5.0,185497
88341,5.0,88341
88646,5.0,88646
89054,5.0,89054
47601,5.0,47601
202595,5.0,202595


In [85]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
6066,6178,"Patch of Blue, A",1965
9561,31487,"Devil and Miss Jones, The",1941
10758,44947,Lacombe Lucien,1974
11015,47601,7 Men from Now (Seven Men from Now),1956
16772,88341,Park Row,1952
16836,88646,"Cairo Station (a.k.a. Iron Gate, The) (Bab el ...",1958
16918,89054,Bolivia,2001
40408,159349,Together,2009
52532,185497,Brexitannia,2017
60154,202595,Ode to the Goose,2018


## Final Reflection
In this project, I worked with the larger MovieLens 25M dataset, which provided a richer and more up to date base for building a collaborative filtering system. Compared to earlier projects with smaller datasets, this scale required more careful preprocessing.

I had to adapt regex rules to correctly extract movie titles and years. After cleaning, I applied Pearson correlation to measure user-user similarity. One important property of Pearson is that it is invariant to scaling, which made it robust against users rating on different scales.

One of the challenge was ensuring interpretability of the recommendations, since not all top rated users produced relevant suggestions.

What worked well was the ability to capture similarity in rating patterns rather than absolute ratings, which made the recommendations feel more aligned with user preferences. However, this approach still suffers from cold start problems and scalability limitations.

Overall, this project helped me deepen my understanding of collaborative filtering on real-world scale data and gave me practical skills in debugging, data preprocessing, and evaluating recommender systems.