In [31]:
import pandas as pd
import numpy as np


(a) Download the MovieLens 100K rating dataset from
https://grouplens.org/datasets/movielens/ (the small dataset recommended for
education and development). Read the dataset, display the first few rows to understand
it, and display the count of ratings (rows) in the dataset to be sure that you download it
correctly.

In [32]:
links = pd.read_csv('ml-latest-small/links.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
ratings = pd.read_csv('ml-latest-small/ratings.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

print("Links Dataset:")
display(links.head())

print("\nMovies Dataset:")
display(movies.head())

print("\nRatings Dataset:")
display(ratings.head())

print("\nTags Dataset:")
display(tags.head())

rating_count = ratings.shape[0]
print(f"\nTotal number of ratings: {rating_count}")

Links Dataset:


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0



Movies Dataset:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy



Ratings Dataset:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931



Tags Dataset:


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200



Total number of ratings: 100836


(b) Implement the user-based collaborative filtering approach, using the Pearson
correlation function for computing similarities between users (4 points),

In [33]:

user_item_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')

centered_user_item_matrix = user_item_matrix.sub(user_item_matrix.mean(axis=1), axis=0)

def pearson(user1, user2):
    common_ratings = ~user1.isna() & ~user2.isna()
    user1_common = user1[common_ratings]
    user2_common = user2[common_ratings]
    
    if len(user1_common) == 0:
        return 0

    numerator = np.sum((user1_common - user1_common.mean()) * (user2_common - user2_common.mean()))
    denominator = np.sqrt(np.sum((user1_common - user1_common.mean()) ** 2) * np.sum((user2_common - user2_common.mean()) ** 2))
    
    if denominator == 0:
        return 0

    return numerator / denominator

user_similarity_matrix = pd.DataFrame(index=user_item_matrix.index, columns=user_item_matrix.index)

#Tulee testiks näitä vertailu dataframeja miten samanlaisia kaikki käyttäjät on keskenään
for user_a in user_item_matrix.index:
    for user_b in user_item_matrix.index:
        if user_a == user_b:
            user_similarity_matrix.loc[user_a, user_b] = 1  
        elif pd.isna(user_similarity_matrix.loc[user_a, user_b]):
            similarity = pearson(centered_user_item_matrix.loc[user_a], centered_user_item_matrix.loc[user_b])
            user_similarity_matrix.loc[user_a, user_b] = similarity
            user_similarity_matrix.loc[user_b, user_a] = similarity

user_similarity_matrix = user_similarity_matrix.astype(float)

print(user_similarity_matrix.head())




userId       1    2         3         4         5             6         7    \
userId                                                                        
1       1.000000  0.0  0.079819  0.207983  0.268749 -2.916358e-01 -0.118773   
2       0.000000  1.0  0.000000  0.000000  0.000000  0.000000e+00 -0.991241   
3       0.079819  0.0  1.000000  0.000000  0.000000  7.850462e-17  0.000000   
4       0.207983  0.0  0.000000  1.000000 -0.336525  1.484982e-01  0.542861   
5       0.268749  0.0  0.000000 -0.336525  1.000000  4.316590e-02  0.158114   

userId       8         9         10   ...       601           602       603  \
userId                                ...                                     
1       0.469668  0.918559 -0.037987  ...  0.091574 -1.183502e-17 -0.061503   
2       0.000000  0.000000  0.037796  ... -0.387347  0.000000e+00 -1.000000   
3       0.000000  0.000000  0.000000  ...  0.000000  0.000000e+00  0.433200   
4       0.117851  0.000000  0.485794  ... -0.222113

(c) the prediction function presented in class for predicting movies scores (4 points).

(d) Design and implement a new similarity function for computing similarities between
users. Explain why this similarity function is useful for the collaborative filtering
approach. Hint: Exploiting ideas from related works are highly encouraged. 4 points

(e) Use the user-based collaborative filtering approach to produce group
recommendations. Specifically, first compute the movies recommendations for each
user in the group, and then aggregate the lists of the individual users, to produce a
single list of movies for the group. You will implement two well established aggregation
methods for producing the group recommendations.

The second aggregation method is the least misery method, where one member can act
as a veto for the rest of the group. In this case, the rating of an item for a group of user is computed as the minimum score assigned to that item in all group members
recommendations. 3 points. 

Use the MovieLens 100K rating dataset for checking the correctness of your solutions.


(f) Define a way for counting the disagreements between the users in a group and
propose a method that takes disagreements into account when computing suggestions
for the group. Implement your method and explain why it is useful when producing
group recommendations. Prepare also a short presentation (about 5 slides) to show
how your method works. 6 points