<h1>Assignment 1</h1>

In [3]:
import os
import pandas as pd
import numpy as np

LINKS_PATH = os.path.join(os.getcwd(), 'movie', 'links.csv')
TAGS_PATH = os.path.join(os.getcwd(), 'movie', 'tags.csv')
MOVIES_PATH = os.path.join(os.getcwd(), 'movie', 'movies.csv')
RATINGS_PATH = os.path.join(os.getcwd(), 'movie', 'ratings.csv')

def load_data(path):
    return pd.read_csv(path)

<h4>Loading and analysing data</h4>

In [4]:
movies = load_data(MOVIES_PATH)
ratings = load_data(RATINGS_PATH)
tags = load_data(TAGS_PATH)
links = load_data(LINKS_PATH)

In [5]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


<h2>Task 1</h2>

<p>Read the
dataset, display the rst few rows to understand it, and display the count of ratings (rows)
in the dataset to be sure that you download it correctly.</p>

In [14]:
ratings.shape

(100836, 4)

In [15]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


To prepare the data for further analysis, we need to transform it into an appropriate format. In this case, we will be transforming the data into a user-item matrix. This will allow us to calculate the similarity between users and recommend items to users based on their similarity to other users.

In [16]:
# Copying the ratings dataframe to a new dataframe for further processing
movie_ratings = ratings.copy()

In [10]:
# Making a pivot table to get the ratings of each movie by each user
ratings_by_users = movie_ratings.pivot_table(index='userId', columns='movieId', values='rating', aggfunc='first')
ratings_by_users.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


<h2>Task 2 (p. 1)</h2>

<p> Implement the user-based collaborative filtering approach, using the Pearson
correlation function for computing similarities between users</p>

In [20]:
# min_common_percentage = 0.1
min_common_items = 10

# Implementation of Pearson correlation
def get_similarity_between_two_items(user1, user2):
    # Find common items
    common_items = user1.notna() & user2.notna()
    
    common_items_count = common_items.sum()
    if common_items_count == 0:
        return 0  # No common items, no correlation
    
    # We implement a treshold to avoid meaningless correlations
    # Approach 1: Common items percentage
    # total_items_user1 = user1.count()
    # total_items_user2 = user2.count()
    # common_percentage_user1 = common_items_count / total_items_user1
    # common_percentage_user2 = common_items_count / total_items_user2
    # if common_percentage_user1 < min_common_percentage or common_percentage_user2 < min_common_percentage:
    #     return 0  # Not enough common items for a meaningful correlation

    # Approach 2: Common items count
    if common_items_count < min_common_items:
        return 0  # Not enough common items for a meaningful correlation
    
    # Get the common items
    user1_common = user1[common_items]
    user2_common = user2[common_items]
    
    # Pearson correlation requires at least 2 common items
    if len(user1_common) < 2:
        return 0 
    
    # Calculate the Pearson correlation coefficient
    correlation = user1_common.corr(user2_common)
    
    if np.isnan(correlation):
        return 0  # Handle NaN values
    
    return correlation

# Helper function to find similar users
# Returns a series of similarities between the target user and all other users
def find_similar_users(user_item_matrix, target_user):
    # Calculate the Pearson correlation between the target user and all other users
    similarities = user_item_matrix.apply(lambda user: get_similarity_between_two_items(user, target_user), axis=1)
    return similarities

def get_similar_users_by_user_id(user_item_matrix, user_id):
    # Get the target user
    target_user = user_item_matrix.loc[user_id]
    similarities = find_similar_users(user_item_matrix, target_user)
    
    # Sort users by similarity in descending order
    similar_users = similarities.sort_values(ascending=False)
    
    # Filter out the target user
    similar_users = similar_users[similar_users.index != target_user.name]
    
    # Return the top n similar users
    return similar_users

<p>Let's test the function and get similar users to the first one</p>

In [22]:
users_similar_to_user_1 = get_similar_users_by_user_id(ratings_by_users, 1)
users_similar_to_user_1.head()

  c /= stddev[:, None]
  c /= stddev[None, :]


userId
476    0.786936
210    0.767649
297    0.706281
44     0.684448
394    0.650600
dtype: float64

In [42]:

# Let's compare the first user ratings with the ratings of the first similar user
user_1 = ratings_by_users.loc[1]
user_2 = ratings_by_users.loc[476]

# Total number of items rated by the first user
total_items_user_1 = user_1.count()
print('Total items rated by user 1:', total_items_user_1)

# Total number of items rated by the second user
total_items_user_2 = user_2.count()
print('Total items rated by user 2:', total_items_user_2 , '\n')


# Get the items that both users have rated
common_items = user_1.notna() & user_2.notna()
common_items_count = common_items.sum()
print('Common items count:', common_items_count, '\n')

# Get the common items
user_1_common = user_1[common_items]
user_2_common = user_2[common_items]

# Printing the comparison
comparison = pd.DataFrame({'User 1': user_1_common, 'User 2': user_2_common})
comparison


Total items rated by user 1: 232
Total items rated by user 2: 69 

Common items count: 11 



Unnamed: 0_level_0,User 1,User 2
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.0,4.0
296,3.0,3.0
349,4.0,3.0
356,4.0,5.0
362,5.0,5.0
457,5.0,5.0
480,4.0,4.0
500,3.0,3.0
590,4.0,5.0
592,4.0,4.0


<p> As it can be seen the users almost identical (the only difference is in movie with ID 349). So the pearson correlation sorting seems to work quite well and we can filter users by similar movie ratings </p>

<h2>Task 2 (p. 2)</h2>

<p>The prediction function presented in class for predicting movies scores </p>

In [74]:
def predict_rating(user_id, movie_id, ratings_by_users):
    # If the user has already rated the movie, return the known rating
    if not np.isnan(ratings_by_users.loc[user_id, movie_id]):
        return ratings_by_users.loc[user_id, movie_id]

    # Get the users who rated the movie
    users_who_rated = ratings_by_users[ratings_by_users[movie_id].notna()].index

    # Calculate the similarities and the weighted ratings
    similarities = [get_similarity_between_two_items(ratings_by_users.loc[user_id], ratings_by_users.loc[other_user_id]) for other_user_id in users_who_rated]
    weighted_ratings = [similarity * (ratings_by_users.loc[other_user_id, movie_id] - ratings_by_users.loc[other_user_id].mean()) for other_user_id, similarity in zip(users_who_rated, similarities)]

    # If no one else rated the movie, return the mean rating of the user
    if sum(similarities) == 0:
        return ratings_by_users.loc[user_id].mean()

    # Return the weighted average rating
    return (sum(weighted_ratings) / sum(similarities)) + ratings_by_users.loc[user_id].mean()


movieId
1         NaN
2         NaN
3         4.0
4         NaN
5         NaN
         ... 
193581    NaN
193583    NaN
193585    NaN
193587    NaN
193609    NaN
Name: 1, Length: 9724, dtype: float64

In [81]:
# Actual rating of the first user for the first movie
actual_rating = ratings_by_users.loc[1, 3]

# Remove the ratings of the first user for the first movie
ratings_by_users.loc[1, 3] = np.nan

# Predict the rating of the first user for the first movie
prediction = predict_rating(1, 3, ratings_by_users)
print('Prediction:', prediction, '\n')
print('Actual rating:', actual_rating, '\n')


Prediction: 3.91121344428583 

Actual rating: 4.0 



<p>As it can be seen, the predicted value is quite close to the one the user initially had, which proves that the function calculates it correctly</p>