## DataMining Assignment 2

Import a Python module named codeUtils using the alias cu. This module contains utility functions and methods that we'll use throughout our data mining project. Using an alias like cu makes it easier to reference the functions within the module in your subsequent code.

In [22]:
import codeUtils as cu
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.cluster import KMeans

Loading our four datasets

In [3]:
movies = cu.load_data('data/movies.csv')
ratings = cu.load_data('data/ratings.csv')
tags = cu.load_data('data/tags.csv')
links = cu.load_data('data/links.csv')

In [4]:
print("Total number of Movies: "+str(len(movies)))
print("Total number of Users: "+str(ratings.userId.nunique()))


Total number of Movies: 9742
Total number of Users: 610


## Managing the dataset

### Note: From README.txt
_Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970_

In our case of study this attribute has no importance, so we will get rid of it

In [5]:
cu.drop_columns(tags, ['timestamp'])
cu.drop_columns(ratings, ['timestamp'])

Merging data resulting in "merged_data" dataset contains consolidated information about movie ratings, movie details, and user-generated tags, which will be used for further analysis and processing.

In [6]:
merged_data = cu.merge_data(ratings, movies, 'movieId','inner')

merged_data = cu.merge_data(merged_data, tags, ['userId','movieId'],'left')

cu.drop_na(merged_data)
cu.drop_duplicate(merged_data)


The resulting "merged_data" dataset now includes the average rating information

In [7]:
avreage_rating = cu.calculate_average(merged_data,'movieId','rating')

merged_data = cu.merge_data(merged_data, avreage_rating, 'movieId','inner')


Binarization of attributes representing each genre, allowing for easier analysis.

In [8]:
new_merged_data = cu.transform_attribute_to_multiple(merged_data, 'genres', '|')


The columns being dropped are 'genres', 'title', and '(no genres listed)'. These columns are no longer needed after transforming the 'genres' attribute into binary attributes and are therefore dropped from the dataset, in order to the dataset contains only the relevant attributes.

In [9]:
cu.drop_columns(new_merged_data, ['genres', 'title', '(no genres listed)'])

The 'tag' attribute contains categorical values that are transformed into numerical labels.

In [10]:
cu.drop_columns(new_merged_data, ['tag'])



We add a new column total_genres to the new_merged_data DataFrame. This column counts the number of genres associated with each movie by summing the binary genre indicators across each row. 

In [11]:
new_merged_data['total_genres'] = new_merged_data.iloc[:, 4:].sum(axis=1)
new_merged_data


Unnamed: 0,userId,movieId,rating,average_rating,Comedy,Drama,Crime,Thriller,War,Mystery,...,Adventure,Children,Fantasy,Animation,Horror,Romance,Musical,Western,Film-Noir,total_genres
0,2,60756,5.0,4.187500,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,60756,5.0,4.187500,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2,60756,5.0,4.187500,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2,89774,5.0,5.000000,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,2,89774,5.0,5.000000,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,606,6107,4.0,4.000000,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2
3472,606,7382,4.5,4.166667,0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,3
3473,610,3265,5.0,5.000000,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,4
3474,610,3265,5.0,5.000000,0,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,4


Preparing the new_merged_data DataFrame by filling missing values, extracting genre columns, and standardizing the numerical features.

In [12]:
new_merged_data.fillna(0, inplace=True)
genre_columns = new_merged_data.columns[4:-1].to_list()
numerical_features = ['rating', 'average_rating', 'total_genres']
scaler = StandardScaler()
new_merged_data[numerical_features] = scaler.fit_transform(new_merged_data[numerical_features])

Defining a function rule_based_rating that calculates a rating for a given user and movie pair based on rule-based logic, then the function predicts a rating for a user-movie pair by considering the user's genre preferences and their average rating.

In [13]:
def rule_based_rating(user_id, movie_id):
    user_data = new_merged_data[new_merged_data['userId'] == user_id]
    movie_data = new_merged_data[new_merged_data['movieId'] == movie_id]
    
    if user_data.empty or movie_data.empty:
        return None  
    
    
    user_genre_preferences = user_data[genre_columns].mean()
    movie_genres = movie_data[genre_columns].iloc[0]
    
    score = (user_genre_preferences * movie_genres).sum()
    average_user_rating = user_data['rating'].mean()
    
    rating = score + average_user_rating
    return rating

 We predict a rating for a user-movie pair by clustering users based on their features, identifying the user's cluster, and finding the closest movie in that cluster to predict the rating.

In [14]:
def clustering_based_rating(user_id, movie_id):
    user_features = new_merged_data[['userId'] + numerical_features + genre_columns].drop_duplicates()
    
    if user_features.empty:
        return None  
    
    
    kmeans = KMeans(n_clusters=10, random_state=42)
    kmeans.fit(user_features.drop(columns=['userId']))
    
    user_data = user_features[user_features['userId'] == user_id]
    if user_data.empty:
        return None  
    
    user_cluster = kmeans.predict(user_data.drop(columns=['userId']))
    cluster_center = kmeans.cluster_centers_[user_cluster]
    
    movie_features = new_merged_data[['movieId'] + numerical_features + genre_columns].drop_duplicates()
    if movie_features.empty:
        return None  
    
    movie_data = movie_features[movie_features['movieId'] == movie_id]
    if movie_data.empty:
        return None  
    
    movie_features['cluster_distance'] = pairwise_distances_argmin_min(movie_features.drop(columns=['movieId']), cluster_center)[0]
    
    closest_movie = movie_features.loc[movie_features['cluster_distance'].idxmin()]
    predicted_rating = closest_movie['rating']
    return predicted_rating

The combined_rating function calculates a final rating for a user-movie pair by averaging the rule-based and clustering-based ratings, and then denormalizes this combined rating to return it in its original scale.

In [15]:
def combined_rating(user_id, movie_id):
    rule_rating = rule_based_rating(user_id, movie_id)
    clustering_rating = clustering_based_rating(user_id, movie_id)
    
    if rule_rating is None or clustering_rating is None:
        return None  
    
    combined_rating = (rule_rating + clustering_rating) / 2
    denormalized_rating = combined_rating * scaler.scale_[0] + scaler.mean_[0]
    return denormalized_rating

In [16]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_model():
    test_data = new_merged_data.sample(frac=0.2, random_state=42)
    test_data = test_data[['userId', 'movieId', 'rating']]
    test_data['predicted_rating'] = test_data.apply(lambda x: combined_rating(x['userId'], x['movieId']), axis=1)
    
    mae = mean_absolute_error(test_data['rating'], test_data['predicted_rating'])
    rmse = mean_squared_error(test_data['rating'], test_data['predicted_rating'], squared=False)
    
    return mae, rmse



Evaluating the performance of our recommender system using two common metrics: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

In [23]:
mae, rmse = evaluate_model()
print(f'Mean Absolute Error: {mae}')
print(f'Root Mean Squared Error: {rmse}')


Mean Absolute Error: 4.198124669464954
Root Mean Squared Error: 4.309352556356485




Testing the model for a certain user_id and movie_id to get the predicted rating

In [20]:
user_id = 336
movie_id = 176419
predicted_rating = combined_rating(user_id, movie_id)
print(f'Predicted rating for user {user_id} and movie {movie_id} is {predicted_rating}')

Predicted rating for user 336 and movie 176419 is 3.722089683639987
