## DataMining Assignment 2

Import a Python module named codeUtils using the alias cu. This module contains utility functions and methods that we'll use throughout our data mining project. Using an alias like cu makes it easier to reference the functions within the module in your subsequent code.

In [1]:
import codeUtils as cu

Loading our four datasets

In [2]:
movies = cu.load_data('data/movies.csv')
ratings = cu.load_data('data/ratings.csv')
tags = cu.load_data('data/tags.csv')
links = cu.load_data('data/links.csv')

In [3]:
print("Total number of Movies: "+str(len(movies)))
print("Total number of Users: "+str(ratings.userId.nunique()))


Totoal number of Movies: 9742
Totoal number of Users: 610


## Managing the dataset

### Note: From README.txt
_Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970_

In our case of study this attribute has no importance, so we will get rid of it

In [4]:
cu.drop_columns(tags, ['timestamp'])
cu.drop_columns(ratings, ['timestamp'])

Merging data resulting in "merged_data" dataset contains consolidated information about movie ratings, movie details, and user-generated tags, which will be used for further analysis and processing.

In [5]:
merged_data = cu.merge_data(ratings, movies, 'movieId','inner')

merged_data = cu.merge_data(merged_data, tags, ['userId','movieId'],'left')

cu.drop_na(merged_data)
cu.drop_duplicate(merged_data)

merged_data


Unnamed: 0,userId,movieId,rating,title,genres,tag
241,2,60756,5.0,Step Brothers (2008),Comedy,funny
242,2,60756,5.0,Step Brothers (2008),Comedy,Highly quotable
243,2,60756,5.0,Step Brothers (2008),Comedy,will ferrell
252,2,89774,5.0,Warrior (2011),Drama,Boxing story
253,2,89774,5.0,Warrior (2011),Drama,MMA
...,...,...,...,...,...,...
99967,606,6107,4.0,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II
100087,606,7382,4.5,I'm Not Scared (Io non ho paura) (2003),Drama|Mystery|Thriller,for katie
101553,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,gun fu
101554,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,heroic bloodshed


The resulting "merged_data" dataset now includes the average rating information

In [6]:
avreage_rating = cu.calculate_average(merged_data,'movieId','rating')

merged_data = cu.merge_data(merged_data, avreage_rating, 'movieId','inner')

merged_data


Unnamed: 0,userId,movieId,rating,title,genres,tag,average_rating
0,2,60756,5.0,Step Brothers (2008),Comedy,funny,4.187500
1,2,60756,5.0,Step Brothers (2008),Comedy,Highly quotable,4.187500
2,2,60756,5.0,Step Brothers (2008),Comedy,will ferrell,4.187500
3,2,89774,5.0,Warrior (2011),Drama,Boxing story,5.000000
4,2,89774,5.0,Warrior (2011),Drama,MMA,5.000000
...,...,...,...,...,...,...,...
3471,606,6107,4.0,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II,4.000000
3472,606,7382,4.5,I'm Not Scared (Io non ho paura) (2003),Drama|Mystery|Thriller,for katie,4.166667
3473,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,gun fu,5.000000
3474,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,heroic bloodshed,5.000000


Binarization of attributes representing each genre, allowing for easier analysis.

In [7]:
new_merged_data = cu.transform_attribute_to_multiple(merged_data, 'genres', '|')

new_merged_data

Unnamed: 0,userId,movieId,rating,title,genres,tag,average_rating,Comedy,Drama,Crime,...,Adventure,Children,Fantasy,Animation,Horror,Romance,Musical,Western,Film-Noir,(no genres listed)
0,2,60756,5.0,Step Brothers (2008),Comedy,funny,4.187500,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,60756,5.0,Step Brothers (2008),Comedy,Highly quotable,4.187500,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,60756,5.0,Step Brothers (2008),Comedy,will ferrell,4.187500,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,89774,5.0,Warrior (2011),Drama,Boxing story,5.000000,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,2,89774,5.0,Warrior (2011),Drama,MMA,5.000000,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,606,6107,4.0,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II,4.000000,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3472,606,7382,4.5,I'm Not Scared (Io non ho paura) (2003),Drama|Mystery|Thriller,for katie,4.166667,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3473,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,gun fu,5.000000,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3474,610,3265,5.0,Hard-Boiled (Lat sau san taam) (1992),Action|Crime|Drama|Thriller,heroic bloodshed,5.000000,0,1,1,...,0,0,0,0,0,0,0,0,0,0


The columns being dropped are 'genres', 'title', and '(no genres listed)'. These columns are no longer needed after transforming the 'genres' attribute into binary attributes and are therefore dropped from the dataset, in order to the dataset contains only the relevant attributes.

In [8]:
cu.drop_columns(new_merged_data, ['genres', 'title', '(no genres listed)'])

The 'tag' attribute contains categorical values that are transformed into numerical labels.

In [9]:
new_merged_data = cu.transform_strings_to_numbers(new_merged_data, 'tag')

new_merged_data

Unnamed: 0,userId,movieId,rating,tag,average_rating,Comedy,Drama,Crime,Thriller,War,...,IMAX,Adventure,Children,Fantasy,Animation,Horror,Romance,Musical,Western,Film-Noir
0,2,60756,5.0,1,4.187500,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,60756,5.0,2,4.187500,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,60756,5.0,3,4.187500,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,89774,5.0,4,5.000000,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,89774,5.0,5,5.000000,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,606,6107,4.0,69,4.000000,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3472,606,7382,4.5,1540,4.166667,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3473,610,3265,5.0,1541,5.000000,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3474,610,3265,5.0,1542,5.000000,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0


The user_movie_matrix is created by pivoting the dataset to have users as rows, movies as columns, and ratings as values. Any missing values are filled with 0.

The binary_user_movie_matrix is derived from the user_movie_matrix by applying a lambda function to each element. If the rating is greater than 0, the corresponding element in the binary matrix is set to 1; otherwise, it is set to 0.

These matrices represent the user-movie interactions in our dataset, with the binary matrix indicating whether a user has rated a movie or not. 

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

user_movie_matrix = new_merged_data.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
binary_user_movie_matrix = user_movie_matrix.applymap(lambda x: 1 if x > 0 else 0)


We split the new_merged_data dataset into training and testing subsets using the train_test_split function from scikit-learn.

In [22]:
train_data, test_data = train_test_split(new_merged_data, test_size=0.2, random_state=42)


This process standardizes the distribution of each feature, ensuring that they have a mean of 0 and a standard deviation of 1.

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
normalized_user_movie_matrix = scaler.fit_transform(user_movie_matrix)


We apply KMeans clustering to the normalized_user_movie_matrix using the KMeans algorithm from scikit-learn.

Then it we instantiate a KMeans object with n_clusters=5, indicating that we want to cluster the data into 5 clusters. The random_state=42 parameter ensures reproducibility of the clustering results by fixing the random seed.

We then use the fit_predict method to fit the KMeans model to the normalized matrix and simultaneously predict the cluster labels for each user.

In [24]:
kmeans = KMeans(n_clusters=5, random_state=42)
user_clusters = pd.DataFrame(kmeans.fit_predict(normalized_user_movie_matrix), index=user_movie_matrix.index, columns=['Cluster'])


In [25]:
train_data = train_data.merge(user_clusters, on='userId')
train_data = train_data.drop(columns=['Cluster_x', 'Cluster_y'], errors='ignore')
train_data.rename(columns={'Cluster': 'Cluster'}, inplace=True)

test_data = test_data.merge(user_clusters, on='userId', how='left')


In [26]:
frequent_itemsets = apriori(binary_user_movie_matrix, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)


In [27]:
import numpy as np
def predict_rating(user_id, movie_id):
    if user_id not in user_clusters.index or movie_id not in user_movie_matrix.columns:
        return user_movie_matrix.loc[user_id].mean()  

    user_cluster = user_clusters.loc[user_id, 'Cluster']
    cluster_avg_rating = train_data[(train_data['movieId'] == movie_id) & 
                                    (train_data['Cluster'] == user_cluster)]['rating'].mean()

    movie_rules = rules[rules['antecedents'].apply(lambda x: movie_id in x)]

    
    if movie_rules.empty:
        return cluster_avg_rating if not np.isnan(cluster_avg_rating) else 0

    def get_consequent_rating(row):
        consequent_movie_ids = list(row['consequents'])
        consequent_ratings = user_movie_matrix.loc[user_id, consequent_movie_ids].values
        
        valid_ratings = [rating for rating in consequent_ratings if rating > 0]
        valid_confidences = [row['confidence']] * len(valid_ratings)
        
        if len(valid_ratings) == 0 or len(valid_confidences) == 0:
            return 0  

        return np.sum(np.array(valid_ratings) * np.array(valid_confidences))

    weighted_ratings = movie_rules.apply(get_consequent_rating, axis=1)
    weighted_sum = np.sum(weighted_ratings)
    confidence_sum = movie_rules['confidence'].sum()

    weighted_avg_rating = weighted_sum / confidence_sum if confidence_sum > 0 else 0

    if pd.isna(cluster_avg_rating):
        return weighted_avg_rating
    
    return (cluster_avg_rating + weighted_avg_rating) / 2


In [28]:

predicted_ratings = []

for index, row in test_data.iterrows():
    predicted_rating = predict_rating(row['userId'], row['movieId'])
    predicted_ratings.append(predicted_rating)

mae = mean_absolute_error(test_data['rating'], predicted_ratings)
print(f'Mean Absolute Error: {mae}')

mse = mean_squared_error(test_data['rating'], predicted_ratings)
print(f'Mean Squared Error: {mse}')

Mean Absolute Error: 1.3409034658910708
Mean Squared Error: 5.020302660595675


## Normalization and Standarization

### 1. Normalization

By applying Min-max

In [10]:
normalized_data = cu.normalize_data(new_merged_data, ['rating', 'average_rating', 'tag'])

normalized_data

Unnamed: 0,userId,movieId,rating,tag,average_rating,Adventure,Animation,Children,Comedy,Fantasy,...,Action,Drama,War,Sci-Fi,Western,Horror,Musical,Film-Noir,IMAX,Documentary
0,336,1,0.777778,0.000000,0.740741,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,474,1,0.777778,0.000000,0.740741,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,567,1,0.666667,0.000649,0.740741,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,289,3,0.444444,0.001297,0.444444,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,289,3,0.444444,0.001946,0.444444,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,567,170945,0.666667,0.998054,0.666667,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3472,567,176419,0.555556,0.998703,0.555556,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3473,567,176419,0.555556,0.999351,0.555556,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3474,567,176419,0.555556,1.000000,0.555556,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


### 2. Standarization

By applying StandardScaler

In [11]:
standrized_data = cu.standardize_data(new_merged_data, ['rating', 'average_rating', 'tag'])

standrized_data

Unnamed: 0,userId,movieId,rating,tag,average_rating,Adventure,Animation,Children,Comedy,Fantasy,...,Action,Drama,War,Sci-Fi,Western,Horror,Musical,Film-Noir,IMAX,Documentary
0,336,1,-0.019642,-1.339026,-0.224722,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,474,1,-0.019642,-1.339026,-0.224722,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,567,1,-0.603208,-1.336675,-0.224722,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,289,3,-1.770339,-1.334324,-1.857613,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,289,3,-1.770339,-1.331973,-1.857613,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,567,170945,-0.603208,2.279183,-0.632945,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3472,567,176419,-1.186773,2.281534,-1.245279,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3473,567,176419,-1.186773,2.283885,-1.245279,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3474,567,176419,-1.186773,2.286236,-1.245279,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


## Split the data into training and test datasets

In [12]:
nrm_train, nrm_test = cu.split_data(normalized_data)

nrm_train.shape, nrm_test.shape

((2780, 24), (696, 24))

In [13]:
std_train, std_test = cu.split_data(standrized_data)

std_train.shape, std_test.shape

((2780, 24), (696, 24))