## DataMining Assignment 2

Import a Python module named codeUtils using the alias cu. This module contains utility functions and methods that we'll use throughout our data mining project. Using an alias like cu makes it easier to reference the functions within the module in your subsequent code.

In [1]:
!dvc pull

Everything is up to date.


In [2]:
import codeUtils as cu

Loading our four datasets

In [15]:
movies = cu.load_data('data/movies.csv')
ratings = cu.load_data('data/ratings.csv')
tags = cu.load_data('data/tags.csv')
links = cu.load_data('data/links.csv')

In [16]:
print("Total number of Movies: "+str(len(movies)))
print("Total number of Users: "+str(ratings.userId.nunique()))


Total number of Movies: 9742
Total number of Users: 610


### Note: From README.txt
_Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970_

In our case of study this attribute has no importance, so we will get rid of it

In [17]:
cu.drop_columns(tags, ['timestamp'])
cu.drop_columns(ratings, ['timestamp'])

### Merging Data

- Merges ratings with movies on movieId using an inner join.
- Then merges the resulting dataframe with tags on userId and movieId using a left join.
- Drops any rows with missing values and removes duplicate rows.

In [18]:
merged_data = cu.merge_data(ratings, movies, 'movieId','inner')

merged_data = cu.merge_data(merged_data, tags, ['userId','movieId'],'left')

cu.drop_na(merged_data)
cu.drop_duplicate(merged_data)


### Calculate average rating

- Calculate the average rating for each movie.
- Merge this average rating back into the main dataframe.

The resulting "merged_data" dataset now includes the average rating information

In [19]:
avreage_rating = cu.calculate_average(merged_data,'movieId','rating')

merged_data = cu.merge_data(merged_data, avreage_rating, 'movieId','inner')


### Transforming Genres

- Split the genres column into multiple binary columns (one for each genre) using the | separator.

In [20]:
new_merged_data = cu.transform_attribute_to_multiple(merged_data, 'genres', '|')


- The columns being dropped are 'genres', 'title', and '(no genres listed)'. These columns are no longer needed after transforming the 'genres' attribute into binary attributes and are therefore dropped from the dataset, in order to the dataset contains only the relevant attributes.

In [21]:
cu.drop_columns(new_merged_data, ['genres', 'title', '(no genres listed)'])

drop the tag column

In [22]:
cu.drop_columns(new_merged_data, ['tag'])

We add a new column total_genres to the new_merged_data DataFrame. This column counts the number of genres associated with each movie by summing the binary genre indicators across each row. 

In [23]:
new_merged_data['total_genres'] = new_merged_data.iloc[:, 4:].sum(axis=1)
new_merged_data


Unnamed: 0,userId,movieId,rating,average_rating,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Drama,War,Sci-Fi,Western,Horror,Musical,Film-Noir,IMAX,Documentary,total_genres
0,336,1,4.0,3.833333,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,5
1,474,1,4.0,3.833333,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,5
2,567,1,3.5,3.833333,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,5
3,289,3,2.5,2.500000,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,2
4,289,3,2.5,2.500000,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3471,567,170945,3.5,3.500000,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,3
3472,567,176419,3.0,3.000000,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,4
3473,567,176419,3.0,3.000000,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,4
3474,567,176419,3.0,3.000000,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,4


### Splitting Data and Scaling Features

- Split the data into training and test sets.
- Fill any missing values with 0.
- Identify the genre columns and numerical features to be scaled.
- Scale the numerical features using StandardScaler.

In [24]:
train_data, test_data = cu.split_data(new_merged_data)

train_data.fillna(0, inplace=True)
test_data.fillna(0, inplace=True)

genre_columns = new_merged_data.columns[4:-1].to_list()
numerical_features = ['rating', 'average_rating', 'total_genres']

scaler = cu.StandardScaler()
train_data[numerical_features] = scaler.fit_transform(train_data[numerical_features])
test_data[numerical_features] = scaler.fit_transform(test_data[numerical_features])



### Predict rating

In [25]:
user_id = 336
movie_id = 176419

predicted_rating = cu.combined_rating(user_id, movie_id, train_data, test_data, scaler, genre_columns, numerical_features)
print(f'Predicted rating for user {user_id} and movie {movie_id} is {predicted_rating}')

Predicted rating for user 336 and movie 176419 is 4.5


### Evaluating Recommender

Evaluate the recommender system using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) on the test data and prints the results.

In [26]:
mae, rmse = cu.evaluate_recommender(test_data, train_data,scaler, genre_columns, numerical_features)
print(f'Mean Absolute Error: {mae}')
print(f'Root Mean Squared Error: {rmse}')

Mean Absolute Error: 4.4213270136078044
Root Mean Squared Error: 4.519298347513183
