# Session 3: Assigment

```{contents}

```

## Movies Recommender System

### Download dataset

In [None]:
!wget "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
!unzip "ml-latest-small.zip"

In [None]:
import pandas as pd
import numpy as np

**Load movie dataset**

In [None]:
df_movie = pd.read_csv('ml-latest-small/movies.csv', encoding='latin-1')
df_movie.head()

**Load rating datase**

In [None]:
df_rating = pd.read_csv('ml-latest-small/ratings.csv', encoding='latin-1')
df_rating.head()

## Data Analysis

Some of the things that we need to do include:
- The meaning of columns in the dataset.
- A few basic statistics about the data.
- Is the data type reasonable, are there any columns that contain numeric values but type `string`?
- Is there a column that semantically appears only 1 time, but in fact they appear more than once?
- In DataFrame, do cells containing empty values (`None`, `null`, `NaN`) exist?
- If there are more than 1 DataFrame, we need to check the link between them to see if it makes sense.

In [None]:
df_movie.info()

Observing the results printed above, we see that the column `title` and `genres` have `Dtype=object`, which is `string` in DataFrame

In [None]:
df_rating.info()

Will any `movieId` appear 2 times in `df_movie`?

In [None]:
df_movie.duplicated(subset=["movieId"], keep=False)

We see that the above code only returns `True` or `False`. To display the full result, we need to nest these `True/False` values into `df_movie`

In [None]:
df_movie[df_movie.duplicated(subset=["movieId"], keep=False)]

Unnamed: 0,movieId,title,genres


Do the same with the `title` column

In [None]:
df_movie[df_movie.duplicated(subset=["title"], keep=False)].sort_values(by=["title"])

In [None]:
df_movie[df_movie.duplicated(subset=["title"], keep="first")].sort_values(by=["title"])

In [None]:
df_movie[df_movie.duplicated(subset=["title"], keep="last")].sort_values(by=["title"])

Different values of the parameter `keep` (`first`, `last`, `False`)
- `keep="first"` that is, retaining duplicate but minimal values (in this case, that value is the smallest `movieId`) and returns the largest duplicate values.
- `keep="last"` means retaining the largest but duplicate values (in this case, that value is the largest `movieId`) and returning the smallest duplicate values
- `keep=False` means no value retained, returned in full

Observing the above results, we see that the films have the same name but the column value`genre` is different. You can write code to keep movies that have longer`genre`. However, in this step we will do it manually. Can you write code to filter out movies with longer genres?

In [None]:
# key is the deleted ID, value is the retained ID
keep_dict = {
   64997: 34048,
   168358: 2851,
   32600: 147002,
   26958: 838,
   6003: 144606
}

We delete the duplicate `movieId` in `df_movie` with the command `~ isin`

In [None]:
df_movie = df_movie[~ df_movie["movieId"].isin(list(keep_dict.keys()))]

In [None]:
df_movie[df_movie.duplicated(subset=["title"], keep=False)].sort_values(by=["title"])

Then, we use the `replace` function to update the `movieId` in `df_rating`

In [None]:
df_rating["movieId"].replace(keep_dict, inplace=True)

At this point, we have a new problem: after replacing `movieId`, `df_rating` has duplicate values in the `userId` and `movieId` columns.

In [None]:
df_rating[df_rating.duplicated(subset=["userId", "movieId"], keep=False)]

We see that there are 2 cases here:
- 1. The `user-movie` folder `68-34048` has 2 reviews with different scores $\to$ we will keep the line with the larger `timestamp`
- 2. For the remaining pairs, we just need to delete the 1 of 2

For simplicity's sake, we`ll do step 1 manually, not write code

In [None]:
# Another way to delete any line in DataFrame is to use the drop function and pass it to the index of the line to be deleted
df_rating.drop([11241], axis=0, inplace=True)

# print it out again for testing
df_rating[df_rating.duplicated(subset=["userId", "movieId"], keep=False)]

In [None]:
# Delete remaining dupplicate lines with keep="last"
df_rating.drop_duplicates(subset=["userId", "movieId"], keep="last", inplace=True)

In [None]:
# Reprint for testing
df_rating[df_rating.duplicated(subset=["userId", "movieId"], keep=False)]

We only keep movies and users with 50 or more reviews as model training data.

Steps to follow:

1. We use the `value_counts` function to statistics the number of `movieIds`. The result of the `value_counts` function has
  - `index`: the `movieId`
  - `values`: the number of occurrences of the `movieId`
2. From the result of the `value_counts` function we can find out which `movieIds` have `>= 50` occurrences to keep
3. Do the same with the `userId`

In [None]:
filter = df_rating["movieId"].value_counts().values > 50
indices = df_rating["movieId"].value_counts().index[filter]
df_rating = df_rating[df_rating["movieId"].isin(indices)]

filter = df_rating["userId"].value_counts().values > 50
indices = df_rating["userId"].value_counts().index[filter]
df_rating = df_rating[df_rating["userId"].isin(indices)]

df_movie = df_movie[df_movie["movieId"].isin(df_rating["movieId"].values)]

We check the number of movies and users remaining after filtering

In [None]:
len(df_rating['movieId'].unique())

In [None]:
len(df_rating['userId'].unique())

We use the `merge` function of pandas to combine 2 DataFrames `df_rating` and `df_movie`
- The merge function will automatically find the common column between 2 DataFrames (in this case, **movieId**) to join those 2 DataFrames together

In [None]:
df = pd.merge(df_rating, df_movie)
df.head()

In [None]:
df.shape

Next, we use`LabelEncode` to transform`userI` and`movieI`

In [None]:
from sklearn.preprocessing import LabelEncoder

user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()

user_encoder.fit(df["userId"].unique())
movie_encoder.fit(df["movieId"].unique())

df["userId"] = user_encoder.transform(df["userId"].values)
df["movieId"] = movie_encoder.transform(df["movieId"].values)

df.head()

Number of users and movies

In [None]:
df["userId"].unique().shape

In [None]:
df["movieId"].unique().shape

## Content-based Recommender System

With the Content-based Recommender System model, we need to build **n models** for **n users** in the system (each user has 1 unique model)

We see that the above method will be very good if we have 1 set of standard features for each item and each user.

In this dataset, we will use the genres column as a feature for each movie.
- We have 19 film genres -> 19 featured, the value of which consists only of 0 or 1 representing zero or yes
- We have 260 users > 260 models, each model consists of 19 weights and 1 bias, symbolizing the user's preferences for each genre of film.
- The model will receive 19 features of any 1 movie and pay points for that movie.





**Split the genres column into values 0, 1**

In [None]:
# Take out the values in the Genres column and convert it to Numpy Array
genres = df["genres"].values
unique_genre = []

for genre in genres:
  temp = genre.split('|') # # Separate words by character |
  for g in temp:
    if g not in unique_genre:
      unique_genre.append(g)


unique_genre = sorted(unique_genre) # sorting in the aphabetical order

print(unique_genre)
print(len(unique_genre))

Create a dictionary with **key** as `movieID` and **value** as values 0-1 corresponding to categories

In [None]:
# Take out the values in the movieID column and convert it to Numpy Array
ids = df["movieId"].values
movie_id_genre_mapping = {}


for id, genre in zip(ids, genres):

  temp = genre.split('|')
  movie_id_genre_mapping[id] = np.zeros(len(unique_genre), dtype=int)  # Add key-value pair

  for g in temp:
    genre_index = unique_genre.index(g) # determine the index of the genre
    movie_id_genre_mapping[id][genre_index] = 1 # assign value 1

print(movie_id_genre_mapping[1])

To create a DataFrame from 1 dictionary, we need that dict with the following structure
```
my_dict = {
  'column_name_1': [] # contains all values of column 1,
  'column_name_2': [] # contains all values of column 2,
  'column_name_3': [] # contains all values of column 3,
  ...
}
```
We convert the dict `movie_id_genre_mapping` into the above format

In [None]:
genre_data = {'movieId': []}
for genre in unique_genre:
  genre_data[genre] = []
print(genre_data)

In [None]:
for key, value in movie_id_genre_mapping.items():
  genre_data['movieId'].append(key)
  for i, v in enumerate(value):
    genre_data[unique_genre[i]].append(v)
print(genre_data)

Create 1 DataFrame to contain genre columns

In [None]:
df_genre = pd.DataFrame(data=genre_data)
df_genre.head()

We use function `merge` from pandas to combine two DataFrame `df_genre` and `df`
- Function `merge` will automoatically find the common columns of 2 Dataframes(**movieId** in this case) to concate two Dataframes together

In [None]:
df = pd.merge(df, df_genre)
df.head()

### Step to perform Content-based Recommender System
- Iterate through all users
  - Get a list of movies and points that the user has rated
  - Train the Linear Regression model separately for that user (please use the `sklearn` library to train Linear Regression for convenience)
  - Save the model for that user.

First, we need information about which movies each user has rated with what scores and what are the characteristics of the movies

We build a `get_user_training_data` function to do this
- `database` in this case is `df`
- Returns 3 `numpy arrays` containing movieIds, ratings and features respectively

In [None]:
def get_user_training_data(database, user_id):
  filter = (database["userId"] == user_id)
  movie_ids = database[filter]["movieId"].values
  ratings = database[filter]["rating"].values
  # taken from the 6th column onwards for features
  features = database[filter].iloc[:, 6:].values
  return movie_ids, ratings, features

Call functions to test

In [None]:
test = get_user_training_data(df, user_id=2)
print(test[0].shape) # movieId
print(test[1].shape) # rating
print(test[2].shape) # list of movies' features

#### TODO 1 (5 pts)

Implement the Content-based Recommender System following the steps described above (use `LinearRegression` from `sklearn`)

In [None]:
# YOUR SOLUTION

So we have the whole model to predict the score for each user.

To suggest movies for any 1 user, we do the following steps:
- Find ids of movies that users have not reviewed
- Take out the featured set of those films
- Use the model with the corresponding key in `all_model` to predict the score
- If the movie is predicted high score $→$ suggest that movie to the user

In fact:
- If we find predicting points difficult, we can predict whether the user will like the movie or not (based on the `rating` to create the Like/Dislike column and then use LogisticRegression to do it)
- I will not train the model on the entire series that the user has evaluated, but will split it into 2 episodes of Train and Test in an 80-20 ratio. Then train on Train and evaluate on Test

## Item-based Collaborative Filtering

Look at the image carefully to understand the logic

![](https://i.imgur.com/HEqxtJF.png)

We create a Word Utility Matrix using the function `pivot_table`
- `index`: rows
- `columns`: columns

In [None]:
utility_matrix = df.pivot_table(index=['movieId'], columns=['userId'], values='rating').reset_index(drop=True)
utility_matrix

In [None]:
# create utility matrix
utility_matrix = df.pivot_table(index=['movieId'], columns=['userId'], values='rating').reset_index(drop=True)

# fill empty values with 0
utility_matrix.fillna(0, inplace=True)

# convert DataFrame to numpy array
utility_matrix = utility_matrix.values
print(utility_matrix.shape)

(436, 260)


Function to compute consine similarity for two vectors

In [None]:
def cosine(a, b):
  # to avoid denominator = 0 --> add a very samll number call epsilon to the denominator
  return a.dot(b) / ((np.linalg.norm(a) * np.linalg.norm(b)) + np.finfo(np.float64).eps)

Compute item-to-item similarity matrix

In [None]:
from tqdm.notebook import tqdm

movie_len = df["movieId"].unique().shape[0]

# Create a square matrix with a shape equal to the number of films containing only zero
item_to_item_similarity_matrix = np.zeros((movie_len, movie_len))

for i in tqdm(range(movie_len)):
  for j in range(movie_len):

    # Take out one pair of item
    item_1 = utility_matrix[i]
    item_2 = utility_matrix[j]

    # Find rating > 0 of each item
    index_not_zero = (item_1 > 0) & (item_2 > 0)

    # Compute the cosine similarity and assign to the similarity/square matrix
    item_to_item_similarity_matrix[i,j] = cosine(item_1[index_not_zero], item_2[index_not_zero])

Check if the diagonal of the `item_to_item_similarity_matrix` is equal to 1.

In [None]:
item_to_item_similarity_matrix.diagonal()

We take out the positions in `utility_matrix` where there is `rating=0`

In [None]:
zero_rating_indices = np.where(utility_matrix == 0)
zero_rating_indices

Variable `zero_rating_indices` above return 2 arrays, corresponding with each of pair of (row-column) values in which value in `utility_matrix` is 0

#### TODO 2 (5 pts)
- Think about how to do step 3 of the Item-based Collaborative Filtering method. After calculating the score, you need to reassign it to the corresponding box in `utility_matrix`
- Hints:
  - Variables in use include `zero_rating_indices`, `item_to_item_similarity_matrix`
  -  When applying the formula in step 3, to avoid the denominator being zero, we add to the denominator 1 a very small number `np.finfo(np.float64).eps`


In [None]:
# YOUR SOLUTION


So we have the complete Utility Matrix. We can use the heatmap to view

In [None]:
# YOUR SOLUTION

At this point, we can reapply the Content-based method above to find a unique suggestion model for each user.

refer to the code above to try the User-based method.


In [None]:
# YOUR SOLUTION