# Lab 10

In [None]:
FIRST_NAME = "Leng"
LAST_NAME = "Her"
STUDENT_ID = "5445877"

## Introduction

There are 2 sections to this Lab:

1. *Exploratory Data Analysis* - to understand the structure, distributions, and statistics about the data set
2. *Recommendations* - using a collaborative filtering approach to recommend new items to a user.



## The Data Set

This lab will use the [Movie Lens](https://grouplens.org/datasets/movielens/) data set. It is a massive User-Item ratings matrix of people who have rated movies 1-5 stars. It is a bench mark data set for recommender systems that was created by researchers at the University of Minnesota (go gophers).

For more information, check out the [README](https://files.grouplens.org/datasets/movielens/ml-100k-README.txt) file for the data set.

## Setup

Run the following command to download the data set zip file.

In [1]:
! curl https://files.grouplens.org/datasets/movielens/ml-100k.zip --output movielens.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4808k  100 4808k    0     0  5786k      0 --:--:-- --:--:-- --:--:-- 5779k


Run the following command to unzip the data.

In [2]:
! unzip movielens.zip

Archive:  movielens.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


Run the following blocks of code to read in the data into Pandas [Data Frames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [3]:
import pandas as pd
import numpy as np

In [4]:
column_names = ['userId','itemId', 'rating','timestamp']
ratings_df = pd.read_table('ml-100k/u.data', sep='\t', names=column_names)
userIds = ratings_df.sort_values('userId').userId.unique()
itemIds = ratings_df.sort_values('itemId').itemId.unique()

In [5]:
ratings_df.head()

Unnamed: 0,userId,itemId,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## Exploratory Data Analysis

Print out the answers to the following questions. You can use Pandas, PySpark SQL, or any other python functions or packages to calculate the answers.

##### Question 1
How many users and how many items are in the data set?

In [6]:
num_users = len(ratings_df['userId'].unique())
num_items = len(ratings_df['itemId'].unique())

print(f"There are {num_users} unique users and {num_items} unique items in the Movie Lens data set.")


There are 943 unique users and 1682 unique items in the Movie Lens data set.


##### Question 2
What is the overall mean and standard deviation of all the ratings?



In [7]:
mean_rating = ratings_df['rating'].mean()
std_rating = ratings_df['rating'].std()

print(f"The overall mean rating is {mean_rating:.2f} with a standard deviation of {std_rating:.2f}.")


The overall mean rating is 3.53 with a standard deviation of 1.13.


##### Question 3
What is the distribution of the ratings? (how many 1s, 2s, 3s, 4s, and 5s are there)


In [8]:
rating_counts = ratings_df['rating'].value_counts().sort_index()

print("Distribution of ratings:")
print(rating_counts)


Distribution of ratings:
1     6110
2    11370
3    27145
4    34174
5    21201
Name: rating, dtype: int64


##### Question 4
List the top 5 items based on their mean rating. Only consider items that have been rated at least 50 times.


In [9]:
# Group the ratings data frame by item ID and calculate the mean rating and number of ratings for each item
item_stats = ratings_df.groupby('itemId').agg({'rating': [np.mean, np.size]})

# Filter out items with less than 50 ratings
item_stats = item_stats[item_stats['rating']['size'] >= 50]

# Sort the resulting data frame by mean rating in descending order
item_stats = item_stats.sort_values([('rating', 'mean')], ascending=False)

# Print out the top 5 items
print("Top 5 itmes by mean rating (with at least 50 ratigns):")
print(item_stats['rating']['mean'].head())


Top 5 items by mean rating (with at least 50 ratings):
itemId
408    4.491071
318    4.466443
169    4.466102
483    4.456790
114    4.447761
Name: mean, dtype: float64


#### Question 5
List the top 5 users based on their total number of items they have rated.

In [10]:
# Group the ratings data frame by user ID and count the number of items each user has rated
user_ratings_count = ratings_df.groupby('userId').size()

# Sort the resulting data frame by number of ratings in descending order and print out the top 5 users
print("Top 5 users by number of items rated:")
print(user_ratings_count.sort_values(ascending=False).head())


Top 5 users by number of items rated:
userId
405    737
655    685
13     636
450    540
276    518
dtype: int64


## Recommendations

The following questions are all going to calculate components of this equation to predict the ratings of movies of the user with `UserId=1`. Then we will print out the top 5 highest predicted movies which that user does not already have a rating for.

$$
pred(user_1,\ i) = \overline{r}_{1} + \frac{\sum_{u \in neighbors} sim(user_1,\ user_u)\ \cdot\ (r_{ui} - \overline{r_u})}{\sum_{u \in neighbors} sim(user_1,\ user_u)}
\qquad \forall\ i \in Items
$$

First, run the following code to transform the raw data into a user-item matrix named `user_item`.

In [11]:
# Create user item matrix
user_item = pd.pivot_table(ratings_df, index='userId', values='rating', columns='itemId', fill_value=np.nan)

In [12]:
# Set the userId to find recommendations for
userId = 1

#### Question 6

Find the average rating of every user. Save the results in a Pandas Series named `user_averages`. The index should be the userId, and the values should be that user's average rating.

This is calculating the $\overline{r}_{1}$ and $\overline{r}_u$ values from the equation above.

The DataFrame [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method is the easiest way to do this.

 * Groupby the `'userId'` column of `ratings_df`.
 * Take the `.mean()` of the `'rating'` column.

In [13]:
# Group the ratings data frame by user ID and calculate the mean rating for each user
user_averages = ratings_df.groupby('userId')['rating'].mean()

print("Average raitngs for each user:")
print(user_averages)


Average ratings for each user:
userId
1      3.610294
2      3.709677
3      2.796296
4      4.333333
5      2.874286
         ...   
939    4.265306
940    3.457944
941    4.045455
942    4.265823
943    3.410714
Name: rating, Length: 943, dtype: float64


#### Question 7

Subtract each user's average rating from all their other ratings. Then fill all the missing values with `0`. Save the result in a DataFrame named `centered_by_user`.

This is calculating the $(r_{ui} - \overline{r_u})$ part of the equation above.

The DataFrame [`subtract`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.subtract.html) and [`fillna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) methods are the easiest way to do this.

* Run the subtract method on the `user_item` DataFrame created above
* Subtract the `user_averages` Series created in the previous question
    * _NOTE: use the argument `axis='index'` in the subtract method to subtract by row instead of by column_
* Then use the `fillna()` method to replace the missing values with 0's.


In [14]:
# Subtract each user's average rating from all their other ratings
centered_by_user = user_item.subtract(user_averages, axis='index')

# Fill all the missing values with 0
centered_by_user = centered_by_user.fillna(0)

print("User-item matrix with ratings centered by user:")
print(centered_by_user)


User-item matrix with ratings centered by user:
itemId      1         2         3         4         5         6         7     \
userId                                                                         
1       1.389706 -0.610294  0.389706 -0.610294 -0.610294  1.389706  0.389706   
2       0.290323  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
5       1.125714  0.125714  0.000000  0.000000  0.000000  0.000000  0.000000   
...          ...       ...       ...       ...       ...       ...       ...   
939     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
940     0.000000  0.000000  0.000000 -1.457944  0.000000  0.000000  0.542056   
941     0.954545  0.000000  0.000000  0.000000  0.000000  0.000000 -0.045455   
942     0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.00

#### Question 8
Calculate the cosine similarity between every pair of users. Save the result into an array named `user_similarity_matrix`.

This is calculating the $sim(user_1,\ user_u)$ parts of the equation above.


The easiest way to do this is with the [`pairwise.cosine_similarity()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity) function from Scikit-Learn on the `centered_by_user` DataFrame created in the previous question. 

In [15]:
from sklearn.metrics import pairwise

In [16]:
# Calculate the cosine similarity between every pair of users
user_similarity_matrix = pairwise.cosine_similarity(centered_by_user)

print("User similarity matrix:")
print(user_similarity_matrix)


User similarity matrix:
[[ 1.00000000e+00  4.34108342e-02  1.10508910e-02 ...  2.87902189e-02
  -3.12704963e-02  3.21234686e-02]
 [ 4.34108342e-02  1.00000000e+00  1.36577422e-02 ... -1.73443333e-02
   1.20678821e-02  3.91732571e-02]
 [ 1.10508910e-02  1.36577422e-02  1.00000000e+00 ...  3.44144054e-02
  -9.18699063e-03  1.48879883e-03]
 ...
 [ 2.87902189e-02 -1.73443333e-02  3.44144054e-02 ...  1.00000000e+00
  -1.90550456e-02  6.66444773e-04]
 [-3.12704963e-02  1.20678821e-02 -9.18699063e-03 ... -1.90550456e-02
   1.00000000e+00  4.03538399e-02]
 [ 3.21234686e-02  3.91732571e-02  1.48879883e-03 ...  6.66444773e-04
   4.03538399e-02  1.00000000e+00]]


#### Question 9
Run the following blocks of code to transform the user similarity matrix into a long format DataFrame.

In [17]:
# Unstack the similarity matrix into a long format
user_sim_df = pd.DataFrame(user_similarity_matrix).unstack().reset_index()
user_sim_df.columns = ['user1', 'user2', 'similarity']

In [18]:
# Correct for matrix being indexed at 0, and userId's starting at 1
user_sim_df['user1'] = user_sim_df['user1'] + 1
user_sim_df['user2'] = user_sim_df['user2'] + 1

In [19]:
# Remove rows comparing a user to themself
user_sim_df = user_sim_df[user_sim_df.user1 != user_sim_df.user2]

In [20]:
# View the results
user_sim_df.head()

Unnamed: 0,user1,user2,similarity
1,1,2,0.043411
2,1,3,0.011051
3,1,4,0.059303
4,1,5,0.134514
5,1,6,0.103373


#### Question 10
From the `user_sim_df` created in the previous question, find the top 20 users who are most closely related to `userId=1` based on cosine similarity. Save the results in a DataFrame named `top20`.


This is filtering the $sim(user_1,\ user_u)$ for only the relevant neighbors in the equation above.


_Hints_:
* Filter `user_sim_df` by where the `'user1'` column = 1
* Use the `.sort_values()` method to sort the remaining values by the `similarity` column with `ascending=False`
* Take the top 20 rows with the `.head()` method

In [22]:
# Filter the user similarity data frame by where the 'user1' column = 1
top_similar_users = user_sim_df[user_sim_df['user1'] == 1]

# Sort the resulting data frame by similarity in descending order
top_similar_users = top_similar_users.sort_values('similarity', ascending=False)

# Take the top 20 rows
top20 = top_similar_users.head(20)

In [23]:
print("Top 20 users most closely related to userId=1:")
print(top20)


Top 20 users most closely related to userId=1:
     user1  user2  similarity
772      1    773    0.204792
867      1    868    0.202321
591      1    592    0.196592
879      1    880    0.195801
428      1    429    0.190661
275      1    276    0.187476
915      1    916    0.186358
221      1    222    0.182415
456      1    457    0.182253
7        1      8    0.180891
659      1    660    0.179206
343      1    344    0.176834
756      1    757    0.176318
362      1    363    0.175709
478      1    479    0.174467
302      1    303    0.173798
565      1    566    0.173653
549      1    550    0.173542
12       1     13    0.171911
660      1    661    0.171152


#### Question 11
Filter the `centered_by_user` DataFrame to only include the ratings from the top 20 neighbors. Save the resulting DataFrame into a variable named `regularized_ratings`.

This is filtering the $(r_{ui} - \overline{r_u})$ for only the relevant neighbors in the equation above.

Use the `.loc[]` attribute of the DataFrame. You can filter out all the rows you want by passing an array or list of the userId's that you want in the resulting DataFrame.

In [24]:
# Extract the userIds of the top 20 most similar users to userId=1
top20_userIds = top20['user2'].values

# Filter the centered_by_user DataFrame to only include the ratings from the top 20 neighbors
regularized_ratings = centered_by_user.loc[top20_userIds]

print("Centered by user ratings of the top 20 neighbors:")
print(regularized_ratings)


Centered by user ratings of the top 20 neighbors:
itemId      1         2         3         4         5         6         7     \
userId                                                                         
773    -0.279503 -0.279503  0.000000  0.000000  0.000000 -0.279503 -1.279503   
868     1.048077 -0.951923  0.000000  0.000000  0.000000  0.000000  2.048077   
592     0.186111  0.000000  0.186111  0.186111  0.000000  0.000000  1.186111   
880     0.573370 -0.426630 -2.426630  0.573370 -0.426630  0.000000 -0.426630   
429    -0.393720 -0.393720 -1.393720  0.606280  0.000000  0.000000 -1.393720   
276     1.534749  0.534749 -0.465251  0.534749 -0.465251  0.000000  1.534749   
916     0.634069 -0.365931 -0.365931  0.634069 -0.365931  0.000000  0.634069   
222     0.950904 -0.049096  0.000000 -0.049096  0.000000  0.000000  1.950904   
457    -0.025271  0.000000  0.000000 -0.025271  0.000000  0.000000 -0.025271   
8       0.000000  0.000000  0.000000  0.000000  0.000000  0.000000 -0.

#### Question 12

Get the weighted average ratings for `userId=1` by multipling each users movie ratings by their similarity to user 1, then taking the sum of all the scaled ratings for each movie.

The result should be a Pandas Series with length equal to 1682, one value for each movie. Save the resulting Series into a variable named `numerator`. 

This is calculating a vector as the numerator for the above equation:

$\sum_{u \in neighbors} sim(user_1,\ user_u)\ \cdot\ (r_{ui} - \overline{r_u})$.

Use the [`np.matmul()`](https://numpy.org/doc/stable/reference/generated/numpy.matmul.html) function to matrix multiply the `top20` similarity values by the `regularized_ratings` DataFrame.

In [25]:
# Calculate the numerator of the weighted average equation
numerator = np.matmul(top20['similarity'].values, regularized_ratings)

print("Weighted average numerator:")
print(numerator)


Weighted average numerator:
itemId
1       1.296710
2      -0.198754
3      -1.207261
4       1.594897
5      -1.316110
          ...   
1678    0.000000
1679    0.000000
1680    0.000000
1681    0.000000
1682   -0.068194
Length: 1682, dtype: float64


#### Question 13

Calculate the sum of all the similarity values for the `top20` users. Save the result into a variable named `denominator`.

This is calculating the denominator of the equation above.

$\sum_{u \in neighbors} sim(user_1,\ user_u)$

In [27]:
# Calculate the sum of all the similarity values for the top20 users
denominator = top20['similarity'].abs().sum()

#### Question 14

Calculate the projected ratings for every movie for $user_1$. Save the results into a Pandas Series named `ratings`. The index should be the `itemId` of each movie, and the values should be user1's projected rating of that movie.

First, divide the numerator and denominator. This gives the projected difference from user1's average rating.

Then, add in user1's average rating from the `user_averages` Series calculated in Question 6.

In [28]:
# Calculate the projected ratings for every movie for user1
projected_diff = numerator / denominator
ratings = user_averages[1] + projected_diff
ratings.index.name = 'itemId'
ratings.name = 'predicted_rating'


#### Question 15 

Find the movies with the top 5 highest predicted ratings that `userId=1` *__has not yet rated__*.

In [29]:
# Find the movies with the top 5 highest predicted ratings that userId=1 has not yet rated
rated_items = user_item.loc[1].dropna().index
unrated_items = user_item.columns.difference(rated_items)
top5_unrated_movies = ratings[ratings.index.isin(unrated_items)].nlargest(5)

In [30]:
print("Top 5 highest predicted ratings for user 1:")
print(top5_unrated_movies)

Top 5 highest predicted ratings for user 1:
itemId
318    4.385344
651    4.291911
408    4.276558
483    4.180320
357    4.143762
Name: predicted_rating, dtype: float64
