# Movie Recommendations HW

**Name:** Thanapoom Phatthanaphan

**ID:** 20011296

**Collaboration Policy:** Homeworks will be done individually: each student must hand in their own answers. Use of partial or entire solutions obtained from others or online is strictly prohibited.

**Late Policy:** Late submission have a penalty of 2\% for each passing hour. 

**Submission format:** Successfully complete the Movie Lens recommender as described in this jupyter notebook. Submit a `.py` and an `.ipynb` file for this notebook. You can go to `File -> Download as ->` to download a .py version of the notebook. 

**Only submit one `.ipynb` file and one `.py` file.** The `.ipynb` file should have answers to all the questions. Do *not* zip any files for submission. 

**Download the dataset from here:** https://grouplens.org/datasets/movielens/1m/

In [1]:
# Import all the required libraries
import numpy as np
import pandas as pd

## Reading the Data
Now that we have downloaded the files from the link above and placed them in the same directory as this Jupyter Notebook, we can load each of the tables of data as a CSV into Pandas. Execute the following, provided code.

In [2]:
# Read the dataset from the two files into ratings_data and movies_data
#NOTE: if you are getting a decode error, add "encoding='ISO-8859-1'" as an additional argument
#      to the read_csv function
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, engine='python')
column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, engine='python', encoding = 'latin-1')
column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, engine='python')

`ratings_data`, `movies_data`, `user_data` corresponds to the data loaded from `ratings.dat`, `movies.dat`, and `users.dat` in Pandas.

## Data analysis

We now have all our data in Pandas - however, it's as three separate datasets! To make some more sense out of the data we have, we can use the Pandas `merge` function to combine our component data-frames. Run the following code:

In [3]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)
data

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000204,5949,2198,5,958846401,M,18,17,47901,Modulations (1998),Documentary
1000205,5675,2703,3,976029116,M,35,14,30030,Broken Vessels (1998),Drama
1000206,5780,2845,1,958153068,M,18,17,92886,White Boys (1999),Drama
1000207,5851,3607,5,957756608,F,18,20,55410,One Little Indian (1973),Comedy|Drama|Western


Next, we can create a pivot table to match the ratings with a given movie title. Using `data.pivot_table`, we can aggregate (using the average/`mean` function) the reviews and find the average rating for each movie. We can save this pivot table into the `mean_ratings` variable. 

In [4]:
mean_ratings=data.pivot_table('Ratings','Title',aggfunc='mean')
mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",3.027027
'Night Mother (1986),3.371429
'Til There Was You (1997),2.692308
"'burbs, The (1989)",2.910891
...And Justice for All (1979),3.713568
...,...
"Zed & Two Noughts, A (1985)",3.413793
Zero Effect (1998),3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995),3.500000
Zeus and Roxanne (1997),2.521739


Now, we can take the `mean_ratings` and sort it by the value of the rating itself. Using this and the `head` function, we can display the top 15 movies by average rating.

In [5]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],aggfunc='mean')
top_15_mean_ratings = mean_ratings.sort_values(by = 'Ratings',ascending = False).head(15)
top_15_mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


Let's adjust our original `mean_ratings` function to account for the differences in gender between reviews. This will be similar to the same code as before, except now we will provide an additional `columns` parameter which will separate the average ratings for men and women, respectively.

In [6]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
mean_ratings

Gender,F,M
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375000,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
...,...,...
"Zed & Two Noughts, A (1985)",3.500000,3.380952
Zero Effect (1998),3.864407,3.723140
Zero Kelvin (Kjærlighetens kjøtere) (1995),,3.500000
Zeus and Roxanne (1997),2.777778,2.357143


We can now sort the ratings as before, but instead of by `Rating`, but by the `F` and `M` gendered rating columns. Print the top rated movies by male and female reviews, respectively.

In [7]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)

mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
print(top_female_ratings.head(15))

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
print(top_male_ratings.head(15))

Gender                                               F         M
Title                                                           
Clean Slate (Coup de Torchon) (1981)               5.0  3.857143
Ballad of Narayama, The (Narayama Bushiko) (1958)  5.0  3.428571
Raw Deal (1948)                                    5.0  3.307692
Bittersweet Motel (2000)                           5.0       NaN
Skipped Parts (2000)                               5.0  4.000000
Lamerica (1994)                                    5.0  4.666667
Gambler, The (A Játékos) (1997)                    5.0  3.166667
Brother, Can You Spare a Dime? (1975)              5.0  3.642857
Ayn Rand: A Sense of Life (1997)                   5.0  4.000000
24 7: Twenty Four Seven (1997)                     5.0  3.750000
Twice Upon a Yesterday (1998)                      5.0  3.222222
Woman of Paris, A (1923)                           5.0  2.428571
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                5.0  4.750000
Gate of Heavenly Peace, T

In [8]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

Gender,F,M,diff
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"James Dean Story, The (1957)",4.0,1.0,-3.0
Country Life (1994),5.0,2.0,-3.0
"Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)",4.0,1.0,-3.0
Babyfever (1994),3.666667,1.0,-2.666667
"Woman of Paris, A (1923)",5.0,2.428571,-2.571429
Cobra (1925),4.0,1.5,-2.5
"Other Side of Sunday, The (Søndagsengler) (1996)",5.0,2.928571,-2.071429
"To Have, or Not (1995)",4.0,2.0,-2.0
For the Moment (1994),5.0,3.0,-2.0
Phat Beach (1996),3.0,1.0,-2.0


Let's try grouping the data-frame, instead, to see how different titles compare in terms of the number of ratings. Group by `Title` and then take the top 10 items by number of reviews. We can see here the most popularly-reviewed titles.

In [9]:
ratings_by_title=data.groupby('Title').size()
ratings_by_title.sort_values(ascending=False).head(10)

Title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64

Similarly, we can filter our grouped data-frame to get all titles with a certain number of reviews. Filter the dataset to get all movie titles such that the number of reviews is >= 2500.

## Question 1

Create a ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. The element at location $[i,j]$ is a rating given by user $i$ for movie $j$. Print the **shape** of the matrix produced.  

Additionally, choose 3 users that have rated the movie with MovieID "**1377**" (Batman Returns). Print these ratings, they will be used later for comparison.


**Notes:**
- Do *not* use `pivot_table`.
- A ratings matrix is *not* the same as `ratings_data` from above.
- The ratings of movie with MovieID $i$ are stored in the ($i$-1)th column (index starts from 0)  
- Not every user has rated every movie. Missing entries should be set to 0 for now.
- If you're stuck, you might want to look into `np.zeros` and how to use it to create a matrix of the desired shape.
- Every review lies between 1 and 5, and thus fits within a `uint8` datatype, which you can specify to numpy.

In [10]:
# Create the desired matrix which
# the number of rows is the number of users
# and the number of columns is the number of movies
ratings = np.zeros((max(user_data['UserID']), max(movies_data['MovieID'])), dtype=np.uint8)

# Add movie rating of each user in the matrix
for i in data.itertuples():
    ratings[i.UserID - 1, i.MovieID - 1] = i.Ratings

print(ratings)

[[5 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [3 0 0 ... 0 0 0]]


In [11]:
# Print the shape
print(ratings.shape)

(6040, 3952)


In [12]:
# Store and print ratings for Batman Returns
ratings_batman = np.array([ratings[9, 1376],ratings[12, 1376], ratings[17, 1376] ])
print(ratings_batman)

[3 3 2]


## Question 2

Normalize the ratings matrix (created in **Question 1**) using Z-score normalization. While we can't use `sklearn`'s `StandardScaler` for this step, we can do the statistical calculations ourselves to normalize the data.

Before you start:
- Your first step should be to get the average of every *column* of the ratings matrix (we want an average by title, not by user!).
- Make sure that the mean is calculated considering only non-zero elements. If there is a movie which is rated only by 10 users, we get its mean rating using (sum of the 10 ratings)/10 and **NOT** (sum of 10 ratings)/(total number of users)
- All of the missing values in the dataset should be replaced with the average rating for the given movie. This is a complex topic, but for our case replacing empty values with the mean will make it so that the absence of a rating doesn't affect the overall average, and it provides an "expected value" which is useful for computing correlations and recommendations in later steps.
- In our matrix, 0 represents a missing rating.
- Next, we want to subtract the average from the original ratings thus allowing us to get a mean of 0 in every *column*. It may be very close but not exactly zero because of the limited precision `float`s allow.
- Lastly, divide this by the standard deviation of the *column*.

- Not every MovieID is used, leading to zero columns. This will cause a divide by zero error when normalizing the matrix. Simply replace any NaN values in your normalized matrix with 0.

In [13]:
# Find average of every column of the ratings matrix
column_mean = []
for i in range(ratings.shape[1]):
    mean = np.mean(ratings[:, i][ratings[:, i] != 0])
    column_mean.append(mean)

# Replace missing values (0 value) with mean in each column of the matrix
normalized_matrix = np.array(ratings, dtype=float)
normalized_matrix = np.where(normalized_matrix == 0, column_mean, normalized_matrix)

# Subtract the average from the original ratings
normalized_matrix -= column_mean

# Divide by the standard deviation of the column
ratings = ratings.astype('float64')
ratings[ratings == 0] = np.nan
column_std = np.nanstd(ratings, axis=0)
normalized_matrix /= column_std

# Replace NaN value as 0 value
normalized_matrix[np.isnan(normalized_matrix)] = 0
print(normalized_matrix)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


[[ 1.00118491  0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-1.3458366   0.          0.         ...  0.          0.
   0.        ]]


  var = nanvar(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  normalized_matrix /= column_std


In [21]:
column_mean

[4.146846413095811,
 3.20114122681883,
 3.01673640167364,
 2.7294117647058824,
 3.0067567567567566,
 3.8787234042553194,
 3.410480349344978,
 3.014705882352941,
 2.656862745098039,
 3.5405405405405403,
 3.7938044530493706,
 2.3625,
 3.2626262626262625,
 3.542483660130719,
 2.458904109589041,
 3.7932551319648096,
 4.027544910179641,
 3.337579617834395,
 2.480719794344473,
 2.5375,
 3.6238938053097347,
 3.3492063492063493,
 2.857142857142857,
 3.1794871794871793,
 3.6510204081632653,
 3.53,
 2.9344262295081966,
 4.055865921787709,
 4.062034739454094,
 3.6486486486486487,
 3.1134751773049647,
 3.945731303772336,
 3.0,
 3.8914905768132497,
 3.3142857142857145,
 3.9579741379310347,
 3.5,
 2.8214285714285716,
 3.6233480176211454,
 3.933333333333333,
 3.958677685950413,
 2.8687782805429864,
 3.4457831325301207,
 2.787781350482315,
 3.4246323529411766,
 3.108433734939759,
 4.106420404573439,
 2.9764397905759163,
 3.740740740740741,
 4.517106001121705,
 nan,
 3.640371229698376,
 4.75,
 2.560975

In [20]:
column_std

array([0.85214387, 0.98247053, 1.07059021, ..., 1.03637545, 1.04403065,
       0.93386856])

## Question 3

We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix from the previous question. Perform the process using numpy, and along the way print the shapes of the $U$, $S$, and $V$ matrices you calculated.

In [14]:
# Compute the SVD of the normalised matrix
U, s, VT = np.linalg.svd(normalized_matrix)
S = np.zeros((normalized_matrix.shape[0], normalized_matrix.shape[1]))
S[:normalized_matrix.shape[1], :normalized_matrix.shape[1]] = np.diag(s)
svd = U@S@VT
print(svd)

[[ 1.00118491e+00 -8.55136003e-15  2.27560483e-15 ... -3.28648784e-17
   4.37258736e-16 -8.98173408e-16]
 [ 2.18141477e-15 -7.79099550e-14 -2.48745116e-14 ... -1.34168664e-15
  -1.92445886e-16 -7.08905590e-16]
 [-2.60580962e-15 -1.57805626e-15 -8.45938987e-14 ... -1.56591320e-15
   1.22270900e-16 -1.63570871e-15]
 ...
 [ 2.06825117e-16  1.37632690e-15 -4.85641258e-16 ...  6.61498850e-17
  -5.12285526e-17 -6.41929001e-16]
 [ 9.20894220e-16 -7.85053852e-16  1.18504653e-15 ...  8.25619954e-16
   4.59837246e-16  1.51883172e-16]
 [-1.34583660e+00 -1.35855276e-15  9.51631352e-16 ... -1.37840043e-15
   2.73381578e-16  1.82276069e-15]]


In [15]:
# Print the shapes
print(U.shape)
print(S.shape)
print(VT.shape)

(6040, 6040)
(6040, 3952)
(3952, 3952)


## Question 4

Reconstruct four rank-k rating matrix $R_k$, where $R_k = U_kS_kV_k^T$ for k = [100, 1000, 2000, 3000]. Using each of $R_k$ make predictions for the 3 users selected in Question 1, for the movie with ID 1377 (Batman Returns). Compare the original ratings with the predicted ratings.

In [16]:
# Reconstruct four rank-k rating matrix
for k in [100, 1000, 2000, 3000]:
    r_k = U[:, :k]@S[:k, :k]@VT[:k, :]
    
    # Rescale the reconstructed data matrix back to the original scale
    r_k *= column_std
    r_k += column_mean
    
    # Compare predicted ratings with the original ratings    
    ratings_batman_predict = r_k[[9, 12, 17], 1376]
    print(f"\nPredictions for the 3 users rated Batman Returns for k = {k}")
    print(ratings_batman_predict)
    print(ratings_batman)


Predictions for the 3 users rated Batman Returns for k = 100
[3.52484139 3.01403391 2.53928225]
[3 3 2]

Predictions for the 3 users rated Batman Returns for k = 1000
[2.93981501 3.03086574 2.06777061]
[3 3 2]

Predictions for the 3 users rated Batman Returns for k = 2000
[3.01116523 3.00914311 1.99047967]
[3 3 2]

Predictions for the 3 users rated Batman Returns for k = 3000
[2.99932658 2.99925765 1.99920346]
[3 3 2]


## Question 5

### Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within $cosine(x,y) \in [0,1]$. $0$ means there is no similarity (perpendicular), where $1$ (parallel) means that both the items are 100% similar.

$$ cosine(x,y) = \frac{x^T y}{||x|| ||y||}  $$

**Based on the reconstruction rank-1000 rating matrix $R_{1000}$ and the cosine similarity,** sort the movies which are most similar. You will have a function `top_movie_similarity` which sorts data by its similarity to a movie with ID `movie_id` and returns the top $n$ items, and a second function `print_similar_movies` which prints the titles of said similar movies. Return the top 5 movies for the movie with ID `1377` (*Batman Returns*)

Note: While finding the cosine similarity, there are a few empty columns which will have a magnitude of **zero** resulting in NaN values. These should be replaced by 0, otherwise these columns will show most similarity with the given movie. 

In [17]:
# Sort the movies based on cosine similarity
def top_movie_similarity(data, movie_id, top_n=5):
    # Movie id starts from 1
    # Use the calculation formula above to compute Cosine Similarity
    x = data
    y = data[:, movie_id - 1]
    magnitude_x = np.linalg.norm(x)
    magnitude_y = np.linalg.norm(y)
    dot_product = np.dot(x.T, y)
    cosine_similarity = dot_product/(magnitude_x * magnitude_y)
    
    # descending sort
    top_indices_similar_movies = np.argsort(-cosine_similarity)[0:top_n + 1]
    
    return top_indices_similar_movies


# Print the top 5 movies for Batman Returns
def print_similar_movies(movie_titles, top_indices_similar_movies):
    print('Most Similar movies: ')
    for i, movie in enumerate(top_indices_similar_movies):
        print(f"{i+1}. {movie_titles[movie_titles['MovieID'] == movie + 1]['Title'].values[0]}")

        
movie_id = 1377
r_1000 = U[:, :1000]@S[:1000, :1000]@VT[:1000, :]

# Compute Cosine Similarity based on rank-1000 rating matrix
top_indices_similar_movies = top_movie_similarity(r_1000, movie_id)
print_similar_movies(movies_data, top_indices_similar_movies)

Most Similar movies: 
1. Batman Returns (1992)
2. Batman (1989)
3. Batman Forever (1995)
4. Men in Black (1997)
5. Back to the Future Part II (1989)
6. True Lies (1994)


## Question 6

### Movie Recommendations
Using the same process from Question 5, write `top_user_similarity` which sorts data by its similarity to a user with ID `user_id` and returns the top result. Then find the MovieIDs of the movies that this similar user has rated most highly, but that `user_id` has not yet seen. Find at least 5 movie recommendations for the user with ID `5954` and print their titles.

Hint: To check your results, find the genres of the movies that the user likes and compare with the genres of the recommended movies.

In [18]:
#Sort users based on cosine similarity
def top_user_similarity(data, user_id):
    # Use the calculation formula above to compute Cosine Similarity
    x = data
    y = data[user_id - 1, :]
    magnitude_x = np.linalg.norm(x)
    magnitude_y = np.linalg.norm(y)
    dot_product = np.dot(x, y.T)
    cosine_similarity = dot_product/(magnitude_x * magnitude_y)
    
    # descending sort
    top_indices_similar_user = np.argsort(-cosine_similarity)[1]
    return top_indices_similar_user


user_id = 5954

# Create array of movies that user has never seen
movie_user_seen = data[data['UserID'] == user_id]['MovieID'].values
movie_user_not_seen = movies_data[~movies_data['MovieID'].isin(movie_user_seen)]['MovieID'].values

# Compute Cosine Similarity based on rank-1000 rating matrix
top_indices_similar_user = top_user_similarity(r_1000, user_id)

# Find top rated movies of similar user
top_rated_movie = data[(data['UserID'] == top_indices_similar_user) & data['MovieID'].isin(movie_user_not_seen)][['Title', 'Ratings']].values
top_rated_movie = top_rated_movie[np.argsort(-top_rated_movie[:, 1])]

# Print movie recommendations
print('Movie recommendations: ')
for i, movie in enumerate(top_rated_movie[0:10]):
    print(f"{i+1}. {movie[0]}")

Movie recommendations: 
1. And the Band Played On (1993)
2. Far and Away (1992)
3. Cop Land (1997)
4. High Fidelity (2000)
5. Leaving Las Vegas (1995)
6. Nell (1994)
7. Man Without a Face, The (1993)
8. Magnolia (1999)
9. Swingers (1996)
10. Taxi Driver (1976)
