<img align="right" style="padding-left:10px; height: 20%; width: 20%" src="figures/projector-300x300.jpg" ></a>

## Case Study: Movie Suggestion

### The Movies Dataset

Collaborative datasets for movies (and other products) can be large! Here is a [small (1 MB) subset of the IMDB database](https://grouplens.org/datasets/movielens/latest/), downloaded and unzipped for your convenience.

The dataset consists of 9742 movies.

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame
movies_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-small/'
movies = pd.read_csv(movies_directory+'movies.csv',header = 0) 

movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Movie Ratings

The above 9742 movies were rated by 610 users; this works out to about 165 movies on average rated by each user, available in the `ratings.csv` file as sampled in the DataFrame below.

In [2]:
ratings = pd.read_csv(movies_directory+'ratings.csv',header = 0) 
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [3]:
# Ratings Group By userId:
# print (type(ratings.groupby(["userId"])["userId"].count())) # prints <class 'pandas.core.series.Series'>

# Convert Series to DataFrame
#     Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html#pandas.Series.reset_index

counts = ratings.groupby(["userId"])["userId"].count().reset_index(name="Count")
counts

Unnamed: 0,userId,Count
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44
...,...,...
605,606,1115
606,607,187
607,608,831
608,609,37


<a href="https://en.wikipedia.org/wiki/Collaborative_filtering"><img align="right" style="padding-left:10px; height: 40%; width: 40%" src="https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif" ></a>

## General Approach

As discussed in [04-05-recommendations](../../04-analysis-and-visualization/04-05-recommendations/04-05-recommendations.ipynb) a generalized version of Collaborative filtering, implied by the adjoining image, is a three-step process:

1. A user expresses their preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain. _The ratings have been collected by IMDB and imported into the `ratings` DataFrame._
2. The system matches this user's ratings against other users' and finds the people with most "similar" tastes. For the purpose of this Case, we shall determine the recommendations for the user with **userId = 607**.
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item).

<span style="color:blue">

### Solution Development
</span>

We proceed with the calculations as outlined above but first create a **tiny dataset** such that we can develop the solution and _verify the calculations manually._ 

Step 2 of the algorithms is for the system to match this user's ratings against other users' and finds the people with most "similar" tastes. We shall use **userId = 5** for this tiny dataset.

Once the solution has been developed, we will write functions and classes to package the developed code and use it for the given dataset.

In [4]:
# Cell 4
# Initial Parameters
given_userId = 5
threshold_distance = 3.5

In [5]:
# Cell 5
# Data Location
tiny_directory = '../../04-analysis-and-visualization/04-05-recommendations/ml-latest-tiny/'
tiny_movies = pd.read_csv(tiny_directory+'movies.csv',header = 0) 
tiny_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [6]:
# Cell 6
tiny_ratings = pd.read_csv(tiny_directory+'ratings.csv',header = 0).drop(columns=['timestamp'])
tiny_ratings

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,5,1,4.0
3,6,2,4.0
4,6,3,5.0
5,6,4,3.0
6,7,1,4.5
7,8,2,4.0
8,15,1,2.5


### Librares to use

We will be using numpy and scipy for most of the calculations, mostly use `pandas` for pretty printing. 

Variable names will be chosen such that:

1. Variables ending in `_np` will be used for numpy arrays.
1. Variables ending in `_2d` will be used for 2-D numpy arrays.
2. Variables ending in `_df` will be used for pandas DataFrames.

In [7]:
# Cell 7
tiny_ratings_np = tiny_ratings.to_numpy(dtype=np.float32)
tiny_ratings_np

array([[ 1. ,  1. ,  4. ],
       [ 1. ,  3. ,  4. ],
       [ 5. ,  1. ,  4. ],
       [ 6. ,  2. ,  4. ],
       [ 6. ,  3. ,  5. ],
       [ 6. ,  4. ,  3. ],
       [ 7. ,  1. ,  4.5],
       [ 8. ,  2. ,  4. ],
       [15. ,  1. ,  2.5]], dtype=float32)

In [8]:
# Find the rating for our user
x = tiny_ratings_np
the_x_2d = x[np.where(x[:,0] == given_userId)][:, [2]]
the_x_2d = the_x_2d[0].reshape(1,1) # pick the first one in case we have multiple ratings records for user.
the_x = the_x_2d.reshape(the_x_2d.shape[0])
the_x

array([4.], dtype=float32)

In [9]:
# Find the ratings for all users
all_x_2d = tiny_ratings_np[:, [2]]
all_x = all_x_2d.reshape(all_x_2d.shape[0])
print (all_x)

all_u_2d = tiny_ratings_np[:, [0]]
all_u = all_u_2d.reshape(all_u_2d.shape[0])

all_m_2d = tiny_ratings_np[:, [1]]
all_m = all_m_2d.reshape(all_m_2d.shape[0])

all_u, all_m, all_x

[4.  4.  4.  4.  5.  3.  4.5 4.  2.5]


(array([ 1.,  1.,  5.,  6.,  6.,  6.,  7.,  8., 15.], dtype=float32),
 array([1., 3., 1., 2., 3., 4., 1., 2., 1.], dtype=float32),
 array([4. , 4. , 4. , 4. , 5. , 3. , 4.5, 4. , 2.5], dtype=float32))

In [10]:
from scipy.spatial.distance import cdist, euclidean
from scipy.spatial import distance_matrix
print (the_x, all_x)
dm = distance_matrix(all_x_2d, all_x_2d)

[4.] [4.  4.  4.  4.  5.  3.  4.5 4.  2.5]


In [11]:
# Manually verifying the euclidean calculation.
# The numbers produced by the previous cell and by this cell should match!
dists = cdist(the_x_2d, all_x_2d)
distances = list(dists.reshape(dists.shape[1]))
print (distances)

[0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.5, 0.0, 1.5]


In [12]:
# This cell will throw an exception if the results of euclidean calculation don't match
from math import sqrt
assert(np.isclose(euclidean(the_x, all_x), 
                  sqrt(sum([_*_ for _ in distances]))))

In [13]:
# Cell 13
dists = np.stack([all_u,
                  all_m,
                  np.apply_along_axis(lambda x: sqrt(sum([_*_ for _ in x])), 0, dm)])
dists_df = DataFrame(dists.transpose(), columns=['u', 'm', 'x']) \
           .astype({'u': 'int32', 'm': 'int32', 'x': 'float32'})
dists_df   #  [['u', 'x']]

Unnamed: 0,u,m,x
0,1,1,2.12132
1,1,3,2.12132
2,5,1,2.12132
3,6,2,2.12132
4,6,3,3.937004
5,6,4,3.391165
6,7,1,2.783882
7,8,2,2.12132
8,15,1,4.66369


In [14]:
# Cell 14
dists_df = dists_df[dists_df['x'] < threshold_distance]
dists_df

Unnamed: 0,u,m,x
0,1,1,2.12132
1,1,3,2.12132
2,5,1,2.12132
3,6,2,2.12132
5,6,4,3.391165
6,7,1,2.783882
7,8,2,2.12132


In **step 3 of the algorithm**, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item).

In [15]:
# Cell 15
# What movies has the user picked already? Don't recommend those!
candidates_df = dists_df[dists_df['u'] == 5]
candidates_np = candidates_df['m']
dists_np = dists_df['m']
candidate_movies = [i for i in set(dists_np) if not i in set(candidates_np) ]
print (candidate_movies)
# To refresh our memory of tiny_movies,
tiny_movies

[2, 3, 4]


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [16]:
# Candidate Movies
np.isin(tiny_movies['movieId'], list(candidate_movies))
tiny_movies[np.isin(tiny_movies['movieId'], list(candidate_movies))]

Unnamed: 0,movieId,title,genres
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


## TO-DO

Your assignment is to package the code from <span style="color:green"><em># Initial Parameters</em></span> to the <span style="color:green"><em># Candidate Movies</em></span> cells.

1. Create a function `recommend_movies(uid,threshold)` that takes `userId`, `threshold` and `movies_directory` as parameters and produces recommendations for the user. Test the code first with userId = 607. Try various values of threshold such that the user gets at least 6 movie recommendations.
2. Back in the <span style="color:green"><em># What movies has the user picked already? Don't recommend those!</em></span> cell (cell 15), we had picked the first record for our user. Modify the code to begin instead with the movie our user liked the most!
3. Time your code for various values of `userId` and `threshold`. What accounts for the variation in timing?