## Recommendations with MovieTweetings: Collaborative Filtering

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, you will be working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t
import progressbar # You need to go to the terminal and pip install this!
from scipy.sparse import csr_matrix
from IPython.display import HTML


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

print(reviews.head())

   user_id  movie_id  rating   timestamp                 date  month_1  \
0        1     68646      10  1381620027  2013-10-12 23:20:27        0   
1        1    113277      10  1379466669  2013-09-18 01:11:09        0   
2        2    422720       8  1412178746  2014-10-01 15:52:26        0   
3        2    454876       8  1394818630  2014-03-14 17:37:10        0   
4        2    790636       7  1389963947  2014-01-17 13:05:47        0   

   month_2  month_3  month_4  month_5    ...      month_9  month_10  month_11  \
0        0        0        0        0    ...            0         1         0   
1        0        0        0        0    ...            0         0         0   
2        0        0        0        0    ...            0         1         0   
3        0        0        0        0    ...            0         0         0   
4        0        0        0        0    ...            0         0         0   

   month_12  year_2013  year_2014  year_2015  year_2016  year_2017  

In [43]:
#movies = pd.read_csv('movies_clean.csv')

### Measures of Similarity

When using **neighborhood** based collaborative filtering, it is important to understand how to measure the similarity of users or items to one another.  

There are a number of ways in which we might measure the similarity between two vectors (which might be two users or two items).  In this notebook, we will look specifically at two measures used to compare vectors:

* **Pearson's correlation coefficient**

Pearson's correlation coefficient is a measure of the strength and direction of a linear relationship. The value for this coefficient is a value between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship. 

If we have two vectors **x** and **y**, we can define the correlation between the vectors as:


$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$

where 

$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

and 

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

where n is the length of the vector, which must be the same for both x and y and $\bar{x}$ is the mean of the observations in the vector.  

We can use the correlation coefficient to indicate how alike two vectors are to one another, where the closer to 1 the coefficient, the more alike the vectors are to one another.  There are some potential downsides to using this metric as a measure of similarity.  You will see some of these throughout this workbook.


* **Euclidean distance**

Euclidean distance is a measure of the straightline distance from one vector to another.  Because this is a measure of distance, larger values are an indication that two vectors are different from one another (which is different than Pearson's correlation coefficient).

Specifically, the euclidean distance between two vectors **x** and **y** is measured as:

$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

Different from the correlation coefficient, no scaling is performed in the denominator.  Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note:** Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.  If some measures are on a 5 point scale, while others are on a 100 point scale, you are likely to have non-optimal results due to the difference in variability of your features.  In this case, we will not need to scale data because they are all on a 10 point scale, but it is always something to keep in mind!

------------

### User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns.  


![alt text](images/userxitem.png "User Item Matrix")


In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**.

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.

In [4]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

Unnamed: 0,user_id,movie_id,rating
0,1,68646,10
1,1,113277,10
2,2,422720,8
3,2,454876,8
4,2,790636,7


### Creating the User-Item Matrix

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results! 

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [5]:
# Create user-by-item matrix
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

user_by_movie.head(10)

movie_id,8,10,12,25,91,417,439,443,628,833,...,8144778,8144868,8206708,8289196,8324578,8335880,8342748,8342946,8402090,8439854
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
10,,,,,,,,,,,...,,,,,,,,,,


Check your results below to make sure your matrix is ready for the upcoming sections.

In [6]:
assert movies.shape[0] == user_by_movie.shape[1], "Oh no! Your matrix should have {} columns, and yours has {}!".format(movies.shape[0], user_by_movie.shape[1])
assert reviews.user_id.nunique() == user_by_movie.shape[0], "Oh no! Your matrix should have {} rows, and yours has {}!".format(reviews.user_id.nunique(), user_by_movie.shape[0])
print("Looks like you are all set! Proceed!")
HTML('<img src="images/greatjob.webp">')

Looks like you are all set! Proceed!


`2.` Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [13]:
movies_test = user_by_movie.loc[1]
movies_test.head()

movie_id
8    NaN
10   NaN
12   NaN
25   NaN
91   NaN
Name: 1, dtype: float64

In [8]:
movies_test = movies[movies.isnull() == False].index.values
movies_test

array([ 68646, 113277])

In [14]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    # get list of all movies for the user
    movies = user_by_movie.loc[user_id]
    
    # remove movies with nan values
    movies = movies[movies.isnull() == False].index.values
    
    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    
    # Do things - hint this may take some time, so you might want to set up a progress bar to watch things progress
    num_users =  user_by_movie.shape[0]
    movies_seen = dict()
    
    # set up progress bar
    counter = 0
    bar = progressbar.ProgressBar(maxval=num_users+1, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    
    for i in range(1, num_users+1):
        
        # update progress bar
        counter += 1
        bar.update(counter)
        
        movies_seen[i] = movies_watched(i)
    
    bar.finish()
    
    return movies_seen


# Use your function to return dictionary
movies_seen = create_user_movie_dict()



In [15]:
movies_seen

{1: array([ 68646, 113277]),
 2: array([ 422720,  454876,  790636,  816711, 1091191, 1103275, 1322269,
        1390411, 1398426, 1431045, 1433811, 1454468, 1535109, 1675434,
        1798709, 2017038, 2024544, 2294629, 2361509, 2381249, 2726560,
        2883512, 3079380]),
 3: array([1790864, 2170439, 2203939]),
 4: array([1300854]),
 5: array([ 54953, 120863]),
 6: array([2103281]),
 7: array([1764234, 1790885, 2053463]),
 8: array([ 385002, 1220198, 1462900, 1512685, 1631707, 1986994, 1999995]),
 9: array([ 65207, 363163, 985699]),
 10: array([1253863]),
 11: array([3294634]),
 12: array([1255953]),
 13: array([2649554]),
 14: array([452623]),
 15: array([1980929]),
 16: array([800241]),
 17: array([68646, 71562, 99674]),
 18: array([2258858]),
 19: array([390384]),
 20: array([1549589, 1605783]),
 21: array([4094724]),
 22: array([ 359950,  770828,  831387, 1091191, 1170358, 1210819, 1430132,
        1490017, 1535108, 1663662, 1843866, 1872181, 2103281, 2179136,
        2209764, 2267

`3.` If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [16]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    
    # Do things to create updated dictionary
    movies_to_analyze = dict()
    
    for user_key, movies_values in movies_seen.items():
        if len(movies_values) > lower_bound:
            movies_to_analyze[user_key] = movies_values
    
    return movies_to_analyze


# Use your function to return your updated dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

In [17]:
# Run the tests below to check that your movies_to_analyze matches the solution
assert len(movies_to_analyze) == 23512, "Oops!  It doesn't look like your dictionary has the right number of individuals."
assert len(movies_to_analyze[2]) == 23, "Oops!  User 2 didn't match the number of movies we thought they would have."
assert len(movies_to_analyze[7])  == 3, "Oops!  User 7 didn't match the number of movies we thought they would have."
print("If this is all you see, you are good to go!")

If this is all you see, you are good to go!


In [18]:
movies_to_analyze

{2: array([ 422720,  454876,  790636,  816711, 1091191, 1103275, 1322269,
        1390411, 1398426, 1431045, 1433811, 1454468, 1535109, 1675434,
        1798709, 2017038, 2024544, 2294629, 2361509, 2381249, 2726560,
        2883512, 3079380]),
 3: array([1790864, 2170439, 2203939]),
 7: array([1764234, 1790885, 2053463]),
 8: array([ 385002, 1220198, 1462900, 1512685, 1631707, 1986994, 1999995]),
 9: array([ 65207, 363163, 985699]),
 17: array([68646, 71562, 99674]),
 22: array([ 359950,  770828,  831387, 1091191, 1170358, 1210819, 1430132,
        1490017, 1535108, 1663662, 1843866, 1872181, 2103281, 2179136,
        2209764, 2267998, 2333784, 2334879, 2713180, 2872732]),
 24: array([ 816692, 2245084, 2267998, 2488496, 2802144, 2820852, 3300542]),
 25: array([  88847,  327056,  482571,  790724,  795421, 1130884, 1231583,
        1232829, 1647668, 2119532, 2294449, 2404463, 3230660, 3774114]),
 26: array([3954660, 5222918, 7291268]),
 30: array([ 111161,  454921, 1454468, 2024469, 2231

### Calculating User Similarities

Now that you have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users.  Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

However, this took a very long time to run, and other methods of performing these operations did not fit on the workspace memory!

Therefore, your task for this question is to look at a few specific examples of the correlation between ratings given by two users.  For this question consider you want to compute the [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) between users.

`4.` Using the **movies_to_analyze** dictionary and **user_by_movie** dataframe, create a function that computes the correlation between the ratings of similar movies for two users.  Then use your function to compare your results to ours using the tests below.  

In [None]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    movies_user1 = movies_to_analyze[user1]
    movies_user2 = movies_to_analyze[user2]
    
    # Find Similar Movies. Will return movie ids
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    
    # get similar movies ratings
    df = user_by_movie.loc[(user1, user2), sim_movs]
    corr = df.transpose().corr().iloc[0,1]
  
    return corr #return the correlation

In [None]:
# Read in solution correlations - this will take some time to read in
import pickle
corrs_import = pickle.load(open("data/Term2/recommendations/lesson1/data/corrs.p", "rb"))
df_corrs = pd.DataFrame(corrs_import)
df_corrs.columns = ['user1', 'user2', 'movie_corr']

In [None]:
# Test your function against the solution
assert compute_correlation(2,2) == df_corrs.query("user1 == 2 and user2 == 2")['movie_corr'][0], "Oops!  The correlation between a user and itself should be 1.0."
assert round(compute_correlation(2,66), 2) == round(df_corrs.query("user1 == 2 and user2 == 66")['movie_corr'][1], 2), "Oops!  The correlation between user 2 and 66 should be about 0.76."
assert np.isnan(compute_correlation(2,104)) == np.isnan(df_corrs.query("user1 == 2 and user2 == 104")['movie_corr'][4]), "Oops!  The correlation between user 2 and 104 should be a NaN."

print("If this is all you see, then it looks like your function passed all of our tests!")

### Why the NaN's?

If the function you wrote passed all of the tests, then you have correctly set up your function to calculate the correlation between any two users.  The **df_corrs** dataframe created in the cell leading up to the tests holds combinations of users along with their corresponding correlation.  

`5.` But one question is, why are we still obtaining **NaN** values?  Look at the header below for users 2 and 104 - they have a correlation of **NaN** - why?

In [None]:
df_corrs.head()

Leave your thoughts here about why the NaN exists, and use the cells below to validate your thoughts.  These Nan's ultimately make the correlation coefficient a less than optimal measure of similarity between two users.

In [21]:
# Which movies did both user 2 and user 1044 see?
movies_set_2 = set(movies_to_analyze[2])
movies_set_104 = set(movies_to_analyze[104])
sim_movies = movies_set_2.intersection(movies_set_104)
sim_movies

{454876, 816711, 1454468, 1535109}

In [22]:
# What were the ratings for each user for those movies?
print(user_by_movie.loc[2, sim_movies])
print(user_by_movie.loc[104, sim_movies])

movie_id
454876     8.0
1454468    8.0
1535109    8.0
816711     8.0
Name: 2, dtype: float64
movie_id
454876     9.0
1454468    7.0
1535109    9.0
816711     7.0
Name: 104, dtype: float64


Since the rating for user 2 is all the same (8.0), the standard deviation will be 0. Since correlation is has standard deviation of user 2 in the denominator, we have a divide by 0 situation. Hence, the correlation is NA.

`6.` Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [32]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    # get id of movies to analyze for both users
    movies_set_1 = movies_to_analyze[user1]
    movies_set_2 = movies_to_analyze[user2]
    
    # find id of similar movies
    sim_movies = np.intersect1d(movies_set_1, movies_set_2, assume_unique=True)
    
    # get subset of user_by_movies for the two users
    subset_df = user_by_movie.loc[(user1,user2), sim_movies]
    
    # calculate euclidean distance
    dist = np.linalg.norm(subset_df.loc[user1] - subset_df.loc[user2])
   
    return dist #return the euclidean distance

In [26]:
# Read in solution euclidean distances - this will take some time to read in
#df_dists = pickle.load(open("data/Term2/recommendations/lesson1/data/dists.p", "rb"))
df_dists = pd.read_pickle('data/Term2/recommendations/lesson1/data/dists.p')
df_dists.head()

Unnamed: 0,user1,user2,eucl_dist
0,2,2,0.0
1,2,66,2.236068
2,2,90,5.385165
3,2,99,2.828427
4,2,104,2.0


In [33]:
# Test your function against the solution
assert compute_euclidean_dist(2,2) == df_dists.query("user1 == 2 and user2 == 2")['eucl_dist'][0], "Oops!  The distance between a user and itself should be 0.0."
assert round(compute_euclidean_dist(2,66), 2) == round(df_dists.query("user1 == 2 and user2 == 66")['eucl_dist'][1], 2), "Oops!  The distance between user 2 and 66 should be about 2.24."
assert np.isnan(compute_euclidean_dist(2,104)) == np.isnan(df_dists.query("user1 == 2 and user2 == 104")['eucl_dist'][4]), "Oops!  The distance between user 2 and 104 should be 2."

print("If this is all you see, then it looks like your function passed all of our tests!")

If this is all you see, then it looks like your function passed all of our tests!


### Using the Nearest Neighbors to Make Recommendations

In the previous questions, you read in **df_corrs** and **df_dists**. Therefore, you have a measure of distance and similarity for each user to every other user. These dataframes hold every possible combination of users, as well as the corresponding correlation or euclidean distance, respectively.

Because of the **NaN** values that exist within **df_corrs**, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [34]:
# find_closest_neighbors
user = 2
df_dists[df_dists['user1'] == user].sort_values(by='eucl_dist').head(10)

Unnamed: 0,user1,user2,eucl_dist
0,2,2,0.000000
35,2,755,0.000000
1161,2,22915,0.000000
1836,2,35310,0.000000
1808,2,34706,0.000000
1729,2,33207,0.000000
481,2,9356,0.000000
458,2,8944,0.000000
656,2,12856,0.000000
1719,2,32951,0.000000


In [35]:
df_dists[df_dists['user1'] == user].sort_values(by='eucl_dist')['user2'].head(10)

0           2
35        755
1161    22915
1836    35310
1808    34706
1729    33207
481      9356
458      8944
656     12856
1719    32951
Name: user2, dtype: int64

In [36]:
# ignore ownself, so take from rows 2 onwards
df_dists[df_dists['user1'] == user].sort_values(by='eucl_dist')['user2'].iloc[1:]

35        755
1161    22915
1836    35310
1808    34706
1729    33207
481      9356
458      8944
656     12856
1719    32951
2311    44688
1029    20390
2047    39336
79       1885
2800    53793
92       2138
1590    30884
2053    39417
1293    25270
1364    26605
1362    26585
2012    38636
85       2023
751     14774
1702    32575
738     14494
628     12284
2027    38932
1329    25961
2375    45808
2406    46415
        ...  
355      6947
1141    22546
606     11804
2695    51917
1318    25766
1746    33621
159      3425
703     13775
727     14276
2780    53447
1098    21695
1867    35832
351      6901
931     18327
1167    22978
49       1063
680     13275
1281    25052
2346    45275
1381    26870
761     14927
818     15981
1545    30036
143      3064
513      9914
2575    49739
165      3514
1915    36807
1699    32494
2743    52737
Name: user2, Length: 2808, dtype: int64

In [40]:
# movies_liked
user_id = 2
min_rating = 7
movies_liked = user_items.query("user_id == @user_id and rating >= @min_rating")['movie_id']
movies_liked

2      422720
3      454876
4      790636
5      816711
6     1091191
7     1103275
8     1322269
9     1390411
10    1398426
11    1431045
12    1433811
13    1454468
14    1535109
15    1675434
16    1798709
17    2017038
18    2024544
19    2294629
20    2361509
22    2726560
23    2883512
24    3079380
Name: movie_id, dtype: int64

In [45]:
movies[movies['movie_id'].isin(movies_liked)]['movie']

13196           Marie Antoinette (2006)
13676                 Life of Pi (2012)
14545         Dallas Buyers Club (2013)
14754                World War Z (2013)
15885              Lone Survivor (2013)
15946                 Two Lovers (2008)
17094       August: Osage County (2013)
17408    In the Heart of the Sea (2015)
17447     Straight Outta Compton (2015)
17622                   Deadpool (2016)
17637                 Disconnect (2012)
17747                    Gravity (2013)
18168           Captain Phillips (2013)
19010           The Intouchables (2011)
19779                        Her (2013)
20857                All Is Lost (2013)
20877           12 Years a Slave (2013)
22307                     Frozen (2013)
22683                 The Intern (2015)
24068           The Longest Ride (2015)
24411                       Chef (2014)
24985                        Spy (2015)
Name: movie, dtype: object

In [52]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy - order the neighbors
    
    closest_neighbors= df_dists[df_dists['user1'] == user].sort_values(by='eucl_dist')['user2'].iloc[1:]
    
    closest_neighbors = np.array(closest_neighbors)   
    
    return closest_neighbors
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    movies_liked = user_items.query("user_id == @user_id and rating >= @min_rating")['movie_id']
    movies_liked = np.array(movies_liked)
  
    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_list = movies[movies['movie_id'].isin(movie_ids)]['movie']
    movie_list = list(movie_list)

    return movie_list
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    # find movies watched by user. Won't recommend this
    movies_seen = movies_watched(user)
    
    # find closest neighbours of users
    closest_neighbours = find_closest_neighbors(user)
    
    rec_movie_id = []
    
    for neighbour in closest_neighbours:
        # find movies liked by neighbour
        movies_neigh_liked = movies_liked(neighbour)
        
        # Obtain recommendations for each neighbor. This is whatever movies liked by neighbour but user has not watched them
        new_rec_movie_neigh = np.setdiff1d(movies_neigh_liked, movies_seen, assume_unique=True)
        
        # add these movies to list of recommended movies
        rec_movie_id.extend(new_rec_movie_neigh)
        
        # if we enough movies, break
        if len(rec_movie_id) >= 10:
            break
    
    # restrict to just num_recs movies
    rec_movie_id = rec_movie_id[:num_recs]
    
    # get only unique movies
    rec_movie_id = np.array(rec_movie_id)
    rec_movie_id = np.unique(rec_movie_id)
    
    recommendations = movie_names(rec_movie_id)
    
    return recommendations

def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''   
    # Apply make recs for each user - 
    # hint this may take some time, so you might want to set up a progress bar to watch things progress
    
    # users to analyze
    users = np.unique(df_dists['user1'])
    
    # number of uers to analyze
    n_users = len(users)
    
    # Set up a progress bar
    cnter = 0
    bar = progressbar.ProgressBar(maxval=n_users+1, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    
    # store recommendations in this dictionary
    all_recs = dict()
    
    # make recommendations for each user
    for user in users:
        # append recommended movie names as values of dictionary
        all_recs[user] = make_recommendations(user)
        
        # update progress bar
        cnter+=1 
        bar.update(cnter)
    
    return all_recs

all_recs = all_recommendations(10)



In [49]:
# This make some time - it loads our solution dictionary so you can compare results
all_recs_sol = pickle.load(open("data/Term2/recommendations/lesson1/data/all_recs.p", "rb"))

In [53]:
assert all_recs[2] == make_recommendations(2), "Oops!  Your recommendations for user 2 didn't match ours."
assert all_recs[26] == make_recommendations(26), "Oops!  It actually wasn't possible to make any recommendations for user 26."
assert all_recs[1503] == make_recommendations(1503), "Oops! Looks like your solution for user 1503 didn't match ours."
print("If you made it here, you now have recommendations for many users using collaborative filtering!")
HTML('<img src="images/greatjob.webp">')

If you made it here, you now have recommendations for many users using collaborative filtering!


### Now What?

If you made it this far, you have successfully implemented a solution to making recommendations using collaborative filtering. 

`8.` Let's do a quick recap of the steps taken to obtain recommendations using collaborative filtering.  

In [56]:
# Check your understanding of the results by correctly filling in the dictionary below
a = "pearson's correlation and spearman's correlation"
b = 'item based collaborative filtering'
c = "there were too many ratings to get a stable metric"
d = 'user based collaborative filtering'
e = "euclidean distance and pearson's correlation coefficient"
f = "manhatten distance and euclidean distance"
g = "spearman's correlation and euclidean distance"
h = "the spread in some ratings was zero"
i = 'content based recommendation'

sol_dict = {
    'The type of recommendation system implemented here was a ...': d,
    'The two methods used to estimate user similarity were: ': e,
    'There was an issue with using the correlation coefficient.  What was it?': h
}

t.test_recs(sol_dict)

KeyError: 'The two methods used to estimate user similarity were: '

Additionally, let's take a closer look at some of the results.  There are three objects that you read in to check your results against the solution:

* **df_corrs** - a dataframe of user1, user2, pearson correlation between the two users
* **df_dists** - a dataframe of user1, user2, euclidean distance between the two users
* **all_recs_sol** - a dictionary of all recommendations (key = user, value = list of recommendations)

Looping your results from the correlation and euclidean distance functions through every pair of users could have been used to create the first two objects (I don't recommend doing this given how long it will take).  

`9.`Use these three objects along with the cells below to correctly fill in the dictionary below and complete this notebook!

In [None]:
a = 567
b = 1503
c = 1319
d = 1325
e = 2526710
f = 0
g = 'Use another method to make recommendations - content based, knowledge based, or model based collaborative filtering'

sol_dict2 = {
    'For how many pairs of users were we not able to obtain a measure of similarity using correlation?': # letter here,
    'For how many pairs of users were we not able to obtain a measure of similarity using euclidean distance?': f,
    'For how many users were we unable to make any recommendations for using collaborative filtering?': # letter here,
    'For how many users were we unable to make 10 recommendations for using collaborative filtering?': # letter here,
    'What might be a way for us to get 10 recommendations for every user?': g   
}

t.test_recs2(sol_dict2)

In [None]:
#Use the below cells for any work you need to do!

In [None]:
# NaN correlation values
df_corrs['movie_corr'].isnull().sum()

In [57]:
# NaN euclidean distance values
df_dists['eucl_dist'].isnull().sum()

0

In [58]:
# Users without recs
users_without_recs = []

for user, movie_recs in all_recs.items():
    if len(movie_recs) == 0:
        users_without_recs.append(user)

len(users_without_recs)

1294

In [59]:
# Users with less than 10 recs
users_with_less_than_10recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) < 10:
        users_with_less_than_10recs.append(user)
    
len(users_with_less_than_10recs)

1720