# 1 Introduction

Recommendations are being used to recommend everything from movies to music to friends to new destinations. There are three main methods for implementing recommendations that you will become familiar with throughout this lesson:
- Knowledge Based Recommendations
- Collaborative Filtering Based Recommendations
- Content Based Recommendations

After completing this lesson, you will be ready for the upcoming lessons where you will
- Learn about more advanced techniques.
- Deploy your recommendations in a web application.

These three lessons will aim to be extremely practical. The lessons will require that you write code to implement a number of different recommendation techniques.

# 2 Example Recommendation Engines

### Example Recommendations
- LinkedIn and Facebook

Both LinkedIn and Facebook have recommendations for connections (business of friends) similar to what is shown below.

- AirBnB Experiences and Destinations

AirBnB uses recommendations to determine experiences and destinations for their users.

- Walmart, Amazon, and Other Retailers

As humans on the Internet, we all get pinged with constant recommendations from retailers.

# 3 What's Ahead

### Types of Recommendations
In this lesson, you will be working with the MovieTweetings data to apply each of the three methods of recommendations:
1. Knowledge Based Recommendations
2. Collaborative Filtering Based Recommendations
3. Content Based Recommendations

Within Collaborative Filtering, there are two main branches:
1. Model Based Collaborative Filtering
2. Neighborhood Based Collaborative Filtering

In this lesson, you will implement Neighborhood Based Collaborative Filtering. In the next lesson, you will implement Model Based Collaborative Filtering.

### Similarity Metrics
In order to implement Neighborhood Based Collaborative Filtering, you will learn about some common ways to measure the similarity between two users (or two items) including:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

You will learn why sometimes one metric works better than another by looking at a specific situation where one metric provides more information than another.

### Business Cases For Recommendations
Finally, you will look at the four ideas needed for businesses to implement successful recommendations to drive revenue, which include:
1. Relevance
2. Novelty
3. Serendipity
4. Increased Diversity

At the end of this lesson, you will have gained a ton of skills to build upon or to start creating your own recommendations in practice.

# 4 Introduction to MovieTweetings

### MovieTweetings
If you would like additional information about the MovieTweetings data, you can find more information at the links provided here:
- [The MovieTweetings white paper.](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf)
- [A Github account set up for MovieTweetings](https://github.com/sidooms/MovieTweetings)
- [A slide deck by Simon Doom about MovieTweetings.](https://www.slideshare.net/simondooms/movie-tweetings-a-movie-rating-dataset-collected-from-twitter)

# 5 MovieTweeting Data

### Recommendations with MovieTweetings: Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).  To get started, you can read more about this project and the dataset from the [publication here](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf).

**Note:** There are solutions to each of the notebooks available by hitting the orange jupyter logo in the top left of this notebook.  Additionally, you can watch me work through the solutions on the screencasts that follow each workbook. 

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/movies.dat', delimiter='::', header=None, names=['movie_id', 'movie', 'genre'], dtype={'movie_id': object}, engine='python')
reviews = pd.read_csv('https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/ratings.dat', delimiter='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'], dtype={'movie_id': object, 'user_id': object, 'timestamp': object}, engine='python')

#### 1. Take a Look At The Data 

Take a look at the data and use your findings to fill in the dictionary below with the correct responses to show your understanding of the data.

In [2]:
# number of movies
print("The number of movies is {}.".format(movies.shape[0]))

# number of ratings
print("The number of ratings is {}.".format(reviews.shape[0]))

# unique users
print("The number of unique users is {}.".format(reviews.user_id.nunique()))

# missing ratings
print("The number of missing reviews is {}.".format(int(reviews.rating.isnull().mean()*reviews.shape[0])))

# the average, min, and max ratings given
print("The average, minimum, and max ratings given are {}, {}, and {}, respectively.".format(np.round(reviews.rating.mean(), 0), reviews.rating.min(), reviews.rating.max()))

The number of movies is 35216.
The number of ratings is 854283.
The number of unique users is 66438.
The number of missing reviews is 0.
The average, minimum, and max ratings given are 7.0, 0, and 10, respectively.


In [3]:
# number of different genres
genres = []
for val in movies.genre:
    try:
        genres.extend(val.split('|'))
    except AttributeError:
        pass

# we end up needing this later
genres = set(genres)
print("The number of genres is {}.".format(len(genres)))

The number of genres is 28.


In [4]:
# Use your findings to match each variable to the correct statement in the dictionary
a = 53968
b = 10
c = 7
d = 31245
e = 15
f = 0
g = 4
h = 712337
i = 28

dict_sol1 = {
'The number of movies in the dataset': d, 
'The number of ratings in the dataset': h,
'The number of different genres': i, 
'The number of unique users in the dataset': a, 
'The number missing ratings in the reviews dataset': f, 
'The average rating given across all ratings': c,
'The minimum rating given across all ratings': f,
'The maximum rating given across all ratings': b
}

print(dict_sol1)

{'The number of movies in the dataset': 31245, 'The number of ratings in the dataset': 712337, 'The number of different genres': 28, 'The number of unique users in the dataset': 53968, 'The number missing ratings in the reviews dataset': 0, 'The average rating given across all ratings': 7, 'The minimum rating given across all ratings': 0, 'The maximum rating given across all ratings': 10}


#### 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

#### Reviews
* Create a date out of time stamp

You can check your results against the header of my solution by running the cell below with the **show_clean_dataframes** function.

In [5]:
# pull date if it exists
create_date = lambda val: val[-5:-1] if val[-1] == ')' else np.nan

# apply the function to pull the date
movies['date'] = movies['movie'].apply(create_date)

# Return century of movie as a dummy column
def add_movie_year(val):
    if val[:2] == yr:
        return 1
    else:
        return 0
        
# Apply function
for yr in ['18', '19', '20']:
    movies[str(yr) + "00's"] = movies['date'].apply(add_movie_year)


In [6]:
# Function to split and return values for columns
def split_genres(val):
    try:
        if val.find(gene) >-1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# Apply function for each genre
for gene in genres:        
    movies[gene] = movies['genre'].apply(split_genres)

In [7]:
movies.head() #Check what it looks like

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Documentary,Biography,Sci-Fi,...,Action,Music,Film-Noir,Musical,Adult,Family,Fantasy,Thriller,Comedy,Western
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
import datetime

change_timestamp = lambda val: datetime.datetime.fromtimestamp(int(val)).strftime('%Y-%m-%d %H:%M:%S')

reviews['date'] = reviews['timestamp'].apply(change_timestamp)

In [9]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 16:00:50
1,2,358273,9,1579057827,2020-01-14 21:10:27
2,2,10039344,5,1578603053,2020-01-09 14:50:53
3,2,6751668,9,1578955697,2020-01-13 16:48:17
4,2,7131622,8,1579559244,2020-01-20 16:27:24


In [10]:
# now reviews and movies are the final dataframes with the necessary columns
# reviews.to_csv('./reviews_clean.csv')
# movies.to_csv('./movies_clean.csv')

# 6 Solution MovieTweeting Data

# 7 Ways to Recommend: Knowledge Based

### Knowledge Based Recommendations
A knowledge based recommendation is one in which knowledge about the item or user preferences are used to make a recommendation.

Knowledge based recommendations are pretty common when purchasing luxury items. Take a look at the filters available on Zillow in the image below. This is an example of building in a knowledge based recommendation, as users can add their own preferences to the items that are provided.

![Zillow](https://video.udacity-data.com/topher/2018/August/5b6a4153_screen-shot-2018-08-07-at-6.02.41-pm/screen-shot-2018-08-07-at-6.02.41-pm.png)

Often a rank based algorithm is provided along with knowledge based recommendations to bring the most popular items in particular categories to the user's attention.

In the next concept, you will get some practice implementing this type of recommendation for the MovieTweetings dataset.

# 8 Notebook: Knowledge Based

### Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.


#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
# movies = pd.read_csv('movies_clean.csv')
# reviews = pd.read_csv('reviews_clean.csv')

try:
    del movies['Unnamed: 0']
except Exception as e:
    print(e)
    
try:
    del reviews['Unnamed: 0']
except Exception as e:
    print(e)#### 1. How To Find The Most Popular Movies


'Unnamed: 0'
'Unnamed: 0'


#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [12]:
def create_ranked_df(movies, reviews):
        '''
        INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe
        
        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews, 
                        then time, and must have more than 4 ratings
        '''
        
        # Pull the average ratings and number of ratings for each movie
        movie_ratings = reviews.groupby('movie_id')['rating']
        avg_ratings = movie_ratings.mean()
        num_ratings = movie_ratings.count()
        last_rating = pd.DataFrame(reviews.groupby('movie_id').max()['date'])
        last_rating.columns = ['last_rating']

        # Add Dates
        rating_count_df = pd.DataFrame({'avg_rating': avg_ratings, 'num_ratings': num_ratings})
        rating_count_df = rating_count_df.join(last_rating)

        # merge with the movies dataset
        movie_recs = movies.set_index('movie_id').join(rating_count_df)

        # sort by top avg rating and number of ratings
        ranked_movies = movie_recs.sort_values(['avg_rating', 'num_ratings', 'last_rating'], ascending=False)

        # for edge cases - subset the movie list to those with only 5 or more reviews
        ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]
        
        return ranked_movies
    

def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies

Usint the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

In [13]:
# Top 20 movies recommended for id 1
ranked_movies = create_ranked_df(movies, reviews) # only run this once - it is not fast

recs_20_for_1 = popular_recommendations('1', 20, ranked_movies)

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations('53968', 5, ranked_movies)

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations('70000', 100, ranked_movies)

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations('43', 35, ranked_movies)

In [14]:
print(recs_20_for_1,"\n")
print(recs_5_for_53968,"\n")
print(recs_35_for_43,"\n")

['MSG 2 the Messenger (2015)', 'Avengers: Age of Ultron Parody (2015)', 'Five Minutes (2017)', 'Selam (2013)', 'Let There Be Light (2017)', "Quiet Riot: Well Now You're Here, There's No Way Back (2014)", 'Crawl Bitch Crawl (2012)', 'Chasing Happiness (2019)', 'Make Like a Dog (2015)', 'Pandorica (2016)', 'Third Contact (2011)', 'Romeo Juliet (2009)', 'Be Somebody (2016)', 'Birlesen Gonuller (2014)', 'Kitbull (2019)', 'Agnelli (2017)', 'Sátántangó (1994)', 'Foster (2011)', 'CM101MMXI Fundamentals (2013)', 'Crystal Lake Memories: The Complete History of Friday the 13th (2013)'] 

['MSG 2 the Messenger (2015)', 'Avengers: Age of Ultron Parody (2015)', 'Five Minutes (2017)', 'Selam (2013)', 'Let There Be Light (2017)'] 

['MSG 2 the Messenger (2015)', 'Avengers: Age of Ultron Parody (2015)', 'Five Minutes (2017)', 'Selam (2013)', 'Let There Be Light (2017)', "Quiet Riot: Well Now You're Here, There's No Way Back (2014)", 'Crawl Bitch Crawl (2012)', 'Chasing Happiness (2019)', 'Make Like a 

**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!


### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

In [15]:
def popular_recs_filtered(user_id, n_top, ranked_movies, years=None, genres=None):
    '''
    REDO THIS DOC STRING
    
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    years - a list of strings with years of movies
    genres - a list of strings with genres of movies
    
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Filter movies based on year and genre
    if years is not None:
        ranked_movies = ranked_movies[ranked_movies['date'].isin(years)]

    if genres is not None:
        num_genre_match = ranked_movies[genres].sum(axis=1)
        ranked_movies = ranked_movies.loc[num_genre_match > 0, :]
            
            
    # create top movies list 
    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies


In [16]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = popular_recs_filtered('53968', 5, ranked_movies, years=['2015', '2016', '2017', '2018'])

# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = popular_recs_filtered('70000', 100, ranked_movies, genres=['History', 'News'])



In [17]:
print(recs_20_for_1_filtered,"\n")
print(recs_5_for_53968_filtered,"\n")

["Hillary's America: The Secret History of the Democratic Party (2016)", 'I Believe in Miracles (2015)', 'O.J.: Made in America (2016)', 'Ayla: The Daughter of War (2017)', 'Hacksaw Ridge (2016)', 'They Shall Not Grow Old (2018)', 'Namhansanseong (2017)', 'The Farthest (2017)', 'Kono sekai no katasumi ni (2016)', 'Sado (2015)', 'Silicon Cowboys (2016)', '13th (2016)', 'Ethel &amp; Ernest (2016)', 'Paul, Apostle of Christ (2018)', 'Kincsem (2017)', 'LA 92 (2017)', 'Straight Outta Compton (2015)', 'Nise - O Coração da Loucura (2015)', 'Under sandet (2015)', 'Only the Dead (2015)'] 

['MSG 2 the Messenger (2015)', 'Avengers: Age of Ultron Parody (2015)', 'Five Minutes (2017)', 'Let There Be Light (2017)', 'Make Like a Dog (2015)'] 



# 9 Solution: Knowledge Based

# 10 More Personalized Recommendations

In some cases, we need to be able to send recommendations without a user telling us exactly what they want or in a more personalized way than simply the top items. Imagine you want to send an email of recommendations or place recommendations within a web page (the side of a blog or as a banner advertisement); in these cases, it is often useful to implement information that we know about users or items to make these recommendations. This leads to some additional recommendation methods!

# 11 Ways to Recommend: Collaborative Filtering

### Collaborative Filtering
- A method of making recommendations based on using the collaboration of user-item interactions

# 12 Collaborative Filtering

### Collaborative Filtering and Content Based Recommendations
**Collaborative filtering** is a method of making recommendations based only on the interactions between users and items.

Alternatively, **content based recommendations** are when we use information about the users or items to assist in our recommendations.

### Examples
- When a user is inputting her/his information (location input), this is an example of knowledge based recommending.
- When we use connections between users and items (connecting Mike and Pradeep as similar), this is an example of collaborative filtering. 
- When we use information about the items or users to recommend new items (items related to robotics), this is an example of content based recommending.

# 13 Measuring Similarity

### Collaborative Filtering
There are two main ways to implement collaborative filtering:
1. Model Based Collaborative Filtering
2. Neighborhood Based Collaborative Filtering

In this lesson, we will cover Neighborhood Based Collaborative Filtering, which is used to identify items or users that are "neighbors" with one another.

There are a number of ways we might go about finding an individual's closest neighbors - the metrics we will take a closer look at include:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

On the next page, you will work through a few examples to get more familiar with how each of these metrics is computed, and why you might use one over another.

# 14 Notebook: Measuring Similarity

### How to Find Your Neighbor?

In neighborhood based collaborative filtering, it is incredibly important to be able to identify an individual's neighbors. Let's look at a small dataset in order to understand how we can use different metrics to identify close neighbors.

In [18]:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr, kendalltau
import matplotlib.pyplot as plt
%matplotlib inline

play_data = pd.DataFrame({'x1': [-3, -2, -1, 0, 1, 2, 3], 
               'x2': [9, 4, 1, 0, 1, 4, 9],
               'x3': [1, 2, 3, 4, 5, 6, 7],
               'x4': [2, 5, 15, 27, 28, 30, 31]
})

#create play data dataframe
play_data = play_data[['x1', 'x2', 'x3', 'x4']]

### Measures of Similarity

The first metrics we will look at have similar characteristics:

1. Pearson's Correlation Coefficient
2. Spearman's Correlation Coefficient
3. Kendall's Tau

Let's take a look at each of these individually.

### Pearson's Correlation

First, **Pearson's correlation coefficient** is a measure related to the strength and direction of a **linear** relationship.  

If we have two vectors x and y, we can compare their individual elements in the following way to calculate Pearson's correlation coefficient:


$$CORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum\limits_{i=1}^{n}(y_i-\bar{y})^2}} $$

where 

$$\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}x_i$$

1. Write a function that takes in two vectors and returns the Pearson correlation coefficient.  You can then compare your answer to the built in function in NumPy by using the assert statements in the following cell.

In [19]:
def pearson_corr(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    corr - the pearson correlation coefficient for comparing x and y
    '''
    # Compute Mean Values
    mean_x, mean_y = np.sum(x)/len(x), np.sum(y)/len(y) 
    
    x_diffs = x - mean_x
    y_diffs = y - mean_y
    numerator = np.sum(x_diffs*y_diffs)
    denominator = np.sqrt(np.sum(x_diffs**2))*np.sqrt(np.sum(y_diffs**2))
        
    corr = numerator/denominator
                            
    return corr                    

In [20]:
# This cell will test your function against the built in numpy function
assert pearson_corr(play_data['x1'], play_data['x2']) == np.corrcoef(play_data['x1'], play_data['x2'])[0][1], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(pearson_corr(play_data['x1'], play_data['x2']))
assert round(pearson_corr(play_data['x1'], play_data['x3']), 2) == np.corrcoef(play_data['x1'], play_data['x3'])[0][1], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], pearson_corr(play_data['x1'], play_data['x3']))
assert round(pearson_corr(play_data['x3'], play_data['x4']), 2) == round(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], pearson_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!")


If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!


In [21]:
a = True
b = False
c = "We can't be sure."


pearson_dct = {"If when x increases, y always increases, Pearson's correlation will be always be 1.": b,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": b
}

print(pearson_dct, "\n")
print("That's right!  Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.")

{"If when x increases, y always increases, Pearson's correlation will be always be 1.": False, "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": True, "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": True, "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": False} 

That's right!  Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.


### Spearman's Correlation

Now, let's look at **Spearman's correlation coefficient**.  Spearman's correlation is what is known as a [non-parametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) statistic, which is a statistic who's distribution doesn't depend parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics).  

Frequently non-parametric statistics are based on the ranks of data rather than the original values collected.  This happens to be the case with Spearman's correlation coefficient, which is calculated similarly to Pearson's correlation.  However, instead of using the raw data, we use the rank of each value.

You can quickly change from the raw data to the ranks using the **.rank()** method as shown here:

In [22]:
print("The ranked values for the variable x1 are: {}".format(np.array(play_data['x1'].rank())))
print("The raw data values for the variable x1 are: {}".format(np.array(play_data['x1'])))

The ranked values for the variable x1 are: [1. 2. 3. 4. 5. 6. 7.]
The raw data values for the variable x1 are: [-3 -2 -1  0  1  2  3]


If we map each of our data to ranked data values as shown above:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Here, we let the **r** indicate these are ranked values (this is not raising any value to the power of r).  Then we compute Spearman's correlation coefficient as:

$$SCORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x^{r}_i - \bar{x}^{r})(y^{r}_i - \bar{y}^{r})}{\sqrt{\sum\limits_{i=1}^{n}(x^{r}_i-\bar{x}^{r})^2}\sqrt{\sum\limits_{i=1}^{n}(y^{r}_i-\bar{y}^{r})^2}} $$

where 

$$\bar{x}^r = \frac{1}{n}\sum\limits_{i=1}^{n}x^r_i$$

`3.` Write a function that takes in two vectors and returns the Spearman correlation coefficient.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.

In [23]:
def corr_spearman(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    corr - the spearman correlation coefficient for comparing x and y
    '''
    # Change each vector to ranked values
    x = x.rank()
    y = y.rank()
    
    # Compute Mean Values
    mean_x, mean_y = np.sum(x)/len(x), np.sum(y)/len(y) 
    
    x_diffs = x - mean_x
    y_diffs = y - mean_y
    numerator = np.sum(x_diffs*y_diffs)
    denominator = np.sqrt(np.sum(x_diffs**2))*np.sqrt(np.sum(y_diffs**2))
        
    corr = numerator/denominator
                            
    return corr  

In [24]:
# This cell will test your function against the built in scipy function
assert corr_spearman(play_data['x1'], play_data['x2']) == spearmanr(play_data['x1'], play_data['x2'])[0], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(compute_corr(play_data['x1'], play_data['x2']))
assert round(corr_spearman(play_data['x1'], play_data['x3']), 2) == spearmanr(play_data['x1'], play_data['x3'])[0], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], compute_corr(play_data['x1'], play_data['x3']))
assert round(corr_spearman(play_data['x3'], play_data['x4']), 2) == round(spearmanr(play_data['x3'], play_data['x4'])[0], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], compute_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Spearman's correlation coefficient!")


If this is all you see, it looks like you are all set!  Nice job coding up Spearman's correlation coefficient!


`4.` Now that you have computed **Spearman's correlation coefficient**, use the below dictionary to identify statements that are true about **this** measure.

In [25]:
a = True
b = False
c = "We can't be sure."


spearman_dct = {"If when x increases, y always increases, Spearman's correlation will be always be 1.": a,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": a
}

print(spearman_dct, "\n")
print("That's right!  Unlike Pearson's correlation, Spearman's correlation can have perfect relationships (1 or -1 values) that aren't linear relationships.  You will notice that neither Spearman or Pearson correlation values suggest a relation when there are quadratic relationships.")

{"If when x increases, y always increases, Spearman's correlation will be always be 1.": True, "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": True, "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": True, "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": True} 

That's right!  Unlike Pearson's correlation, Spearman's correlation can have perfect relationships (1 or -1 values) that aren't linear relationships.  You will notice that neither Spearman or Pearson correlation values suggest a relation when there are quadratic relationships.


### Kendall's Tau

Kendall's tau is quite similar to Spearman's correlation coefficient.  Both of these measures are nonparametric measures of a relationship.  Specifically both Spearman and Kendall's coefficients are calculated based on ranking data and not the raw data.  

Similar to both of the previous measures, Kendall's Tau is always between -1 and 1, where -1 suggests a strong, negative relationship between two variables and 1 suggests a strong, positive relationship between two variables.

Though Spearman's and Kendall's measures are very similar, there are statistical advantages to choosing Kendall's measure in that Kendall's Tau has smaller variability when using larger sample sizes.  However Spearman's measure is more computationally efficient, as Kendall's Tau is O(n^2) and Spearman's correlation is O(nLog(n)). You can find more on this topic in [this thread](https://www.researchgate.net/post/Does_Spearmans_rho_have_any_advantage_over_Kendalls_tau).

Let's take a closer look at exactly how this measure is calculated.  Again, we want to map our data to ranks:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Then we calculate Kendall's Tau as:

$$TAU(\textbf{x}, \textbf{y}) = \frac{2}{n(n -1)}\sum_{i < j}sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

Where $sgn$ takes the the sign associated with the difference in the ranked values.  An alternative way to write 

$$sgn(x^r_i - x^r_j)$$ 

is in the following way:

$$
 \begin{cases} 
      -1  & x^r_i < x^r_j \\
      0 & x^r_i = x^r_j \\
      1 & x^r_i > x^r_j 
   \end{cases}
$$

Therefore the possible results of 

$$sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

are only 1, -1, or 0, which are summed to give an idea of the propotion of times the ranks of **x** and **y** are pointed in the right direction.

`5.` Write a function that takes in two vectors and returns Kendall's Tau.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.

In [26]:
def kendalls_tau(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    tau - the kendall's tau for comparing x and y
    '''    
    # Change each vector to ranked values
    x = x.rank()
    y = y.rank()
    n = len(x)
     
    sum_vals = 0
    # Compute Mean Values
    for i, (x_i, y_i) in enumerate(zip(x, y)):
        for j, (x_j, y_j) in enumerate(zip(x, y)):
            if i < j:
                sum_vals += np.sign(x_i - x_j)*np.sign(y_i - y_j)
                        
    tau = 2*sum_vals/(n*(n-1))
    
    return tau

In [27]:
# This cell will test your function against the built in scipy function
assert kendalls_tau(play_data['x1'], play_data['x2']) == kendalltau(play_data['x1'], play_data['x2'])[0], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(kendalls_tau(play_data['x1'], play_data['x2']))
assert round(kendalls_tau(play_data['x1'], play_data['x3']), 2) == kendalltau(play_data['x1'], play_data['x3'])[0], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(kendalltau(play_data['x1'], play_data['x3'])[0][1], kendalls_tau(play_data['x1'], play_data['x3']))
assert round(kendalls_tau(play_data['x3'], play_data['x4']), 2) == round(kendalltau(play_data['x3'], play_data['x4'])[0], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(kendalltau(play_data['x3'], play_data['x4'])[0][1], kendalls_tau(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Kendall's Tau!")


If this is all you see, it looks like you are all set!  Nice job coding up Kendall's Tau!


`6.` Use your functions (and/or your knowledge of each of the above coefficients) to accurately identify each of the below statements as True or False.  **Note:** There may be some rounding differences due to the way numbers are stored, so it is recommended that you consider comparisons to 4 or fewer decimal places.

In [28]:
a = True
b = False
c = "We can't be sure."


corr_comp_dct = {"For all columns of play_data, Spearman and Kendall's measures match.": a,
                "For all columns of play_data, Spearman and Pearson's measures match.": b, 
                "For all columns of play_data, Pearson and Kendall's measures match.": b}

print(corr_comp_dct, "\n")
print("That's right!  Pearson does not match the other two measures, as it looks specifically for linear relationships.  However, Spearman and Kenall's measures are exactly the same to one another in the cases related to play_data.")

{"For all columns of play_data, Spearman and Kendall's measures match.": True, "For all columns of play_data, Spearman and Pearson's measures match.": False, "For all columns of play_data, Pearson and Kendall's measures match.": False} 

That's right!  Pearson does not match the other two measures, as it looks specifically for linear relationships.  However, Spearman and Kenall's measures are exactly the same to one another in the cases related to play_data.


### Distance Measures

Each of the above measures are considered measures of correlation.  Similarly, there are distance measures (of which there are many).  [This is a great article](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/) on some popular distance metrics.  In this notebook, we will be looking specifically at two of these measures.  

1. Euclidean Distance
2. Manhattan Distance

Different than the three measures you built functions for, these two measures take on values between 0 and potentially infinity.  Measures that are closer to 0 imply that two vectors are more similar to one another.  The larger these values become, the more dissimilar two vectors are to one another.

Choosing one of these two `distance` metrics vs. one of the three `similarity` above is often a matter of personal preference, audience, and data specificities.  You will see in a bit a case where one of these measures (euclidean or manhattan distance) is optimal to using Pearson's correlation coefficient.

### Euclidean Distance

Euclidean distance can also just be considered as straight-line distance between two vectors.

For two vectors **x** and **y**, we can compute this as:

$$ EUC(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}(x_i - y_i)^2}$$

### Manhattan Distance

Different from euclidean distance, Manhattan distance is a 'manhattan block' distance from one vector to another.  Therefore, you can imagine this distance as a way to compute the distance between two points when you are not able to go through buildings.

Specifically, this distance is computed as:

$$ MANHATTAN(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}|x_i - y_i|}$$

Using each of the above, write a function for each to take two vectors and compute the euclidean and manhattan distances.


<img src="https://view3f484599.udacity-student-workspaces.com/notebooks/images/distances.png">

You can see in the above image, the **blue** line gives the **Manhattan** distance, while the **green** line gives the **Euclidean** distance between two points.

`7.` Use the below cell to complete a function for each distance metric.  Then test your functions against the built in values using the below.

In [29]:
def eucl_dist(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    euc - the euclidean distance between x and y
    '''  
    return np.linalg.norm(x - y)
    
def manhat_dist(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    manhat - the manhattan distance between x and y
    '''  
    return sum(abs(e - s) for s, e in zip(x, y))

In [30]:
# Test your functions
eucl_dist(play_data['x1'], play_data['x2'])

14.966629547095765

In [31]:
manhat_dist(play_data['x2'], play_data['x3'])

24

# 15 Solution: Measuring Similarity

# 16 Identifying Recommendation Engines

### Making Recommendations
Finalizing our neighborhood based recommendations, we need to use the ratings from our neighbors to influence the ratings we provide to other users.

There are a few ways to do this, but a simple method would be to:
1. Remove movies our user has already seen.
2. Find ratings of the neighbors that are high.
3. Recommend movies to each user where both 1 and 2 above hold.

Other methods for making recommendations using collaborative filtering are based on weighting of the neighbors' ratings based on the 'closeness' of the neighbors.

You can use each of the following two papers to learn more about this technique:
1. [Domino Data Lab Paper](https://blog.dominodatalab.com/recommender-systems-collaborative-filtering/)
2. [Semantic Scholar Paper On Weighted Ratings](https://pdfs.semanticscholar.org/3e9e/bcd9503ef7375c7bb334511804d1e45127e9.pdf)

In the next notebook, you will implement the three-step process above to make recommendations for every user in the dataset. For computational reasons, you will notice that iterating this approach through all users has been done for you. But you will go through the process of implementing for individual pairs of users, which could easily be extended via looping to all users.

# 17 Notebook: Collaborative Filtering

## Recommendations with MovieTweetings: Collaborative Filtering

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, you will be working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

**NOTE**: Because of the size of the datasets, some of your code cells here will take a while to execute, so be patient!

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t
from scipy.sparse import csr_matrix
from IPython.display import HTML


%matplotlib inline

# Read in the datasets
# movies = pd.read_csv('movies_clean.csv')
# reviews = pd.read_csv('reviews_clean.csv')

# del movies['Unnamed: 0']
# del reviews['Unnamed: 0']

print(reviews.head())

  user_id  movie_id  rating   timestamp                 date
0       1   0114508       8  1381006850  2013-10-05 16:00:50
1       2   0358273       9  1579057827  2020-01-14 21:10:27
2       2  10039344       5  1578603053  2020-01-09 14:50:53
3       2   6751668       9  1578955697  2020-01-13 16:48:17
4       2   7131622       8  1579559244  2020-01-20 16:27:24


### Measures of Similarity

When using **neighborhood** based collaborative filtering, it is important to understand how to measure the similarity of users or items to one another.  

There are a number of ways in which we might measure the similarity between two vectors (which might be two users or two items).  In this notebook, we will look specifically at two measures used to compare vectors:

* **Pearson's correlation coefficient**

Pearson's correlation coefficient is a measure of the strength and direction of a linear relationship. The value for this coefficient is a value between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship. 

If we have two vectors x and y, we can define the correlation between the vectors as:


$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$

where 

$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

and 

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

where n is the length of the vector, which must be the same for both x and y and $\bar{x}$ is the mean of the observations in the vector.  

We can use the correlation coefficient to indicate how alike two vectors are to one another, where the closer to 1 the coefficient, the more alike the vectors are to one another.  There are some potential downsides to using this metric as a measure of similarity.  You will see some of these throughout this workbook.


* **Euclidean distance**

Euclidean distance is a measure of the straightline distance from one vector to another.  Because this is a measure of distance, larger values are an indication that two vectors are different from one another (which is different than Pearson's correlation coefficient).

Specifically, the euclidean distance between two vectors x and y is measured as:

$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

Different from the correlation coefficient, no scaling is performed in the denominator.  Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note:** Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.  In this case, we will not need to scale data because they are all on a 10 point scale, but it is always something to keep in mind!

------------

### User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns.  


![alt text](https://view3f484599.udacity-student-workspaces.com/notebooks/images/userxitem.png)


In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**.

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.

In [33]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()
len(user_items)

854283

### Creating the User-Item Matrix

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results! 

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [35]:
# Create user-by-item matrix
try:
    user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
except Exception as e:
    print(e)

Unstacked DataFrame is too big, causing int32 overflow


In [None]:
# assert movies.shape[0] == user_by_movie.shape[1], "Oh no! Your matrix should have {} columns, and yours has {}!".format(movies.shape[0], user_by_movie.shape[1])
# assert reviews.user_id.nunique() == user_by_movie.shape[0], "Oh no! Your matrix should have {} rows, and yours has {}!".format(reviews.user_id.nunique(), user_by_movie.shape[0])


`2.` Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [None]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].isnull() == False].index.values

    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    n_users = user_by_movie.shape[0]
    movies_seen = dict()

    for user1 in range(1, n_users+1):
        
        # assign list of movies to each user key
        movies_seen[user1] = movies_watched(user1)
    
    return movies_seen
    
movies_seen = create_user_movie_dict()`2.` Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [15]:
for user in [1,2,3,4]:
    print(user,":",movies_seen[user])

1 : [ 68646 113277]
2 : [ 422720  454876  790636  816711 1091191 1103275 1322269 1390411 1398426
 1431045 1433811 1454468 1535109 1675434 1798709 2017038 2024544 2294629
 2361509 2381249 2726560 2883512 3079380]
3 : [1790864 2170439 2203939]
4 : [1300854]


`3.` If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [None]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    movies_to_analyze = dict()

    for user, movies in movies_seen.items():
        if len(movies) > lower_bound:
            movies_to_analyze[user] = movies
    return movies_to_analyze

movies_to_analyze = create_movies_to_analyze(movies_seen)

In [None]:
# Run the tests below to check that your movies_to_analyze matches the solution
# assert len(movies_to_analyze) == 23512, "Oops!  It doesn't look like your dictionary has the right number of individuals."
# assert len(movies_to_analyze[2]) == 23, "Oops!  User 2 didn't match the number of movies we thought they would have."
# assert len(movies_to_analyze[7])  == 3, "Oops!  User 7 didn't match the number of movies we thought they would have."
# print("If this is all you see, you are good to go!")

### Calculating User Similarities

Now that you have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users. Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

However, this took a very long time to run, and other methods of performing these operations did not fit on the workspace memory!

Therefore, rather than creating a dataframe with all possible pairings of users in our data, your task for this question is to look at a few specific examples of the correlation between ratings given by two users.  For this question consider you want to compute the [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) between users.

`4.` Using the **movies_to_analyze** dictionary and **user_by_movie** dataframe, create a function that computes the correlation between the ratings of similar movies for two users.  Then use your function to compare your results to ours using the tests below.  

In [None]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    # Pull movies for each user
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    
    # Find Similar Movies
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    
    # Calculate correlation between the users
    df = user_by_movie.loc[(user1, user2), sim_movs]
    corr = df.transpose().corr().iloc[0,1]
    
    return corr #return the correlation

In [None]:
# Test your function against the solution
# assert compute_correlation(2,2) == 1.0, "Oops!  The correlation between a user and itself should be 1.0."
# assert round(compute_correlation(2,66), 2) == 0.76, "Oops!  The correlation between user 2 and 66 should be about 0.76."
# assert np.isnan(compute_correlation(2,104)), "Oops!  The correlation between user 2 and 104 should be a NaN."

# print("If this is all you see, then it looks like your function passed all of our tests!")

### Why the NaN's?

If the function you wrote passed all of the tests, then you have correctly set up your function to calculate the correlation between any two users.  

`5.` But one question is, why are we still obtaining **NaN** values?  As you can see in the code cell above, users 2 and 104 have a correlation of **NaN**. Why?

Think and write your ideas here about why these NaNs exist, and use the cells below to do some coding to validate your thoughts. You can check other pairs of users and see that there are actually many NaNs in our data - 2,526,710 of them in fact. These NaN's ultimately make the correlation coefficient a less than optimal measure of similarity between two users.

```
In the denominator of the correlation coefficient, we calculate the standard deviation for each user's ratings.  The ratings for user 2 are all the same rating on the movies that match with user 104.  Therefore, the standard deviation is 0.  Because a 0 is in the denominator of the correlation coefficient, we end up with a **NaN** correlation coefficient.  Therefore, a different approach is likely better for this particular situation.
```

In [10]:
# Which movies did both user 2 and user 104 see?
set_2 = set(movies_to_analyze[2])
set_104 = set(movies_to_analyze[104])
set_2.intersection(set_104)

{454876, 816711, 1454468, 1535109}

In [11]:
# What were the ratings for each user on those movies?
print(user_by_movie.loc[2, set_2.intersection(set_104)])
print(user_by_movie.loc[104, set_2.intersection(set_104)])

movie_id
454876     8.0
1454468    8.0
1535109    8.0
816711     8.0
Name: 2, dtype: float64
movie_id
454876     9.0
1454468    7.0
1535109    9.0
816711     7.0
Name: 104, dtype: float64


`6.` Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [None]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    # Pull movies for each user
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    
    # Find Similar Movies
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    
    # Calculate euclidean distance between the users
    df = user_by_movie.loc[(user1, user2), sim_movs]
    dist = np.linalg.norm(df.loc[user1] - df.loc[user2])
    
    return dist #return the euclidean distance

In [None]:
# Read in solution euclidean distances"
import pickle
df_dists = pd.read_pickle("data/Term2/recommendations/lesson1/data/dists.p")

### Using the Nearest Neighbors to Make Recommendations

In the previous question, you read in **df_dists**. Therefore, you have a measure of distance between each user and every other user. This dataframe holds every possible pairing of users, as well as the corresponding euclidean distance.

Because of the **NaN** values that exist within the correlations of the matching ratings for many pairs of users, as we discussed above, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [None]:
def find_closest_neighbors(user):
    '''
    INPUT:
        user - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy
    
    closest_users = df_dists[df_dists['user1']==user].sort_values(by='eucl_dist').iloc[1:]['user2']
    closest_neighbors = np.array(closest_users)
    
    return closest_neighbors
    
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    movies_liked = np.array(user_items.query('user_id == @user_id and rating > (@min_rating -1)')['movie_id'])
    
    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    # I wanted to make recommendations by pulling different movies than the user has already seen
    # Go in order from closest to farthest to find movies you would recommend
    # I also only considered movies where the closest user rated the movie as a 9 or 10
    
    # movies_seen by user (we don't want to recommend these)
    movies_seen = movies_watched(user)
    closest_neighbors = find_closest_neighbors(user)
    
    # Keep the recommended movies here
    recs = np.array([])
    
    # Go through the neighbors and identify movies they like the user hasn't seen
    for neighbor in closest_neighbors:
        neighbs_likes = movies_liked(neighbor)
        
        #Obtain recommendations for each neighbor
        new_recs = np.setdiff1d(neighbs_likes, movies_seen, assume_unique=True)
        
        # Update recs with new recs
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))
        
        # If we have enough recommendations exit the loop
        if len(recs) > num_recs-1:
            break
    
    # Pull movie titles using movie ids
    recommendations = movie_names(recs)
    
    return recommendations

def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''
    
    # All the users we need to make recommendations for
    users = np.unique(df_dists['user1'])
    n_users = len(users)
    
    #Store all recommendations in this dictionary
    all_recs = dict()
    
    # Make the recommendations for each user
    for user in users:
        all_recs[user] = make_recommendations(user, num_recs)
    
    return all_recs

all_recs = all_recommendations(10)

### Now What?

If you made it this far, you have successfully implemented a solution to making recommendations using collaborative filtering. 

`8.` Let's do a quick recap of the steps taken to obtain recommendations using collaborative filtering.  

In [None]:
a = "pearson's correlation and spearman's correlation"
b = 'item based collaborative filtering'
c = "there were too many ratings to get a stable metric"
d = 'user based collaborative filtering'
e = "euclidean distance and pearson's correlation coefficient"
f = "manhattan distance and euclidean distance"
g = "spearman's correlation and euclidean distance"
h = "the spread in some ratings was zero"
i = 'content based recommendation'

sol_dict = {
    'The type of recommendation system implemented here was a ...': d,
    'The two methods used to estimate user similarity were: ': e,
    'There was an issue with using the correlation coefficient.  What was it?': h
}### Now What?

If you made it this far, you have successfully implemented a solution to making recommendations using collaborative filtering. 

`8.` Let's do a quick recap of the steps taken to obtain recommendations using collaborative filtering.  

Additionally, let's take a closer look at some of the results.  There are two solution files that you read in to check your results, and you created these objects

* **df_dists** - a dataframe of user1, user2, euclidean distance between the two users
* **all_recs_sol** - a dictionary of all recommendations (key = user, value = list of recommendations)  

`9.` Use these two objects along with the cells below to correctly fill in the dictionary below and complete this notebook!

In [None]:
a = 567
b = 1503
c = 1319
d = 1325
e = 2526710
f = 0
g = 'Use another method to make recommendations - content based, knowledge based, or model based collaborative filtering'

sol_dict2 = {
    'For how many pairs of users were we not able to obtain a measure of similarity using correlation?': e,
    'For how many pairs of users were we not able to obtain a measure of similarity using euclidean distance?': f,
    'For how many users were we unable to make any recommendations for using collaborative filtering?': c,
    'For how many users were we unable to make 10 recommendations for using collaborative filtering?': d,
    'What might be a way for us to get 10 recommendations for every user?': g   
}


In [21]:
# Users without recs
users_without_recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) == 0:
        users_without_recs.append(user)
    
len(users_without_recs)

1319

In [22]:
# NaN euclidean distance values
df_dists['eucl_dist'].isnull().sum()

0

In [23]:
# Users with fewer than 10 recs
users_with_less_than_10recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) < 10:
        users_with_less_than_10recs.append(user)
    
len(users_with_less_than_10recs)

1325

# 18 Solution: Collaborative Filtering

# 19 Solution: Collaborative Filtering

# 20 Ways to Recommend: Content Based


### Content Based Recommendations
In the previous notebook, you implemented collaborative filtering to make recommendations based on finding similar users. However, there were a number of users who did not receive recommendations when using this technique. For this reason, we might use other recommendation techniques to find items to recommend to users.

Another popular technique for making recommendations is called content based recommendations. In this recommendation technique, we use information that is known about the user or item to make recommendations. This method of making recommendations is particularly useful when we do not have a lot of user-item connections available in our dataset.

It might be the case that content based and collaborative filtering based techniques come up with similar recommendations, but the methods by which data scientists approach these recommendations are very different. In collaborative filtering, you are using the connections of users and items (as you did before). In content based techniques, you are using information about the users and items, but not connections (hence the usefulness when you do not have a lot of internal data already available to use).

Let's take a look at how we might implement this method with the MovieTweetings data.

# 21 Notebook: Content Based

### Content Based Recommendations

In the previous notebook, you were introduced to a way to make recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve...

In order to help these users out, let's try another technique **content based** recommendations.  Let's start off where we were in the previous notebook.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import HTML
import progressbar
import tests as t
import pickle


%matplotlib inline

path = "data/"

# Read in the datasets
movies = pd.read_csv(path + 'movies_clean.csv')
reviews = pd.read_csv(path + 'reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']


all_recs = pickle.load(open(path + "all_recs.p", "rb"))



### Datasets

From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did recieve 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.

-----

`1.` Let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [3]:
users_with_all_recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)

print("There are {} users with all reccomendations from collaborative filtering.".format(len(users_with_all_recs)))

users = np.unique(reviews['user_id'])
users_who_need_recs = np.setdiff1d(users, users_with_all_recs)

print("There are {} users who still need recommendations.".format(len(users_who_need_recs)))
print("This means that only {}% of users received all 10 of their recommendations using collaborative filtering".format(round(len(users_with_all_recs)/len(np.unique(reviews['user_id'])), 4)*100))   

There are 22187 users with all reccomendations from collaborative filtering.
There are 31781 users who still need recommendations.
This means that only 41.11% of users received all 10 of their recommendations using collaborative filtering


### Content Based Recommendations

You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier.     

`2.` Before finding recommendations, rank the user's ratings from highest to lowest. You will move through the movies in this order looking for other similar movies.

In [4]:
# create a dataframe similar to reviews, but ranked by rating for each user
ranked_reviews = reviews.sort_values(by=['user_id', 'rating'], ascending=False)

### Similarities

In the collaborative filtering sections, you became quite familiar with different methods of determining the similarity (or distance) of two users.  We can perform similarities based on content in much the same way.  

In many cases, it turns out that one of the fastest ways we can find out how similar items are to one another (when our matrix isn't totally sparse like it was in the earlier section) is by simply using matrix multiplication.  If you are not familiar with this, an explanation is available [here by 3blue1brown](https://www.youtube.com/watch?v=LyGKycYT2v0) and another quick explanation is provided [on the post here](https://math.stackexchange.com/questions/689022/how-does-the-dot-product-determine-similarity).

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the dot product of this matrix with itself.  Notice in the below that the dot product where our 1 values overlap gives a value of 2 indicating higher similarity.  In the second dot product, the 1 values don't match up.  This leads to a dot product of 0 indicating lower similarity.

<img src="https://view3f484599.udacity-student-workspaces.com/notebooks/images/dotprod1.png" alt="Dot Product" height="500" width="500">

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  In the below image, you can see that movies 1 and 8 are most similar, movies 2 and 8 are most similar and movies 3 and 9 are most similar for this subset of the data.  The diagonal elements of the matrix will contain the similarity of a movie with itself, which will be the largest possible similarity (which will also be the number of 1's in the movie row within the orginal movie content matrix.

<img src="https://view3f484599.udacity-student-workspaces.com/notebooks/images/moviemat.png" alt="Dot Product" height="500" width="500">


`3.` Create a numpy array that is a matrix of indicator variables related to year (by century) and movie genres by movie.  Perform the dot product of this matrix with itself (transposed) to obtain a similarity matrix of each movie with every other movie.  The final matrix should be 31245 x 31245.

In [5]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = np.array(movies.iloc[:,4:])

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))

### For Each User...


Now that you have a matrix where each user has their ratings ordered.  You also have a second matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity. Therefore, it is time to get to the fun part.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

As a final note, you may need to adjust the criteria for 'most similar' to obtain 10 recommendations.  As a first pass, I used only movies with the highest possible similarity to one another as similar enough to add as a recommendation.

`3.` In the below cell, complete each of the functions needed for making content based recommendations.

In [6]:
def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    movie_idx = np.where(movies['movie_id'] == movie_id)[0][0]
    
    # find the most similar movie indices - to start I said they need to be the same for all content
    similar_idxs = np.where(dot_prod_movies[movie_idx] == np.max(dot_prod_movies[movie_idx]))[0]
    
    # pull the movie titles based on the indices
    similar_movies = np.array(movies.iloc[similar_idxs, ]['movie'])
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst

def make_recs():
    '''
    INPUT
    None
    OUTPUT
    recs - a dictionary with keys of the user and values of the recommendations
    '''
    # Create dictionary to return with users and ratings
    recs = defaultdict(set)
    # How many users for progress bar
    n_users = len(users)

    
    # Create the progressbar
    cnter = 0
    bar = progressbar.ProgressBar(maxval=n_users+1, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    
    # For each user
    for user in users:
        
        # Update the progress bar
        cnter+=1 
        bar.update(cnter)

        # Pull only the reviews the user has seen
        reviews_temp = ranked_reviews[ranked_reviews['user_id'] == user]
        movies_temp = np.array(reviews_temp['movie_id'])
        movie_names = np.array(get_movie_names(movies_temp))

        # Look at each of the movies (highest ranked first), 
        # pull the movies the user hasn't seen that are most similar
        # These will be the recommendations - continue until 10 recs 
        # or you have depleted the movie list for the user
        for movie in movies_temp:
            rec_movies = find_similar_movies(movie)
            temp_recs = np.setdiff1d(rec_movies, movie_names)
            recs[user].update(temp_recs)

            # If there are more than 
            if len(recs[user]) > 9:
                break

    bar.finish()
    
    return recs

In [7]:
recs = make_recs()



### How Did We Do?

Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?

`4.` Use the cells below to see how many individuals you were able to make recommendations for, as well as explore characteristics about individuals who you were not able to make recommendations for.  

In [8]:
# Explore recommendations
users_without_all_recs = []
users_with_all_recs = []
no_recs = []
for user, movie_recs in recs.items():
    if len(movie_recs) < 10:
        users_without_all_recs.append(user)
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)
    if len(movie_recs) == 0:
        no_recs.append(user)

In [9]:
# Some characteristics of my content based recommendations
print("There were {} users without all 10 recommendations we would have liked to have.".format(len(users_without_all_recs)))
print("There were {} users with all 10 recommendations we would like them to have.".format(len(users_with_all_recs)))
print("There were {} users with no recommendations at all!".format(len(no_recs)))

There were 2179 users without all 10 recommendations we would have liked to have.
There were 51789 users with all 10 recommendations we would like them to have.
There were 174 users with no recommendations at all!


In [10]:
# Closer look at individual user characteristics
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].isnull() == False].index.values

    return movies


movies_watched(189)

array([457430])

In [11]:
cnter = 0
print("Some of the movie lists for users without any recommendations include:")
for user_id in no_recs:
    print(user_id)
    print(get_movie_names(movies_watched(user_id)))
    cnter+=1
    if cnter > 10:
        break

Some of the movie lists for users without any recommendations include:
189
['El laberinto del fauno (2006)']
797
['The 414s (2015)']
1603
['Beauty and the Beast (2017)']
2056
['Brimstone (2016)']
2438
['Baby Driver (2017)']
3322
['Rosenberg (2013)']
3925
['El laberinto del fauno (2006)']
4325
['Beauty and the Beast (2017)']
4773
['The Frozen Ground (2013)']
4869
['Beauty and the Beast (2017)']
4878
['American Made (2017)']


### Now What?  

Well, if you were really strict with your criteria for how similar two movies are (like I was initially), then you still have some users that don't have all 10 recommendations (and a small group of users who have no recommendations at all). 

As stated earlier, recommendation engines are a bit of an **art** and a **science**.  There are a number of things we still could look into - how do our collaborative filtering and content based recommendations compare to one another? How could we incorporate user input along with collaborative filtering and/or content based recommendations to improve any of our recommendations?  How can we truly gain recommendations for every user?

`5.` In this last step feel free to explore any last ideas you have with the recommendation techniques we have looked at so far.  You might choose to make the final needed recommendations using the first technique with just top ranked movies.  You might also loosen up the strictness in the similarity needed between movies.  Be creative and share your insights with your classmates!

# 22 Solution: Content Based

# 23 Three Types of Recommendation Systems

### Three Main Branches of Recommendations
- Knowledge Based
- Collaborative Filtering
- Content Based

### Knowledge Based Recommendations
Use knowledge about items that meet user specifications to recommend items.

Common for luxury items.

### Collaborative Filtering
A method of making recommendations based on using the collaboration of user-item interactions.

Most common recommendation method

### Content Based
Use information about items to find ite, similarities. Often the similarities are related to item descriptions or purpose.

# 24 More Recommendation Technniques

### Three Types of Recommendations
There are three methods that you have now implemented for making recommendations. These are the three most recognized methods in industry:

`1.` Knowledge Based Recommendations

Knowledge based recommendations frequently are implemented using filters, and are extremely common amongst luxury based goods. Filters that you might see when purchasing items like cars or homes are examples of knowledge based recommendations. In knowledge based recommendations, users provide information about the types of recommendations they would like back.

`2.` Collaborative Filtering Based Recommendations

Collaborative filtering uses the connections between users and items to make recommendations. Even the content based recommendation you just implemented used some collaborative filtering techniques, as you were not treating items and users independent from one another. In this lesson, you used neighborhood based collaborative filtering to find users who were alike and then recommend new movies based on these similar users.

Even in the content based recommendation, you were using collaborative filtering. You were finding items that were similar and making recommendations of new items based on the highest ratings of a user. Because you were still using the user ratings of an item, this was an example of a blend between content and collaborative filtering based techniques.

`3.` Content Based Recommendations

In the previous notebook, you created a matrix of similarities between items (movies) based only on the content related to those movies (year and genre). The similarity matrix that was used, was completely created using only the items (movies). There was no information used about the users implemented. For any movie, you would be able to determine the most related additional movies based only on the genre and the year of the movie. This is the premise of how a completely content based recommendation would be made.

Often blended techniques of all three types are used in practice to provide the the best recommendation for a particular circumstance.

There are still more advanced techniques that are related to the methods that you learned about here, and they will most likely fall in one of the three buckets below.

- [AirBnB uses embeddings in their recommendation, which you can read more about here.](https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search-601172f7603e)

![Cali](https://video.udacity-data.com/topher/2018/August/5b6f3e96_screen-shot-2018-08-11-at-12.52.21-pm/screen-shot-2018-08-11-at-12.52.21-pm.png)

- [As our smart phones become more addicting every day, it is easy to see why location based recommendations will be more and more popular. You can read more about these types of recommendations here.](https://link.springer.com/referenceworkentry/10.1007%2F978-3-319-17885-1_1580)

![Location](https://video.udacity-data.com/topher/2018/August/5b6f3eb7_screen-shot-2018-08-11-at-12.52.03-pm/screen-shot-2018-08-11-at-12.52.03-pm.png)

- [Many companies are also exploring deep learning use cases in recommendation systems.](https://ebaytech.berlin/deep-learning-for-recommender-systems-48c786a20e1a)

![Usernet](https://video.udacity-data.com/topher/2018/August/5b6f3f18_screen-shot-2018-08-11-at-12.54.48-pm/screen-shot-2018-08-11-at-12.54.48-pm.png)


# 25 Quiz: Recommendation Methods

### Self Check
Now that you have gained exposure to each of the different types of recommendation techniques, let's see if you can tell what type of recommendation engine is being used in each of the below cases.

You are interested in purchasing a new shirt. You go through and look at a number of shirts online. The next day while you are surfing the web, an advertisement pops up that says other users who liked the shirts you were looking at also liked these shirts (with new shirts shown). This is an example of `Collaborative Filtering`

You are interested in purchasing a new shirt. You go through and look at a number of shirts online. The next day while you are surfing the web, an advertisement pops up that says here are some other items that were like the items you viewed yesterday. This is an example of `Content Based`

You are interested in purchasing a new shirt. You know you want a red shirt. You use a filter to select that you only want red shirts. This is an example of `Knowledge Based`

# 26 Types of Ratings


### Different Scales
If you are in control of choosing your rating scale, think of what might be most beneficial to your scenario. If you are working alongside a team TO design the interfaces for how data will be collected, there are number of ideas to keep in mind.
- Do you need to ask questions of your user or can you collect data about their interactions with items?
- If you need to ask questions, how many do you ask?
- How do you word the questions?
- And finally, the question in the above video: what type of scale should you use?

In general, I suggest using the simplest rating that allows you to get whatever questions of interest you have, but there are some important ideas to keep in mind when choosing a particular type of rating. Ratings are a necessary part of working with different recommendation systems, but they aren't a central part of our focus. A good overview of types of ratings and when to use them is also provided [here](https://conversionxl.com/blog/survey-response-scales/).

Most of these ideas are specific to your use case, and are easy to notice in hindsight. It is simply important to think of this in advance and not completely gloss over possible issues with the data you are collecting and how it connects to the questions you want answered.

# 27 Goals of Recommendation Systems

### Business Goals of Recommendations
In general, recommendations are important because they are often central to driving revenue for a company. In the new world of the Internet, you don't always need a friend to personally recommend you an item you never heard of - instead, the recommendations can do this.

There are ultimately 4 goals to keep in mind when performing recommendations:
1. Relevance
2. Novelty
3. Serendipity
4. Diversity

Often people only think of the first goal, and it is a good reminder that focusing only on **relevance** can lead to drops in use, and therefore loss of revenue. We also need to consider the other three goals, so that recommendations can be most effective in driving sales. In this [article](https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff), Lab41 shows how they worked to quantify these metrics.

# 28 Quiz: Types of Ratings & Goals of Recommendation Systems

You want to make sure that when a user gives you an item rating you are able to clearly identify if they like the item or not, which of the following are possible rating scales for this situation?
- 1 for like, 0 for dislike
- Scale 1 - 4
- With odd valued scales, a user may choose to be undecided.



# 29 Outro

# 30 Recap

### Types of Recommendations
In this lesson, you worked with the MovieTweetings data to apply each of the three methods of recommendations:
1. Knowledge Based Recommendations
2. Collaborative Filtering Based Recommendations
3. Content Based Recommendations

Within Collaborative Filtering, there are two main branches:
1. Model Based Collaborative Filtering
2. Neighborhood Based Collaborative Filtering

In this lesson, you implemented Neighborhood Based Collaborative Filtering. In the next lesson, you will implement Model Based Collaborative Filtering.

### Similarity Metrics
In order to implement Neighborhood Based Collaborative Filtering, you were introduced to and applied a few techniques to assess how similar or distant two users were from one another:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

### Types of Ratings
We took a quick look at different types of ratings:
1. Did the user interact with an item or not.
2. Did the user like an item or not.
3. More granular scales 1-7, 1-10, etc.

It is important to understand what the data might be used for, and what type of granularity might be important for a particular case. One of the main considerations is whether you want to have neutrality available, in which case an odd number of possible values in your scale will provide a value in the middle. Another common question is, how many levels do you really need to understand how much a user likes a particular product? Again, this is largely up to individual preference and specific use cases.

### Business Cases For Recommendations
Finally, you looked at the four ideas needed for businesses to implement successful recommendations to drive revenue, which include:
1. Relevance
2. Novelty
3. Serendipity
4. Increased Diversity

At the end of this lesson, you will have gained a ton of skills to build upon or to start creating your own recommendations in practice.

### Next Lesson
In the upcoming lesson, we will take a closer look at model based collaborative filtering, different methods for dealing with the cold start problem, and how to assess how well our model is performing. Then as a final touch, you will have the opportunity to deploy your recommendations to the web!