# Knowledged-based recommendation

A knowledge based recommendation is one in which knowledge about the item or user preferences are used to make a recommendation.

Knowledge based recommendations are pretty common when purchasing luxury items. Take a look at the filters available on Zillow in the image below. This is an example of building in a knowledge based recommendation, as users can add their own preferences to the items that are provided.

<img src="https://video.udacity-data.com/topher/2018/August/5b6a4153_screen-shot-2018-08-07-at-6.02.41-pm/screen-shot-2018-08-07-at-6.02.41-pm.png" width=200 height=200>

Often a **rank based algorithm** is provided along with knowledge based recommendations to bring the most popular items in particular categories to the user's attention.

# Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.


In [87]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline
%config Completer.use_jedi = False

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

## Part I: How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that **no matter the user**, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

### Sample trial

In [42]:
# Create a column with number of ratings for each movie
ratings_count = reviews.groupby('movie_id').size().to_dict()
reviews['ratings_count'] = reviews.movie_id.map(rating_counts)

In [46]:
# Filter movies with ratings above 5 (inclusive)
reviews_best = reviews.query('ratings_count >= 5')

In [47]:
# Get reviews by user_id
reviews_user2 = reviews_best.query('user_id == 2')
reviews_user2.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,ratings_count
1,2,208092,5,1586466072,2020-04-10 05:01:12,395
2,2,358273,9,1579057827,2020-01-15 11:10:27,213
3,2,10039344,5,1578603053,2020-01-10 04:50:53,163
4,2,6751668,9,1578955697,2020-01-14 06:48:17,1715
5,2,7131622,8,1579559244,2020-01-21 06:27:24,1541


In [75]:
# Order : rating -> count -> date (recent)
reviews_user2_sorted = reviews_user2.sort_values(by=['rating', 'ratings_count', 'timestamp'], ascending=[False, False, False])

# Get top 5 
print(reviews_user2_sorted.iloc[:5])

   user_id  movie_id  rating   timestamp                 date  ratings_count
9        2   8579674      10  1579261830  2020-01-17 19:50:30           2784
4        2   6751668       9  1578955697  2020-01-14 06:48:17           1715
2        2    358273       9  1579057827  2020-01-15 11:10:27            213
5        2   7131622       8  1579559244  2020-01-21 06:27:24           1541
8        2   8367814       8  1586436354  2020-04-09 20:45:54           1139


### Function
The above trial requires preprocessing of `reviews` dataframe, and also it is not general no matter of user as required for this rank based system. Let's try to create a function without the need of this preprocessing.

In [90]:
# Read in the datasets again
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

In [91]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-06 05:00:50
1,2,208092,5,1586466072,2020-04-10 05:01:12
2,2,358273,9,1579057827,2020-01-15 11:10:27
3,2,10039344,5,1578603053,2020-01-10 04:50:53
4,2,6751668,9,1578955697,2020-01-14 06:48:17


In [130]:
def popular_recommendations(user_id, n_top):
    '''
    INPUT:
    user_id - the user_id of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Create a new dataframe grouped by movie id
    movie_ratings = reviews.groupby('movie_id')['rating']   
    # Highest average rating    
    avg_ratings = movie_ratings.mean()
    # Tie - More ratings
    count_ratings = movie_ratings.count()
    # Tie - More recent ratings
    recent_ratings = reviews.groupby('movie_id').max()['timestamp'].rename('lastest_rating')
    
    # Combine the ratings
    rating_summary = pd.DataFrame({'avg_rating': avg_ratings, 
                                   'count_rating': count_ratings, 
                                   'recent_rating': recent_ratings})
    
    # Merge with movies dataframe
    movies_merged = movies.merge(rating_summary, on='movie_id')
    
    # Rank movies by ratings
    ranked_movies = movies_merged.sort_values(['avg_rating', 'count_rating', 'recent_rating'], ascending=False)
        
    # Remove movie < 4 number of ratings
    ranked_movies = ranked_movies.query('count_rating >= 5')
    
    # List top movies with user input 'n_top'
    top_movies = list(ranked_movies['movie'][:n_top])
    
    return top_movies # a list of the n_top movies as recommended

In [131]:
popular_recommendations('1', 5)

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Five Minutes (2017)',
 'Selam (2013)',
 'Let There Be Light (2017)']

In [132]:
# Put your solutions for each of the cases here

# Top 20 movies recommended for id 1
recs_20_for_1 = popular_recommendations('1', 20)

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations('1', 5) 

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations('70000', 100) 

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations('43', 35) 


In [133]:
### You Should Not Need To Modify Anything In This Cell
ranked_movies = t.create_ranked_df(movies, reviews) # only run this once - it is not fast

# check 1 
assert t.popular_recommendations('1', 20, ranked_movies) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movies) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movies) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movies) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!


**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!


### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

Try writing a few tests against the test function in our test function.  Below returns the top 20 movies for user 1 based on the specified year and genre filters.  Does yours return the same? 

```
t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])
```