### Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [39]:
def movie_rankings(movies, reviews):
    '''
     INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe
        
        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews, 
                        then time, and must have more than 4 ratings
    '''
    # Do stuff
    
    movie_ratings = reviews.groupby('movie_id')['rating']
    
    # get average ratings for each movie
    mean_ratings = movie_ratings.mean()
    
    # get number of ratings for each movie
    num_ratings = movie_ratings.count()
    
    # get last rating
    last_rating = pd.DataFrame(reviews.groupby('movie_id')['date'].max())
    last_rating.rename(columns = {'date': 'last_ratings'}, inplace=True)
    
    
    rec_movies_df = pd.DataFrame({'mean_ratings': mean_ratings, 'num_ratings': num_ratings})
    rec_movies_df = rec_movies_df.join(last_rating)
    movies_df = movies.join(rec_movies_df)
    
    # sort movies by highest average rating, then number of ratings, then latest rating
    ranked_movies = movies_df.sort_values(['mean_ratings', 'num_ratings', 'last_ratings'], ascending=False)
    
    # keep ony novies with 5 or more reviews
    ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]
    
    return ranked_movies # a list of the n_top movies as recommended

def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    top_movies = list(ranked_movies['movie'][:n_top])
    
    return top_movies

Usint the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

In [37]:
ranked_movies = movie_rankings(movies, reviews)
ranked_movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Action,Documentary,Animation,Comedy,Short,Western,Thriller,mean_ratings,num_ratings,last_ratings
12364,368578,Are We There Yet? (2005),Adventure|Comedy|Family,2005,0,0,1,0,0,0,...,0,0,0,1,0,0,0,9.625,8.0,2016-09-30 18:20:04
29843,5641282,India in a Day (2016),Documentary,2016,0,0,1,0,0,0,...,0,1,0,0,0,0,0,9.5,10.0,2016-04-23 19:34:05
19254,1714209,In the Land of Blood and Honey (2011),Drama|Romance|War,2011,0,0,1,0,0,0,...,0,0,0,0,0,0,0,9.238095,21.0,2018-05-15 14:40:03
15064,889136,Kings (2007),Drama,2007,0,0,1,0,0,0,...,0,0,0,0,0,0,0,9.0,8.0,2016-01-27 20:50:25
18742,1637706,Our Idiot Brother (2011),Comedy|Drama,2011,0,0,1,0,0,0,...,0,0,0,1,0,0,0,9.0,8.0,2015-12-06 14:32:21


In [42]:
# Put your solutions for each of the cases here

# Top 20 movies recommended for id 1

recs_20_for_1 = popular_recommendations('1', 20, ranked_movies)# Your solution list here

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations('53968', 5, ranked_movies)# Your solution list here

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations('70000', 100, ranked_movies)# Your solution list here

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations('43', 35, ranked_movies)# Your solution list here

In [43]:
### You Should Not Need To Modify Anything In This Cell
ranked_movies = t.create_ranked_df(movies, reviews) # only run this once - it is not fast

# check 1 
assert t.popular_recommendations('1', 20, ranked_movies) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movies) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movies) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movies) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!


**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!


### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

Try writing a few tests against the test function in our test function.  Below returns the top 20 movies for user 1 based on the specified year and genre filters.  Does yours return the same? 

```
t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])
```

In [48]:
num_genre_match = ranked_movies[['History', 'News']].sum(axis=1)
num_genre_match.head()

movie_id
4921860    0
5262972    0
5688932    0
2737018    0
2560840    0
dtype: int64

In [49]:
num_genre_match > 0

movie_id
4921860    False
5262972    False
5688932    False
2737018    False
2560840    False
2219210    False
4448444    False
5131914    False
2059318    False
1431149    False
5512872    False
4148400     True
6798422    False
111341     False
423176     False
1629443    False
2592910    False
58888      False
2396421    False
12364      False
57565      False
6054758    False
5134588    False
363473     False
2265179    False
5323386    False
29843      False
45274      False
2357788    False
6781498    False
           ...  
2088923    False
3203620    False
3727824    False
2769184    False
466342     False
2006801    False
3283792    False
2325518    False
185183     False
1540767    False
3551400    False
2357489    False
110978     False
5988370    False
829176     False
3104304    False
2622826    False
3561236    False
3036740    False
1098327    False
2147365    False
105585     False
2573750    False
5669936    False
1213644    False
60666      False
3108604    False
31873

In [55]:
def popular_recs_filtered(user_id, n_top, ranked_movies, years=None, genres=None):
    
    # filter for years
    if years is not None:
        # only keep movies from filtered year
        ranked_movies = ranked_movies[ranked_movies['date'].isin(years)]
        
    # filter for genres
    if genres is not None:
        # find number of genre matches for every movie
        num_genre_match = ranked_movies[genres].sum(axis=1)
        # only keep movies from filtered genres
        ranked_movies = ranked_movies.loc[num_genre_match > 0, :]
    
    top_movies = list(ranked_movies['movie'][:n_top])
    
    return top_movies

In [57]:
# Top 20 movies recommended for id 1
rec_20_for_1_filtered = popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recs_filtered('53968', 5, ranked_movies, years=['1998', '1999', '2000', '2001'], genres=['Animation', 'Comedy'])

In [59]:
assert t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

assert t.popular_recs_filtered('53968', 5, ranked_movies, years=['1998', '1999', '2000', '2001'], genres=['Animation', 'Comedy'])

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!
