### Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.


In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

In [10]:
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Mystery,Film-Noir,Crime,...,Adult,Romance,News,Horror,Fantasy,Documentary,Sci-Fi,Talk-Show,Game-Show,Biography
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [11]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,111161,10,1373234211,2013-07-07 21:56:51
1,1,117060,7,1373415231,2013-07-10 00:13:51
2,1,120755,6,1373424360,2013-07-10 02:46:00
3,1,317919,6,1373495763,2013-07-10 22:36:03
4,1,454876,10,1373621125,2013-07-12 09:25:25


#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [6]:
# reviews rating mean 
reviews.groupby('movie_id')['rating'].transform('mean')

0         9.400612
1         7.310078
2         6.056604
3         7.271845
4         8.148620
5         7.033573
6         6.821429
7         7.491429
8         6.649560
9         7.684343
10        7.009471
11        6.148649
12        7.121711
13        5.894928
14        5.958932
15        7.196149
16        6.583871
17        5.243187
18        7.071158
19        6.766667
20        8.148620
21        8.057917
22        7.305000
23        7.886233
24        7.421053
25        7.433255
26        7.374790
27        8.310030
28        8.110266
29        7.875479
            ...   
798270    5.271186
798271    5.000000
798272    4.833333
798273    5.431034
798274    4.565217
798275    7.079585
798276    7.669043
798277    6.428571
798278    7.081818
798279    5.250000
798280    5.774648
798281    6.261538
798282    6.542553
798283    8.096774
798284    6.000000
798285    7.230088
798286    6.389937
798287    8.667057
798288    5.607143
798289    6.792453
798290    6.992701
798291    7.

In [7]:
# movie number of ratings
reviews.groupby('movie_id')['rating'].transform('count')

0          981
1          129
2          106
3          103
4          942
5          834
6          336
7          175
8          682
9         2376
10        1795
11         666
12         608
13         276
14         487
15         831
16         310
17         477
18        1321
19          30
20         942
21        1623
22        2400
23        1046
24          38
25         427
26         595
27         658
28        1841
29         522
          ... 
798270      59
798271       1
798272      12
798273      58
798274      23
798275     289
798276     982
798277     140
798278     110
798279      16
798280      71
798281      65
798282      94
798283      31
798284      12
798285     113
798286     159
798287     853
798288      28
798289      53
798290     274
798291     352
798292      35
798293     408
798294     349
798295     224
798296     189
798297    2400
798298     227
798299       7
Name: rating, Length: 798300, dtype: int64

In [16]:
# latest date for each movie 
reviews.groupby('movie_id')['date'].transform('max')

0         2019-08-20 06:47:37
1         2019-06-02 09:29:32
2         2019-03-09 15:06:28
3         2019-03-09 23:06:32
4         2019-08-03 20:40:02
5         2019-07-29 03:13:07
6         2019-02-14 23:48:01
7         2019-07-05 05:48:23
8         2019-06-26 22:44:00
9         2019-08-20 00:29:22
10        2019-08-13 15:57:06
11        2018-05-25 15:04:55
12        2019-02-03 02:57:59
13        2018-07-22 07:51:47
14        2019-08-14 14:04:38
15        2019-08-11 14:21:03
16        2019-08-06 05:15:06
17        2019-03-20 05:53:44
18        2019-06-23 13:20:16
19        2019-03-24 12:05:40
20        2019-08-03 20:40:02
21        2019-08-16 20:39:05
22        2019-07-29 08:03:49
23        2019-06-25 14:40:50
24        2018-12-21 14:35:52
25        2019-08-21 16:26:28
26        2019-07-12 23:04:10
27        2019-06-21 19:46:45
28        2019-08-06 22:00:14
29        2019-05-03 12:03:36
                 ...         
798270    2019-03-17 22:32:10
798271    2019-02-03 21:02:06
798272    

In [36]:
def create_ranked_df(movies, reviews):
    
    # create a new dataframe 
    reviews['movie_avg_rating'] = reviews.groupby('movie_id')['rating'].transform('mean')
    reviews['reviews_count'] = reviews.groupby('movie_id')['rating'].transform('count')
    reviews['most_recent_date'] = reviews.groupby('movie_id')['date'].transform('max')

    # create new dataframe the merge reviews and movies dataframe 

    movies_reviews_stats_df = pd.merge(movies, 
                               reviews[['movie_id','movie_avg_rating',
                                       'reviews_count','most_recent_date']], 
                                      on=['movie_id'])
    movies_reviews_stats_df.drop_duplicates(inplace=True)
    
    top_movies_df = movies_reviews_stats_df.sort_values(by=['movie_avg_rating','reviews_count','most_recent_date'], ascending=False)
    
    top_movies_df = top_movies_df[top_movies_df.reviews_count > 4]
    
    return top_movies_df

top_movies_df = create_ranked_df(movies, reviews)

top_movies_df.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Mystery,Film-Noir,Crime,...,Horror,Fantasy,Documentary,Sci-Fi,Talk-Show,Game-Show,Biography,movie_avg_rating,reviews_count,most_recent_date
740295,4921860,MSG 2 the Messenger (2015),Comedy|Drama|Fantasy|Horror,2015,0,0,1,0,0,0,...,1,1,0,0,0,0,0,10.0,48,2016-08-14 17:16:50
755749,5262972,Avengers: Age of Ultron Parody (2015),Short|Comedy,2015,0,0,1,0,0,0,...,0,0,0,0,0,0,0,10.0,28,2016-01-08 00:44:43
783891,6662050,Five Minutes (2017),Short|Comedy,2017,0,0,1,0,0,0,...,0,0,0,0,0,0,0,10.0,22,2019-04-20 22:29:19
611629,2737018,Selam (2013),Drama|Romance,2013,0,0,1,0,0,0,...,0,0,0,0,0,0,0,10.0,10,2015-05-10 22:56:01
594400,2560840,"Quiet Riot: Well Now You're Here, There's No W...",Documentary|Music,2014,0,0,1,0,0,0,...,0,0,1,0,0,0,0,10.0,6,2016-01-23 00:30:44


In [33]:
def popular_recommendations(user_id, n_top):
    '''
    INPUT:
    user_id - the user_id of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Do stuff
    #top_movies = movies_reviews_stats_df.sort_values(by=['movie_avg_rating','reviews_count','most_recent_date'], ascending=[0,0,0]).head(n_top)['movie']
    top_movies = list(top_movies_df['movie'][:n_top])
    return top_movies # a list of the n_top movies as recommended

Usint the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

In [34]:
# Put your solutions for each of the cases here

# Top 20 movies recommended for id 1

recs_20_for_1 = popular_recommendations(1, 20)# Your solution list here
# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations(53968, 5)# Your solution list here

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations(70000, 100)# Your solution list here

# Top 35 movies recommended for id 43
recs_35_for_43 =  popular_recommendations(43, 35)# Your solution list here



In [35]:
### You Should Not Need To Modify Anything In This Cell
ranked_movies = t.create_ranked_df(movies, reviews) # only run this once - it is not fast

# check 1 
assert t.popular_recommendations('1', 20, ranked_movies) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movies) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movies) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movies) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!


**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!


### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

Try writing a few tests against the test function in our test function.  Below returns the top 20 movies for user 1 based on the specified year and genre filters.  Does yours return the same? 

```
t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])
```

In [40]:
def popular_recommendations_filter_based(user_id, n_top, ranked_movies, years=None, genres=None):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    years - a list of strings with years of movies
    genres - a list of strings with genres of movies
    
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # filter based on years and genre
    if years is not None:
        ranked_movies = ranked_movies[ranked_movies['date'].isin(years)]

    if genres is not None:
        num_genre_match = ranked_movies[genres].sum(axis=1)
        ranked_movies = ranked_movies.loc[num_genre_match > 0, :]
            
            
    # create top movies list 
    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies

    

In [42]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = popular_recommendations_filter_based('1', 20, top_movies_df, years=['2015', '2016', '2017', '2018'], genres=['History'])

# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = popular_recommendations_filter_based('53968', 5, top_movies_df, years=['2015', '2016', '2017', '2018'])

# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = popular_recommendations_filter_based('70000', 100, top_movies_df, genres=['History', 'News'])



In [43]:
### You Should Not Need To Modify Anything In This Cell

# check 1 
assert t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History']) == recs_20_for_1_filtered,  "The first check failed..."
# check 2
assert t.popular_recs_filtered('53968', 5, ranked_movies, years=['2015', '2016', '2017', '2018']) == recs_5_for_53968_filtered,  "The second check failed..."
# check 3
assert t.popular_recs_filtered('70000', 100, ranked_movies, genres=['History', 'News']) == recs_100_for_70000_filtered,  "The third check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!
