### 通过 MovieTweetings 创建推荐系统：最热门的推荐内容

你已经创建了将在这节课剩余部分创建推荐系统时使用的必要列，下面我们开始创建推荐系统的第一个步骤吧。

首先，使用以下代码读取将在这节课中一直使用的库和两个数据集。

In [138]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import project_tests as t

pd.set_option('max_row',300)

%matplotlib inline

In [120]:
# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

# 使用 del 删除了 Unnamed: 0 列(后续研究)
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

In [121]:
display(movies.head(1))
display(reviews.head(1))

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,History,News,Horror,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0


#### 1.如何查找最热门的电影

对于此 notebook，我们只有一个任务。我们的任务是无论用户是谁，我们都需要根据最热门的项目提供一个推荐列表。

对于此任务，我们将根据以下标准判断什么“最热门”：

* 平均评分最高的电影被视为最佳电影
* 如果评分一样，则评分数量更多的电影更好
* 电影如果评分不足 5 条，则不能被视为最佳电影
* 如果电影的平均评分和评分数量都一样，那么根据最近的评分判断排名

根据这些标准，此 notebook 的目标是获取 **user_id** 并返回 **n_top** 推荐。以下函数将作为所有未来推荐系统的框架。

In [122]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1,1,113277,10,1379466669,2013-09-18 01:11:09,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,2,422720,8,1412178746,2014-10-01 15:52:26,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3,2,454876,8,1394818630,2014-03-14 17:37:10,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,2,790636,7,1389963947,2014-01-17 13:05:47,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [124]:
# groupby 的测试
testdf = reviews.groupby('movie_id')
# https://kite.com/blog/python/pandas-groupby-count-value-count/
# https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm
testdf.get_group(23037)

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
129691,9970,23037,4,1431806091,2015-05-16 19:54:51,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
190551,14678,23037,8,1440606412,2015-08-26 16:26:52,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
266288,20177,23037,3,1515015066,2018-01-03 21:31:06,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
288099,21750,23037,8,1442501785,2015-09-17 14:56:25,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
319705,24253,23037,8,1408106216,2014-08-15 12:36:56,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [125]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712337 entries, 0 to 712336
Data columns (total 23 columns):
user_id      712337 non-null int64
movie_id     712337 non-null int64
rating       712337 non-null int64
timestamp    712337 non-null int64
date         712337 non-null object
month_1      712337 non-null int64
month_2      712337 non-null int64
month_3      712337 non-null int64
month_4      712337 non-null int64
month_5      712337 non-null int64
month_6      712337 non-null int64
month_7      712337 non-null int64
month_8      712337 non-null int64
month_9      712337 non-null int64
month_10     712337 non-null int64
month_11     712337 non-null int64
month_12     712337 non-null int64
year_2013    712337 non-null int64
year_2014    712337 non-null int64
year_2015    712337 non-null int64
year_2016    712337 non-null int64
year_2017    712337 non-null int64
year_2018    712337 non-null int64
dtypes: int64(22), object(1)
memory usage: 125.0+ MB


In [126]:
len(testdf),len(reviews)

(31245, 712337)

In [133]:
# 探索
rmean = reviews.groupby('movie_id')['rating'].mean().mean()
rmean1 = reviews.rating.mean()
rmax = reviews.groupby('movie_id')['rating'].max().mean()
rmean, rmax, rmean1
# 可以看出来用max的评分来分析的话，总评分要高很多
# Groupby 的结果是一个对象，对于对象来做各种聚合输出，常见的是
## count,sum,mean,median,std,var,min,max,prod,first,last

## 注意rmean和rmean1是不同的，这点参见‘辛普森悖论’
## 因为其实每个group并不是平均分配的，按照组求mean，再把各组平权出mean
## 和直接所有值求平均是不同的

(6.7118390700619885, 7.7941750680108814, 7.302481830931146)

In [140]:
# explain - groupby max() 

# testing df
df = pd.DataFrame([('bird', 'Falconiformes', 389.0, 100),
                   ('bird', 'Psittaciformes', 24.0, 90),
                   ('mammal', 'Carnivora', 80.2, 80),
                   ('mammal', 'Primates', np.nan, 70),
                   ('mammal', 'Carnivora', 58, 60)],
                  index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
                  columns=('class', 'order', 'max_speed','weight'))

# groupby max()
grouped = df.groupby('class')
display(grouped.max())
display(grouped.max()['weight'])

# 默认会对所有数值列做 max ，之后通过 ['column'] 可选中列

Unnamed: 0_level_0,order,max_speed,weight
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bird,Psittaciformes,389.0,100
mammal,Primates,80.2,80


class
bird      100
mammal     80
Name: weight, dtype: int64

In [144]:
# explain - set_index
## 通过 set_index 将一列设为 index （表头处的位置是下沉的）
movies.set_index('movie_id').head(1)

Unnamed: 0_level_0,movie,genre,date,1800's,1900's,2000's,History,News,Horror,Musical,...,Fantasy,Romance,Game-Show,Action,Documentary,Animation,Comedy,Short,Western,Thriller
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


In [145]:
def create_ranked_df(movies, reviews):
        '''
        INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe
        
        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews, 
                        then time, and must have more than 4 ratings
        '''
        
        # Pull the average ratings and number of ratings for each movie
        movie_ratings = reviews.groupby('movie_id')['rating']
        avg_ratings = movie_ratings.mean()
        num_ratings = movie_ratings.count()
        ## 按照 ‘movie_id' 做groupby，取 mean 和 count
        last_rating = pd.DataFrame(reviews.groupby('movie_id').max()['date'])
        ## 将 data max（）列作为最后评价的输出
        ## last rating 作为一个生成的参考列（后续会看到 sorting 第一指标还是 avg_rating) 
        last_rating.columns = ['last_rating']

        # Add Dates
        rating_count_df = pd.DataFrame({'avg_rating': avg_ratings, 'num_ratings': num_ratings})
        ## 通过pd.DataFrame将group的对象转换成dataframe
        rating_count_df = rating_count_df.join(last_rating)
        ## 到此，rating_count_df 包括 avg，num，last 三列

        # merge with the movies dataset
        movie_recs = movies.set_index('movie_id').join(rating_count_df)
        ## 将三个生成列合并到movies中，因为rating_count_df 已经有了 movie_id 作为index了
        ## 所以需要对 movies set_index 也同样生成才能匹配 join

        # sort by top avg rating and number of ratings
        ranked_movies = movie_recs.sort_values(['avg_rating', 'num_ratings', 'last_rating'], ascending=False)
        ## 按照 avg，num，last 做sorting

        # for edge cases - subset the movie list to those with only 5 or more reviews
        ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]
        ## 对于 num 不到5的过滤掉
        
        return ranked_movies
        ## 输出 ranked movies
    

def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies
    ## 其实这里的输出和 user_id 没有关系，是按照 ranked_movies 进行的，是最简单的推荐（所有人都推荐相同，后续会继续拓展）

有了上述三个标准后，你将能够编写上述函数。如果你对自己的答案很有信心了，可以对照我们的解答检查你的函数代码。下个页面会提供代码演示，当然，你也可以查看此 workspace 中的解答 notebook。

In [146]:
# Top 20 movies recommended for id 1
ranked_movies = create_ranked_df(movies, reviews) # only run this once - it is not fast

recs_20_for_1 = popular_recommendations('1', 20, ranked_movies)

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations('53968', 5, ranked_movies)

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations('70000', 100, ranked_movies)

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations('43', 35, ranked_movies)

In [158]:
 recs_20_for_1[:3]

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Sorry to Bother You (2018)']

In [159]:
recs_5_for_53968[:3]

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Sorry to Bother You (2018)']

In [160]:
recs_35_for_43[:3]

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Sorry to Bother You (2018)']

In [150]:
### You Should Not Need To Modify Anything In This Cell
ranked_movies = t.create_ranked_df(movies, reviews) # only run this once - it is not fast

# check 1 
assert t.popular_recommendations('1', 20, ranked_movies) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movies) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movies) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movies) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!


**注意：**这并不是确定“最高评分”电影的唯一方式。如果是跟踪热门新闻或社会事件，则需要创建一个从当前时间开始的时间期限，然后从最近的期限内提取报道。至于什么方式最好，我们需要自己判断。

如果你发现没有人关注你的最热门推荐内容了，那么就需要寻找新的推荐方式，这节课的后续部分将介绍这方面的知识。


### 第二部分：添加过滤器

创建返回 **n_top** 电影的函数后，我们来完善下此函数。添加作为电影**年份 year**和**类型 genre**过滤器的语句。  

在以下单元格中调整现有函数，将**年份**和**类型**参数设为**字符串列表**。然后，从提供的年份和类型列表（作为 `or` 条件）中过滤出最终结果。如果没有提供列表，则不应用过滤器。

你可以根据需要调整其他输入，从而检索你想要的最终结果。

请在我们的测试函数中编写一些测试。下面这行代码会根据指定的年份和类型过滤器为用户 1 返回前 20 部热门电影。你的代码返回的结果一样吗？

```
t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])
```

In [152]:
def popular_recs_filtered(user_id, n_top, ranked_movies, years=None, genres=None):
    '''
    REDO THIS DOC STRING
    
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    years - a list of strings with years of movies
    genres - a list of strings with genres of movies
    
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Filter movies based on year and genre
    if years is not None:
        ranked_movies = ranked_movies[ranked_movies['date'].isin(years)]
        ## 如果给定年，则会推荐相关年的电影

    if genres is not None:
        num_genre_match = ranked_movies[genres].sum(axis=1)
        ranked_movies = ranked_movies.loc[num_genre_match > 0, :]
        ## 如果给定genres，则会推荐相关类型的电影
            
    # create top movies list 
    top_movies = list(ranked_movies['movie'][:n_top])
    ## 按照n输出推荐列表

    return top_movies

In [153]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = popular_recs_filtered('53968', 5, ranked_movies, years=['2015', '2016', '2017', '2018'])

# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = popular_recs_filtered('70000', 100, ranked_movies, genres=['History', 'News'])

In [163]:
display(recs_20_for_1_filtered[:3])
display(recs_5_for_53968_filtered[:3])
display(recs_100_for_70000_filtered[:3])
## 当有了year和genre信息以后，输出就不同了;

['Ayla: The Daughter of War (2017)',
 'I Believe in Miracles (2015)',
 'The Farthest (2017)']

['MSG 2 the Messenger (2015)',
 'Avengers: Age of Ultron Parody (2015)',
 'Sorry to Bother You (2018)']

['Birlesen Gonuller (2014)',
 'Seppuku (1962)',
 "La passion de Jeanne d'Arc (1928)"]

In [11]:
### You Should Not Need To Modify Anything In This Cell

# check 1 
assert t.popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History']) == recs_20_for_1_filtered,  "The first check failed..."
# check 2
assert t.popular_recs_filtered('53968', 5, ranked_movies, years=['2015', '2016', '2017', '2018']) == recs_5_for_53968_filtered,  "The second check failed..."
# check 3
assert t.popular_recs_filtered('70000', 100, ranked_movies, genres=['History', 'News']) == recs_100_for_70000_filtered,  "The third check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

If you got here, looks like you are good to go!  Nice job!
