### 冷启动问题

在上个 notebook 中，你第一次听说了**冷启动问题**。如果遇到新用户或新电影，协同过滤方法将无法做出预测。

你需要使用在上节课学习的技巧做出预测，例如针对新项目使用基于内容的推荐方法，或针对新用户使用基于排名的推荐方法。  

在创建推荐系统的最后一步，我们将处理这些极端情形。首先请运行以下单元格。

### 矩阵分解 - 适合协同过滤的情形

运行以下单元格后会获得以下信息：

`1.` **reviews** - 一个评价 dataframe

`2.` **movies** - 一个电影 dataframe

`3.` **create_train_test** - 一个创建训练数据集和验证数据集的函数

`4.` **predict_rating** - 一个参数包括用户和电影的函数，并且能够使用 FunkSVD 提供预测

`5.` **train_df** 和 **val_df** - 在上个 notebook 中使用的训练集和测试集

`6.` **user_mat** 和 **movie_mat** - 通过 FunkSVD 获得的 u 和 v 矩阵

`7.` **train_data_df** - 包含评分的用户-电影矩阵。对此矩阵执行了 FunkSVD

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import pickle

# Read in the datasets
movies = pd.read_csv('data/movies_clean.csv')
reviews = pd.read_csv('data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

def create_train_test(reviews, order_by, training_size, testing_size):
    '''    
    INPUT:
    reviews - (pandas df) dataframe to split into train and test
    order_by - (string) column name to sort by
    training_size - (int) number of rows in training set
    testing_size - (int) number of columns in the test set
    
    OUTPUT:
    training_df -  (pandas df) dataframe of the training set
    validation_df - (pandas df) dataframe of the test set
    '''
    reviews_new = reviews.sort_values(order_by)
    training_df = reviews_new.head(training_size)
    validation_df = reviews_new.iloc[training_size:training_size+testing_size]
    
    return training_df, validation_df

def predict_rating(user_matrix, movie_matrix, user_id, movie_id):
    '''
    INPUT:
    user_matrix - user by latent factor matrix
    movie_matrix - latent factor by movie matrix
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    
    OUTPUT:
    pred - the predicted rating for user_id-movie_id according to FunkSVD
    '''
    # Create series of users and movies in the right order
    user_ids_series = np.array(train_data_df.index)
    movie_ids_series = np.array(train_data_df.columns)
    
    # User row and Movie Column
    user_row = np.where(user_ids_series == user_id)[0][0]
    movie_col = np.where(movie_ids_series == movie_id)[0][0]
    
    # Take dot product of that row and column in U and V to make prediction
    pred = np.dot(user_matrix[user_row, :], movie_matrix[:, movie_col])
    
    return pred

# Use our function to create training and test datasets
train_df, val_df = create_train_test(reviews, 'date', 8000, 2000)

# Create user-by-item matrix - this will keep track of order of users and movies in u and v
train_user_item = train_df[['user_id', 'movie_id', 'rating', 'timestamp']]
train_data_df = train_user_item.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
train_data_np = np.array(train_data_df)

# Read in user and movie matrices
user_file = open("user_matrix", 'rb')
user_mat = pickle.load(user_file)
user_file.close()

movie_file = open("movie_matrix", 'rb')
movie_mat = pickle.load(movie_file)
movie_file.close()

### 验证预测值

遗憾的是，你无法对测试集中的每个用户-电影组合做出预测，因为某些用户或电影是新添加的。  

但是，你可以对 user_mat 和 movie_mat 矩阵中存在的用户-电影对进行预测。  

`1.` 请完成以下函数，看看所有预测评分的平均值与实际值相差多少。

In [None]:
def validation_comparison(val_df, user_mat=user_mat, movie_mat=movie_mat):
    '''
    INPUT:
    val_df - the validation dataset created in the third cell above
    user_mat - U matrix in FunkSVD
    movie_mat - V matrix in FunkSVD
        
    OUTPUT:
    rmse - RMSE of how far off each value is from it's predicted value
    perc_rated - percent of predictions out of all possible that could be rated
    actual_v_pred - a 10 x 10 grid with counts for actual vs predicted values
    preds - (list) predictions for any user-movie pairs where it was possible to make a prediction
    acts - (list) actual values for any user-movie pairs where it was possible to make a prediction
    '''

    return rmse, perc_rated, actual_v_pred, preds, acts

In [None]:
# How well did we do? # Make some plots and calculate some statistics to 
# understand how well this technique is working


`2.` 我们的预测值并没有多糟糕。但是，我们无法对多少个用户-电影对做出预测？请在以下单元格中回答这个问题。

### 对新电影采用基于内容的推荐方法

如果上述流程都没有问题，你将发现我们依然需要处理一些极端情况。我们需要针对这些新用户和电影应用在上节课编写的一些代码。以下代码可以基于内容做出推荐，能够发现与其他电影相似的电影。这部分代码来自上节课的 **5_Content_Based_Recommendations** 部分。

以下函数 **find_similar_movies** 将仅根据内容提供与任何电影相似的电影。  

请运行以下单元格并获取基于内容的相似性函数。

In [None]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = np.array(movies.iloc[:,4:])

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))


def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    movie_idx = np.where(movies['movie_id'] == movie_id)[0][0]
    
    # find the most similar movie indices - to start I said they need to be the same for all content
    similar_idxs = np.where(dot_prod_movies[movie_idx] == np.max(dot_prod_movies[movie_idx]))[0]
    
    # pull the movie titles based on the indices
    similar_movies = np.array(movies.iloc[similar_idxs, ]['movie'])
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst

### 对新用户采用基于排名的推荐方法

通过上述两个代码单元格，我们能够对在用户-电影矩阵的任何部分具有评分的电影-用户对做出推荐。我们还可以通过电影相似性针对从未获得评分的电影预测评分。

在最后一部分，我们需要对新用户做出推荐。我们可以利用在第一节课的 **2_Most_Popular_Recommendations** 部分创建的函数。请运行以下单元格并获取这些函数。

请运行以下单元格并获取基于排名的函数。

In [None]:
def create_ranked_df(movies, reviews):
        '''
        INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe
        
        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews, 
                        then time, and must have more than 4 ratings
        '''
        
        # Pull the average ratings and number of ratings for each movie
        movie_ratings = reviews.groupby('movie_id')['rating']
        avg_ratings = movie_ratings.mean()
        num_ratings = movie_ratings.count()
        last_rating = pd.DataFrame(reviews.groupby('movie_id').max()['date'])
        last_rating.columns = ['last_rating']

        # Add Dates
        rating_count_df = pd.DataFrame({'avg_rating': avg_ratings, 'num_ratings': num_ratings})
        rating_count_df = rating_count_df.join(last_rating)

        # merge with the movies dataset
        movie_recs = movies.set_index('movie_id').join(rating_count_df)

        # sort by top avg rating and number of ratings
        ranked_movies = movie_recs.sort_values(['avg_rating', 'num_ratings', 'last_rating'], ascending=False)

        # for edge cases - subset the movie list to those with only 5 or more reviews
        ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]
        
        return ranked_movies
    

def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies
        

### 你的任务

上述单元格已经为做出预测准备好一切条件。你的任务是编写一个函数，它会根据需要使用上述信息为 **val_df** dataframe 中的每个用户提供推荐。正确答案不止一个，但是结合使用三种方法可能是最合适的解决方案。  

你可以在下个页面的视频中查看我是如何结合使用这些方法的，当然，你也可以想出其他有创意的方法。

`3.` 请利用以下函数和文档字符串完成此 notebook 中的任务。

In [None]:
def make_recommendations(_id, _id_type='movie', train_data=train_data_df, 
                         train_df=train_df, movies=movies, rec_num=5, user_mat=user_mat):
    '''
    INPUT:
    _id - either a user or movie id (int)
    _id_type - "movie" or "user" (str)
    train_data - dataframe of data as user-movie matrix
    train_df - dataframe of training data reviews
    movies - movies df
    rec_num - number of recommendations to return (int)
    user_mat - the U matrix of matrix factorization
    movie_mat - the V matrix of matrix factorization
    
    OUTPUT:
    rec_ids - (array) a list or numpy array of recommended movies by id                  
    rec_names - (array) a list or numpy array of recommended movies by name
    '''

    
    return rec_ids, rec_names

In [None]:
# Use these cells to see that you can truly predict for everyone in the test set
# Do you see anything insightful?


**请在此单元格中描述你的发现结果。**


```python

```