### 效果检验

在上个 notebook 中，你创建了一个即使缺少大量值也能正常运行的 SVD 函数。太棒了！问题是，这个函数的实际效果如何？

在此 notebook 中，我们将完全模拟真实的情况，并微调我们的推荐系统。  

请运行以下单元格来读取数据并开始。

In [1]:
import numpy as np
import pandas as pd

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

1.请对 **reviews** dataframe 执行以下任务，创建一个训练集和验证集，并使用**离线**验证技巧测试 SVD 算法的效果。

 * 从最早到最新对 reviews dataframe 排序 
 * 从数据集中提取前 10000 条评论
 * 将这 10000 条评论中的前 8000 条作为训练数据 
 * 将这 10000 条评论中的后 2000 条作为测试数据
 * 返回训练和测试数据集

In [2]:
def create_train_test(reviews, order_by, training_size, testing_size):
    '''    
    INPUT:
    reviews - (pandas df) dataframe to split into train and test
    order_by - (string) column name to sort by
    training_size - (int) number of rows in training set
    testing_size - (int) number of columns in the test set
    
    OUTPUT:
    training_df -  (pandas df) dataframe of the training set
    validation_df - (pandas df) dataframe of the test set
    '''
    reviews_new = reviews.sort_values(order_by)
    training_df = reviews_new.head(training_size)
    validation_df = reviews_new.iloc[training_size:training_size+testing_size]
    
    return training_df, validation_df

In [3]:
# Nothing to change in this or the next cell
# Use our function to create training and test datasets
train_df, val_df = create_train_test(reviews, 'date', 8000, 2000)

In [4]:
# Make sure the dataframes we are using are the right shape
assert train_df.shape[0] == 8000, "The number of rows doesn't look right in the training dataset."
assert val_df.shape[0] == 2000, "The number of rows doesn't look right in the validation dataset"
assert str(train_df.tail(1)['date']).split()[1] == '2013-03-15', "The last date in the training dataset doesn't look like what we expected."
assert str(val_df.tail(1)['date']).split()[1] == '2013-03-18', "The last date in the validation dataset doesn't look like what we expected."
print("Nice job!  Looks like you have written a function that provides training and validation dataframes for you to use in the next steps.")

Nice job!  Looks like you have written a function that provides training and validation dataframes for you to use in the next steps.


在现实中，我们可能会将到最后日期的所有数据当做训练数据。然后我们将查看出现在测试数据集中的每个新评分效果如何。

下面是在之前的示例中创建的一个能运行的函数示例，你可以使用该函数，或者替换成你自己的函数。

`2.` 使用以下超参数将函数拟合到训练数据：15 个潜在特征，学习速率为 0.005，迭代 250 次。运行需要一段时间，如果你想加快运行速度，可以选择更少的潜在特征、更高的学习速率，或者更少的迭代次数。  

**注意：**你可以散散步，休息一下，或者打个电话。不需要更改以下代码，除非你想更快地获得结果。

In [5]:
def FunkSVD(ratings_mat, latent_features=12, learning_rate=0.001, iters=50):
    '''
    This function performs matrix factorization using a basic form of FunkSVD with no regularization
    
    INPUT:
    ratings_mat - (numpy array) a matrix with users as rows, movies as columns, and ratings as values
    latent_features - (int) the number of latent features used
    learning_rate - (float) the learning rate 
    iters - (int) the number of iterations
    
    OUTPUT:
    user_mat - (numpy array) a user by latent feature matrix
    movie_mat - (numpy array) a latent feature by movie matrix
    '''
    
    # Set up useful values to be used through the rest of the function
    n_users = ratings_mat.shape[0]
    n_movies = ratings_mat.shape[1]
    num_ratings = np.count_nonzero(~np.isnan(ratings_mat))
    
    # initialize the user and movie matrices with random values
    user_mat = np.random.rand(n_users, latent_features)
    movie_mat = np.random.rand(latent_features, n_movies)
    
    # initialize sse at 0 for first iteration
    sse_accum = 0
    
    # keep track of iteration and MSE
    print("Optimizaiton Statistics")
    print("Iterations | Mean Squared Error ")
    
    # for each iteration
    for iteration in range(iters):

        # update our sse
        old_sse = sse_accum
        sse_accum = 0
        
        # For each user-movie pair
        for i in range(n_users):
            for j in range(n_movies):
                
                # if the rating exists
                if ratings_mat[i, j] > 0:
                    
                    # compute the error as the actual minus the dot product of the user and movie latent features
                    diff = ratings_mat[i, j] - np.dot(user_mat[i, :], movie_mat[:, j])
                    
                    # Keep track of the sum of squared errors for the matrix
                    sse_accum += diff**2
                    
                    # update the values in each matrix in the direction of the gradient
                    for k in range(latent_features):
                        user_mat[i, k] += learning_rate * (2*diff*movie_mat[k, j])
                        movie_mat[k, j] += learning_rate * (2*diff*user_mat[i, k])

        # print results
        print("%d \t\t %f" % (iteration+1, sse_accum / num_ratings))
        
    return user_mat, movie_mat 

In [6]:
# Create user-by-item matrix - nothing to do here
train_user_item = train_df[['user_id', 'movie_id', 'rating', 'timestamp']]
train_data_df = train_user_item.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
train_data_np = np.array(train_data_df)

# Fit FunkSVD with the specified hyper parameters to the training data
user_mat, movie_mat = FunkSVD(train_data_np, latent_features=15, learning_rate=0.005, iters=250)

Optimizaiton Statistics
Iterations | Mean Squared Error 
1 		 10.733528
2 		 6.007291
3 		 4.202513
4 		 3.146491
5 		 2.452975
6 		 1.964293
7 		 1.603168
8 		 1.327189
9 		 1.110957
10 		 0.938229
11 		 0.798115
12 		 0.683055
13 		 0.587637
14 		 0.507882
15 		 0.440783
16 		 0.384014
17 		 0.335745
18 		 0.294512
19 		 0.259137
20 		 0.228666
21 		 0.202319
22 		 0.179459
23 		 0.159561
24 		 0.142188
25 		 0.126979
26 		 0.113629
27 		 0.101883
28 		 0.091524
29 		 0.082367
30 		 0.074256
31 		 0.067056
32 		 0.060653
33 		 0.054946
34 		 0.049850
35 		 0.045291
36 		 0.041206
37 		 0.037539
38 		 0.034242
39 		 0.031271
40 		 0.028592
41 		 0.026171
42 		 0.023982
43 		 0.021998
44 		 0.020198
45 		 0.018563
46 		 0.017077
47 		 0.015723
48 		 0.014490
49 		 0.013364
50 		 0.012336
51 		 0.011396
52 		 0.010536
53 		 0.009749
54 		 0.009027
55 		 0.008364
56 		 0.007756
57 		 0.007196
58 		 0.006682
59 		 0.006209
60 		 0.005772
61 		 0.005370
62 		 0.004999
63 		 0.004657
64 		 

创建 **user_mat** 和 **movie_mat** 之后，我们可以计算用户对应的行和电影对应的列之间的点积，从而预测用户对电影的评分。

`3.` 请按照下面的注释完成 **predict_rating** 函数。

In [8]:
def predict_rating(user_matrix, movie_matrix, user_id, movie_id):
    '''
    INPUT:
    user_matrix - user by latent factor matrix
    movie_matrix - latent factor by movie matrix
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    
    OUTPUT:
    pred - the predicted rating for user_id-movie_id according to FunkSVD
    '''
    # Use the training data to create a series of users and movies that matches the ordering in training data
    user_ids_series = np.array(train_data_df.index)
    movie_ids_series = np.array(train_data_df.columns)
    
    # User row and Movie Column
    user_row = np.where(user_ids_series == user_id)[0][0]
    movie_col = np.where(movie_ids_series == movie_id)[0][0]
    
    # Take dot product of that row and column in U and V to make prediction
    pred = np.dot(user_matrix[user_row, :], movie_matrix[:, movie_col])
    
    return pred

In [9]:
# Test your function with the first user-movie in the user-movie matrix (notice this is a nan)
pred_val = predict_rating(user_mat, movie_mat, 8, 2844)
pred_val

6.934642271979579

现在你已经能够做出预测了。但是如果能获取关于用户、电影和评分的描述就更好了。

`4.` 请按照下面的注释完成 **predict_rating**。  

**注意：**返回的片名格式有点乱，我在解答中稍微调整了下代码，使格式更清晰。

In [10]:
def print_prediction_summary(user_id, movie_id, prediction):
    '''
    INPUT:
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    prediction - the predicted rating for user_id-movie_id
    '''
    movie_name = str(movies[movies['movie_id'] == movie_id]['movie']) [5:]
    movie_name = movie_name.replace('\nName: movie, dtype: object', '')
    print("For user {} we predict a {} rating for the movie {}.".format(user_id, round(prediction, 2), str(movie_name)))

In [11]:
# Test your function the the results of the previous function
print_prediction_summary(8, 2844, pred_val)

For user 8 we predict a 6.93 rating for the movie  Fantômas - À l'ombre de la guillotine (1913).


现在我们已经能够预测评分了，下面我们将检验函数对已经存在的评分的预测效果。这样就能判断我们获取潜在特征的效果，以及日后利用潜在特征做出预测的能力。

`5.` 对于 **val_df** 数据集中的每个用户-电影评分，请比较实际评分与预测评分。预测效果如何？遇到任何问题吗？如果遇到了，是什么问题？请根据下面的文档字符串和注释回答这些问题。

In [12]:
def validation_comparison(val_df, num_preds):
    '''
    INPUT:
    val_df - the validation dataset created in the third cell above
    num_preds - (int) the number of rows (going in order) you would like to make predictions for
    
    OUTPUT:
    Nothing returned - print a statement about the prediciton made for each row of val_df from row 0 to num_preds
    '''
    val_users = np.array(val_df['user_id'])
    val_movies = np.array(val_df['movie_id'])
    val_ratings = np.array(val_df['rating'])
    
    
    for idx in range(num_preds):
        pred = predict_rating(user_mat, movie_mat, val_users[idx], val_movies[idx])
        print("The actual rating for user {} on movie {} is {}.\n While the predicted rating is {}.".format(val_users[idx], val_movies[idx], val_ratings[idx], round(pred))) 

        
# Perform the predicted vs. actual for the first 6 rows.  How does it look?
validation_comparison(val_df, 6)        

The actual rating for user 49056 on movie 1598822 is 8.
 While the predicted rating is 6.0.
The actual rating for user 49056 on movie 289879 is 9.
 While the predicted rating is 8.0.
The actual rating for user 49056 on movie 1563738 is 9.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 1458175 is 4.
 While the predicted rating is 7.0.
The actual rating for user 28599 on movie 103639 is 8.
 While the predicted rating is 8.0.
The actual rating for user 50593 on movie 1560985 is 4.
 While the predicted rating is 5.0.


In [14]:
# Perform the predicted vs. actual for the first 7 rows.  What happened?
validation_comparison(val_df, 7)
# 会报错，解释在最后

The actual rating for user 49056 on movie 1598822 is 8.
 While the predicted rating is 6.0.
The actual rating for user 49056 on movie 289879 is 9.
 While the predicted rating is 8.0.
The actual rating for user 49056 on movie 1563738 is 9.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 1458175 is 4.
 While the predicted rating is 7.0.
The actual rating for user 28599 on movie 103639 is 8.
 While the predicted rating is 8.0.
The actual rating for user 50593 on movie 1560985 is 4.
 While the predicted rating is 5.0.


IndexError: index 0 is out of bounds for axis 0 with size 0

** 解释下为何会发生所发生的情况。**

** The 7th movie is a movie that has no ratings.  Therefore, we are not able to make a prediction for this user-movie pair.**

```python

```