### 奇异值分解

到目前为止，你已经学习了一些奇异值分解知识。在此 notebook 中，你将练习这方面的技巧。

首先读取库和设置将在这个 notebook 中一直使用的数据。

`1.` 请运行以下单元格并创建 **user_movie_subset** DataFrame。你将在此 notebook 的第一部分使用该 DataFrame。

**注意：分解该矩阵需要大约 10 分钟的时间。**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import svd_tests as t
%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

# Create user-by-item matrix
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

user_movie_subset = user_by_movie[[73486, 75314,  68646, 99685]].dropna(axis=0)
print(user_movie_subset)

movie_id  73486  75314  68646  99685
user_id                             
265        10.0   10.0   10.0   10.0
1023       10.0    4.0    9.0   10.0
1683        8.0    9.0   10.0    5.0
6571        9.0    8.0   10.0   10.0
11639      10.0    5.0    9.0    9.0
13006       6.0    4.0   10.0    6.0
14076       9.0    8.0   10.0    9.0
14725      10.0    5.0    9.0    8.0
23548       7.0    8.0   10.0    8.0
24760       9.0    5.0    9.0    7.0
28713       9.0    8.0   10.0    8.0
30685       9.0   10.0   10.0    9.0
34110      10.0    9.0   10.0    8.0
34430       5.0    8.0    5.0    8.0
35150      10.0    8.0   10.0   10.0
43294       9.0    9.0   10.0   10.0
46849       9.0    8.0    8.0    8.0
50556      10.0    8.0    1.0   10.0
51382       5.0    6.0   10.0   10.0
51410       8.0    7.0   10.0    7.0


`2.` 获得 **user_movie_subset** 矩阵后，请根据该矩阵将每个键与以下字典中的正确值相匹配。请使用字典下方的单元格。

In [2]:
# match each letter to the best statement in the dictionary below - each will be used at most once
a = 20
b = 68646
c = 'The Godfather'
d = 'Goodfellas'
e = 265
f = 30685
g = 4

sol_1_dict = {
    'the number of users in the user_movie_subset': a,
    'the number of movies in the user_movie_subset': g,
    'the user_id with the highest average ratings given': e,
    'the movie_id with the highest average ratings received': b,
    'the name of the movie that received the highest average rating': c
}


#test dictionary here
t.test1(sol_1_dict)

That's right!  There are 20 users in the dataset, which is given by the number of rows. There are 4 movies in the dataset given by the number of columns.  You can find the movies or users with the highest average ratings by taking the mean of each row or column.  Using the movies table, you can find the movie names associated with each id.  This shows the top rated movie is The Godfather!


In [3]:
# Cell for work
# user with the highest average rating
print(user_movie_subset.mean(axis=1))

# movie with highest average rating
print(user_movie_subset.mean(axis=0))

# list of movie names
for movie_id in [73486, 75314,  68646, 99685]:
    print(movies[movies['movie_id'] == movie_id]['movie'])
    
# users by movies
user_movie_subset.shape

user_id
265      10.00
1023      8.25
1683      8.00
6571      9.25
11639     8.25
13006     6.50
14076     9.00
14725     8.00
23548     8.25
24760     7.50
28713     8.75
30685     9.50
34110     9.25
34430     6.50
35150     9.50
43294     9.50
46849     8.25
50556     7.25
51382     7.75
51410     8.00
dtype: float64
movie_id
73486    8.60
75314    7.35
68646    9.00
99685    8.50
dtype: float64
4187    One Flew Over the Cuckoo's Nest (1975)
Name: movie, dtype: object
4361    Taxi Driver (1976)
Name: movie, dtype: object
3706    The Godfather (1972)
Name: movie, dtype: object
6917    Goodfellas (1990)
Name: movie, dtype: object


(20, 4)

现在你已经大致了解了我们将对其执行奇异值分解的矩阵，下面我们将进行分解。首先，回忆下要获取的每个矩阵的维度。   我们要将 **user_movie_subset** 矩阵拆分为三个矩阵：

$$ U \Sigma V^T $$


`3.` 请根据在这节课之前的部分学到的知识，在以下字典中指出上述每个矩阵的维度。

In [4]:
# match each letter in the dictionary below - a letter may appear more than once.
a = 'a number that you can choose as the number of latent features to keep'
b = 'the number of users'
c = 'the number of movies'
d = 'the sum of the number of users and movies'
e = 'the product of the number of users and movies'

sol_2_dict = {
    'the number of rows in the U matrix': b, 
    'the number of columns in the U matrix': a, 
    'the number of rows in the V transpose matrix': a, 
    'the number of columns in the V transpose matrix': c
}

#test dictionary here
t.test2(sol_2_dict)

That's right!  We will now put this to use, so you can see how the dot product of these matrices come together to create our user item matrix.  The number of latent features will control the sigma matrix as well, and this will a square matrix that will at most be the minimum of the number of users and number of movies (in our case the minimum is the 4 movies).


现在对用户-电影矩阵执行 SVD，并验证上述维度。

`4.` 以下是在 NumPy 中执行 SVD 的代码。要详细了解此功能，请参阅[此文档](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.linalg.svd.html)。你发现这些矩阵的形状有什么规律？如果尝试对获得的三个对象执行点积运算，能够直接获得用户-电影矩阵吗？

In [5]:
u, s, vt = np.linalg.svd(user_movie_subset) # perform svd here on user_movie_subset
s.shape, u.shape, vt.shape

((4,), (20, 20), (4, 4))

In [6]:
# Run this cell for our thoughts on the questions posted above
t.question4thoughts()

Looking at the dimensions of the three returned objects, we can see the following:

 1. The u matrix is a square matrix with the number of rows and columns equaling the number of users. 

 2. The v transpose matrix is also a square matrix with the number of rows and columns equaling the number of items.

 3. The sigma matrix is actually returned as just an array with 4 values, but should be a diagonal matrix.  Numpy has a diag method to help with this.  

 In order to set up the matrices in a way that they can be multiplied together, we have a few steps to perform: 

 1. Turn sigma into a square matrix with the number of latent features we would like to keep. 

 2. Change the columns of u and the rows of v transpose to match this number of dimensions. 

 If we would like to exactly re-create the user-movie matrix, we could choose to keep all of the latent features.


`5.` 请利用上个问题的思路创建有 4 个潜在特征的 U、S 和 V 转置矩阵。正确创建所有三个矩阵后，请运行以下测试，表示三个矩阵的点积能够创建原始用户-电影矩阵。这些矩阵的维度应该如下所示：

$$ U_{n x k} $$

$$\Sigma_{k x k} $$

$$V^T_{k x m} $$

其中：

1. n 表示用户数
2. k 表示潜在特征的数量（在此例中是 4 个）
3. m 表示电影数量

In [7]:
# Change the dimensions of u, s, and vt as necessary to use four latent features
# update the shape of u and store in u_new
u_new = u[:, :len(s)]

# update the shape of s and store in s_new
s_new = np.zeros((len(s), len(s)))
s_new[:len(s), :len(s)] = np.diag(s) 

# Because we are using 4 latent features and there are only 4 movies, 
# vt and vt_new are the same
vt_new = vt

In [8]:
# Check your matrices against the solution
assert u_new.shape == (20, 4), "Oops!  The shape of the u matrix doesn't look right. It should be 20 by 4."
assert s_new.shape == (4, 4), "Oops!  The shape of the sigma matrix doesn't look right.  It should be 4 x 4."
assert vt_new.shape == (4, 4), "Oops! The shape of the v transpose matrix doesn't look right.  It should be 4 x 4."
assert np.allclose(np.dot(np.dot(u_new, s_new), vt_new), user_movie_subset), "Oops!  Something went wrong with the dot product.  Your result didn't reproduce the original movie_user matrix."
print("That's right! The dimensions of u should be 20 x 4, and both v transpose and sigma should be 4 x 4.  The dot product of the three matrices how equals the original user-movie matrix!")

That's right! The dimensions of u should be 20 x 4, and both v transpose and sigma should be 4 x 4.  The dot product of the three matrices how equals the original user-movie matrix!


∑ 矩阵能够告诉我们每个潜在特征从用户-电影矩阵中捕获的原始变化性程度如何。要解释的总变化量等于对角元素的平方和。第一个分量解释的变化量等于对角线中第一个值的平方。第二个分量解释的变化量等于对角线中第二个值的平方。   

`6.` 利用以上信息，你能判断仅使用前两个分量的话，能够解释原始用户-电影矩阵中的多少变化量吗？请在以下单元格中尝试一下，然后对照下个单元格中的解答检验你的答案。

In [10]:
total_var = np.sum(s**2)
var_exp_comp1_and_comp2 = s[0]**2 + s[1]**2
perc_exp = round(var_exp_comp1_and_comp2/total_var*100, 2)
print("The total variance in the original matrix is {}.".format(total_var))
print("Ther percentage of variability captured by the first two components is {}%.".format(perc_exp))

The total variance in the original matrix is 5877.0.
Ther percentage of variability captured by the first two components is 98.55%.


In [11]:
assert np.round(perc_exp, 2) == 98.55, "Oops!  That doesn't look quite right.  You should have total variability as the sum of all the squared elements in the sigma matrix.  Then just the sum of the squared first two elements is the amount explained by the first two latent features.  Try again."
print("Yup!  That all looks good!")

Yup!  That all looks good!


`7.` 与上个问题类似，但是更改下 U、∑ 和 V 转置矩阵的形状。这次仅使用前两个分量重现用户-电影矩阵，而不使用四个分量。设置好矩阵后，请运行测试并对照解答页面检查你的矩阵。这些矩阵的维度应该如下所示：

$$ U_{n x k} $$

$$\Sigma_{k x k} $$

$$V^T_{k x m} $$

其中：

1. n 表示用户数
2. k 表示潜在特征的数量（在此例中是 2 个）
3. m 表示电影数量

In [12]:
# Change the dimensions of u, s, and vt as necessary to use four latent features
# update the shape of u and store in u_new
k = 2
u_2 = u[:, :k]

# update the shape of s and store in s_new
s_2 = np.zeros((k, k))
s_2[:k, :k] = np.diag(s[:k]) 

# Because we are using 2 latent features, we need to update vt this time
vt_2 = vt[:k, :]

In [13]:
# Check that your matrices are the correct shapes
assert u_2.shape == (20, 2), "Oops!  The shape of the u matrix doesn't look right. It should be 20 by 2."
assert s_2.shape == (2, 2), "Oops!  The shape of the sigma matrix doesn't look right.  It should be 2 x 2."
assert vt_2.shape == (2, 4), "Oops! The shape of the v transpose matrix doesn't look right.  It should be 2 x 4."
print("That's right! The dimensions of u should be 20 x 2, sigma should be 2 x 2, and v transpose should be 2 x 4. \n\n The question is now that we don't have all of the latent features, how well can we really re-create the original user-movie matrix?")

That's right! The dimensions of u should be 20 x 2, sigma should be 2 x 2, and v transpose should be 2 x 4. 

 The question is now that we don't have all of the latent features, how well can we really re-create the original user-movie matrix?


`8.` 使用所有 4 个潜在特征时，我们能够完全重现用户-电影矩阵。现在只有 2 个潜在特征，我们可以衡量下重现原始矩阵的效果，方法是查看点击运算生成的评分与实际评分之间的平方误差之和。仅使用两个潜在特征计算平方误差之和，并在以下单元格中对照解答页面测试你的答案。

In [14]:
# Compute the dot product
pred_ratings = np.dot(np.dot(u_2, s_2), vt_2)

# Compute the squared error for each predicted vs. actual rating
sum_square_errs = np.sum(np.sum((user_movie_subset - pred_ratings)**2))

In [15]:
# Check against the solution
assert np.round(sum_square_errs, 2) == 85.34, "Oops!  That doesn't look quite right.  You should return a single number for the whole matrix."
print("That looks right!  Nice job!")

That looks right!  Nice job!


这时候你可能会想.. 为何要选择一个不能返回包含所有原始评分的完整用户-电影矩阵的 k。问得好。一个是计算原因，我们肯定想要降低数据的维度，但这不是我们希望 k 小于电影数量和用户数量的主要原因。

我们暂时先思维往后退一步。在我们刚刚查看的这个示例中，矩阵很整洁。每个用户-电影组合都有一个评分。**没有丢失任何值。**但是从上节课我们知道，用户-电影矩阵有很多值丢失了。  

下面是与我们刚刚对其执行 SVD 的矩阵相似的矩阵：

![image.png](attachment:image.png)

现实中：

![image.png](attachment:image.png)


所以，如果保留所有 k 个潜在特征，∑ 矩阵中值更小的潜在特征很有可能解释的变化性是由噪点导致的，而不是信号导致的。此外，如果我们在重构原始用户-电影矩阵时使用“有噪点的”潜在特征，可能会导致评分比仅使用与信号相关的潜在特征得出的评分更糟糕。   

`9.` 我们对缺少值的矩阵执行 SVD，使数据更加真实。下面我向矩阵中添加了一个新的用户，该用户尚未对所有四部电影都评分。请尝试对新的矩阵执行 SVD。会发生什么？

In [16]:
# This line adds one nan value as the very first entry in our matrix
user_movie_subset.iloc[0, 0] = np.nan

# Try svd with this new matrix
u, s, vt = np.linalg.svd(user_movie_subset)

LinAlgError: SVD did not converge

**请在这里填写答案。** 

Even with just one nan value we cannot perform SVD! This is going to be a huge problem, because our real dataset has nan values everywhere! This is where FunkSVD comes in to help.

```python

```