## 1、SVD的应用

> 优点：简化数据，去除噪声，提高算法的结果

> 缺点：数据的转换可能难以理解

> 适用数据类型：数值型数据

利用SVD实现，我们能用小得多的数据集来表示原始数据集。这样做实际上失去除了噪声和冗余信息。

### 1.1 隐性语义索引

最早的SVD的应用之一就是信息检索，我们称利用SVD的方法为隐性语义索引（Latent Semantic Indexing, LSI）。

在LSI中，一个矩阵是由文档和词语组成，在该矩阵上应用SVD就会构建出多个奇异值，这些奇异值代表了文档中的概念或主题，这一特点可用于更搞笑的文档搜索。

### 1.2 推荐系统

SVD的另一个应用就是推荐系统，简单版本的推荐系统能够计算项或者人之间的相似度，更先进的方法则先利用SVD从数据中构建一个主题空间，然后再在该控件下计算其相似度。

## 2、矩阵分解

In [1]:
import numpy as np

In [2]:
U, Sigma, VT = np.linalg.svd([[1, 1], [7, 7]])
U

array([[-0.14142136, -0.98994949],
       [-0.98994949,  0.14142136]])

In [3]:
Sigma

array([ 10.,   0.])

In [4]:
VT

array([[-0.70710678, -0.70710678],
       [-0.70710678,  0.70710678]])

In [2]:
def loadExData():
    return[[1, 1, 1, 0, 0],
           [2, 2, 2, 0, 0],
           [1, 1, 1, 0, 0],
           [5, 5, 5, 0, 0],
           [1, 1, 0, 2, 2],
           [0, 0, 0, 2, 2],
           [0, 0, 0, 3, 3],
           [0, 0, 0, 1, 1]]

In [8]:
data = loadExData()
U, Sigma, VT = np.linalg.svd(data)
Sigma

array([  9.72254240e+00,   5.99678956e+00,   7.14621710e-01,
         1.33226763e-15,   2.96636264e-31])

前三个值比其他之大很多，这样就可以近似的重构原始矩阵。

In [9]:
Sig3 = np.mat([[Sigma[0], 0, 0], [0, Sigma[1], 0], [0, 0, Sigma[2]]])
reConstruct = U[:, :3] * Sig3 * VT[:3, :]
reConstruct

matrix([[  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -7.43654270e-16,  -7.08092439e-16],
        [  2.00000000e+00,   2.00000000e+00,   2.00000000e+00,
          -5.61616725e-16,  -4.90493063e-16],
        [  1.00000000e+00,   1.00000000e+00,   1.00000000e+00,
          -8.38738801e-16,  -8.03176969e-16],
        [  5.00000000e+00,   5.00000000e+00,   5.00000000e+00,
           4.47124976e-16,   6.11490025e-16],
        [  1.00000000e+00,   1.00000000e+00,   2.22044605e-16,
           2.00000000e+00,   2.00000000e+00],
        [  2.77555756e-17,   5.55111512e-17,   2.77555756e-17,
           2.00000000e+00,   2.00000000e+00],
        [  4.16333634e-17,   6.93889390e-17,   2.77555756e-17,
           3.00000000e+00,   3.00000000e+00],
        [  1.38777878e-17,   2.77555756e-17,   1.38777878e-17,
           1.00000000e+00,   1.00000000e+00]])

## 3、基于协同过滤的推荐引擎

### 3.1 相似度计算

相似度的计算方法有很多：
1. 欧式距离，然后通过1/(1+distance)归一化
2. 皮尔逊相关系数（Pearson correlation）
3. 余弦相似度（cosine similarity）

In [3]:
def norm(inX):
    return np.sqrt(np.sum(np.power(inX, 2)))

In [4]:
def eulidSim(inA, inB):
    return 1 / (1 + np.sqrt(np.sum(np.power(inA - inB, 2))))

In [5]:
def pearsSim(inA, inB):
    if len(inA) < 3:
        return 1.0
    return 0.5 + 0.5 * np.corrcoef(inA, inB, rowvar=0)[0][1]

In [6]:
def cosSim(inA, inB):
    num = float(inA.T * inB)
    denom = norm(inA) * norm(inB)
    return 0.5 + 0.5 * (num/denom)

In [21]:
myMat = np.mat(loadExData())
eulidSim(myMat[:, 0], myMat[:, 4])

0.12849622184722817

In [22]:
eulidSim(myMat[:, 0], myMat[:, 0])

1.0

In [24]:
cosSim(myMat[:, 0], myMat[:, 4])

0.54166666666666663

In [25]:
cosSim(myMat[:, 0], myMat[:, 0])

0.99999999999999989

In [26]:
pearsSim(myMat[:, 0], myMat[:, 4])

0.21355405038422687

### 3.2 基于物品的相似度还是基于用户的相似度？

距离的计算有两种方式：基于物品（item-based）的相似度以及基于用户（user-based）的相似度，行与行之间比较的是基于用户的相似度，列与列之间比较的则是基于物品的相似度。

### 3.3 推荐引擎的评价

采用交叉测试的方法，具体做法是将某些已知的评分制去掉，对它们预测，最后计算预测值和真实值之间的差异，通常采用的指标是最小均方根误差（root mean squared error， RMSE）。

## 4、示例：餐馆菜肴推荐引擎

### 4.1 推荐未尝过的菜肴

In [9]:
# 基于物品相似度的推荐引擎
def standEst(dataMat, user, simMeans, item):
    m, n = dataMat.shape
    simTotal = 0.0
    ratSimTotal = 0.0
    for j in range(n):
        userRating = dataMat[user, j]
        if userRating == 0:
            continue
        overLap = np.nonzero(np.logical_and(dataMat[:, item].A > 0,\
                                            dataMat[:, j].A > 0))[0]
        if len(overLap) == 0:
            similarity = 0
        else:
            similarity = simMeans(dataMat[overLap, item], dataMat[overLap, j])
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0:
        return 0
    else:
        return ratSimTotal / simTotal

In [15]:
def recommend(dataMat, user, N=3, simMeans=cosSim, estMethod=standEst):
    unratedItems = np.nonzero(dataMat[user, :].A == 0)[1]
    if len(unratedItems) == 0:
        return 'you rated everything'
    itemScores = []
    for item in unratedItems:
        estimatedScore = estMethod(dataMat, user, simMeans, item)
        itemScores.append((item, estimatedScore))
    return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]
        

In [14]:
myMat = np.mat(loadExData())
# 略作修改
myMat[0, 1] = myMat[0, 0] = myMat[1, 0] = myMat[2, 0] = 4
myMat[3, 3] = 2
myMat

matrix([[4, 4, 1, 0, 0],
        [4, 2, 2, 0, 0],
        [4, 1, 1, 0, 0],
        [5, 5, 5, 2, 0],
        [1, 1, 0, 2, 2],
        [0, 0, 0, 2, 2],
        [0, 0, 0, 3, 3],
        [0, 0, 0, 1, 1]])

In [16]:
recommend(myMat, 2)

[(4, 2.5), (3, 1.9703483892927431)]

### 4.2 利用SVD提高推荐的效果

In [17]:
def loadExData2():
        return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5],
           [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3],
           [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0],
           [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0],
           [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0],
           [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0],
           [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1],
           [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4],
           [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2],
           [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0],
           [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]]

In [18]:
U, Sigma, VT = np.linalg.svd(np.mat(loadExData2()))
Sigma

array([ 15.77075346,  11.40670395,  11.03044558,   4.84639758,
         3.09292055,   2.58097379,   1.00413543,   0.72817072,
         0.43800353,   0.22082113,   0.07367823])

In [19]:
Sig2 = Sigma ** 2

In [20]:
sum(Sig2)

541.99999999999943

In [21]:
sum(Sig2) * 0.9

487.7999999999995

In [22]:
sum(Sig2[:3])

500.50028912757926

于是我们可以将一个11维的矩阵转换成一个三维的矩阵

In [31]:
def svdEst(dataMat, user, simMeans, item):
    m, n = dataMat.shape
    simTotal = 0.0
    ratSimTotal = 0.0
    U, Sigma, VT = np.linalg.svd(dataMat)
    Sig4 = np.mat(np.eye(4) * Sigma[:4])
    # 映射到低维空间
    xformedItems = dataMat.T * U[:, :4] * Sig4.I
    for j in range(n):
        userRating = dataMat[user, j]
        if userRating == 0 or j == item:
            continue
        similarity = simMeans(xformedItems[item, :].T, xformedItems[j, :].T)
        simTotal += similarity
        ratSimTotal += similarity * userRating
    if simTotal == 0:
        return 0
    else:
        return ratSimTotal / simTotal

In [33]:
myMat = np.mat(loadExData2())

In [34]:
recommend(myMat, 1, estMethod=svdEst)

[(4, 3.3447149384692283), (7, 3.3294020724526963), (9, 3.328100876390069)]

### 4.3 构建推荐引擎面临的挑战

1. 要明确，SVD分解其实效率很低，大型系统中SVD每天运行一次或者频率更低，而且是离线运行
2. 推荐引擎中还存在其他很多规模扩展性的挑战问题，比如矩阵的表示方法，因为有很多0，另一个潜在的计算资源浪费则来自于相似度得分，有时可以重复使用
3. 另一个问题就是如何在缺乏数据时给出好的推荐，这称为冷启动（cold-start）问题，可以抓换位基于内容（content-based）的推荐

## 5、示例：基于SVD的图像压缩

In [42]:
def printMat(inMat, thresh=0.8):
    for i in range(32):
        for k in range(32):
            if float(inMat[i, k] > thresh):
                print(1, end='')
            else:
                print(0, end='')
        print(' ')

In [46]:
def imgCompress(numSV=3, thresh=0.8):
    myl = []
    for line in open('0_5.txt').readlines():
        newRow = []
        for i in range(32):
            newRow.append(int(line[i]))
        myl.append(newRow)
    myMat = np.mat(myl)
    print('****original matrix****')
    printMat(myMat, thresh)
    U, Sigma, VT = np.linalg.svd(myMat)
    SigRecon = np.mat(np.zeros((numSV, numSV)))
    for k in range(numSV):
        SigRecon[k, k] = Sigma[k]
    reconMat = U[:, :numSV] * SigRecon * VT[:numSV, :]
    print('****reconstructed matrix using {} sigular values****'.format(numSV))
    printMat(reconMat, thresh)

In [47]:
imgCompress(3)

****original matrix****
00000000000000110000000000000000 
00000000000011111100000000000000 
00000000000111111110000000000000 
00000000001111111111000000000000 
00000000111111111111100000000000 
00000001111111111111110000000000 
00000000111111111111111000000000 
00000000111111100001111100000000 
00000001111111000001111100000000 
00000011111100000000111100000000 
00000011111100000000111110000000 
00000011111100000000011110000000 
00000011111100000000011110000000 
00000001111110000000001111000000 
00000011111110000000001111000000 
00000011111100000000001111000000 
00000001111100000000001111000000 
00000011111100000000001111000000 
00000001111100000000001111000000 
00000001111100000000011111000000 
00000000111110000000001111100000 
00000000111110000000001111100000 
00000000111110000000001111100000 
00000000111110000000011111000000 
00000000111110000000111111000000 
00000000111111000001111110000000 
00000000011111111111111110000000 
00000000001111111111111110000000 
000000000011111111111111

In [48]:
imgCompress(2)

****original matrix****
00000000000000110000000000000000 
00000000000011111100000000000000 
00000000000111111110000000000000 
00000000001111111111000000000000 
00000000111111111111100000000000 
00000001111111111111110000000000 
00000000111111111111111000000000 
00000000111111100001111100000000 
00000001111111000001111100000000 
00000011111100000000111100000000 
00000011111100000000111110000000 
00000011111100000000011110000000 
00000011111100000000011110000000 
00000001111110000000001111000000 
00000011111110000000001111000000 
00000011111100000000001111000000 
00000001111100000000001111000000 
00000011111100000000001111000000 
00000001111100000000001111000000 
00000001111100000000011111000000 
00000000111110000000001111100000 
00000000111110000000001111100000 
00000000111110000000001111100000 
00000000111110000000011111000000 
00000000111110000000111111000000 
00000000111111000001111110000000 
00000000011111111111111110000000 
00000000001111111111111110000000 
000000000011111111111111

> 总结：SVD是一种强大的降维工具，我们可以利用SVD来逼近矩阵并从中提取重要特征。通过保留矩阵80%-90%的能量，就可以得到重要的特征并去除噪声。