# Chapter2 提供推荐

下面从源码中导入相关的评论的数据.

In [6]:
import recommendations
from pprint import pprint
pprint(recommendations.critics)

{'Claudia Puig': {'Just My Luck': 3.0,
                  'Snakes on a Plane': 3.5,
                  'Superman Returns': 4.0,
                  'The Night Listener': 4.5,
                  'You, Me and Dupree': 2.5},
 'Gene Seymour': {'Just My Luck': 1.5,
                  'Lady in the Water': 3.0,
                  'Snakes on a Plane': 3.5,
                  'Superman Returns': 5.0,
                  'The Night Listener': 3.0,
                  'You, Me and Dupree': 3.5},
 'Jack Matthews': {'Lady in the Water': 3.0,
                   'Snakes on a Plane': 4.0,
                   'Superman Returns': 5.0,
                   'The Night Listener': 3.0,
                   'You, Me and Dupree': 3.5},
 'Lisa Rose': {'Just My Luck': 3.0,
               'Lady in the Water': 2.5,
               'Snakes on a Plane': 3.5,
               'Superman Returns': 3.5,
               'The Night Listener': 3.0,
               'You, Me and Dupree': 2.5},
 'Michael Phillips': {'Lady in the Water': 2.5,
    

## 2.1寻找相近用户
这里我们主要利用**欧几里德距离**和**皮尔逊相关度**来计算用户之间的相关性，根据相关性的高低来寻找相近用户
### 2.1.1 欧几里德距离
这里的[欧几里德距离](https://zh.wikipedia.org/zh-hans/%E6%AC%A7%E5%87%A0%E9%87%8C%E5%BE%97%E8%B7%9D%E7%A6%BB)就是我们上学的时候在直角坐标系里面求两点距离。

In [23]:
from math import sqrt
sqrt(pow(4.5-4, 2) + pow(1-2,2))
1/(1+sqrt(pow(4.5-4, 2) + pow(1-2,2)))

def sim_distance(prefs, person1, person2):
    share_items = set(prefs[person1])&set(prefs[person2])
    if not share_items:
        return 0
    else:
        sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in share_items])
        return 1/(1+sqrt(sum_of_squares))

In [24]:
sim_distance(recommendations.critics, "Lisa Rose", "Gene Seymour")

0.29429805508554946

### 2.1.2 皮尔逊相关性
皮尔逊相关性系数用于评价两组数据与某一直线拟合程序。其计算过程相比上面的欧几里德距离复杂，在数据与平均水平偏差比较大的时候，得到的结果会比欧几里德距离更好。因为皮尔逊相关性在计算时，会修正“夸大分值”的情况，而在欧几里德距离计算中，夸大分值会拉大距离结果。网上关于皮尔逊相关性系数的[链接1](https://www.zhihu.com/question/19734616)，[链接2](https://segmentfault.com/q/1010000000094674)。

In [33]:
def sim_pearson(prefs, p1, p2):
    share_items = set(prefs[p1])&set(prefs[p2])
    if len(share_items) == 0:
        return 1
    
    n = len(share_items)
    sum1 = sum([prefs[p1][it] for it in share_items])
    sum2 = sum([prefs[p2][it] for it in share_items])
    
    sum1Sq = sum([pow(prefs[p1][it],2) for it in share_items])
    sum2Sq = sum([pow(prefs[p2][it],2) for it in share_items])
    
    pSum = sum([prefs[p1][it]*prefs[p2][it] for it in share_items])
    
    num = pSum - (sum1*sum2/n)
    den = sqrt( (sum1Sq-pow(sum1,2)/n) * (sum2Sq-pow(sum2,2)/n) )
    if den == 0:
        return 0
    return num/den

In [34]:
sim_pearson(recommendations.critics, "Lisa Rose", "Gene Seymour")

0.39605901719066977