## RecommendationSystem 推薦系統

- Manhattan Distance : 如果數據比較“密集”，變量之間基本都存在公有值，且這些距離數據是非常重要的，那就使用歐幾里得或曼哈頓距離。
- Pearson Correlation: 如果數據存在“分數膨脹”問題，就使用皮爾遜相關係數。
- Cosine Similarity  : 如果數據是稀疏的，則使用餘弦相似度。


如何找到相似的用戶？所以首先要做的工作是找到相似的用戶。這裡用最簡單的二維模型來描述。假設用戶會在網站用五顆星來評價一本書——
- 沒有星表示書寫得很糟，五顆星表示很好。
- 二維模型，所以僅對兩本書進行評價：史蒂芬森的《雪崩》（縱軸）和拉爾森 的《龍紋身的女孩》（橫軸）。

![Recommend.png](attachment:image.png)

In [6]:
# Code file for the book Programmer's Guide to Data Mining
# DataSet
from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
        }

In [8]:
users["Bill"]

{'Blues Traveler': 2.0,
 'Broken Bells': 3.5,
 'Deadmau5': 4.0,
 'Phoenix': 2.0,
 'Slightly Stoopid': 3.5,
 'Vampire Weekend': 3.0}

![User.png](attachment:image.png)

## Manhattan Distance 曼哈頓距離 
最簡單的距離計算方式是曼哈頓距離。在二維模型中，每個人都可以用(x, y)的點來表示，這裡我用下標來表示不同的人，
- (x1, y1) 表示艾米
- (x2, y2) 表示X先生
那麼他們之間的曼哈頓距離就是4

![image.png](attachment:image.png)

In [10]:
def manhattan(rating1, rating2):
    distance = 0
    commonRatings = False 
    for key in rating1:
        if key in rating2:
            distance += abs(rating1[key] - rating2[key])  # 公式
            commonRatings = True
    if commonRatings:
        return distance
    else:
        return -1 #Indicates no ratings in common

manhattan(users["Bill"],users["Dan"])  # Bill and Dan's dist

5.0

In [12]:
def computeNearestNeighbor(username, users):
    """creates a sorted list of users based on their distance to username"""
    distances = []
    for user in users:
        if user != username:
            distance = manhattan(users[user], users[username])
            distances.append((distance, user))
    distances.sort()
    return distances

computeNearestNeighbor("Bill",users) # Bill 跟誰的興趣最接近(距離最短) = Veronica

[(4.0, 'Veronica'),
 (5.0, 'Dan'),
 (5.5, 'Hailey'),
 (6.0, 'Jordyn'),
 (8.0, 'Sam'),
 (9.0, 'Angelica'),
 (14.0, 'Chan')]

In [16]:
def recommend(username, users):
    """Give list of recommendations"""
    # first find nearest neighbor
    nearest = computeNearestNeighbor(username, users)[0][1]
    recommendations = []
    
    # now find bands neighbor rated that user didn't
    neighborRatings = users[nearest]
    userRatings = users[username]
    for artist in neighborRatings:
        if not artist in userRatings:
            recommendations.append((artist, neighborRatings[artist]))
    return sorted(recommendations, 
                  key=lambda artistTuple: artistTuple[1], 
                  reverse = True)

print(recommend('Hailey', users))   # recommand band : Phoenix
print(recommend('Chan', users))     # recommand band : The Strokes
print(recommend('Angelica', users)) # recommand band : None


[('Phoenix', 4.0), ('Blues Traveler', 3.0), ('Slightly Stoopid', 2.5)]
[('The Strokes', 4.0), ('Vampire Weekend', 1.0)]
[]


![image.png](attachment:image.png)

## Fix Problem : 明氏距離 (Minkowski distance)
Veronica評價過的樂隊，Angelica也都評價過了，所以我們沒有推薦(None)。可以利用閔可夫斯基距離函數計算它。

In [18]:
print(recommend('Angelica', users))      # recommand band : None
computeNearestNeighbor("Angelica",users) # Angelica 跟誰的興趣最接近(距離最短) = Veronica

[]


[(3.5, 'Veronica'),
 (4.5, 'Chan'),
 (5.0, 'Hailey'),
 (8.0, 'Sam'),
 (9.0, 'Bill'),
 (9.0, 'Dan'),
 (9.5, 'Jordyn')]

![List.png](attachment:image.png)

## Fix Problem : 皮爾遜相關係數(Pearson Correlation)

打分標準不同，讓我們仔細看看用戶對樂隊的評分，可以發現每個用戶的打分標準非常不同： Bill沒有打出極端的分數，都在2至4分之間； Jordyn似乎喜歡所有的樂隊，打分都在4至5之間； Hailey是一個有趣的人，他的分數不是1就是4。那麼，如何比較這些用戶呢？比如Hailey的4分相當於Jordan的4分還是5分呢？我覺得更接近 5分。這樣一來就會影響到推薦系統的準確性了。

![image-3.png](attachment:image-3.png)

In [20]:
def pearson(rating1, rating2):
    sum_xy = 0
    sum_x = 0
    sum_y = 0
    sum_x2 = 0
    sum_y2 = 0
    n = 0
    for key in rating1:
        if key in rating2:
            n += 1
            x = rating1[key]
            y = rating2[key]
            sum_xy += x * y
            sum_x += x
            sum_y += y
            sum_x2 += pow(x, 2)
            sum_y2 += pow(y, 2)
    # now compute denominator
    denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
    if denominator == 0:
        return 0
    else:
        return (sum_xy - (sum_x * sum_y) / n) / denominator    

pearson(users['Angelica'], users['Bill'])
pearson(users['Angelica'], users['Hailey'])


0.42008402520840293

## Fix Problem : Cosine Similarity (餘弦相似度) 

事實上，iTunes上有1500萬首音樂，而我 只聽過4000首。所以說單個用戶的數據是稀疏的，因為非零值較總體要少得多。當我們用1500萬首歌曲來比較兩個用戶時，很有可能他們之間沒有任何交集，這樣一來就無從計算他們之間的距離了。

類似的情況是在計算文章的相似度時。比如說我們想找一本和《The Space Pioneers》相類似的書，方法之一是利用單詞出現的頻率，即統計每個單詞在書中出現的次數佔全書單詞的比例，如“the”出現頻率為6.13%，“Tom” 0.89%，“space” 0.25%。我們可以用這些數據來尋找一本相近的書。同樣有數據稀疏性問題。《The Space Pioneers》有 6629 個不同的單詞，但英語語言中有超過100萬個單詞，這樣一來非零值就很稀少了，也就不能計算兩本書之間的距離。

![image.png](attachment:image.png)

![image.png](attachment:image.png)


## Example code
data : Download (http://guidetodatamining.com/assets/data/BX-Dump.zip)
- BX-Book-Ratings  28M
- BX-Books  73M
- BX-Users  11M

In [21]:
import codecs 
from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
                      "Norah Jones": 4.5, "Phoenix": 5.0,
                      "Slightly Stoopid": 1.5,
                      "The Strokes": 2.5, "Vampire Weekend": 2.0},
         
         "Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
                 "Deadmau5": 4.0, "Phoenix": 2.0,
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
         
         "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
                  "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},
         
         "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
                 "Deadmau5": 4.5, "Phoenix": 3.0,
                 "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                 "Vampire Weekend": 2.0},
         
         "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
                    "Norah Jones": 4.0, "The Strokes": 4.0,
                    "Vampire Weekend": 1.0},
         
         "Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
                     "Norah Jones": 5.0, "Phoenix": 5.0,
                     "Slightly Stoopid": 4.5, "The Strokes": 4.0,
                     "Vampire Weekend": 4.0},
         
         "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
                 "Norah Jones": 3.0, "Phoenix": 5.0,
                 "Slightly Stoopid": 4.0, "The Strokes": 5.0},
         
         "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
                      "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                      "The Strokes": 3.0}
        }



class recommender:

    def __init__(self, data, k=1, metric='pearson', n=5):
        """ initialize recommender
        currently, if data is dictionary the recommender is initialized
        to it.
        For all other data types of data, no initialization occurs
        k is the k value for k nearest neighbor
        metric is which distance formula to use
        n is the maximum number of recommendations to make"""
        self.k = k
        self.n = n
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # for some reason I want to save the name of the metric
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        #
        # if data is dictionary set recommender data to it
        #
        if type(data).__name__ == 'dict':
            self.data = data

    def convertProductID2name(self, id):
        """Given product id number return product name"""
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id


    def userRatings(self, id, n):
        """Return n top ratings for user with id"""
        print ("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # finally sort and return
        ratings.sort(key=lambda artistTuple: artistTuple[1],
                     reverse = True)
        ratings = ratings[:n]
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))


    def loadBookDB(self, path=''):
        """loads the BX book dataset. Path is where the BX files are
        located"""
        self.data = {}
        i = 0
        #
        # First load book ratings into self.data
        #
        f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        #
        # Now load books into self.productid2name
        # Books contains isbn, title, and author among other fields
        #
        f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        #
        #  Now load user info into both self.userid2name and
        #  self.username2id
        #
        f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)
                
        
    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator


    def computeNearestNeighbor(self, username):
        """creates a sorted list of users based on their distance to
        username"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username],
                                   self.data[instance])
                distances.append((instance, distance))
        # sort based on distance -- closest first
        distances.sort(key=lambda artistTuple: artistTuple[1],
                       reverse=True)
        return distances

    def recommend(self, user):
       """Give list of recommendations"""
       recommendations = {}
       # first get list of users  ordered by nearness
       nearest = self.computeNearestNeighbor(user)
       #
       # now get the ratings for the user
       #
       userRatings = self.data[user]
       #
       # determine the total distance
       totalDistance = 0.0
       for i in range(self.k):
          totalDistance += nearest[i][1]
       # now iterate through the k nearest neighbors
       # accumulating their ratings
       for i in range(self.k):
          # compute slice of pie 
          weight = nearest[i][1] / totalDistance
          # get the name of the person
          name = nearest[i][0]
          # get the ratings for this person
          neighborRatings = self.data[name]
          # get the name of the person
          # now find bands neighbor rated that user didn't
          for artist in neighborRatings:
             if not artist in userRatings:
                if artist not in recommendations:
                   recommendations[artist] = (neighborRatings[artist]
                                              * weight)
                else:
                   recommendations[artist] = (recommendations[artist]
                                              + neighborRatings[artist]
                                              * weight)
       # now make list from dictionary
       recommendations = list(recommendations.items())
       recommendations = [(self.convertProductID2name(k), v)
                          for (k, v) in recommendations]
       # finally sort and return
       recommendations.sort(key=lambda artistTuple: artistTuple[1],
                            reverse = True)
       # Return the first n items
       return recommendations[:self.n]

In [23]:
r = recommender(users)
r.recommend('Jordyn')

[('Blues Traveler', 5.0)]

In [24]:
r.recommend('Hailey')

[('Phoenix', 5.0), ('Slightly Stoopid', 4.5)]

In [31]:
r.loadBookDB('input/')

1700018


In [32]:
r.recommend('171118')

[("A Swiftly Tilting Planet by Madeleine L'Engle", 10.0),
 ("The Godmother's Apprentice by Elizabeth Ann Scarborough", 10.0),
 ("The Godmother's Web by Elizabeth Ann Scarborough", 10.0),
 ("The Irrational Season (The Crosswicks Journal, Book 3) by Madeleine L'Engle",
  10.0),
 ('The Girl Who Loved Tom Gordon by Stephen King', 9.0)]

In [None]:
# 作業
# 1. 實現一個計算曼哈頓距離和歐幾里得距離的方法
# 2. 本書的網站上有一個包含25部電影評價的數據集，實現一個推薦算法。