### First recommendation model
This is a version 1.0.0 of the recommendation engine using user-based collabrative filtering

**Advantages:**
- Simple and mathematical reliance.
- Ability to handle missing values.
- Easy to deploy.

**Drawbacks:**
- Less scalable and slow.

### Mathematical concept:
**Overcome Average Rating**
- Calculate the similarity of the users' preference simply calulate the average rating.
- Then cluster users that has similar average rating.
$$
rating(u, i) = \frac{ \sum_{u^{'} \in \Omega_{i}} r_{u^{'}i} }{|\Omega_{i}|}
$$

Explanation:
The rating of user $u$ to the item $i$ can be expressed as average of total ratings from each user $u^{'}$ to item $i$ then divide over the number of ratings.

This calculation makes the recommendation less accurate as it will predict the rating even though the predicted user has opposite rating with other users.
Hence, we need to add weight into the formula:

$$
rating(u, i) = \overline{r}_i + \frac{ \sum_{u^{'} \in \Omega_{i}} w_{uu^{'}} \times \overline{y}_{u^{'}i}  }{\sum_{u^{'} \in \Omega_{i}} |w_{uu^{'} }|}
$$

Symbols Notes:
$\overline{y}_{u^{'}i} = r_{u^{'}i} - \overline{r}_{u^{'}}$

This is the key point of our recommendation system.

Problems occurs:


### The recommendation steps:
1. Calculate the cosine similarity among all the users.
2. Train and predict the missing rating in the matrix.
3. Recommend by inference from the precalculated step.


In [1]:
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.utils import shuffle
from collections import Counter

In [2]:
books = pd.read_csv("./books/Books.csv", low_memory=False)
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [3]:
book_dataset = pd.read_csv("./books/Ratings.csv")
book_dataset.rename(columns = {'User-ID':'user_id', 'ISBN':'item_id', 'Book-Rating':'rating'}, inplace = True)
book_dataset.head()
book_dataset.shape

(1149780, 3)

In [4]:
# Preprocess data
book_dataset.user_id = book_dataset.user_id - 1
# map old movie index to new movie index
unique_item_ids = set(book_dataset.item_id.values)
item2idx = {}
count = 0
for item_id in unique_item_ids:
    item2idx[item_id] = count
    count += 1


# add them to the data frame
# takes awhile
book_dataset['item_idx'] = book_dataset.apply(lambda row: item2idx[row.item_id], axis=1)
book_dataset.head()

Unnamed: 0,user_id,item_id,rating,item_idx
0,276724,034545104X,0,89834
1,276725,0155061224,5,134147
2,276726,0446520802,0,61227
3,276728,052165615X,3,93169
4,276728,0521795028,6,318767


In [5]:
user_ids_count = Counter(book_dataset.user_id)
movie_ids_count = Counter(book_dataset.item_idx)

n = 20000
m = 2000

user_ids= [u for u, _ in user_ids_count.most_common(n)]
item_ids = [m for m, _ in movie_ids_count.most_common(m)]

df_small = book_dataset[book_dataset.user_id.isin(user_ids) & book_dataset.item_idx.isin(item_ids)].copy()

new_user_id_map = {}
i = 0 
for old in user_ids:
  new_user_id_map[old] = i
  i += 1
print("i:", i)

new_item_id_map = {}
j = 0
for old in item_ids:
  new_item_id_map[old] = j
  j += 1
print("j:", j)

print("Setting new ids")
df_small.loc[:, 'user_idx'] = df_small.apply(lambda row: new_user_id_map[row.user_id], axis=1)
df_small.loc[:, 'new_item_idx'] = df_small.apply(lambda row: new_item_id_map[row.item_idx], axis=1)

i: 20000
j: 2000
Setting new ids


In [6]:
df_small.head()
# df_small.shape

Unnamed: 0,user_id,item_id,rating,item_idx,user_idx,new_item_idx
10,276745,0425115801,0,59632,17383,443
11,276745,0449006522,0,44179,17383,602
12,276745,0553561618,0,162916,17383,424
13,276745,055356451X,0,294423,17383,280
16,276746,0060517794,9,287146,14270,1411


In [7]:
columns_titles = ["user_id", "item_id", "item_idx", "user_idx", "new_item_idx", "rating"]
df_small = df_small.reindex(columns=columns_titles)
df = df_small.iloc[:,3:].copy()
rating_matrix = df.values
rating_matrix


array([[17383,   443,     0],
       [17383,   602,     0],
       [17383,   424,     0],
       ...,
       [ 8395,   597,     0],
       [ 8395,   648,     7],
       [ 8395,   114,     0]])

In [43]:
class Recommender:
    """
    User-user collaborative filtering recommendation engine implementation
    """
    def __init__(self, data_matrix, k, distance=cosine_similarity):
        self.data = data_matrix
        self.y_pred = None
        self.k = k
        self.distance = distance
        # M x N matrix
        self.m_users = int(np.max(self.data[:, 0]))  + 1
        self.n_items = int(np.max(self.data[:, 1]))  + 1

    def fit(self):
        self.normalize()
        self.similarity()
    
    def normalize(self):
        users = self.data[:, 0]
        self.mu = np.zeros((self.m_users,))
        self.y_bar = self.data.copy()
        for u in range(self.m_users):
            idx = np.where(users == u)[0].astype(np.int32)
            # items = self.data[idx, 1]
            ratings = self.data[idx, 2]
            m = np.mean(ratings)
            if np.isnan(m):
                m = 0
            self.mu[u] = m
            self.y_bar[idx, 2] = ratings - self.mu[u]
        
        self.Y_bar = sparse.coo_matrix((self.y_bar[:, 2],
                                       (self.y_bar[:, 1], self.y_bar[:, 0])), (self.n_items, self.m_users))
        self.Y_bar = self.Y_bar.tocsr()

    def similarity(self):
        self.S = self.distance(self.Y_bar.T, self.Y_bar.T)
    
    def predict(self, u, i, normalized=1):
        """
        Predict rating of users to each item
        """
        # Find users that have already rated item i
        idx = np.where(self.data[:, 1] == i)[0].astype(np.int32)
        users_rated_i = (self.data[idx, 0]).astype(np.int32)
        sim = self.S[u, users_rated_i]
        a = np.argsort(sim)[-self.k:]
        nearest_s = sim[a]
        r = self.Y_bar[i, users_rated_i[a]]
        if normalized:
            return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8)

        return (r * nearest_s)[0] / (np.abs(nearest_s).sum() + 1e-8) + self.mu[u]

    def recommend(self, u):
        """
        Determine all items should be recommended for user u.
        The decision is made based on all i such that:
        self.prediction(u, i) > 0. Suppose we are considering items which
        have not been rated by u yet.
        """
        ids = np.where(self.data[:, 0] == u)[0]
        items_rated_by_u = self.data[ids, 1].tolist()
        recommended_items = []
        for i in range(self.n_items):
            if i not in items_rated_by_u:
                rating = self.predict(u, i)
                if rating > 0:
                    recommended_items.append(i)

        return recommended_items

    def recommend(self, u):
        """
        Determine top 10 items should be recommended for user u. 
        Suppose we are considering items which
        have not been rated by u yet.
        """
        ids = np.where(self.data[:, 0] == u)[0]
        items_rated_by_u = self.data[ids, 1].tolist()
        item = {'item_id': None, 'similar': None}
        list_items = []

        def take_similar(elem):
            return elem['similar']

        for i in range(self.n_items):
            if i not in items_rated_by_u:
                rating = self.predict(u, i)
                item['item_id'] = i
                item['similar'] = rating
                list_items.append(item.copy())

        sorted_items = sorted(list_items, key=take_similar, reverse=True)
        return sorted_items

    def print_recommendation(self, u, top_k_recommendations=10):
        """
        print all items which should be recommended for each user
        """
        print('Recommendation: ')
        items = self.recommend(u)
        top_k_items = items[:top_k_recommendations]

        idx = [i["item_id"] for i in top_k_items]
        recommend_list = []
        for i in idx:
            isbn = df_small.loc[df_small["new_item_idx"] == i]["item_id"].tolist()[0]
            book_titles = books.loc[books["ISBN"] == isbn, "Book-Title"].tolist()
            if len(book_titles) > 0:
                recommend_list.append(book_titles[0])
        for b in recommend_list:
            print(b)

In [44]:
recommender = Recommender(rating_matrix, k = 15)
recommender.fit()

user_id = 10
recommender.print_recommendation(u = user_id)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Recommendation: 
The Lord of the Rings (Movie Art Cover)
Pride and Prejudice (Penguin Popular Classics)
The Talisman
Charlotte's Web (Trophy Newbery)
The Vagina Monologues: The V-Day Edition
Harry Potter and the Prisoner of Azkaban (Book 3)
Fear and Loathing in Las Vegas : A Savage Journey to the Heart of the American Dream
The Hobbit : The Enchanting Prelude to The Lord of the Rings
Chapterhouse Dune (Dune Chronicles, Book 6)


In [32]:
print(recommender.recommend_top(u = user_id))

[{'item_id': 1578, 'similar': 4.360990117604245}, {'item_id': 1952, 'similar': 3.709310817824069}, {'item_id': 1595, 'similar': 3.5885971844943203}, {'item_id': 695, 'similar': 3.4839592959630377}, {'item_id': 482, 'similar': 3.427109636389919}, {'item_id': 1765, 'similar': 3.4229481732362124}, {'item_id': 200, 'similar': 3.3265498992063143}, {'item_id': 1790, 'similar': 3.3253066230351904}, {'item_id': 94, 'similar': 3.323977315716412}, {'item_id': 1990, 'similar': 3.0537566754700043}, {'item_id': 677, 'similar': 3.025836679567945}, {'item_id': 1524, 'similar': 3.013920227037299}, {'item_id': 1497, 'similar': 2.9855066436319904}, {'item_id': 1334, 'similar': 2.8576743304768346}, {'item_id': 1671, 'similar': 2.8569385001258443}, {'item_id': 1527, 'similar': 2.8137161123518264}, {'item_id': 1620, 'similar': 2.7929203823431896}, {'item_id': 1088, 'similar': 2.7779213112908874}, {'item_id': 737, 'similar': 2.716017788446766}, {'item_id': 854, 'similar': 2.6752865693490215}, {'item_id': 16