# Intro

In this notebook we will implement a recommendation algorithm for books.

As method, we will use "colaborative filtering", in which we find common interests among users and recommend books based on said similarities.

# Group Members:
- Kaleb Alebachew 1539/13
- Natnael Malike  2166/13
- Kalkidan Tadesse  1559/13
- Tewodros Million  2675/13
- Mikiyas Mesfin    4731/13

In [1]:
import numpy as np
import pandas as pd



In [76]:
books = pd.read_csv("/content/BX_Books.csv", encoding='latin-1', error_bad_lines=False, sep=';')
ratings = pd.read_csv("/content/BX-Book-Ratings.csv", encoding='latin-1', error_bad_lines=False, sep=';')

books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

ratings.columns = ['userID', 'ISBN', 'bookRating']

In [77]:
ratings = ratings.set_index("ISBN")

books = books.set_index("ISBN")

In [78]:
books.head()

Unnamed: 0_level_0,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [79]:
ratings.tail()

Unnamed: 0_level_0,userID,bookRating
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1
1563526298,276704,9
679447156,276706,0
515107662,276709,10
590442449,276721,10
5162443314,276723,8


---

# Distance between users

First we must implement a distance function in order to understand how close the interests of two users are.

For this, we will use NumPy's `linalg.norm`

In [9]:
def array_distance(a,b):
  return np.linalg.norm(a - b)

In [14]:
# Testing function

a=np.array([3, 4, 5, 3, 2, 4])
b = np.array([4, 4, 4, 3, 2, 2])

array_distance(a, b)

2.449489742783178

The above output represents the distance between the points in both arrays. 

This is the core idea of the distance between users ratings on movies.

**That being said, let's grab the users ratings**

In [92]:
def ratings_from_user(userID):
  ratings_from_user = ratings.query("userID==%d" % userID)
  ratings_from_user = ratings_from_user[["bookRating"]]
  return ratings_from_user

In [93]:
#Testing
ratings_from_user(276704)

Unnamed: 0_level_0,bookRating
ISBN,Unnamed: 1_level_1
0152022597,0
0312873115,0
0345386108,6
0380796155,5
0395404258,0
0425060772,0
0440206529,0
0441007813,0
0446353957,0
0446605409,0


**Now, let's use the above functions to get the distance between two users.**

In [26]:
distance_test = ratings_from_user(276729).join(ratings_from_user(276729), rsuffix="_A", lsuffix="_B")

In [28]:
array_distance(distance_test["bookRating_B"], distance_test['bookRating_A'])

0.0

Although the result is misleading (both users actually have nothing in common, and this issue will be soon addressed), we can see that **the distance function actually works**.


Let's define a function that outputs the distance between two users.

In [30]:
def distance_between_users(user1:int, user2:int): # parameters = int
  ratings_from_user1 = ratings_from_user(user1)
  ratings_from_user2 = ratings_from_user(user2)
  # Above we create dataframes for each user

  #Bellow we join both dataframes
  both_ratings = ratings_from_user1.join(ratings_from_user2, lsuffix="_A", rsuffix="_B").dropna()

  #Now we return the distance between the columns of each dataframe
  distance = array_distance(both_ratings["bookRating_A"], both_ratings["bookRating_B"])

  return [user1, user2, distance]

In [60]:
#Test
distance_between_users(276729, 276704)

[276729, 276704, 0.0]

It works! Same output as before.

---

# Find most similar user

It is very time consuming to keep analyzing pairs of random users. **It is more valuable for us to find the K most similar users**

(Or our "k-nearest-neighbors" if you will)

In [80]:
print ("We have %d users" %len(ratings["userID"].unique()))

We have 105283 users


In [172]:
def distance_from_all(targetID:int):
  all_users = ratings["userID"].unique()[:3000] #since we have 105k+ users, we need to select a smaller sample. 2k will do.

  distances = [distance_between_users(targetID, users) for users in all_users]

  distances = pd.DataFrame(distances, columns = ["targetID", "otherUserID", "distance"])

  return distances.set_index("otherUserID").sort_values("distance").query("distance>0")

In [116]:
distance_from_all(276704)

Unnamed: 0_level_0,targetID,distance
otherUserID,Unnamed: 1_level_1,Unnamed: 2_level_1
685,276704,3.0
277984,276704,4.0
3167,276704,6.0
278781,276704,7.0


---

# Suggest books based on closest users

This function bellow calls the distance_from_all functions and filters the best matches. Then, it joins the best matches' ratings and suggest books based on them.

In [170]:
def suggest_to(userID:int):
  #user_ratings = ratings_from_user(userID)
  #books_read_by_user = user_ratings.index

  similar_users = distance_from_all(userID).head(3)
  similar_users_list = similar_users.index
  
  ratings_from_similar_users = ratings[ratings["userID"].isin(similar_users_list)]

  suggestions = ratings_from_similar_users.groupby("ISBN").mean()[["bookRating"]]
  suggestions = suggestions.sort_values("bookRating", ascending=False)

  

  return suggestions.join(books[["bookTitle", "bookAuthor", "yearOfPublication"]])

In [173]:
suggest_to(276704)

Unnamed: 0_level_0,bookRating,bookTitle,bookAuthor,yearOfPublication
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
780451524201,10,,,
0553574353,10,Helter Skelter: The True Story of the Manson M...,Vincent Bugliosi,1996.0
0440998050,10,A Wrinkle in Time,Madeleine L'Engle,1976.0
0446611778,10,Last Man Standing,David Baldacci,2002.0
0515087491,10,The Corps: Semper Fi/Bk 1 (Corps (Paperback)),W. E. B. Griffin,1988.0
...,...,...,...,...
0553203630,0,Big Sky,Alfred Bertram Jr. Guthrie,1982.0
0345347951,0,Childhood's End,Arthur C. Clarke,1987.0
0345347536,0,A Spell for Chameleon (Xanth Novels (Paperback)),Piers Anthony,1987.0
0553239813,0,Little Drummer Girl,John Lecarre,1984.0
