<a href="https://colab.research.google.com/github/irishka-learns/freecodecamp/blob/main/fcc_book_recommendation_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [None]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [None]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2021-03-10 10:49:40--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.3.33, 104.26.2.33, 172.67.70.149, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.3.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘book-crossings.zip’

book-crossings.zip      [  <=>               ]  24.88M  3.54MB/s    in 6.5s    

2021-03-10 10:49:47 (3.80 MB/s) - ‘book-crossings.zip’ saved [26085508]

Archive:  book-crossings.zip
  inflating: BX-Book-Ratings.csv     
  inflating: BX-Books.csv            
  inflating: BX-Users.csv            


In [None]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

The code below aims to formally pass the challenge. However, I personally disagree with the approach which provides the expected values as it doesn't take into account certain things which are unclear from the task description. After the test code I provide the analysis and corresponding code which makes more sense to me, even though it doesn't produce expected results.

In [None]:
userCounts = df_ratings.user.value_counts()
usersList = userCounts[userCounts >= 200].index.values
booksCounts = df_ratings.isbn.value_counts()
booksList = booksCounts[booksCounts >= 100].index.values
df_ratings_set = df_ratings[(df_ratings.user.isin(usersList)) & (df_ratings.isbn.isin(booksList))]
df_ratings_set = df_ratings_set.merge(df_books, on = 'isbn')
df_ratings_set.drop_duplicates(subset=['user', 'title'], inplace=True)
df_ratings_set.head()

Unnamed: 0,user,isbn,rating,title,author
0,277427,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
1,3363,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
2,11676,002542730X,6.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
3,12538,002542730X,10.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner
4,13552,002542730X,0.0,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner


In [None]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):
  recommended_books = [book]
  #creating sparce matrix
  piv = df_ratings_set.pivot(index= 'title', columns= 'user', values='rating')
  piv = piv.fillna(0)
  rating_matrix = piv.values
  #creating the model
  neighbors = NearestNeighbors(metric = 'cosine', algorithm='brute')
  neighbors.fit(rating_matrix)
  #converting function input to the input for the model
  req = np.array(piv.loc[book]).reshape(1,-1)
  #predicting recommendations and distance
  dist, ids = neighbors.kneighbors(req, 6)
  books = list(piv.iloc[ids[0]].index.values)[-5:]
  dist = list(dist[0])[-5:]
  recommended_zip = zip(books, dist)
  recommended = [[x[0], x[1]] for x in recommended_zip]
  recommended_books.append(recommended[::-1])
  return recommended_books

Use the cell below to test your function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying.

In [None]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge! 🎉🎉🎉🎉🎉")
  else:
    print("You havn't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7234864]]]
You passed the challenge! 🎉🎉🎉🎉🎉


Why do i disagree with the approach above?
1. The search of the duplicates doesn't take into account that there're number of books with the same title, but written by a different author.
2. There're the same books with different isbns, which means that to exclude books which were rated fewer than 100 times title+author combination has to be used
3. Consequently, combination user+title+author has to be used to remove duplicates.
4. It is unclear how to decide which rating has to be taken into consideration when removing duplicates: first found, first non-zero, mean, mean non-zero, max, etc?
5. It is unclear what rating 0.0 means, as in the task description it is stated that the rating ranges from 1 to 10. If zero-values mean that a book wasn't actually rated, it should be taken into consideration, when selecting books for the resulting data-set. If the fact that the book was read by a certain user to find similarities, therefore such books shouldn't be considered as unrated, therefore shouldn't be candidates to be dropped, then some adjustments have to be done to differentiate such members from zeroes of the final sparce-matrix. 

In [None]:
#df_books.info() #27139 records
#double-checking that all isbns are unique
df_books.isbn.unique().shape
#check if there're duplicates with title
df_books.duplicated(subset = ['title']).sum() #29225
#compare with duplicates with title+author
df_books.duplicated(subset=['title', 'author']).sum() #20175 => combination title+author has to be considered for duplicates search, there's m-to-1 relationship between isbn and title+author

#checking if there're users who rated the same book more than once
#df_ratings_set.duplicated(subset = ['user', 'title', 'author']).sum() #3435 => remove duplicates, turn realationship between isbn and title+author to 1-to-1 for simplicity

In [None]:
#merge tables
df_ratings_adv = df_ratings.merge(df_books, on = 'isbn')
#get the list of rated books without duplicates
df_ratings_adv['full_name'] = df_ratings_adv['title'] + '//' + df_ratings_adv['author']
df_ratings_adv.drop(columns=['isbn', 'author', 'title'], inplace=True)
#calculation of the final rating: the mean, assuming that users could reasses rating, trying to get the 'average impression
df_ratings_adv = pd.DataFrame(df_ratings_adv.groupby(['user', 'full_name']).mean('rating')).reset_index() # table with no duplicates
  #dropping users who provided less than 200 ratings
userCounts = df_ratings_adv.user.value_counts()
usersList = userCounts[userCounts >= 200].index.values
  #dropping books which were rated fewer than 100 times
booksCounts = df_ratings_adv.full_name.value_counts()
booksList = booksCounts[booksCounts >= 100].index.values
df_ratings_adv = df_ratings_adv[(df_ratings_adv.user.isin(usersList)) & (df_ratings_adv.full_name.isin(booksList))]
#dealing with 0.0 values of rating, assuming that the fact that a book was read should be taken into consideration, therefore adding 1 to all the rating values
df_ratings_adv['rating'] = df_ratings_adv.rating.apply(lambda x: x+1)

In [None]:
#function
def get_recommends_advanced(book = ""):
  recommended_books = [book]
  #creating sparce matrix
  piv = df_ratings_adv.pivot(index= 'full_name', columns= 'user', values='rating')
  piv = piv.fillna(0)
  rating_matrix = piv.values
  #creating the model
  neighbors = NearestNeighbors(metric = 'cosine', algorithm='brute')
  neighbors.fit(rating_matrix)
  #converting function input to the input for the model
  req = np.array(piv.loc[piv.index.str.startswith(book)]).reshape(1,-1)
  #predicting recommendations and distance
  dist, ids = neighbors.kneighbors(req, 6)
  f_names = list(piv.iloc[ids[0]].index.values)[-5:]
  books = [x.split('//')[0] for x in f_names]
  dist = list(dist[0])[-5:]
  recommended_zip = zip(books, dist)
  recommended = [[x[0], x[1]] for x in recommended_zip]
  recommended_books.append(recommended[::-1])
  return recommended_books

In [None]:
get_recommends_advanced('The Queen of the Damned (Vampire Chronicles (Paperback))')

['The Queen of the Damned (Vampire Chronicles (Paperback))',
 [['The Witching Hour (Lives of the Mayfair Witches)', 0.7079966933191927],
  ['Cry to Heaven', 0.7060470249304717],
  ['Interview with the Vampire', 0.6849759223267521],
  ['The Tale of the Body Thief (Vampire Chronicles (Paperback))',
   0.5039896286131911],
  ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.48604703760195067]]]