*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will create a book recommendation algorithm using **K-Nearest Neighbors**.

You will use the [Book-Crossings dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). This dataset contains 1.1 million ratings (scale of 1-10) of 270,000 books by 90,000 users. 

After importing and cleaning the data, use `NearestNeighbors` from `sklearn.neighbors` to develop a model that shows books that are similar to a given book. The Nearest Neighbors algorithm measures distance to determine the “closeness” of instances.

Create a function named `get_recommends` that takes a book title (from the dataset) as an argument and returns a list of 5 similar books with their distances from the book argument.

This code:

`get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))")`

should return:

```
[
  'The Queen of the Damned (Vampire Chronicles (Paperback))',
  [
    ['Catch 22', 0.793983519077301], 
    ['The Witching Hour (Lives of the Mayfair Witches)', 0.7448656558990479], 
    ['Interview with the Vampire', 0.7345068454742432],
    ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.5376338362693787],
    ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.5178412199020386]
  ]
]
```

Notice that the data returned from `get_recommends()` is a list. The first element in the list is the book title passed in to the function. The second element in the list is a list of five more lists. Each of the five lists contains a recommended book and the distance from the recommended book to the book passed in to the function.

If you graph the dataset (optional), you will notice that most books are not rated frequently. To ensure statistical significance, remove from the dataset users with less than 200 ratings and books with less than 100 ratings.

The first three cells import libraries you may need and the data to use. The final cell is for testing. Write all your code in between those cells.

In [15]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [16]:
# get data files
!wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

!unzip book-crossings.zip

books_filename = 'BX-Books.csv'
ratings_filename = 'BX-Book-Ratings.csv'

--2023-01-17 19:38:30--  https://cdn.freecodecamp.org/project-data/books/book-crossings.zip
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 104.26.3.33, 104.26.2.33, 172.67.70.149, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|104.26.3.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26085508 (25M) [application/zip]
Saving to: ‘book-crossings.zip.1’


2023-01-17 19:38:30 (152 MB/s) - ‘book-crossings.zip.1’ saved [26085508/26085508]

Archive:  book-crossings.zip
replace BX-Book-Ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace BX-Books.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace BX-Users.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [17]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [18]:
# add your code here - consider creating a new cell for each section of code

#print(df_ratings)

#plotting the graphs
#plt.rc("font", size=15)
#df_ratings.value_counts(sort=False).plot(kind='bar')
#plt.title('Rating Distribution\n')
#plt.xlabel('Rating')
#plt.ylabel('Count')
#plt.savefig('system1.png', bbox_inches='tight')
#plt.show()

In [19]:
# Merging the 2 datasets through the isbn
df_all = pd.merge(df_ratings,df_books,on='isbn')

In [20]:
print(df_all)

           user        isbn  rating  \
0        276725  034545104X     0.0   
1          2313  034545104X     5.0   
2          6543  034545104X     0.0   
3          8680  034545104X     5.0   
4         10314  034545104X     9.0   
...         ...         ...     ...   
1031170  276688  0517145553     0.0   
1031171  276688  1575660792     7.0   
1031172  276690  0590907301     0.0   
1031173  276704  0679752714     0.0   
1031174  276704  0806917695     5.0   

                                                     title           author  
0                                     Flesh Tones: A Novel       M. J. Rose  
1                                     Flesh Tones: A Novel       M. J. Rose  
2                                     Flesh Tones: A Novel       M. J. Rose  
3                                     Flesh Tones: A Novel       M. J. Rose  
4                                     Flesh Tones: A Novel       M. J. Rose  
...                                                    ...     

Cleaning up the datasets

In [21]:
# Get list of users with 200 ratings or more
user_ratings_count = (df_ratings.groupby(by = ['user'])['rating'].count().reset_index().rename(columns = {'rating': 'totalRatings'})[['user', 'totalRatings']])
keep_users = user_ratings_count.query('totalRatings > 199').user.tolist()

In [22]:
# Get list of books with 100 ratings or more
book_ratings_count = (df_all.groupby(by = ['title'])['rating'].count().reset_index().rename(columns = {'rating': 'totalRatings'})[['title', 'totalRatings']])
keep_books = book_ratings_count.query('totalRatings > 99').title.tolist()

In [23]:
# Remove the books with less than 100 ratings and users with less than 200 ratings
df_all = df_all[df_all['title'].isin(keep_books)]
df_all = df_all[df_all['user'].isin(keep_users)]

In [24]:
print(df_all)

           user        isbn  rating                   title           author
63       278418  0446520802     0.0            The Notebook  Nicholas Sparks
65         3363  0446520802     0.0            The Notebook  Nicholas Sparks
66         7158  0446520802    10.0            The Notebook  Nicholas Sparks
69        11676  0446520802    10.0            The Notebook  Nicholas Sparks
74        23768  0446520802     6.0            The Notebook  Nicholas Sparks
...         ...         ...     ...                     ...              ...
1028816  271284  0440910927     0.0           The Rainmaker     John Grisham
1029109  271705  B0001PIOX4     0.0          Fahrenheit 451     Ray Bradbury
1030402  274808  0449701913     0.0              Homecoming    Cynthia Voigt
1030863  275970  0865714215     0.0          Stormy Weather      Guy Dauncey
1030907  275970  1586210661     9.0  Me Talk Pretty One Day    David Sedaris

[68365 rows x 5 columns]


In [25]:
# Turning the table into something that can be used by the model
from scipy.sparse import csr_matrix

book_features = df_all.pivot_table(index='title',columns='user',values='rating').fillna(0)
book_features_matrix = csr_matrix(book_features.values)

In [26]:
# Creat KNN model
model = NearestNeighbors(metric = 'cosine', n_neighbors=5, algorithm='auto')
model.fit(book_features_matrix)

NearestNeighbors(metric='cosine')

In [27]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):

  for index in range(len(book_features)):
      if book_features.index[index] == book:
          break

  recommended_books = [book, []]

  distances, indices = model.kneighbors(book_features.iloc[index,:].values.reshape(1, -1))
  
  for i in range(1, len(distances.flatten())):
    recommended_books[1].insert(0, [book_features.index[indices.flatten()[i]], distances.flatten()[i]])
  return recommended_books

Use the cell below to test your function. The `test_book_recommendation()` function will inform you if you passed the challenge or need to keep trying.

In [28]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
  test_pass = True
  recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
  if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
    test_pass = False
  recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
  recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
  for i in range(2): 
    if recommends[1][i][0] not in recommended_books:
      test_pass = False
    if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
      test_pass = False
  if test_pass:
    print("You passed the challenge!")
  else:
    print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [['The Weight of Water', 0.77085835], ['I Know This Much Is True', 0.7529293], ['The Lovely Bones: A Novel', 0.7234864], ['Blue Diary', 0.71828747]]]
You passed the challenge!
