# Freecodecamp Challenge - Book Recommendation Engine
## with KNN
Nearest Neighbors algorithms have two kinds:
- unsupervised neighbors-based learning methods
- supervised neighbors-based learning methods

The first unsupervised one includes manifold learning and spectral clustering while the later includes-
> 1. Nearest Neighbors Classification with discrete labels
> 2. Nearest Neighbors Regression with continuous labels

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.

In this test, <b>Unsupervised Nearest Neighbors</b> is applied to find the similar books of any given input book.

In [1]:
# import libraries (you may add additional imports but you may not have to)
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

In [2]:
# get data files
# !wget https://cdn.freecodecamp.org/project-data/books/book-crossings.zip

# !unzip book-crossings.zip

books_filename = 'book-crossings/BX-Books.csv'
ratings_filename = 'book-crossings/BX-Book-Ratings.csv'

In [3]:
# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

In [4]:
df_books

Unnamed: 0,isbn,title,author
0,0195153448,Classical Mythology,Mark P. O. Morford
1,0002005018,Clara Callan,Richard Bruce Wright
2,0060973129,Decision in Normandy,Carlo D'Este
3,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata
4,0393045218,The Mummies of Urumchi,E. J. W. Barber
...,...,...,...
271374,0440400988,There's a Bat in Bunk Five,Paula Danziger
271375,0525447644,From One to One Hundred,Teri Sloat
271376,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker
271377,0192126040,Republic (World's Classics),Plato


In [5]:
df_ratings

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0.0
1,276726,0155061224,5.0
2,276727,0446520802,0.0
3,276729,052165615X,3.0
4,276729,0521795028,6.0
...,...,...,...
1149775,276704,1563526298,9.0
1149776,276706,0679447156,0.0
1149777,276709,0515107662,10.0
1149778,276721,0590442449,10.0


## Data Preprocessing 
There are many duplicates and unprocessed data in both dataframes. If the two dataframes are combined, the resultant dataframe becomes very large for the model to train. For this, we base the <b>user_rating_frequency</b> and 
<b>unit_book_total_ratings</b> to split off some data, remove duplicates, and combine the datasets to achieve smaller and processed resultant dataframe to feed the model.
## 1. User_Rating_Frequency
### I. Getting the Total Number of Ratings per User

> The same user can gives two ratings on two different books.

This is a good reason to process the number of times the particular user rates (or) the number of ratings per user.
First, we build a new rating dataframe by combining the number of ratings of one particular user gives, and by applying this step to all users.

In [6]:
user_rating_frequency = df_ratings.groupby(by='user')['rating'].count().reset_index().rename(columns={'rating':'user_rating_frequency'})
user_rating_frequency

Unnamed: 0,user,user_rating_frequency
0,2,1
1,7,1
2,8,18
3,9,3
4,10,2
...,...,...
105278,278846,2
105279,278849,4
105280,278851,23
105281,278852,1


### II. Sorting the Rating Frequency in the Ascending Order
Then, we sort the dataframe in the ascending order of <b>user_rating_frequency</b>. 

By applying <span style="background:#BEBEBE; color:black; border-radius:3px; padding: 2px 4px">.value_counts()</span> panda method to <span style="background:#BEBEBE; color:black; border-radius:3px; padding: 2px 4px">df_ratings['user']</span>, step I and step II are done in a single line. This method not only builds the previous dataframe, but also sorts the dataframe without the need to convert it to numpy array.

Here, numpy is used to sort the values just to prove my work of research. The sorted values should be in the dataframe object, so that the indices remain for comparison, no matter how the data is sorted. 

In [7]:
U = user_rating_frequency.to_numpy() # change it to numpy array
U = U[U[:,1].argsort()] # U[:,1].argsort() returns the indices in sorted order
user_rating_frequency = pd.DataFrame(U, columns=['user','user_rating_frequency']) # rebuild the dataframe object from numpy array
user_rating_frequency

Unnamed: 0,user,user_rating_frequency
0,2,1
1,159665,1
2,159662,1
3,159655,1
4,159651,1
...,...,...
105278,35859,5850
105279,98391,5891
105280,153662,6109
105281,198711,7550


### III. Filtering the Rating Frequency
If the model considers only the significant values, the training time and the model's precision will be more efficient. Thus, we need to rebuild our dataframe after successfully removing the insignificant values. 

Here, we filter the dataframe by removing the values of a user who rates less than 200 times.

> Do not lose the values of user!!!

It is important to maintain those for step IV.

In [8]:
U = user_rating_frequency[user_rating_frequency['user_rating_frequency']>=200] # remove users who rate less than 200 times
U

Unnamed: 0,user,user_rating_frequency
104378,252827,200
104379,36554,200
104380,83671,200
104381,99955,200
104382,225595,200
...,...,...
105278,35859,5850
105279,98391,5891
105280,153662,6109
105281,198711,7550


### IV. Building the new DataFrame
With the values under 'user', it is much easier to retrieve the <b>ISBN</b> values for the filtered data from the raw <b>df_ratings</b>. ISBN is required as a joint to combine the dataframes. We will build the filtered <b>user_rating_frequency</b> dataframe with ISBN in this step.

Sometimes, the values under 'user' are treated as "row indices" of the values under <i>user_rating_frequency column</i>, in the above dataframe, U. In that case, <span style="background:#BEBEBE; color:black; border-radius:3px; padding: 2px 4px">U.index</span> can be passed into <b>original_idx</b>.

If the values under 'user' are not considered as "row indices", we can use <span style="background:#BEBEBE; color:black; border-radius:3px; padding: 2px 4px">U['user'].values</span>. 

In [9]:
original_idx = U['user'].values # get the users of the filtered data
df_URF = df_ratings[df_ratings['user'].isin(original_idx)] # build the sorted, filtered dataframe with isbn values
df_URF

Unnamed: 0,user,isbn,rating
1456,277427,002542730X,10.0
1457,277427,0026217457,0.0
1458,277427,003008685X,8.0
1459,277427,0030615321,0.0
1460,277427,0060002050,0.0
...,...,...,...
1147612,275970,3829021860,0.0
1147613,275970,4770019572,0.0
1147614,275970,896086097,0.0
1147615,275970,9626340762,8.0


## 2. Unit_Book_Total_Ratings
> The same book can have three different ratings.

To process this, the same four previous steps are repeated and build a new dataframe. Here, step I and II are done by previously mentioned panda method.

In [10]:
count = df_ratings['isbn'].value_counts()
original_indices = count[count>=100].index
df_UBTR = df_ratings[df_ratings['isbn'].isin(original_indices)]  # isbn values will be used as a joint to connect dataframes
df_UBTR

Unnamed: 0,user,isbn,rating
2,276727,0446520802,0.0
8,276744,038550120X,7.0
10,276746,0425115801,0.0
11,276746,0449006522,0.0
12,276746,0553561618,0.0
...,...,...,...
1149749,276690,0064400557,0.0
1149761,276704,0345386108,6.0
1149768,276704,0446605409,0.0
1149771,276704,0743211383,7.0


### V. Combining DataFrames and Removing Duplicates
By combining the two previous dataframes (dataframes for user_rating_frequency and unit_book_total_ratings), we attain the processed, filtered, and smaller dataframe of ratings.

This dataframe is again merged with the original books dataframe to create a final dataframe. Duplicates are removed in the final dataframe as it will be the model input.

In [11]:
df_processed_ratings = pd.merge(df_URF, df_UBTR) # new df_ratings 
df = pd.merge(df_books, df_processed_ratings) # final df 
df = df.drop_duplicates(['title','user']) # removing duplicates
df

Unnamed: 0,isbn,title,author,user,rating
0,0440234743,The Testament,John Grisham,277478,0.0
1,0440234743,The Testament,John Grisham,2977,0.0
2,0440234743,The Testament,John Grisham,3363,0.0
3,0440234743,The Testament,John Grisham,7346,9.0
4,0440234743,The Testament,John Grisham,9856,0.0
...,...,...,...,...,...
49512,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,236283,0.0
49513,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,251613,0.0
49514,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,252071,0.0
49515,0515135739,Eleventh Hour: An FBI Thriller (FBI Thriller (...,Catherine Coulter,256407,0.0


### VI. Pivoting the DataFrame
Pivoting is basically reshaping the dataframe. We reshape the dataframe so that we can pass the rating values into the model. 
Nans or no-ratings are also replaced by zeros for computational purposes.

In [12]:
piv=df.pivot(index='title', columns='user', values='rating').fillna(0)
piv 

user,254,2276,2766,2977,3363,4017,4385,6242,6251,6323,...,274004,274061,274301,274308,274808,275970,277427,277478,277639,278418
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1984,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1st to Die: A Novel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2nd Chance,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 Blondes,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Without Remorse,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Year of Wonders,0.0,0.0,0.0,7.0,0.0,0.0,0.0,7.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
You Belong To Me,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### VIII. Creating a Matrix
Since the KNN model will take the sample array or matrix as an input, the dataframe object must change into the matrix object.

In [13]:
book_titles = piv.index
matrix = piv.values

## Model Training
NearestNeighbors have 3 algorithms, each with their own pros and cons:
> - brute algorithm
> - kd_tree algorithm
> - ball_tree algorithm

Here, brute algorithm works the best along with cosine metric. One note is that metric cosine is unable to use in "ball_tree" and "kd_tree" algorithms.

In [14]:
model = NearestNeighbors(n_neighbors=5, algorithm="brute", metric="cosine") # build the model
model.fit(matrix) # train the model

## Return Function

In [15]:
# function to return recommended books - this will be tested
def get_recommends(book = ""):
    if book!="":
        recommended_books = []
        similar_books = []
        recommended_books.append(book) # add the input book as 1st entry

        if book in book_titles: 
            a_book_df = piv[book_titles == book] # get the input book dataframe
            rating_array = a_book_df.to_numpy().reshape(1, -1) # convert df to np # reshape the numpy array to 1 row with unknown columns
            distances, indices  = model.kneighbors(rating_array, n_neighbors=6) # get (the book itself + 5 closest neighbors') distances and indices
      
            for index in indices[0]: # can also use indices.flatten()
                similar_book_name = piv.index[index] 
                if similar_book_name == book: # skip the book itself
                    continue
                # np.where(np_array==value) returns the index(in array form) of that value in np_array # [0][0] for un-nesting # can also use len(np_array.flatten())
                similarity = distances[0][np.where(indices[0]==index)[0][0]] # distances[0][i] to get the distance
                similar_books.append([similar_book_name, similarity]) # book_name-distance pair is listed and add to the similar_books
            similar_books.reverse()

        recommended_books.append(similar_books) # add similar_books as 2nd entry
    return recommended_books

## Predictions

In [16]:
books = get_recommends("The Queen of the Damned (Vampire Chronicles (Paperback))") # test the model
print(books)

['The Queen of the Damned (Vampire Chronicles (Paperback))', [['Catch 22', 0.7939835], ['The Witching Hour (Lives of the Mayfair Witches)', 0.74486566], ['Interview with the Vampire', 0.73450685], ['The Tale of the Body Thief (Vampire Chronicles (Paperback))', 0.53763384], ['The Vampire Lestat (Vampire Chronicles, Book II)', 0.51784116]]]


In [17]:
books = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
print(books)

def test_book_recommendation():
    test_pass = True
    recommends = get_recommends("Where the Heart Is (Oprah's Book Club (Paperback))")
    if recommends[0] != "Where the Heart Is (Oprah's Book Club (Paperback))":
        test_pass = False
    recommended_books = ["I'll Be Seeing You", 'The Weight of Water', 'The Surgeon', 'I Know This Much Is True']
    recommended_books_dist = [0.8, 0.77, 0.77, 0.77]
    for i in range(2): 
        if recommends[1][i][0] not in recommended_books:
            test_pass = False
        if abs(recommends[1][i][1] - recommended_books_dist[i]) >= 0.05:
            test_pass = False
    if test_pass:
        print("You passed the challenge! 🎉🎉🎉🎉🎉")
    else:
        print("You haven't passed yet. Keep trying!")

test_book_recommendation()

["Where the Heart Is (Oprah's Book Club (Paperback))", [["I'll Be Seeing You", 0.8016211], ['The Weight of Water', 0.77085835], ['The Surgeon', 0.7699411], ['I Know This Much Is True', 0.7677075], ['The Lovely Bones: A Novel', 0.7234864]]]
You passed the challenge! 🎉🎉🎉🎉🎉


In [18]:
def run():
    books = get_recommends(input("Please insert the name of the book: "))
    print()
    print(books)
run()

Please insert the name of the book:  I'll Be Seeing You



["I'll Be Seeing You", [['Loves Music, Loves to Dance', 0.6195198], ["Daddy's Little Girl", 0.60819477], ['Before I Say Good-Bye', 0.6000526], ['Let Me Call You Sweetheart', 0.5416254], ['You Belong To Me', 0.48801863]]]
