In [4]:
import pandas as pd

df = pd.read_csv('BX-Books.csv', on_bad_lines="skip", encoding='latin-1', sep=';', low_memory=False)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'BX-Books.csv'

In [None]:
df.info()

# Removing Duplicates
#### First, let us check if there are any duplicate book titles. These are redundant to the algorithm and must be removed

In [2]:
df.duplicated(subset='Book-Title').sum() # 29,225

NameError: name 'df' is not defined

## Random Sampling
#### As mentioned in the previous section, we need to randomly sample 15,000 rows from the dataframe to avoid running into memory errors:



In [None]:
sample_size = 15000
df = df.sample(n=sample_size, replace=False, random_state=490)

df = df.reset_index()
df = df.drop('index',axis=1)

## Processing Text Data
##### Now, let us print the head of the dataframe again:



In [None]:
df.head()

#### Data Cleaning

Before converting the data into a word vector, we need to clean it. First, let’s remove whitespaces from the “Book-Author” column. If we do not do this, then CountVectorizer will count the authors’ first and last name as a separate word.
For instance, if one author is named James Clear and another is called James Patterson, the vectorizer will count the word James in both cases, and the recommender system might consider the books as highly similar, even though they are not related at all. James Clear writes self-help books while James Patterson is known for his mystery novels.
Run the following lines of code to combine the authors’ first and last names:



In [None]:
def clean_text(author):
    result = str(author).lower()
    return(result.replace(' ',''))

df['Book-Author'] = df['Book-Author'].apply(clean_text)

df.head()

In [None]:
df['Book-Title'] = df['Book-Title'].str.lower()
df['Publisher'] = df['Publisher'].str.lower()

In [None]:
# combine all strings:
df2 = df.drop(['ISBN','Image-URL-S','Image-URL-M','Image-URL-L','Year-Of-Publication'],axis=1)

df2['data'] = df2[df2.columns[1:]].apply(
    lambda x: ' '.join(x.dropna().astype(str)),
    axis=1
)

print(df2['data'].head())

#### ●       Vectorize the Dataframe
###### Finally, we can apply Scikit-Learn’s CountVectorizer() on the combined text data:



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized = vectorizer.fit_transform(df2['data'])
#print(vectorized)


### Building the Recommendation System
###### Now, we will use a distance measure called cosine similarity to find the resemblance between each bag-of-words. Cosine similarity is a metric that calculates the cosine of the angle between two or more vectors to determine if they are pointing in the same direction.
###### Cosine similarity ranges between 0 and 1. A value of 0 indicates that the two vectors are not similar at all, while 1 tells us that they are identical.
##### Run the following lines of code to apply cosine similarity on the vector we created:



In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(vectorized)
#Now, print the “similarities” variable:

print(similarities)

In [None]:
df = pd.DataFrame(similarities, columns=df['Book-Title'], index=df['Book-Title']).reset_index()

df.head()


### Displaying User Recommendations
###### Finally, let’s use the dataframe above to display book recommendations. If a book is entered as input, the top 10 similar books must be returned.

###### Let us do this using a book from the Star Trek series as input:



In [None]:
input_book = "the general prologue to the canterbury tales"
recommendations = pd.DataFrame(df.nlargest(100,input_book)['Book-Title'])
recommendations = recommendations[recommendations['Book-Title']!=input_book]
print(recommendations)