# A Simple Book Recommendation System

This is a simple book recommender that analyzes the summary of a particular book and recommends book similar to it. This uses the CMU Book Summaries Dataset and since that dataset contains less than 20,000 books, the efficacy of this algorithm will be relatively limited. 

## 1 Data Cleaning

### 1.1. Importing Data from the TXT file

In [1]:
import numpy as np
import pandas as pd
import nltk
import json
import re
import csv
from tqdm import tqdm
pd.set_option('display.max_colwidth', 300)

data = []

with open("/kaggle/input/cmu-book-summary-dataset/booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in tqdm(reader):
        data.append(row)

16559it [00:01, 10585.39it/s]


### 1.2. Converting Data into a Dataframe

In [2]:
book_index = []
book_id = []
book_author = []
book_name = []
summary = []
genre = []
a = 1
for i in tqdm(data):
    book_index.append(a)
    a = a+1
    book_id.append(i[0])
    book_name.append(i[2])
    book_author.append(i[3])
    genre.append(i[5])
    summary.append(i[6])

df = pd.DataFrame({'Index': book_index, 'ID': book_id, 'BookTitle': book_name, 'Author': book_author,
                       'Genre': genre, 'Summary': summary})
df.head()

100%|██████████| 16559/16559 [00:00<00:00, 439757.12it/s]


Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p..."
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and..."
2,3,986,The Plague,Albert Camus,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fiction"", ""/m/0pym5"": ""Absurdist fiction"", ""/m/05hgj"": ""Novel""}","The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order t..."
3,4,1756,An Enquiry Concerning Human Understanding,David Hume,,"The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his principles to specific topics. In the first section of the Enquiry, Hume provides a rough introdu..."
4,5,2080,A Fire Upon the Deep,Vernor Vinge,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90"": ""Science Fiction"", ""/m/014dfn"": ""Speculative fiction"", ""/m/01hmnh"": ""Fantasy"", ""/m/02xlf"": ""Fiction""}","The novel posits that space around the Milky Way is divided into concentric layers called Zones, each being constrained by different laws of physics and each allowing for different degrees of biological and technological advancement. The innermost, the ""Unthinking Depths"", surrounds the galacti..."


### 1.3. Cleaning up Genres

In [3]:
df.isna().sum()

df = df.drop(df[df['Genre'] == ''].index)
df = df.drop(df[df['Summary'] == ''].index)


genres_cleaned = []
for i in df['Genre']:
    genres_cleaned.append(list(json.loads(i).values()))
df['Genres'] = genres_cleaned



### 1.4. Cleaning up the Summaries

In [4]:
def clean_summary(text):
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]"," ",text)
    text = ' '.join(text.split())
    text = text.lower()
    return text

df['clean_summary'] = df['Summary'].apply(lambda x: clean_summary(x))
df.head(2)

Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary,Genres,clean_summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p...","[Roman à clef, Satire, Children's literature, Speculative fiction, Fiction]",old major the old boar on the manor farm calls the animals on the farm for a meeting where he compares the humans to parasites and teaches the animals a revolutionary song beasts of england when major dies two young pigs snowball and napoleon assume command and turn his dream into a philosophy t...
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and...","[Science Fiction, Novella, Speculative fiction, Utopian and dystopian fiction, Satire, Fiction]",alex a teenager living in near future england leads his gang on nightly orgies of opportunistic random ultra violence alexs friends droogs in the novels anglo russian slang nadsat are dim a slow witted bruiser who is the gangs muscle georgie an ambitious second in command and pete who mostly pla...


## 2. Model

**STEPS:**
1. First, I create a combined text field that takes the cleaned book summary, the author's name and the associated genres and combines them. 
2. I apply the Count Vectorizer on it to create a count matrix.
3. I calculate the cosine similarity 

NOTE: I initially intended on using the million books dataset from Goodreads. However, both my PC and Google Colab kept on crashing while trying to calculate the cosine similarities. Hence, I settled for a smaller dataset.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords


df['GenreString'] = df['Genres'].apply(lambda x: ' '.join(x))

#get a combined text that includes author's name and associated genres
df["combined_text"] = df["clean_summary"] + " " + df["Author"] + " " + df["GenreString"]


"""stopwords = stopwords.words('english')
df['text_without_stopwords'] = df['combined_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['text_without_stopwords'])"""

tf = TfidfVectorizer(analyzer = "word", ngram_range=(1,2), min_df=0, stop_words='english')

tfidf_matrix = tf.fit_transform(df['combined_text'])

cosine =  cosine_similarity(tfidf_matrix)





I define a simple function that extracts the books that are most similar to the entered book based on their cosine similarities. 

In [6]:
def get_title_from_index(Index):
    return df[df.Index == Index]["BookTitle"].values[0]
def get_index_from_title(BookTitle):
    return df[df.BookTitle == BookTitle]["Index"].values[0]

def get_recommendations(book):
    book_index = get_index_from_title(book)
    similar_books = list(enumerate(cosine[book_index]))
    sortedbooks = sorted(similar_books, key = lambda x:x[1], reverse=True)[1:]
    i = 0
    for book in sortedbooks:
        print(get_title_from_index(book[0]) + " by " + df.Author[df["Index"] == book[0]])
        i = i+1
        if i>10:
            break

In [7]:
print(get_recommendations("The Stand"))

1881    Dune: The Machine Crusade by Kevin J. Anderson
Name: Author, dtype: object
2606    Adrian Mole and the Weapons of Mass Destruction by Sue Townsend
Name: Author, dtype: object
3232    The Secret of the Lost Tunnel by Franklin W. Dixon
Name: Author, dtype: object
2331    High Time to Kill by Raymond Benson
Name: Author, dtype: object
1926    The Food of the Gods and How It Came to Earth by H. G. Wells
Name: Author, dtype: object


IndexError: index 0 is out of bounds for axis 0 with size 0

In [8]:
print(get_recommendations("A Clockwork Orange"))

5660    Fire on the Mountain by Terry Bisson
Name: Author, dtype: object
5869    The Last of the Jedi: Underworld by Judy Blundell
Name: Author, dtype: object
7590    Storming Heaven by Dale Brown
Name: Author, dtype: object
8493    Sir Percy Leads the Band by Baroness Emma Orczy
Name: Author, dtype: object
5516    The Doorbell Rang by Rex Stout
Name: Author, dtype: object
5957    The Masters of Darkness by Joe Dever
Name: Author, dtype: object
6363    Running Out of Time by Margaret Haddix
Name: Author, dtype: object
4977    High Five by Janet Evanovich
Name: Author, dtype: object
1584    Rob Roy by Walter Scott
Name: Author, dtype: object
1506    Billy Bathgate by E. L. Doctorow
Name: Author, dtype: object
8495    A Modern Instance by William Dean Howells
Name: Author, dtype: object
None


In [9]:
print(get_recommendations("Dune"))

8250    The Memory Keeper's Daughter by Kim Edwards
Name: Author, dtype: object
168    The Magician's Nephew by C. S. Lewis
Name: Author, dtype: object
2161    Not This August by Cyril M. Kornbluth
Name: Author, dtype: object
667    Trainspotting by Irvine Welsh
Name: Author, dtype: object
11272    My Booky Wook by Russell Brand
Name: Author, dtype: object
171    Double Star by Robert A. Heinlein
Name: Author, dtype: object
10013    The Summer That Never Was by Peter Robinson
Name: Author, dtype: object
5256    The Sleeper Awakes by H. G. Wells
Name: Author, dtype: object
703    The Blind Assassin by Margaret Atwood
Name: Author, dtype: object


IndexError: index 0 is out of bounds for axis 0 with size 0

In [10]:
print(get_recommendations("Oliver Twist"))

429    Son Excellence Eugène Rougon by Émile Zola
Name: Author, dtype: object
712    The Bridge by Iain Banks
Name: Author, dtype: object
698    From the Earth to the Moon by Jules Verne
Name: Author, dtype: object
1032    The Once and Future King by T. H. White
Name: Author, dtype: object
697    Oscar and Lucinda by Peter Carey
Name: Author, dtype: object
704    Red Harvest by Dashiell Hammett
Name: Author, dtype: object
721    Effi Briest by Theodor Fontane
Name: Author, dtype: object
1071    Billy Budd by Herman Melville
Name: Author, dtype: object
803    Empire of the Sun by J. G. Ballard
Name: Author, dtype: object
1078    The Red and the Black by Stendhal
Name: Author, dtype: object
1041    The Story of the Stone by Barry Hughart
Name: Author, dtype: object
None


In [11]:
print(get_recommendations('White Noise'))

595    Heidi by Johanna Spyri
Name: Author, dtype: object
4434    The Island on Bird Street by Uri Orlev
Name: Author, dtype: object
6113    The Da Vinci Code by Dan Brown
Name: Author, dtype: object
1477    In Dubious Battle by John Steinbeck
Name: Author, dtype: object
3185    Zaynab by Muhammad Husayn Haykal
Name: Author, dtype: object


IndexError: index 0 is out of bounds for axis 0 with size 0

## 3. Extensions and Improvements

This is just the first draft of the system. I plan on improving the model, first tryinf Tfidf Vectorizer and then somehow finding a way increase the relative importance of Author and Genres as compared to the text of the summary itself. Any suggestions would be greatly welcomed.