# A Simple Book Recommendation System

This is a simple book recommender that analyzes the summary of a particular book and recommends book similar to it. This uses the CMU Book Summaries Dataset and since that dataset contains less than 20,000 books, the efficacy of this algorithm will be relatively limited. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/cmu-book-summary-dataset/booksummaries.txt


## 1 Data Cleaning

### 1.1. Importing Data from the TXT file

In [2]:
import json
import re
import csv
from tqdm import tqdm
pd.set_option('display.max_colwidth', 300)

data = []

with open("/kaggle/input/cmu-book-summary-dataset/booksummaries.txt", 'r') as f:
    reader = csv.reader(f, dialect='excel-tab')
    for row in tqdm(reader):
        data.append(row)

16559it [00:01, 10454.90it/s]


### 1.2. Converting Data into a Dataframe

In [3]:
book_index = []
book_id = []
book_author = []
book_name = []
summary = []
genre = []
a = 1
for i in tqdm(data):
    book_index.append(a)
    a = a+1
    book_id.append(i[0])
    book_name.append(i[2])
    book_author.append(i[3])
    genre.append(i[5])
    summary.append(i[6])

df = pd.DataFrame({'Index': book_index, 'ID': book_id, 'BookTitle': book_name, 'Author': book_author,
                       'Genre': genre, 'Summary': summary})
df.head()

100%|██████████| 16559/16559 [00:00<00:00, 436317.65it/s]


Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p..."
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and..."
2,3,986,The Plague,Albert Camus,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fiction"", ""/m/0pym5"": ""Absurdist fiction"", ""/m/05hgj"": ""Novel""}","The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order t..."
3,4,1756,An Enquiry Concerning Human Understanding,David Hume,,"The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his principles to specific topics. In the first section of the Enquiry, Hume provides a rough introdu..."
4,5,2080,A Fire Upon the Deep,Vernor Vinge,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90"": ""Science Fiction"", ""/m/014dfn"": ""Speculative fiction"", ""/m/01hmnh"": ""Fantasy"", ""/m/02xlf"": ""Fiction""}","The novel posits that space around the Milky Way is divided into concentric layers called Zones, each being constrained by different laws of physics and each allowing for different degrees of biological and technological advancement. The innermost, the ""Unthinking Depths"", surrounds the galacti..."


### 1.3. Cleaning up Genres

In [4]:
df.isna().sum()

df = df.drop(df[df['Genre'] == ''].index)
df = df.drop(df[df['Summary'] == ''].index)


genres_cleaned = []
for i in df['Genre']:
    genres_cleaned.append(list(json.loads(i).values()))
df['Genres'] = genres_cleaned



### 1.4. Cleaning up the Summaries

In [5]:
def clean_summary(text):
    text = re.sub("\'", "", text)
    text = re.sub("[^a-zA-Z]"," ",text)
    text = ' '.join(text.split())
    text = text.lower()
    return text

df['clean_summary'] = df['Summary'].apply(lambda x: clean_summary(x))
df.head(2)

Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary,Genres,clean_summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p...","[Roman à clef, Satire, Children's literature, Speculative fiction, Fiction]",old major the old boar on the manor farm calls the animals on the farm for a meeting where he compares the humans to parasites and teaches the animals a revolutionary song beasts of england when major dies two young pigs snowball and napoleon assume command and turn his dream into a philosophy t...
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and...","[Science Fiction, Novella, Speculative fiction, Utopian and dystopian fiction, Satire, Fiction]",alex a teenager living in near future england leads his gang on nightly orgies of opportunistic random ultra violence alexs friends droogs in the novels anglo russian slang nadsat are dim a slow witted bruiser who is the gangs muscle georgie an ambitious second in command and pete who mostly pla...


## 2. Model

**STEPS:**
1. First, I create a combined text field that takes the cleaned book summary, the author's name and the associated genres and combines them. 
2. I apply the Count Vectorizer on it to create a count matrix.
3. I calculate the cosine similarity 

NOTE: I initially intended on using the million books dataset from Goodreads. However, both my PC and Google Colab kept on crashing while trying to calculate the cosine similarities. Hence, I settled for a smaller dataset.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df['GenreString'] = df['Genres'].apply(lambda x: ' '.join(x))

df["combined_text"] = df["clean_summary"] + " " + df["Author"] + " " + df["GenreString"]
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_text"])
cosine =  cosine_similarity(count_matrix)

I define a simple function that extracts the books that are most similar to the entered book based on their cosine similarities. 

In [7]:
def get_title_from_index(Index):
    return df[df.Index == Index]["BookTitle"].values[0]
def get_index_from_title(BookTitle):
    return df[df.BookTitle == BookTitle]["Index"].values[0]

def get_recommendations(book):
    book_index = get_index_from_title(book)
    similar_books = list(enumerate(cosine[book_index]))
    sortedbooks = sorted(similar_books, key = lambda x:x[1], reverse=True)[1:]
    i = 0
    for book in sortedbooks:
        print(get_title_from_index(book[0]))
        i = i+1
        if i>10:
            break    

In [8]:
print(get_recommendations("The Stand"))

The Haunted Mask
At Bertram's Hotel


IndexError: index 0 is out of bounds for axis 0 with size 0

In [9]:
print(get_recommendations("A Clockwork Orange"))

The Silencers
Inside Mr. Enderby
Danny Dunn Scientific Detective
At Bertram's Hotel
The Nutmeg of Consolation
A Separate Peace
Roll of Thunder, Hear My Cry
Mixed Blessings
The Infiltrators
Out of Order
Serpent's Reach
None


In [10]:
print(get_recommendations("Dune"))

Dawn


IndexError: index 0 is out of bounds for axis 0 with size 0

In [15]:
print(get_recommendations("Oliver Twist"))

Son Excellence Eugène Rougon
The Silencers
Inside Mr. Enderby
Danny Dunn Scientific Detective
At Bertram's Hotel
Dark Universe
Crash


IndexError: index 0 is out of bounds for axis 0 with size 0

## 3. Extensions and Improvements

This is just the first draft of the system. I plan on improving the model, first tryinf Tfidf Vectorizer and then somehow finding a way increase the relative importance of Author and Genres as compared to the text of the summary itself. Any suggestions would be greatly welcomed.