**Introduction:**
    This code implements a book recommendation system using the TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine similarity algorithms. The goal of this system is to recommend books to users based on the similarity of their plot summaries. 

**1. Import necessary libraries**

In [1]:
import numpy as np
import pandas as pd
import spacy
import re
import csv
from tqdm import tqdm

pd.set_option("display.max_colwidth", 300) # Set display options for Pandas DataFrame
# Import modules for text processing and similarity calculation
from sklearn.feature_extraction.text import TfidfVectorizer  # For TF-IDF vectorization
from sklearn.metrics.pairwise import cosine_similarity   # For calculating cosine similarity

# Import matplotlib.pyplot for plotting graphs
import matplotlib.pyplot as plt

# Configure matplotlib to display the output inline
%matplotlib inline

**2. Data Acquisition**: For our book recommendation system, we used the CMU Book Summary Dataset from Kaggle (https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset). It contains plot summaries for 16 559 books extracted from Wikipedia, along with their metadata. 


In [2]:
# Read and process data from the CSV file
data = []
with open("booksummaries.txt", "r") as f:
    reader = csv.reader(f, dialect="excel-tab")
    for row in tqdm(reader):
        data.append(row)

16559it [00:00, 29658.46it/s]


In [3]:
# Extract relevant information from the data and create a DataFrame
book_index = []
book_id = []
book_author = []
book_name = []
summary = []
genre = []
a = 1
for i in tqdm(data):
    book_index.append(a)
    a = a + 1
    book_id.append(i[0])
    book_name.append(i[2])
    book_author.append(i[3])
    genre.append(i[5])
    summary.append(i[6])

book_df = pd.DataFrame(
    {
        "Index": book_index,
        "ID": book_id,
        "BookTitle": book_name,
        "Author": book_author,
        "Genre": genre,
        "Summary": summary,
    }
)
book_df.head()

100%|█████████████████████████████████| 16559/16559 [00:00<00:00, 500453.81it/s]


Unnamed: 0,Index,ID,BookTitle,Author,Genre,Summary
0,1,620,Animal Farm,George Orwell,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a p..."
1,2,843,A Clockwork Orange,Anthony Burgess,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""Novella"", ""/m/014dfn"": ""Speculative fiction"", ""/m/0c082"": ""Utopian and dystopian fiction"", ""/m/06nbt"": ""Satire"", ""/m/02xlf"": ""Fiction""}","Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ""ultra-violence."" Alex's friends (""droogs"" in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and..."
2,3,986,The Plague,Albert Camus,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fiction"", ""/m/0pym5"": ""Absurdist fiction"", ""/m/05hgj"": ""Novel""}","The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order t..."
3,4,1756,An Enquiry Concerning Human Understanding,David Hume,,"The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his principles to specific topics. In the first section of the Enquiry, Hume provides a rough introdu..."
4,5,2080,A Fire Upon the Deep,Vernor Vinge,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90"": ""Science Fiction"", ""/m/014dfn"": ""Speculative fiction"", ""/m/01hmnh"": ""Fantasy"", ""/m/02xlf"": ""Fiction""}","The novel posits that space around the Milky Way is divided into concentric layers called Zones, each being constrained by different laws of physics and each allowing for different degrees of biological and technological advancement. The innermost, the ""Unthinking Depths"", surrounds the galacti..."


**3. Text Preprocessing**: Plot summaries are tokenized and lemmatized using the SpaCy library. 

In [4]:
# Load the SpaCy English model
nlp = spacy.load("en_core_web_sm")

In [5]:
# Clean the summary text using SpaCy and create a new column for cleaned summaries
def clean_summary(text):
    doc = nlp(text)  # Process the input text using spaCy
    cleaned_tokens = [token.lemma_ for token in doc if token.is_alpha]  # Retrieve the lemma of each token
    cleaned_text = " ".join(cleaned_tokens)  # Join the list of cleaned tokens into a single string
    return cleaned_text

book_df["clean_summary"] = book_df["Summary"].apply(lambda x: clean_summary(x))

# Save the DataFrame with cleaned summaries to a CSV file
book_df.to_csv("book_clean.csv")

**4. Feature Extraction**: TF-IDF vectorization is applied to represent each summary as a numerical feature vector. 

In [6]:
# Instantiate TF-IDF vectorizer object with specified parameters
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8,  # Ignore terms that appear in more than 80% of documents
    max_features=200000,  # Limit the number of features to 200,000
    min_df=0.2,  # Ignore terms that appear in less than 20% of documents
    stop_words="english",  # Exclude common English stop words
    use_idf=True,  # Use inverse document frequency weighting
    ngram_range=(1, 3),  # Include unigrams, bigrams, and trigrams
)

In [7]:
# Fit and transform the TF-IDF vectorizer with the cleaned summaries
tfidf_matrix = tfidf_vectorizer.fit_transform(book_df["clean_summary"])

# Print the shape of the TF-IDF matrix
print(tfidf_matrix.shape)

(16559, 59)


**5. Similarity Calculation**: Cosine similarity is computed between the TF-IDF vectors to measure the similarity between book summaries.

In [8]:
# Calculate the similarity distance matrix using cosine similarity
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
similarity_distance.shape
print(similarity_distance)

[[-2.22044605e-16  5.06531107e-01  5.94108812e-01 ...  8.47722571e-01
   1.00000000e+00  7.15262047e-01]
 [ 5.06531107e-01 -2.22044605e-16  4.67487753e-01 ...  8.37637308e-01
   1.00000000e+00  6.46287613e-01]
 [ 5.94108812e-01  4.67487753e-01  0.00000000e+00 ...  7.98069840e-01
   1.00000000e+00  5.64933919e-01]
 ...
 [ 8.47722571e-01  8.37637308e-01  7.98069840e-01 ...  0.00000000e+00
   1.00000000e+00  6.90172240e-01]
 [ 1.00000000e+00  1.00000000e+00  1.00000000e+00 ...  1.00000000e+00
   1.00000000e+00  1.00000000e+00]
 [ 7.15262047e-01  6.46287613e-01  5.64933919e-01 ...  6.90172240e-01
   1.00000000e+00 -2.22044605e-16]]


**6. Defining recommendation function**

In [9]:
# Define a function to get the nearest books based on similarity distance
def get_nearest(title, top_n=3):
    try:
        idx = np.where(np.array(book_df["BookTitle"]) == title)[0][0]  # Get the index of the input book
    except:
        print(f"Book {title} not found. Try again :)")
        return f"Book {title} not found. Try again :)"
    b = similarity_distance[idx, :]  # Select row of pairwise distances
    result_indices = b.argsort()[1 : top_n + 1]  # Sort indices and return top n (excluding the input book)
    print("You might want to read:")
    for book_title in book_df.loc[result_indices]["BookTitle"].values:
        print(f"- {book_title}")

**7. Recommendation Generation**: Nearest neighbors are identified based on similarity scores, and relevant book recommendations are provided to the user.

In [None]:
# Prompt the user to enter the title of a book and get nearest recommendations
while True:
    user_book = input("Enter the title of the book: ")
    get_nearest(user_book, top_n=3)

Enter the title of the book:  Dune


You might want to read:
- Charon's Claw
- Area 7
- Star Wars: Darth Bane: Rule of Two


Enter the title of the book:  Harry Potter and the Prisoner of Azkaban


You might want to read:
- Death of a Ghost
- The Yellow "M"
- Dombey and Son


Enter the title of the book:  The Ugly Duckling


You might want to read:
- White Oleander
- Time for Bed
- A White Heron


Enter the title of the book:  Oblomov


You might want to read:
- The Hours
- The Cancer Ward
- Anglo-Saxon Attitudes


Enter the title of the book:  The Autobiography of Alice B. Toklas


You might want to read:
- The Song of the Lark
- Return to the Clans
- Heart of Glass


Enter the title of the book:  Into the Wild


You might want to read:
- The Chequer Board
- Things as They Are or The Adventures of Caleb Williams
- A Romance of the Halifax Disaster
