# Meet Cute: A Romance Book Recommender
Author: Mackenzie Ross

In this notebook, I will be cleaning the text data from my book data, performing feature extraction, and building the recommendation engine.

## Load Dataset

In [1]:
import os
os.getcwd()

'/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute'

In [2]:
os.chdir('/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute/Data')
os.getcwd()

'/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute/Data'

In [3]:
import pandas as pd

book_df = pd.read_csv('books.csv')
book_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1242 entries, 0 to 1241
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         1242 non-null   int64  
 1   title              1242 non-null   object 
 2   author             1242 non-null   object 
 3   release year       1242 non-null   int64  
 4   synopsis           1242 non-null   object 
 5   book length        1242 non-null   int64  
 6   rating             1242 non-null   float64
 7   number of ratings  1242 non-null   int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 77.8+ KB


In [4]:
book_df.head()

Unnamed: 0.1,Unnamed: 0,title,author,release year,synopsis,book length,rating,number of ratings
0,0,Pride and Prejudice (Paperback),Jane Austen,1813,Alternate cover edition of ISBN 9780679783268S...,279,4.28,3732237
1,1,The Fault in Our Stars (Hardcover),John Green,2012,Despite the tumor-shrinking medical miracle th...,313,4.16,4501032
2,2,"Red, White & Royal Blue (Paperback)",Casey McQuiston,2019,Original cover edition of ASIN B07J4LPZRN here...,448,4.16,607767
3,3,"Twilight (The Twilight Saga, #1)",Stephenie Meyer,2005,About three things I was absolutely positive.F...,498,3.63,5901197
4,4,The Hating Game (Paperback),Sally Thorne,2016,Nemesis (n.) 1) An opponent or rival whom a pe...,365,3.98,537504


## Preprocess Text Data
- Import necessary libraries
- Import Enlgish stopwords
- Create function to clean text

In [5]:
import nltk
import re
import numpy as np

In [6]:
en_stopwords = nltk.corpus.stopwords.words('english')

In [7]:
# function modified from Sarkar
def normalize_document(doc):
    # lower case and remove special characters/whitespace
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [t for t in tokens if t not in en_stopwords] # recreate document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [8]:
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(book_df['synopsis'])
len(norm_corpus)

1242

In [9]:
# function to remove parentheticals from book titles
def normalize_titles(doc):
    doc = re.sub('[\(\[].*?[\)\]]', '', doc)
    doc.strip()
    return doc

In [10]:
for i in range(len(book_df)):
    title = book_df.loc[i, 'title']
    norm_title = normalize_titles(title)
    book_df.loc[i, 'title'] = norm_title

In [11]:
book_df.head()

Unnamed: 0.1,Unnamed: 0,title,author,release year,synopsis,book length,rating,number of ratings
0,0,Pride and Prejudice,Jane Austen,1813,Alternate cover edition of ISBN 9780679783268S...,279,4.28,3732237
1,1,The Fault in Our Stars,John Green,2012,Despite the tumor-shrinking medical miracle th...,313,4.16,4501032
2,2,"Red, White & Royal Blue",Casey McQuiston,2019,Original cover edition of ASIN B07J4LPZRN here...,448,4.16,607767
3,3,Twilight,Stephenie Meyer,2005,About three things I was absolutely positive.F...,498,3.63,5901197
4,4,The Hating Game,Sally Thorne,2016,Nemesis (n.) 1) An opponent or rival whom a pe...,365,3.98,537504


## Feature Extraction

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(1242, 6914)

## Main Functionality
We will use cosine similarity to calculate the similarity between book synopses.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

doc_similarity = cosine_similarity(tfidf_matrix)
doc_similarity_df = pd.DataFrame(doc_similarity)
doc_similarity_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1232,1233,1234,1235,1236,1237,1238,1239,1240,1241
0,1.0,0.0,0.053536,0.0,0.0,0.018914,0.005503,0.004386,0.015429,0.014146,...,0.037887,0.0,0.0,0.009019,0.054977,0.029931,0.0,0.011928,0.045286,0.0
1,0.0,1.0,0.0,0.017747,0.0,0.0,0.0,0.049478,0.0,0.0,...,0.0,0.023378,0.0,0.0,0.0,0.050286,0.008859,0.0,0.0,0.0
2,0.053536,0.0,1.0,0.0,0.0,0.0,0.037943,0.0,0.0,0.020088,...,0.0,0.0,0.0,0.0,0.049408,0.0,0.0,0.020578,0.051659,0.0
3,0.0,0.017747,0.0,1.0,0.0,0.013005,0.0,0.0,0.0,0.015674,...,0.0,0.013241,0.0,0.014726,0.03382,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.009644,0.0532,...,0.0,0.0,0.0,0.0,0.022652,0.024015,0.0,0.0,0.0,0.041415


In [28]:
book_list = book_df['title'].values
author_list = book_df['author'].values

# adapted from Week 8 Coding Practice, this will return the top 3 similar books for a given book
def book_recommender(book_title, books=book_list, authors=author_list, doc_sims=doc_similarity_df): 
    book_index = np.where(books == book_title)[0][0]
    book_similarities = doc_sims.iloc[book_index].values 
    similar_book_indices = np.argsort(-book_similarities)[1:4]
    similar_books = books[similar_book_indices]
    similar_authors = authors[similar_book_indices]
    
    books_with_authors = []
    for i in range(len(similar_books)):
        b_title = similar_books[i]
        b_author = similar_authors[i]
        books_with_authors.append(str(b_title + 'by ' + b_author))
    return books_with_authors

### Randomly Select 5 Books and Generate Recommendations 

In [18]:
import random
random_books = random.choices(book_list, k=5)
print(random_books)

['The Fine Print ', 'When He Was Wicked ', 'Definitely Dead ', 'To Have and to Hoax ', 'The Marriage Game ']


In [29]:
for book in random_books:
    for b in range(len(book_list)):
        if book_list[b] == book:
            print('Book: ' + book + 'by ' + author_list[b])
            print('Top 3 Recommended Books:', book_recommender(book_title=book))
            print()

Book: The Fine Print by Lauren Asher
Top 3 Recommended Books: ['Idol by Kristen Callihan', 'This Heart of Mine by Lisa Kleypas', 'The Kiss Thief by L.J. Shen']

Book: When He Was Wicked by Julia Quinn
Top 3 Recommended Books: ['Rule by Jay Crownover', "The Lover's Dictionary by David Levithan", 'Entwined with You by Sylvia Day']

Book: Definitely Dead by Charlaine Harris
Top 3 Recommended Books: ['All Together Dead by Charlaine Harris', 'Dead Until Dark by Kate Stayman-London', 'From Dead to Worse by Charlaine Harris']

Book: To Have and to Hoax by Martha Waters
Top 3 Recommended Books: ['Him by Sarina Bowen', "Lady Isabella's Scandalous Marriage by Jennifer Ashley", 'Ten Tiny Breaths by K.A. Tucker']

Book: The Marriage Game by Sara Desai
Top 3 Recommended Books: ['Layla by Colleen Hoover', 'White Hot Kiss by Jennifer L. Armentrout', 'Wallbanger by Alice Clayton']



## Personal Contribution Statement
Working on this section of the project gave me different ideas of how I could improve the recommendation engine. The first idea I had was to add columns for 2-3 subgenres for the books. This would add another dimension to determining the similarity between books. The second idea I had was to filter out books with the same author when giving a recommendation because reading a book by the same author seems like a given. By eliminating books by the same author, I would be able to provide more recommendations for the user. The biggest element of the project I need to nail down is how I want the user to enter the book that the want recommendations for. Ideally, the user would be able to enter the name of any romance book. However, the way the recommendation engine is set up now it can only provide recommendations for books that exist in the dataset. I am brainstorming a way to search Goodreads for information about the user's selected book and compare it to the books in the dataset.