# Meet Cute: A Romance Book Recommender
Author: Mackenzie Ross

In this notebook, I will be cleaning the text data from my book data, performing feature extraction, and building the recommendation engine.

## Load Dataset

In [1]:
import os
os.getcwd()

'/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute/jupyter_notebook'

In [2]:
os.chdir('/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute')
os.getcwd()

'/Users/mackenzieross/Documents/17th Grade/Fall 2022/Intro to NLP/Meet Cute/meet_cute/Data'

In [3]:
import pandas as pd

book_df = pd.read_csv('books.csv')
book_df.info()

FileNotFoundError: [Errno 2] No such file or directory: 'books.csv'

In [None]:
book_df.head()

## Preprocess Text Data
- Import necessary libraries
- Import Enlgish stopwords
- Create function to clean text

In [None]:
import nltk
import re
import numpy as np

In [None]:
en_stopwords = nltk.corpus.stopwords.words('english')

In [None]:
# function modified from Sarkar
def normalize_document(doc):
    # lower case and remove special characters/whitespace
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [t for t in tokens if t not in en_stopwords] # recreate document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

In [None]:
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(book_df['synopsis'])
len(norm_corpus)

In [None]:
# function to remove parentheticals from book titles
def normalize_titles(doc):
    doc = re.sub('[\(\[].*?[\)\]]', '', doc)
    doc.strip()
    return doc

In [None]:
for i in range(len(book_df)):
    title = book_df.loc[i, 'title']
    norm_title = normalize_titles(title)
    book_df.loc[i, 'title'] = norm_title

In [None]:
book_df.head()

## Feature Extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

## Main Functionality
We will use cosine similarity to calculate the similarity between book synopses.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

doc_similarity = cosine_similarity(tfidf_matrix)
doc_similarity_df = pd.DataFrame(doc_similarity)
doc_similarity_df.head()

In [None]:
book_list = book_df['title'].values
author_list = book_df['author'].values

# adapted from Week 8 Coding Practice, this will return the top 3 similar books for a given book
def book_recommender(book_title, books=book_list, authors=author_list, doc_sims=doc_similarity_df): 
    book_index = np.where(books == book_title)[0][0]
    book_similarities = doc_sims.iloc[book_index].values 
    similar_book_indices = np.argsort(-book_similarities)[1:4]
    similar_books = books[similar_book_indices]
    similar_authors = authors[similar_book_indices]
    
    books_with_authors = []
    for i in range(len(similar_books)):
        b_title = similar_books[i]
        b_author = similar_authors[i]
        books_with_authors.append(str(b_title + 'by ' + b_author))
    return books_with_authors

### Randomly Select 5 Books and Generate Recommendations 

In [None]:
import random
random_books = random.choices(book_list, k=5)
print(random_books)

In [None]:
for book in random_books:
    for b in range(len(book_list)):
        if book_list[b] == book:
            print('Book: ' + book + 'by ' + author_list[b])
            print('Top 3 Recommended Books:', book_recommender(book_title=book))
            print()

## Personal Contribution Statement
Working on this section of the project gave me different ideas of how I could improve the recommendation engine. The first idea I had was to add columns for 2-3 subgenres for the books. This would add another dimension to determining the similarity between books. The second idea I had was to filter out books with the same author when giving a recommendation because reading a book by the same author seems like a given. By eliminating books by the same author, I would be able to provide more recommendations for the user. The biggest element of the project I need to nail down is how I want the user to enter the book that the want recommendations for. Ideally, the user would be able to enter the name of any romance book. However, the way the recommendation engine is set up now it can only provide recommendations for books that exist in the dataset. I am brainstorming a way to search Goodreads for information about the user's selected book and compare it to the books in the dataset.