# DS 4300: Book Recommendation Engine
## Sara Adra, Anika Das, Mirah Gordon, Genny Jawor

This notebook is meant to serve as the home of all user facing interactions. 

The code here includes:
* asking a user for four books inputs (which requires a book title and author name)
* finding the records of those books (if they exist) in our mongo books collection
* finding the top 5 reviews of each book in our mongo reviews collections
* running sentiment analysis on reviews to find most common words
* connecting to neo4j graph amd querying to find a book based on the common words

### Connecting to Mongo

In [1]:
# import pymonogo to connect to the database and our collections
import pymongo
from pymongo import MongoClient
import pprint
import operator
# create a mongo client
client = MongoClient()
# connect the client to the local host
client = MongoClient('localhost', 27017)

**Setting up Mongo Client**

In [2]:
# find and use the demo database
db = client.demo
# find and use the books collection
all_books = db.all_books
# find and use the authors collection
authors = db.authors
# find and use the reviews collection
reviews = db.reviews

### Connecting to Neo4j

In [None]:
# import neo4j driver to connect to the database 
from neo4j import GraphDatabase

uri = "neo4j://localhost:7474"
driver = GraphDatabase.driver(uri, auth=("neo4j", "neo4jj"))

### Finding a Book and It's Reviews

We re-uploaded the books json in a new mongo collection that would be unedited to use in looking up book entries from user inputs.
* /Users/mirahgordon/documents/MongoDB/bin/mongodb-tools/bin/mongoimport --db demo --collection all_books --file $HOME/data/all_books/goodreads_books.json

We created a new index for faster querying.
* db.all_books.createIndex( { 'title': -1 } )

**Cleaning User Input**

In [3]:
# imports
import pandas as pd
import numpy as np

In [4]:
"""
capitalizes each word in the given statement if necessary
"""
def capitalizeWords(statement):
    
    # words that should stay lowercase
    # source: https://whenyouwrite.com/what-words-do-you-not-capitalize-in-a-title/
    lowercase_words = ['and', 'as', 'at', 'but', 'by', 'for', 'from', 'if', 'in', 'into', 'like', 'near', 'nor', 'of', 'off', 'on', 'once', 'onto', 'or', 'over', 'past', 'so', 'than', 'that', 'till', 'to', 'up', 'upon', 'with', 'when', 'yet']
    
    # intialize list for fixed words
    fixed_words_list = []
    
    # iterate through each word in the given statement
    for word in statement.split(' '):
        # make the word all lowercase letters
        fixed_word = word.lower()
        # if word is not in the list of words that should stay lowercase, capitalize it
        if (fixed_word not in lowercase_words):
            fixed_word = fixed_word.capitalize()
            
        # append the fixed word to the list of fixed words
        fixed_words_list.append(fixed_word)
       
    # return the inputted statement capitalized as necessary
    # (by joining the items in the fixed_words_list with a space)
    return ' '.join(fixed_words_list)
            
            
capitalizeWords('percY JaCksOn')

'Percy Jackson'

**Asking for User Input**

In [5]:
# get inputs from user (favorite books)
num_of_books = 4
print("Please provide your " + str(num_of_books) + " favorite books below:")

inputted_books = []
for book_i in range(1, num_of_books + 1):
    print("\nBook " + str(book_i) + ":\n")
    book_title = input("     Title: ")
    book_author = input("     Author: ")
    
    # fix the capitalization of user inputs
    book_title_fixed_capitalization = capitalizeWords(book_title)
    book_author_fixed_capitalization = capitalizeWords(book_author)
    
    inputted_books.append((book_title_fixed_capitalization, book_author_fixed_capitalization))

Please provide your 4 favorite books below:

Book 1:

     Title: 1984
     Author: George Orwell

Book 2:

     Title: The Martian
     Author: Andy Weir

Book 3:

     Title: Wool
     Author: Hugh Howey

Book 4:

     Title: Crime and Punishment
     Author: Fyodor Dostoevsky


**Querying Mongo based on Input**

In [6]:
# empty list to hold entries
entries = []

# find a book entry in the books collection
for book in inputted_books:
    # get only the title of the book
    title = book[0]
    # mongo command to find the book entry of a given book using the title
    book_entry = all_books.find( {'title': title, 'language_code': 'eng' }, { 'book_id':1, 'title':1, 'description':1, 'num_pages':1, '_id':0}).limit(1)
    
    for b in book_entry:
        entries.append(b)        

In [7]:
# empty list to hold book ids
book_ids = []

# get each book's id from the list of entries
for book in entries:
    book_id = book['book_id']
    book_ids.append(book_id)

In [8]:
# get reviews for each book

# create dataframe to hold all the top reviews
top_reviews = pd.DataFrame()

# iterate through 
for book in book_ids:

    # mongo command to find the reviews with a specific book id
    book_reviews = reviews.find( { 'book_id' : str(book) }, { 'book_id':1, 'review_text':1, 'n_votes':1, '_id':0 } )

    # collect all reviews as a list
    top5 = []

    for r in book_reviews:
        top5.append(r)
    
    # sort reviews by number of votes and keep the top 5
    sorted_by_votes = sorted(top5, key=lambda d: d['n_votes'])
    sorted_by_votes = sorted_by_votes[0:5]
        
    # append the top 5 reviews to the dataframe 
    top_reviews = top_reviews.append(sorted_by_votes, ignore_index=True, sort=False)

In [9]:
# set the index as the book id
top_reviews.set_index('book_id', inplace=True)
# drop any row with an empty review
top_reviews['review_text'] = top_reviews['review_text'].replace('', np.nan)
top_reviews.dropna(axis=0, subset=['review_text'], inplace=True)

In [10]:
top_reviews.head()

Unnamed: 0_level_0,review_text,n_votes
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
17678435,One more angle of thinking how world is going ...,1
26853362,A fantastic book. Mark is a great character wh...,0
26853362,4.5 stars. THIS BOOK WAS SO AMAZING I CRIED SO...,0
26853362,"No, I haven't seen the movie (yet). \n In fact...",0
26853362,Wow. Possibly my favourite read of the year.,0


### Analyzing Common Words in Reviews

In [11]:
# imports for stop words and counter
from collections import Counter
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
set_stopwords = set(stopwords.words('english'))
set_stopwords.update(['book', 'books', 'author', 'story', 'read', "i've"])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mirahgordon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# compile all reviews for books 
reviews_list = top_reviews['review_text'].tolist()

In [13]:
"""
function to clean the given word (by making it all lower case
letters + removing any trailing punctuation if present)
"""
def clean_word(word):
    # make the word all lowercase letters
    cleaned_word = word.lower()
    
    # check if the last character in the word is a letter or not
    # remove last character if not a letter
    # (ex. ',' '.' '-' ' ')
    if (cleaned_word[-1].isalpha() == False):
         cleaned_word = cleaned_word[:-1]
    
    return cleaned_word

In [14]:
most_freq_word_column_labels = []
for idx in range(1, 11):
    most_freq_word_column_labels.append('most_freq_word_' + str(idx))

In [15]:
most_freq_words_all_books = dict()
most_freq_words_all_books_df = pd.DataFrame(columns=most_freq_word_column_labels)

for book_id in top_reviews.index.unique():
    total_count_Counter = Counter()
    for index, review in top_reviews[top_reviews.index == book_id].iterrows():
        words_in_review_list = str(review['review_text']).split(' ')
        cleaned_words_in_review = filter(lambda word: (word not in set_stopwords) and (word != ''), map(clean_word, words_in_review_list))
        cleaned_review_word_count_Counter = Counter(cleaned_words_in_review)
        total_count_Counter = Counter(total_count_Counter) + Counter(cleaned_review_word_count_Counter)

    most_freq_words = total_count_Counter.most_common(10)
    most_freq_words_all_books[book_id] = most_freq_words
    
    most_freq_words_list = [word_to_freq[0] for word_to_freq in most_freq_words]
    while len(most_freq_words_list) < 10: most_freq_words_list.append(np.nan)

    most_freq_words_all_books_df.loc[book_id] = most_freq_words_list

most_freq_words_all_books_df

Unnamed: 0,most_freq_word_1,most_freq_word_2,most_freq_word_3,most_freq_word_4,most_freq_word_5,most_freq_word_6,most_freq_word_7,most_freq_word_8,most_freq_word_9,most_freq_word_10
17678435,going,one,world,person,angle,thinking,expect,future,..,possibly
26853362,character,makes,feel,fact,interesting,would,fantastic,mark,great,can't
17164686,silo,juliette,people,outside,next,point,really,well,done,post-apocalyptic
23398716,novel,yang,great,dostoyevsky,one,think,felt,crime,manusia,literature


In [16]:
set_freq_words = set()
for col in most_freq_word_column_labels:
    set_freq_words = set_freq_words.union(set(most_freq_words_all_books_df[col].unique()))
    
    
all_most_freq_words_df = pd.DataFrame(list(set_freq_words), columns=['most_freq_words'])
all_most_freq_words_df.dropna(inplace=True)

In [17]:
# represent frequent words in a list to give to neo4j
frequent_words = all_most_freq_words_df['most_freq_words'].tolist()
frequent_words

['angle',
 'literature',
 'really',
 'point',
 'fact',
 'yang',
 "can't",
 'crime',
 'interesting',
 'novel',
 'thinking',
 'makes',
 'juliette',
 'would',
 'mark',
 'future',
 'possibly',
 'silo',
 'think',
 'post-apocalyptic',
 'dostoyevsky',
 'well',
 'going',
 'one',
 'fantastic',
 'character',
 'people',
 'feel',
 'world',
 'manusia',
 'person',
 'felt',
 'done',
 'great',
 'next',
 '..',
 'expect',
 'outside']

### Cypher Query and Recommendation

In [None]:
def get_books(tx, words):
    books = []
    result = tx.run("MATCH (book:Title)-[r:Assoc]->(cw:CommonWords) WHERE cw.CommonWords IN $words RETURN book LIMIT 4", words=frequen_words)
    for record in result:
        books.append(record["Title"])
    return books

with driver.session() as session:
    books = session.read_transaction(get_books, frequent_words)

driver.close()

In [21]:
# present recommended books to the user
print("\033[1mBased on your favorite books:\n \033[0m")

for fav_book in inputted_books:
    print(fav_book[0] + " by: " + fav_book[1])
    
print("\033[1m\nOur engine recommends the following books:\n  \033[0m")

for book in books:
    print(book[0] + " by: " + book[1])

[1mBased on your favorite books:
 [0m
1984 by: George Orwell
The Martian by: Andy Weir
Wool by: Hugh Howey
Crime and Punishment by: Fyodor Dostoevsky
[1m
Our engine recommends the following books:
  [0m
Charlotte by: David Foenkinos
M Is For Malice by: Sue Grafton
A Working Man (Men of Manhattan, #4) by: Sandrine Gasq-Dion
Holy Hustler by: P.L. Wilson
