<a href="https://colab.research.google.com/github/jessie1111101/mais202/blob/master/Book_Recommendation_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Problem Statement
I will make a book recommendation generator by using bag of words model and kaggle good books 10k data.

2. the dataset I will work with is kaggle goodbooks-10k. It contains information on 10,000 books. I preprocess the data by importing the csv file with pandas and accessing columns of information that I need. For example, for each book I accessed title, author, book id, and book tags to generate recommendations.

3. Machine learning model
I will use bag of words model. Specifically, I use TfidVectorizer from sklearn.feature_extraction.text and linear_kernel from sklearn.metrics.pairwise to vectorize my data and generate cosine similarity (to calculate numeric similarity) respectively. 

In [1]:
import numpy as np
import pandas as pd
import sklearn.feature_extraction.text
import sklearn.metrics.pairwise

TfidVectorizer = sklearn.feature_extraction.text.TfidfVectorizer
linear_kernel = sklearn.metrics.pairwise.linear_kernel

book_csv_url = 'https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv'
books = pd.read_csv(book_csv_url)

ratings_url = 'https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv'
ratings = pd.read_csv(ratings_url)

book_tags_url = 'https://github.com/zygmuntz/goodbooks-10k/blob/master/book_tags.csv?raw=true'
book_tags = pd.read_csv(book_tags_url)

tags_url = 'https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/tags.csv'
tags = pd.read_csv(tags_url)

tags_join_DF = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how='inner')

to_read_url='https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/to_read.csv'
to_read = pd.read_csv(to_read_url)

'''
books.head()
ratings.head()
book_tags.head()
tags.tail()
tags_join_DF.head()
to_read.head()
'''


'\nbooks.head()\nratings.head()\nbook_tags.head()\ntags.tail()\ntags_join_DF.head()\nto_read.head()\n'

TfidVectorizer transforms text to feature vectors that is used as input to estimate

cosine simlarity calculates numeric value denoting similarity between two books

In [2]:
tf = TfidVectorizer(analyzer = 'word', ngram_range=(1,2), min_df=0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(books['authors'])
cosine_sim=linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim

#build 1d dimensional array w book titles
titles = books['title']
indices = pd.Series(books.index, index=books['title'])

#function that gets book recs based on cosine similarity score of book authors
def authors_recommendations(title):
  idx = indices[title]
  sim_scores = list(enumerate(cosine_sim[idx]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
  sim_scores = sim_scores[1:21]
  book_indices = [i[0] for i in sim_scores]
  return titles.iloc[book_indices]

authors_recommendations('The Hobbit').head(20)

18      The Fellowship of the Ring (The Lord of the Ri...
154            The Two Towers (The Lord of the Rings, #2)
160     The Return of the King (The Lord of the Rings,...
188     The Lord of the Rings (The Lord of the Rings, ...
963     J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
4975         Unfinished Tales of Númenor and Middle-Earth
2308                                The Children of Húrin
610              The Silmarillion (Middle-Earth Universe)
8271                   The Complete Guide to Middle-Earth
1128     The History of the Hobbit, Part One: Mr. Baggins
465                             The Hobbit: Graphic Novel
0                 The Hunger Games (The Hunger Games, #1)
1       Harry Potter and the Sorcerer's Stone (Harry P...
2                                 Twilight (Twilight, #1)
3                                   To Kill a Mockingbird
4                                        The Great Gatsby
5                                  The Fault in Our Stars
7             

recommend books using tags provided to books

In [3]:
books_with_tags = pd.merge(books, tags_join_DF, left_on='book_id', right_on='goodreads_book_id', how='inner')
# books_with_tags[(books_with_tags.goodreads_book_id==18710190)]['tag_name']

tf1 = TfidVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix1 = tf1.fit_transform(books_with_tags['tag_name'].head(10000))
cosine_sim1 = linear_kernel(tfidf_matrix1, tfidf_matrix1)

cosine_sim1

#build 1d array w book titles
titles1 = books['title']
indices1 = pd.Series(books.index, index=books['title'])

# Function that get book recommendations based on the cosine similarity score of books tags
def tags_recommendations(title):
    idx = indices1[title]
    sim_scores = list(enumerate(cosine_sim1[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    book_indices = [i[0] for i in sim_scores]
    return titles.iloc[book_indices]

tags_recommendations('The Hobbit').head(20)

106                                    A Walk to Remember
206                One for the Money (Stephanie Plum, #1)
306     The Wise Man's Fear (The Kingkiller Chronicle,...
404                                Breakfast of Champions
506     The Hunger Games Trilogy Boxset (The Hunger Ga...
606     City of Heavenly Fire (The Mortal Instruments,...
2805                              The Rules of Attraction
54                                        Brave New World
136                             Outlander (Outlander, #1)
255     Three Cups of Tea: One Man's Mission to Promot...
354                       Graceling (Graceling Realm, #1)
449                   Storm Front (The Dresden Files, #1)
542                  Last Sacrifice (Vampire Academy, #6)
647               Inheritance (The Inheritance Cycle, #4)
571                        Oryx and Crake (MaddAddam, #1)
680      Little House in the Big Woods (Little House, #1)
99                                   The Poisonwood Bible
168           

Recommendation of books using the authors and tags attributes for better results. Creating corpus of features and calculating the TF-IDF on the corpus of attributes for gettings better recommendations.

In [4]:
temp_df = books_with_tags.groupby('book_id')['tag_name'].apply(' '.join).reset_index()
temp_df.head()

books = pd.merge(books, temp_df, left_on='book_id', right_on='book_id', how='inner')
books.head()

books['corpus'] = (pd.Series(books[['authors', 'tag_name']].fillna('').values.tolist()).str.join(' '))
tf_corpus = TfidVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix_corpus = tf_corpus.fit_transform(books['corpus'])
cosine_sim_corpus = linear_kernel(tfidf_matrix_corpus, tfidf_matrix_corpus)

# Build a 1-dimensional array with book titles
titles = books['title']
indices = pd.Series(books.index, index=books['title'])

# Function that get book recommendations based on the cosine similarity score of books tags
def corpus_recommendations(title):
    idx = indices1[title]
    sim_scores = list(enumerate(cosine_sim_corpus[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    book_indices = [i[0] for i in sim_scores]
    return titles.iloc[book_indices]

#corpus_recommendations("The Hobbit")
#corpus_recommendations("Twilight (Twilight, #1)")
#corpus_recommendations("Romeo and Juliet")
#corpus_recommendations("The Perks of Being a Wallflower")
#corpus_recommendations("The Glass Castle")
corpus_recommendations("Gone with the Wind")



155             Faithful Place (Dublin Murder Squad, #3)
108                               The Accidental Tourist
150    Dealing with Dragons (Enchanted Forest Chronic...
802                             The Last Days of Dogtown
173                  Sea Swept (Chesapeake Bay Saga, #1)
60                        Where She Went (If I Stay, #2)
221                          Sarum: The Novel of England
195                  A Suitable Boy (A Suitable Boy, #1)
144                                        Perfect Match
206                              Saga, Vol. 3 (Saga, #3)
142            A Bear Called Paddington (Paddington, #1)
62     The Restaurant at the End of the Universe (Hit...
102    90 Minutes in Heaven: A True Story of Death an...
105                                   The Birth of Venus
74                       Around the World in Eighty Days
576    The Declaration of Independence and The Consti...
114    The Road Less Traveled: A New Psychology of Lo...
573                            

4. Preliminary results
The model works by returning 20 books as recommendation for each book input. 

5. Next steps
The pros are that I can recommend books given a book title, but I am not sure how to test the accuracy of my predictions. I can also try to incorporate sentiment analysis, and create a function to recommend authors.