# Recommender System
I will now build a basic recommender system based for the books that have been scraped.  The idea is that you give it a single book and it will return books you are likely to also enjoy based on their similarity to the book that you provided.

## Type
While there are many types of recommender systems, the two most common are *collaborative filters* and *content filters*.

At a high level, collaborative filtering works at a user-level.  It takes individual statistics like ratings, which items were viewed, etc., and draws similarities between users based on these values.  If there is content that one has interacted with that another did not, it can be a potential suggestion.

On the other hand, content filters ignore the user and focus on the similarities between the actual content of the data, such as weighted ratings, similarity of authors, frequency of topics appearing in the description, and so on.  This method requires a direct 
'similarity score' between items in order to compute how related they are.

I'm going to go with the **content filtering** method because the data that I scraped best fits this - it has book content, not user interaction data.

In [None]:
pip install rake-nltk

In [1]:
import pandas as pd
import numpy as np

# NLP stuff.
import string
from rake_nltk import Rake
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

ModuleNotFoundError: No module named 'rake_nltk'

In [None]:
book_data = pd.read_csv('./scraper/output/pages-1-100.tsv', sep='\t')

## Remove duplicates
I have read on the user forum and eyeballed a few duplicates.  I will remove them by common title.  Of course the disadvantage to this is that some removes entries may contain information that's missing in the first encounter (which is what is kept by default).

In [None]:
book_data.drop_duplicates(subset='title', inplace=True)

# Resetting the index is VERY important!
# We rely on the index later and if we remove values here, the index will no longer be right.
book_data = book_data.reset_index()

## Weighted rating & top books
We cannot take rating scores directly as they can be imbalanced.  One user rating a book 5/5 is not better than 50,000 people rating it on average 4.5.  We need some kind of algorithm to weight the rating values.

[IMDB's FAQ](https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV?ref_=helpms_helpart_inline#calculatetop) describes the algorithm that they use to weight the rank o movies and TV shows for the top rated lists.  It reads:

$\text{Weighted Rating (WR)} = (\frac{v}{v+m} \cdot R) + (\frac{m}{v+m} \cdot C)$

where

* $R$ is the average rating for the movie (mean).
* $v$ is the number of votes for the movie.
* $m$ is the minimum number of votes to be listed (25,000 in their case)
* $C$ is the mean vote across the whole report.

We already have access to $R$ and $v$ in the columns directly.  $C$ is something we can compute from the data.  $m$ is something we can configure and tweak.  I'll begin with the 10th percentile, essentially chopping off the bottom part of the data.

In [None]:
C = book_data['avg_rating'].mean()
C

In [None]:
m = book_data['num_ratings'].quantile(0.1)
m

In [None]:
def weighted_rating(book, m, C):
    # Average rating for the book.
    R = book['avg_rating']
    # Total number of votes for the book.
    v = book['num_ratings']
    # IMDB formula.
    return (v / (v+m) * R) + (m / (m+v) * C)

# Calculate the weighted rating for books that are within our threshold.
book_data.loc[book_data.num_ratings > m, 'weighted_rating'] = book_data.loc[book_data.num_ratings > m].apply(lambda x: weighted_rating(x, m, C), axis=1)

# Fill the NaN values (i.e., books lower than our threshold) with a zero score.
book_data['weighted_rating'].fillna(0, inplace=True)

Using this method, let's eyeball the top and bottom 5 entries (sorted by `weighted_rating`).  These movies are 'similar' in that they are ordered by their weighted rating.  Books around the same score were rated similar.  However, this is too simple and doesn't consider what the actual books are about, who wrote them, and so on.

In [None]:
book_data.sort_values('weighted_rating', ascending=False).head(5)

In [None]:
book_data.sort_values('weighted_rating', ascending=False).tail(5)

In [None]:
# A little cleanup.
del C
del m

## Content-Based Recommender System
Now let's get to building the recommender.  It will be based on the content, so we will be creating an amalgam of features per book that will be used to calculate the similarity score between books.

Values I'm thinking of using include the title, series that it belongs to (if any), language, author(s), genres, and of course we can identify keywords from the book's description.

Instead of treating each entry equally, we can add weight to them by mentioning the words multiple times in the vector that we will use to calculate similarity.

Problems with the approach I have taken below include:

* Genres and languages can overlap (English vs. English) which increases the importance of that feature.
* Processing is a little trivial without much testing yet.
* All authors are included blindly.  They could be filtered based on their (Role).

In [None]:
# Takes a string and returns an array of its processed words.
def clean_string(s):
    # Remove stopwords and punctuation.
    stop = stopwords.words('english') + list(string.punctuation)
    return [n for n in wordpunct_tokenize(s.lower()) if n not in stop]

def create_soup(x):
    title_importance = 1
    language_importance = 1
    series_importance = 1
    authors_importance = 1
    genres_importance = 1

    soup = ''
    
    # Keywords from description.
    desc = x['description']
    if desc is not np.nan:
        rake = Rake()
        rake.extract_keywords_from_text(desc)
        desc_soup = ' '.join(list(rake.get_word_degrees().keys()))
        soup = ' '.join(filter(None, [soup, desc_soup]))
    
    # Title.
    title_soup = ' '.join(clean_string(x['title']) * title_importance)
    soup = ' '.join(filter(None, [soup, title_soup]))
    
    # Language.
    language = x['language']
    if language is not np.nan:
        language_soup = ' '.join(clean_string(language) * language_importance)
        soup = ' '.join(filter(None, [soup, language_soup]))
    
    # Series.
    series = x['series']
    if series is not np.nan:
        series_soup = ' '.join(clean_string(series) * series_importance)
        soup = ' '.join(filter(None, [soup, series_soup]))

    # Authors.
    authors = x['authors']
    if authors is not np.nan:
        # I'm trying to not remove punctuation here but to just set all as spaces. I want to retain (Role).
        # Providing it's consistent across entries, this should work.
        author_soup = ' '.join([a.lower().replace(' ', '') for a in authors.split(',')] * authors_importance)
        soup = ' '.join(filter(None, [soup, author_soup]))
    
    # Genres.
    genres = x['genres']
    if genres is not np.nan:
        # Almost the same treatment as authors (strip spaces to make matching a bit more likely).
        genre_soup = ' '.join([g.lower().replace(' ', '') for g in genres.split(',')] * genres_importance)
        soup = ' '.join(filter(None, [soup, genre_soup]))
    
    return soup

book_data['soup'] = book_data.apply(create_soup, axis=1)

In [None]:
book_data.soup.head()

Now it's time to create the similarity matrix between all books based on our lovely steaming soup.

In [None]:
count_vec = CountVectorizer()
count_matrix = count_vec.fit_transform(book_data['soup'])

from sklearn.metrics.pairwise import linear_kernel
cos_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reverse lookup of title vs. index.
title_to_index = pd.Series(book_data.index, index=book_data['title'])

def get_recommendation(title):
    idx = title_to_index[title]
    print(idx)
    print(book_data.loc[idx].soup)
    
    scores = pd.Series(cos_sim[idx]).sort_values(ascending=False)
    book_indices = list(scores.iloc[1:11].index)
    
#     scores = list(enumerate(cos_sim[idx]))
#     scores = sorted(scores, key=lambda x: x[1], reverse=True)
#     scores = scores[1:11]
#     book_indices = [i[0] for i in scores]
    print(scores[1:11])
    return book_data.iloc[book_indices]

# get_recommendation('Harry Potter and the Chamber of Secrets')
get_recommendation("The Hitchhiker's Guide to the Galaxy")

I'm going to now output the data to some pickle files for loading elsewhere (since it has been processed a little).

In [None]:
import pickle

should_export = False

if should_export:
    # Book data.
    print('Exporting book data...', end='')
    pickle.dump(book_data, open('book_data.pickle', 'wb'))
    print('done!')
    
    # Cosine similarity (warning: this will be huge).
    print('Exporting similarity matrix...', end='')
    pickle.dump(cos_sim, open('cossim.pickle', 'wb'))
    print('done!')

In [None]:
# DEBUG: Easy way to find the rows of books I know.
book_data.loc[book_data.title.str.contains('Hitchhiker')]