In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

from ipywidgets import interact, widgets, Layout, Box
from IPython.display import display, clear_output, HTML

In [2]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# Book Recommendation Model #

##### Data Exploration #####

Source: GoodBooks-10K

Link: https://github.com/zygmuntz/goodbooks-10k

Book and user prefernce data is sourced from the link above. Here, we have a dataset detailing 

In [3]:
# read in book characteristics and user rating data
books = pd.read_csv('data/books.csv')
ratings = pd.read_csv('data/ratings.csv')

In [4]:
# filtered list of unique books and their id
title_id = books[['book_id','original_title']]
unique_books = books['original_title'].sort_values().replace(' ', np.nan).dropna()
title_id = title_id[title_id['original_title'].isin(unique_books)].drop_duplicates()
ratings = ratings[ratings['book_id'].isin(title_id['book_id'])]

## Matrix Creation ##

Consider the individual user, \\(i \\), and their rating for a distinct book, \\(j\\). Then  \\(x_i\\) = (\\(x_i^1,x_i^2, \dots ,x_i^n \\)) represents the vector of ratings associated with the user \\(x_i\\) for \\(n\\) books. In a similar manner, for an individual book we can define \\(y_j\\) = (\\(y_j^1, y_j^2, \dots, y_j^m \\)) where \\(y_j\\) is the vector of ratings associated with a book for \\(m\\) users. Collecting these vectors we can render the matrices \\(X\\) and  \\(Y\\) where

\\[ X = \begin{bmatrix}
    \vdots & \vdots & \cdots & \vdots \\\
    x_i^1 & x_i^2& \dots & x_i^n  \\\
    \vdots & \vdots & \cdots & \vdots
\end{bmatrix}
\space Y = \begin{bmatrix}
    \vdots & \vdots & \cdots & \vdots \\\
    y_j^1 & y_j^2& \dots & y_j^m  \\\
    \vdots & \vdots & \cdots & \vdots
\end{bmatrix}\\]

Using matrix multiplication, we can calculate the matrix of ratings for the set of users and books for use via the equation.

\\[\check{R} = XY^T\\]

Here, \\(\check{R}\\) ~ \\(R\\) or \\(\check{R}\\) approximates \\(R\\) where \\(R\\) is the true matrix of ratings between users and books. In reality, we only have partial user preference data thus the need for a reccomendation system. Because of this, an alternative method needs to be used to infer a user's rating.

In [5]:
# filter ratings for easy computation
ratings = ratings.loc[(ratings['user_id']<2000) & (ratings['book_id']<2000)]

# user and book matrix creation
X = ratings.pivot(index='book_id', columns='user_id', values='rating')
Y = ratings.pivot(index='user_id', columns='book_id', values='rating')

## Nearest Neighborhood Collaborative Filtering ##

Collaborative filtering is a recommendation system method which uses historical data to predict future preferences. An important assumption made within this model is that past actions are indicitive of future actions. In this sense, CF doesn't require the same feature density that may be required in a Content Based approach, i.e., using descriptive characteristics of users and items. Within collaborative filtering, we will several comparison methods to uncover user and item similarity. 

### Item-Based ###

On a high level, an item-based approach relates products to other products and reccomends novels to users that are similar to the novels that user rated well. In addition to this, there is also user-based collaborative filtering which relates users to other users and makes preference prediction based on this relation. Although for the purposes of this project, I will only explore item-based collaborative filtering.

### Pearson Correlation ###

The pearson correlation depicts the association between two vectors or in our case the two sequences of book ratings drawn from readers in respect to a distinct book. Implementing this equation, we get a correlation value ranging from -1 to 1 where -1 depicts a perfect negative correlation between the two books and 1 a perfect positive correlation. Positive in this sense says that one book is


\\[ pCor(x_i, x_k) = \frac{\sum_{ik}(x_i-\overline{x_i})(x_k-\overline{x_k})}{\sqrt{\sum_i(x_i-\overline{x_i})\sum_k(x_k-\overline{x_k})}}\\]

\\[cosSim(x_i, x_k) = \frac{x \cdot y}{\Vert x_i \Vert \Vert x_k \Vert} = \frac{\sum_{ik} x_i x_k}{\sqrt{\sum_i(x_i)^2}\sqrt{\sum_k(x_k)^2}}\\]


In the code, null values are imputed with zeros. This mathematically says that a given book's rating vector has zero magnitude in all dimensions where the book was not assessed. 

In [6]:
# fill nulls in dataframe
X = X.fillna(0)

In [7]:
from numpy import cov

In [None]:
def pearson_correlation(A):
    n = len(A.columns)
    np_A = A.to_numpy()
    for i in np.nditer(np_A):
        for j in np.nditer(np_A):
            mean_i = np.full((1, n), np.average(i))
            mean_j = np.full((1, n), np.average(j))
            #np.multiply((i - , mean_i)), (j - np.full((1, n), mean_j)))

In [None]:
def pearson_correlation(A):
    mat = []
    #np_A = A.to_numpy()
    for i in A.values:
        row = []
        for j in A.values:
            mean_x = sum(x) / len(x)
            mean_y = sum(y) / len(y)
            cov = sum((a - mean_x) * (b - mean_y) for (a,b) in zip(x,y))
            pcorr = cov / np.sqrt(np.var(x) * np.var(y))
            row.append(pcorr)
        mat.append(row)
    return mat

In [None]:
np_A = X.to_numpy()
#np.cov(np_A[0], np_A[1])

x = np_A[0]
y = np_A[1]
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)

cov = sum((a - mean_x) * (b - mean_y) for (a,b) in zip(x,y))

cov / np.sqrt(np.var(x) * np.var(y))

In [None]:
np.cov(x,y)

In [None]:
pearson_correlation(X)

In [None]:
np.full((1, len(X.columns)), np.average(X.to_numpy()[0]))

In [8]:
def cos_similarity(A):
    '''Takes a DataFrame and returns a similarity matrix containing the angles between the row vectors'''
    cols = A.index
    sim_mat = []
    for i in A.values:
        row = []
        for j in A.values:
            dot_prod = np.dot(i, j)
            norm_i = np.linalg.norm(i)
            norm_j = np.linalg.norm(j)
            row.append(dot_prod / (norm_i * norm_j))
        sim_mat.append(row)
    return pd.DataFrame(sim_mat, columns = cols).set_index(cols)

In [9]:
# generate cosine similarity and pearson's correlation dataframe
cos_X = cos_similarity(X)
pear_X = X.T.fillna(0).corr(method='pearson')

In [17]:
from IPython.display import FileLink, FileLinks

cos_X.to_csv('cos_x.csv', index=True)
pear_X.to_csv('pear_x.csv', index=True)

#FileLinks('/path/to/')

In [13]:
# widget creation 
text = widgets.Text(placeholder='Filter book title dropdown w/ text', description='Filter:', disabled=False)
dropdown = widgets.Dropdown(options=title_id['original_title'][:100], description='Books:')
button = widgets.Button(description="Find Matches!")
output = widgets.Output()

# filterBooks generates the drop down based on a user specified value
def filterBooks(wdgt):
    r = re.compile('.*' + wdgt.value, re.IGNORECASE)
    matches = list(filter(r.match, title_id['original_title']))
    matches = matches if len(matches) > 0 else ['No Matches']
    dropdown.options = matches

# findResults searches correlation and similarity dataframes for user search and returns a comparison table
def findResults(b):
    with output:
        clear_output()
        selected_title = title_id[title_id['original_title'] == dropdown.value]['book_id'].iloc[0]
        
        # select top ten books per method
        pear_ten = pear_X.loc[selected_title].sort_values(ascending=False)[1:11]
        cos_ten = cos_X.loc[selected_title].sort_values(ascending=False)[1:11]
        
        # get book titles and authors from books dataset
        pear_out = pd.DataFrame(pear_ten).merge(books[['book_id','original_title','authors']], on='book_id', how='inner')
        pear_out['rank'] = pear_out.index + 1
        pear_out = pear_out.drop('book_id', axis=1)
        pear_out = pear_out.rename(columns={'original_title': 'pearson_title', 'authors': 'p_author(s)'})
        
        # get book titles and authors from books dataset
        cos_out = pd.DataFrame(cos_ten).merge(books[['book_id','original_title','authors']], on='book_id', how='inner')
        cos_out['rank'] = cos_out.index + 1
        cos_out = cos_out.drop('book_id', axis=1)
        cos_out = cos_out.rename(columns={'original_title': 'cos_title', 'authors': 'c_author(s)'})

        # merge and display results
        results = pear_out[['p_author(s)','pearson_title','rank']].merge(cos_out[['rank','cos_title','c_author(s)']], on='rank', how='outer')
        display(HTML('<center><h1>Pearsons Correlation - Cosine Similarity</h1></center>'))
        display(HTML('<center>' + results.to_html(index=False) + '</center>'))

# output interactive widgets
display(text)
display(dropdown)
display(button, output)

# event handler for filter submition and reccomendation searches
text.on_submit(filterBooks)
button.on_click(findResults)

Text(value='', description='Filter:', placeholder='Filter book title dropdown w/ text')

Dropdown(description='Books:', options=('The Hunger Games', "Harry Potter and the Philosopher's Stone", 'Twili…

Button(description='Find Matches!', style=ButtonStyle())

Output()

### Support Vector Decomposition ###

SVD is a dimension reduction technique that allows us to view a particular matrix in the context of lower ranking matrices, where these matrices can be thought of as a linear combination of the original matrix. This technique is useful due to the scarcity of out current dataset. The following is the support vector decomposition formula. 

\\[ A = UDV^T\\]


In [None]:
U, D, V = np.linalg.svd(A.fillna(0))

In [None]:
# cumulative variance explained
cum_var = [D[:i].sum()/D.sum() for i in range(len(D))]
plt.plot(range(1,1982), cum_var, label='line')
plt.show()

In [None]:
len(ratings['user_id'].unique())