## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import random

# For method 1
from random import randint
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

# For method 2
from numpy import dot
from numpy.linalg import norm 

# For method 3
from sklearn.metrics.pairwise import cosine_similarity
from surprise import SVD, Reader, Dataset, accuracy
from surprise.model_selection import train_test_split

# For method 4
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, LongType


# Generating data

In [2]:
def generate_data(n_books, n_genres, n_authors, n_publishers, n_readers, dataset_size):
    '''
    This function will generate a dataset with features associated to
    book data set. The dataset will have the following columns : 
        - book_id (String) : Unique identified for the book
        - book_rating (Integer) : A value between 0 and 10
        - reader_id (String) : Unique identifier for the user
        - book_genre (Integer) : An integer representing a genre for the book, 
                                 value is between 1 and 15, indicating that 
                                 there are 15 unique genres. Each book can only
                                 have 1 genre
        - author_id (String) : Unique identifier for the author of the book
        - num_pages (Integer) : Random value between 70 and 500
        - publisher_id (String) : A unique identifier for the publisher of the book
        - publish_year (Integer) : The year of book publishing
        - book_price (Integer) : The sale price of the book
        - text_lang (Integer) : The language of the book - returns an integer which 
                                is mapped to some language
        
    params:
        n_books (Integer) : The number of books you want the dataset to have
        n_genres (Integer) : Number of genres to be chosen from
        n_authors (Integer) : Number of authors to be generated
        n_publishers (Integer) : Number of publishers for the dataset
        n_readers (Integer) : Number of readers for the dataset
        dataset_size (Integer) : The number of rows to be generated 
        
    example:
        data = generate_data()
    '''
    
    d = pd.DataFrame(
        {
            'book_id' : [randint(1, n_books) for _ in range(dataset_size)],
            'author_id' : [randint(1, n_authors) for _ in range(dataset_size)],
            'book_genre' : [randint(1, n_genres) for _ in range(dataset_size)],
            'reader_id' : [randint(1, n_readers) for _ in range(dataset_size)],
            'num_pages' : [randint(75, 700) for _ in range(dataset_size)],
            'book_rating' : [randint(1, 10) for _ in range(dataset_size)],
            'publisher_id' : [randint(1, n_publishers) for _ in range(dataset_size)],
            'publish_year' : [randint(2000, 2021) for _ in range(dataset_size)],
            'book_price' : [randint(1, 200) for _ in range(dataset_size)],
            'text_lang' : [randint(1,7) for _ in range(dataset_size)]
        }
    ).drop_duplicates()
    return d

In [3]:
d = generate_data(n_books = 3000, n_genres = 10, n_authors = 450, n_publishers = 50, n_readers = 30000, dataset_size = 100000)
d.to_csv('./data/data.csv', index = False)

# Method 1. Collaborative Filtering System

**Intuition**

Collaborative filtering is the process of predicting the interests of a user by identifying preferences and information from many users. This is done by filtering data for information or patterns using techniques involving collaboration among multiple agents, data sources, etc. The underlying intuition behind collaborative filtering is that if user A and B have similar taste in a product, then A and B are likely to have similar taste in other products as well. 

There are two common types of approaches in collaborative filtering, memory based and model based approach.

**Memory based approaches** — also often referred to as neighbourhood collaborative filtering. Essentially, ratings of user-item combinations are predicted on the basis of their neighbourhoods. This can be further split into user based collaborative filtering and item based collaborative filtering. User based essentially means that likeminded users are going to yield strong and similar recommendations. Item based collaborative filtering recommends items based on the similarity between items calculated using user ratings of those items.

**Model based approaches** — are predictive models using machine learning. Features associated to the dataset are parameterized as inputs of the model to try to solve an optimization related problem. Model based approaches include using things like decision trees, rule based approaches, latent factor models etc.

**Advantages**

The main advantage to using collaborative filtering models is its simplicity to implement and the high level coverage they provide. It is also beneficial because it captures subtle characteristics (very true for latent factor models) and does not require understanding of the item content.

**Disadvantages**

The main disadvantage to this model is that it’s not friendly for recommending new items, this is because there has been no user/item interaction with it. This is referred to as the cold start problem. Memory based algorithms are known to perform poorly on highly sparse datasets.

**Examples**

Some examples of collaborative filtering algorithms :
- YouTube content recommendation to users — recommending you videos based on other users who have subscribed / watched similar videos as yourself.
- CourseEra course recommendation — recommending you courses based on other individuals who have finished existing courses you’ve finished.

**Implementation**

- Import data from generate_data function (function provided above)
- Generate a pivot table with readers on the index and books on the column and values being the ratings
- Calculate similarity between items and users using svds
- Generate item recommendations based on user_id

## Functions

In [4]:
def normalize_pred_ratings(pred_ratings):
    '''
    This function will normalize the input pred_ratings
    
    params:
        pred_ratings (List -> List) : The prediction ratings 
    '''
    return (pred_ratings - pred_ratings.min()) / (pred_ratings.max() - pred_ratings.min())

In [5]:
def generate_prediction_df(mat, pt_df, n_factors):
    '''
    This function will calculate the single value decomposition of the input matrix
    given n_factors. It will then generate and normalize the user rating predictions.
    
    params:
        mat (CSR Matrix) : scipy csr matrix corresponding to the pivot table (pt_df)
        pt_df (DataFrame) : pandas dataframe which is a pivot table
        n_factors (Integer) : Number of singular values and vectors to compute. 
                              Must be 1 <= n_factors < min(mat.shape). 
    '''
    
    if not 1 <= n_factors < min(mat.shape):
        raise ValueError("Must be 1 <= n_factors < min(mat.shape)")
        
    # matrix factorization
    u, s, v = svds(mat, k = n_factors)
    s = np.diag(s)

    # calculate pred ratings
    pred_ratings = np.dot(np.dot(u, s), v) 
    pred_ratings = normalize_pred_ratings(pred_ratings)
    
    # convert to df
    pred_df = pd.DataFrame(
        pred_ratings,
        columns = pt_df.columns,
        index = list(pt_df.index)
    ).transpose()
    return pred_df

In [6]:
def recommend_items(pred_df, usr_id, n_recs):
    '''
    Given a usr_id and pred_df this function will recommend
    items to the user.
    
    params:
        pred_df (DataFrame) : generated from `generate_prediction_df` function
        usr_id (Integer) : The user you wish to get item recommendations for
        n_recs (Integer) : The number of recommendations you want for this user
    '''
    
    usr_pred = pred_df[usr_id].sort_values(ascending = False).reset_index().rename(columns = {usr_id : 'sim'})
    rec_df = usr_pred.sort_values(by = 'sim', ascending = False).head(n_recs)
    return rec_df

## Implementation

### Reading data

In [7]:
PATH = './data/data.csv'

In [8]:
data = pd.read_csv(PATH)
print(data.shape)


(100000, 10)


In [9]:
data.head()

Unnamed: 0,book_id,author_id,book_genre,reader_id,num_pages,book_rating,publisher_id,publish_year,book_price,text_lang
0,1575,439,4,13777,103,2,45,2017,6,2
1,823,304,7,11446,295,6,6,2000,120,7
2,283,204,1,23959,460,2,35,2004,169,6
3,2309,192,6,25156,681,6,36,2011,14,7
4,1872,376,2,15986,212,3,3,2014,10,6


### Generating a pivot table with readers on the index and books on the column and values being the ratings

In [10]:
# Creating a copy
df = data.copy()

# Pivot table
pt_df = df.pivot_table(
    columns = 'book_id',
    index = 'reader_id',
    values = 'book_rating'
).fillna(0)

pt_df

book_id,1,2,3,4,5,6,7,8,9,10,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,3000
reader_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Converting to a csr matrix

In [11]:
mat = pt_df.values
mat = csr_matrix(mat)
mat

<28975x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 99944 stored elements in Compressed Sparse Row format>

In [12]:
pred_df = generate_prediction_df(mat, pt_df, 10)
pred_df

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,29991,29992,29993,29994,29995,29996,29997,29998,29999,30000
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.173206,0.172777,0.170402,0.172853,0.172851,0.173631,0.173030,0.173532,0.173203,0.172891,...,0.173617,0.173183,0.173941,0.174044,0.173267,0.172897,0.172766,0.173220,0.172990,0.173047
2,0.173217,0.172777,0.173187,0.172871,0.172985,0.173539,0.172887,0.172937,0.173001,0.172950,...,0.173391,0.172946,0.173513,0.174294,0.173071,0.172891,0.172757,0.173141,0.172799,0.173021
3,0.173165,0.172772,0.173695,0.172817,0.172905,0.173658,0.172956,0.173078,0.173016,0.172910,...,0.173277,0.172899,0.173521,0.173962,0.172966,0.172844,0.172755,0.173007,0.172740,0.172997
4,0.172382,0.172770,0.176367,0.172858,0.172789,0.174660,0.172867,0.172991,0.172967,0.173285,...,0.173157,0.173268,0.174006,0.171410,0.173054,0.173223,0.172757,0.172427,0.173066,0.172843
5,0.173049,0.172775,0.172286,0.172866,0.172790,0.173542,0.172922,0.172687,0.172970,0.173037,...,0.172864,0.172903,0.173246,0.171719,0.172944,0.172914,0.172745,0.173148,0.172743,0.173197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,0.173445,0.172768,0.173437,0.172832,0.172876,0.173273,0.172871,0.173107,0.173084,0.172996,...,0.173512,0.173076,0.173971,0.174350,0.173281,0.172951,0.172764,0.173297,0.172933,0.172994
2997,0.172922,0.172793,0.171540,0.172820,0.172924,0.174258,0.173271,0.173696,0.173269,0.172742,...,0.173421,0.173126,0.173514,0.173860,0.172838,0.172810,0.172759,0.172866,0.172790,0.173131
2998,0.172991,0.172772,0.173331,0.172872,0.172909,0.173234,0.172794,0.172884,0.172956,0.173012,...,0.173348,0.173103,0.173540,0.173556,0.173120,0.173002,0.172757,0.173003,0.172967,0.172905
2999,0.172803,0.172763,0.173916,0.172829,0.172825,0.173711,0.172860,0.173068,0.172970,0.172987,...,0.173151,0.173027,0.173590,0.172945,0.173058,0.172939,0.172755,0.172833,0.172862,0.172947


### Generating recommendations

In [13]:
recommend_items(pred_df, usr_id=5, n_recs=5)

Unnamed: 0,book_id,sim
0,2221,0.178577
1,1178,0.17742
2,1030,0.176645
3,934,0.176443
4,1470,0.176194


# Method 2. Content Based Systems

**Intuition**

Content based systems generates recommendations based on the users preferences and profile. They try to match users to items which they’ve liked previously. The level of similarity between items is generally established based on attributes of items liked by the user. Unlike most collaborative filtering models which leverage ratings between target user and other users, content based models focus on the ratings provided by the target user themselves. In essence, the content based approach leverages different sources of data to generate recommendations.

The simplest forms of content based systems require the following sources of data (these requirements can increase based on the complexity of the system you’re trying to build):

1. Item level data source — you need a strong source of data associated to the attributes of the item. For our scenario, we have things like book price, num_pages, published_year, etc. The more information you know regarding the item, the more beneficial it will be for your system.
2. User level data source — you need some sort of user feedback based on the item you’re providing recommendations for. This level of feedback can be either implicit or explicit. In our sample data, we’re working with user ratings of books they’ve read. The more user feedback you can track, the more beneficial it will be for your system.

**Advantages**

Content based models are most advantageous for recommending items when there is an insufficient amount of rating data available. This is because other items with similar attributes might have been rated by the user. Hence, a model should be able to leverage the ratings along with the item attributes to generate recommendations even when there isn’t a lot of data.

**Disadvantages**

There are two main disadvantages of content based systems.
1. The recommendations provided are “obvious” based on the items / content the user has consumed. This is a disadvantage because if the user has never interacted with a particular type of item, that item will never be recommended to the user. For example, if you’ve never read mystery books, then through this approach, you will never be recommended mystery books. This is because the model is user specific and doesn’t leverage knowledge from similar users. This reduces the diversity of the recommendations, this is a negative outcome for many businesses.
2. They’re ineffective for providing recommendations for new users. When building a model you require a history of explicit / implicit user level data for the items. It’s generally important to have a large dataset of ratings available to make robust predictions without overfitting.

**Examples**

Some examples of content based systems are :
- Amazon product feed (you’re being recommended products similar to what you’ve previously purchased)
- Spotify music recommendations

There are many excellent content based systems which are built algorithmically without the dependency on a model based approach. For example companies like Hacker Rank and Reddit have been known to previously used algorithmic approaches to recommend new posts on their platform to users. The key to building an algorithmic approach to content based recommenders lies in defining a set of rules for your business which can be used to rank items. In the case of Reddit, their recommendations are bounded by time of post, number of likes, number of dislikes, number of comments, etc. This can be factored into a formula to generate a score for a post, a high score would yield a high recommendation and vice versa.

**Implementation**

- Import data from generate_data function (function provided above)
- Normalize book_price, book_ratings, num_pages
- One hot encode publish_year, book_genre, text_lang
- Given a book_id input, calculate the cosine similarity and return top n books similar to the input

## Functions

In [14]:
def normalize_input_data(data):
    '''
    This function will normalize the input data to be between 0 and 1
    
    params:
        data (List) : The list of values you want to normalize
    
    returns:
        The input data normalized between 0 and 1
    '''
    min_val = min(data)
    if min_val < 0:
        data = [x + abs(min_val) for x in data]
    max_val = max(data)
    return [x/max_val for x in data]

In [15]:
def ohe(df, enc_col):
    '''
    This function will one hot encode the specified column and add it back
    onto the input dataframe
    
    params:
        df (DataFrame) : The dataframe you wish for the results to be appended to
        enc_col (String) : The column you want to OHE
    
    returns:
        The OHE columns added onto the input dataframe
    '''
    
    ohe_df = pd.get_dummies(df[enc_col])
    ohe_df.reset_index(drop = True, inplace = True)
    return pd.concat([df, ohe_df], axis = 1)

In [16]:
class CBRecommend():
    def __init__(self, df):
        self.df = df
        
    def cosine_sim(self, v1,v2):
        '''
        This function will calculate the cosine similarity between two vectors
        '''
        return sum(dot(v1,v2)/(norm(v1)*norm(v2)))
    
    def recommend(self, book_id, n_rec):
        """
        df (dataframe): The dataframe
        song_id (string): Representing the song name
        n_rec (int): amount of rec user wants
        """
        
        # calculate similarity of input book_id vector w.r.t all other vectors
        inputVec = self.df.loc[book_id].values
        self.df['similarity']= self.df.apply(lambda x: self.cosine_sim(inputVec, x.values), axis=1)

        # returns top n user specified books
        return self.df.nlargest(columns='similarity',n=n_rec)

## Implementation

### Reading data

In [17]:
PATH = './data/data.csv'

In [18]:
data = pd.read_csv(PATH)
print(data.shape)


(100000, 10)


In [19]:
data.head()

Unnamed: 0,book_id,author_id,book_genre,reader_id,num_pages,book_rating,publisher_id,publish_year,book_price,text_lang
0,1575,439,4,13777,103,2,45,2017,6,2
1,823,304,7,11446,295,6,6,2000,120,7
2,283,204,1,23959,460,2,35,2004,169,6
3,2309,192,6,25156,681,6,36,2011,14,7
4,1872,376,2,15986,212,3,3,2014,10,6


### Transformations

In [20]:
# Creating a copy
df = data.copy()

# Normalizing the num_pages, ratings, price columns
df['num_pages_norm'] = normalize_input_data(df['num_pages'].values)
df['book_rating_norm'] = normalize_input_data(df['book_rating'].values)
df['book_price_norm'] = normalize_input_data(df['book_price'].values)

# OHE on publish_year and genre
df = ohe(df = df, enc_col = 'publish_year')
df = ohe(df = df, enc_col = 'book_genre')
df = ohe(df = df, enc_col = 'text_lang')

# Drop redundant columns
cols = ['publish_year', 'book_genre', 'num_pages', 'book_rating', 'book_price', 'text_lang']
df.drop(columns = cols, inplace = True)
df.set_index('book_id', inplace = True)

df.head()

Unnamed: 0_level_0,author_id,reader_id,publisher_id,num_pages_norm,book_rating_norm,book_price_norm,2000,2001,2002,2003,...,8,9,10,1,2,3,4,5,6,7
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1575,439,13777,45,0.147143,0.2,0.03,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
823,304,11446,6,0.421429,0.6,0.6,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
283,204,23959,35,0.657143,0.2,0.845,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2309,192,25156,36,0.972857,0.6,0.07,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1872,376,15986,3,0.302857,0.3,0.05,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


### Run on a sample as an example

In [21]:
t = df.copy()
cbr = CBRecommend(df = t)
cbr

<__main__.CBRecommend at 0x7fc0db5a8be0>

In [22]:
cbr.recommend(book_id = 1042, n_rec = 5)[['author_id', 'book_rating_norm', 'similarity']].sort_values(by=['book_rating_norm'], ascending = False)

Unnamed: 0_level_0,author_id,book_rating_norm,similarity
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1964,261,0.8,4.315669
2550,344,0.8,4.315669
2186,363,0.6,4.315669
1771,376,0.4,4.315669
1769,346,0.2,4.315669


# Method 3. Hybrid Recommendation System

**Intuition**

Various methods of recommendations systems have their own benefits and flaws. Often, many of these methods may seem restrictive when used in isolation, especially when multiple sources of data is available for the problem. Hybrid recommender systems are ones designed to use different available data sources to generate robust inferences.

Hybrid recommendation systems have two predominant designs, parallel and sequential. The parallel design provides the input to multiple recommendation systems, each of those recommendations are combined to generate one output. The sequential design provides the input parameters to a single recommendation engine, the output is passed on to the following recommender in a sequence.

**Advantages**

Hybrid systems combine different models to combat the disadvantages of one model with another. This overall reduces the weaknesses of using individual models and aids in generating more robust recommendations. This yields more robust and personalized recommendations for users.

**Disadvantages**

These types of models generally have high computational complexity and require a large database of ratings and other attributes to keep up to date. Without up to date metrics (user engagement, ratings, etc.) it makes it difficult to retrain and provide new recommendations with updated items and ratings from various users.

**Examples**

Netflix is a company which uses a hybrid recommendation system, they generate recommendations to users based on the watch and search style of similar users (collaborative filtering) in conjunction with movies which share similar characteristics who’ve been rated by users (content based).

**Implementation**

- Import data from generate_data function (function provided above)
- Use a content-based model (cosine_similarity) to compute the 50 most similar books
- Compute the predicted ratings that the user might give these 50 books using a collaborative filtering model (SVD)
- Return the top n books with the highest predicted rating

## Functions

In [23]:
def hybrid(reader_id, book_id, n_recs, df, cosine_sim, svd_model):
    '''
    This function represents a hybrid recommendation system, it will have the following flow:
        1. Use a content-based model (cosine_similarity) to compute the 50 most similar books
        2. Compute the predicted ratings that the user might give these 50 books using a collaborative filtering model (SVD)
        3. Return the top n books with the highest predicted rating
        
    params:
        reader_id (Integer) : The reader_id 
        book_id (Integer) : The book_id 
        n_recs (Integer) : The number of recommendations you want
        df (DataFrame) : Original dataframe with all book information 
        cosine_sim (DataFrame) : The cosine similarity dataframe
        svd_model (Model) : SVD model
    '''
    
    # sort similarity values in decreasing order and take top 50 results
    similarity = list(enumerate(cosine_sim[int(book_id)]))
    similarity = sorted(similarity, key=lambda x: x[1], reverse=True)
    similarity = similarity[1:50]
    
    # get book metadata
    book_idx = [i[0] for i in similarity]
    books = df.iloc[book_idx][['book_id', 'book_rating', 'num_pages', 'publish_year', 'book_price', 'reader_id']]
    
    # predict using the svd_model
    books['est'] = books.apply(lambda x: svd_model.predict(reader_id, x['book_id'], x['book_rating']).est, axis = 1)
    
    # sort predictions in decreasing order and return top n_recs
    books = books.sort_values('est', ascending=False)
    return books.head(n_recs)

## Implementation

### Reading data

In [24]:
PATH = './data/data.csv'

In [25]:
data = pd.read_csv(PATH)
print(data.shape)


(100000, 10)


In [26]:
data.head()

Unnamed: 0,book_id,author_id,book_genre,reader_id,num_pages,book_rating,publisher_id,publish_year,book_price,text_lang
0,1575,439,4,13777,103,2,45,2017,6,2
1,823,304,7,11446,295,6,6,2000,120,7
2,283,204,1,23959,460,2,35,2004,169,6
3,2309,192,6,25156,681,6,36,2011,14,7
4,1872,376,2,15986,212,3,3,2014,10,6


### Content based

In [27]:
# Creating a copy
df = data.copy()

# Pivot table
rmat = df.pivot_table(
    columns = 'book_id',
    index = 'reader_id',
    values = 'book_rating'
).fillna(0)

### Compute the cosine similarity matrix 

In [28]:
cosine_sim = cosine_similarity(rmat, rmat)
cosine_sim = pd.DataFrame(cosine_sim, index=rmat.index, columns=rmat.index)
cosine_sim

reader_id,1,2,3,4,5,6,7,8,9,10,...,29991,29992,29993,29994,29995,29996,29997,29998,29999,30000
reader_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
29997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
29998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Collaborative filtering

In [29]:
reader = Reader()
data = Dataset.load_from_df(df[['reader_id', 'book_id', 'book_rating']], reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7fc0db5d6610>

### Modeling

In [30]:
# split data into train test
trainset, testset = train_test_split(data, test_size=0.3,random_state=10)

# train model
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc0dc28c340>

In [31]:
# run the trained model against the testset
test_pred = svd.test(testset)

# get RMSE
accuracy.rmse(test_pred, verbose=True)

RMSE: 2.9346


2.9346061189637904

### Generate recommendations

In [32]:
hybrid(reader_id = df['reader_id'].values[0], 
       book_id = df['book_id'].values[0], 
       n_recs = 10, 
       df=df, 
       cosine_sim=cosine_sim, 
       svd_model=svd
      ).sort_values(by=['book_rating'], ascending = False)

Unnamed: 0,book_id,book_rating,num_pages,publish_year,book_price,reader_id,est
12841,1163,10,630,2009,167,15346,5.0
13758,791,10,502,2011,176,25835,5.0
22452,1720,10,447,2001,28,29612,5.0
23230,1788,8,251,2006,30,16523,5.0
13430,1439,7,489,2021,96,13369,5.0
21960,2262,6,396,2011,121,24351,5.0
4496,133,6,126,2019,175,4062,5.0
11955,1247,5,175,2007,157,23910,5.0
2558,1431,4,90,2004,135,586,5.0
12388,2836,4,442,2002,33,3550,5.0


In [33]:
df.head()

Unnamed: 0,book_id,author_id,book_genre,reader_id,num_pages,book_rating,publisher_id,publish_year,book_price,text_lang
0,1575,439,4,13777,103,2,45,2017,6,2
1,823,304,7,11446,295,6,6,2000,120,7
2,283,204,1,23959,460,2,35,2004,169,6
3,2309,192,6,25156,681,6,36,2011,14,7
4,1872,376,2,15986,212,3,3,2014,10,6


# Method 4. fpGrowth - Pyspark using different type of data (associating different products bought in different purchases)

**Intuition**

TODO

## Implementation

### fpGrwowth

In [37]:
spark = SparkSession\
  .builder\
  .appName("test_import")\
  .enableHiveSupport()\
  .getOrCreate()

In [38]:
def generate_data_spark(n_purchases, n_products, dataset_size):
    '''
    This function will generate a dataset with features associated to
    items purchased in a specific transaction. : 
        - purchase_id (int) : Unique identified for the purchase
        - product_id (int) : Unique identifier for the product
                
    params:
        n_purchases (Integer) : The number of purchases you want the dataset to have
        n_products (Integer) : Number of products to be chosen from
        dataset_size (Integer) : The number of rows to be generated 
        
    example:
        data = generate_data_spark()
    '''
    
    df = pd.DataFrame(
        {
            'purchase_id' : [randint(1, n_purchases) for _ in range(dataset_size)],
            'product_id': [sorted(random.sample(range(1, n_products), randint(1, 15))) for _ in range(dataset_size)],  # 15 is maximum number of purchased items per customer
            #'product_id' : [list(np.random.randint(n_products, size = randint(1, 15))) for _ in range(dataset_size)],   # Using numpy. It has some problems with spark
        }
    ).drop_duplicates(subset='purchase_id', keep='first')
    
    # Creating schema for converting to spark df
    mySchema = StructType([
        StructField('purchase_id', LongType(), True),
        StructField('product_id', ArrayType(LongType()),True),  
    ])
    sparkDF = spark.createDataFrame(df, schema=mySchema) 
    return sparkDF

In [49]:
df = generate_data_spark(n_purchases = 1200, n_products = 20, dataset_size = 10000)
df.head(20)
df

DataFrame[purchase_id: bigint, product_id: array<bigint>]

In [50]:
df.printSchema()

root
 |-- purchase_id: long (nullable = true)
 |-- product_id: array (nullable = true)
 |    |-- element: long (containsNull = true)



In [51]:
df.show(truncate=False)

+-----------+-----------------------------------------------------+
|purchase_id|product_id                                           |
+-----------+-----------------------------------------------------+
|1078       |[1, 2, 3, 6, 10, 12, 13, 14, 17, 18]                 |
|422        |[1, 2, 4, 5, 6, 7, 8, 9, 10, 13, 14, 19]             |
|357        |[2, 5, 6, 19]                                        |
|432        |[1, 5, 9, 13]                                        |
|734        |[1, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 19]           |
|916        |[1, 2, 3, 4, 5, 7, 9, 10, 11, 12, 15, 16, 19]        |
|1063       |[1, 5, 9]                                            |
|1091       |[1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 18, 19] |
|773        |[5, 6, 12]                                           |
|1148       |[1, 4, 5, 9, 10, 11, 12, 13, 17]                     |
|114        |[1, 7, 8, 11, 12, 14, 15, 16, 19]                    |
|164        |[19]                               

In [52]:
fp = FPGrowth(itemsCol='product_id', minSupport=0.1, minConfidence=0.6)
model = fp.fit(df)
model

FPGrowthModel: uid=FPGrowth_01ad03d6bc22, numTrainingRecords=1200

In [53]:
# Display frequent itemsets.
model.freqItemsets.show(10)

+-----------+----+
|      items|freq|
+-----------+----+
|       [11]| 482|
|    [11, 7]| 250|
| [11, 7, 9]| 135|
|[11, 7, 14]| 152|
| [11, 7, 4]| 141|
| [11, 7, 5]| 155|
|[11, 7, 18]| 144|
| [11, 7, 1]| 148|
|[11, 7, 19]| 132|
| [11, 7, 2]| 156|
+-----------+----+
only showing top 10 rows



In [54]:
# Display generated association rules.
model.associationRules.show(10)

+----------+----------+------------------+------------------+-------------------+
|antecedent|consequent|        confidence|              lift|            support|
+----------+----------+------------------+------------------+-------------------+
|  [15, 14]|       [9]| 0.610655737704918|1.4256554187663455|0.12416666666666666|
|  [15, 14]|       [2]|0.6065573770491803|1.3479052823315119|0.12333333333333334|
|  [15, 14]|       [8]| 0.610655737704918| 1.492437648158659|0.12416666666666666|
|    [4, 2]|       [9]| 0.637065637065637|1.4873127713594638|             0.1375|
|    [4, 2]|       [8]|0.6216216216216216|1.5192381791159795|0.13416666666666666|
|    [4, 2]|      [17]|0.6023166023166023| 1.481106399139186|               0.13|
|  [16, 14]|       [1]|0.6271186440677966|1.4669441966498167|0.12333333333333334|
|  [16, 14]|       [5]|0.6059322033898306|1.4600775985297123|0.11916666666666667|
|  [16, 14]|       [8]|0.6186440677966102|1.5119610618247092|0.12166666666666667|
|  [16, 11]|    

In [55]:
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show(10)

+-----------+--------------------+--------------------+
|purchase_id|          product_id|          prediction|
+-----------+--------------------+--------------------+
|       1078|[1, 2, 3, 6, 10, ...|[15, 4, 5, 8, 9, ...|
|        422|[1, 2, 4, 5, 6, 7...|[17, 15, 11, 3, 1...|
|        357|       [2, 5, 6, 19]|   [17, 1, 11, 9, 8]|
|        432|       [1, 5, 9, 13]|   [2, 15, 6, 11, 8]|
|        734|[1, 4, 5, 6, 7, 8...|[9, 2, 11, 18, 3,...|
|        916|[1, 2, 3, 4, 5, 7...|[8, 17, 14, 18, 1...|
|       1063|           [1, 5, 9]|              [2, 6]|
|       1091|[1, 2, 3, 5, 6, 7...|         [14, 4, 12]|
|        773|          [5, 6, 12]|    [9, 14, 8, 2, 1]|
|       1148|[1, 4, 5, 9, 10, ...|[14, 2, 8, 15, 6, 3]|
+-----------+--------------------+--------------------+
only showing top 10 rows

