# Content Based Recommender System

This notebook builds and implements a content based recommender system using Pyspark. This system focuses on three key book features; popular_shelves, genre and title. Term Frequency-Inverse Document Frequency (TF-IDF) is utilised to convert the set of strings for each book into numerical values. Term frequency measures the frequency of a word in a document. Inverse document frequency is a measure of how common or rare a word is across the entire dataset. When combined TF-IDF increases proportionally to the occurances of a word in a document, but is offset by the number of documnets that contain that word. TF-IDF can be used to compute a score that is indicative of the importance of each word within the document and the corpus.

Below we build a ML Pipeline (a set of high-level APIs built on top of DataFrames) that applies TF-IDF vectorization to all strings passed. This transforms the strings to numerical data so we can compare books and calculate their similarity using a similarity score (cosine similarity).
 
Overview:

*   Imports
*   Data Loading
*   Text Processing and Featurization
*   Recommendations based on book list
*   Recommendations for users
*   Evaluation

## Import packages and libraries

In [1]:
import pandas as pd
import numpy as np
from ast import literal_eval

In [2]:
%%capture
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.3-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop2.7"

In [4]:
import findspark
findspark.init()
import pyspark # only run after findspark.init()

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row, SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml.feature import RegexTokenizer, CountVectorizer
from pyspark.ml.feature import StopWordsRemover, VectorAssembler
from pyspark.ml.feature import Word2Vec, Word2VecModel
from pyspark.ml.feature import IDF
from pyspark.ml import Pipeline, PipelineModel

# Create new config
conf = SparkConf().set("spark.driver.maxResultSize", "20g")

sc = SparkContext(appName="PythonKMeans", sparkHome="/content/spark-3.0.3-bin-hadoop2.7", conf=conf)    
sqlContext = SQLContext(sc)

spark = SparkSession.builder.master("local[*]").getOrCreate()

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Loading

In [6]:
books_df = spark.read.option("inferSchema",True) \
                .option("delimiter","|") \
                .option("header",True) \
  .csv("drive/MyDrive/ca4022_data/books_delimeter_fixed.csv")
books_df.printSchema()

root
 |-- isbn: string (nullable = true)
 |-- text_reviews_count: integer (nullable = true)
 |-- series: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- popular_shelves: string (nullable = true)
 |-- asin: string (nullable = true)
 |-- is_ebook: boolean (nullable = true)
 |-- average_rating: double (nullable = true)
 |-- kindle_asin: string (nullable = true)
 |-- similar_books: string (nullable = true)
 |-- description: string (nullable = true)
 |-- format: string (nullable = true)
 |-- link: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- num_pages: double (nullable = true)
 |-- publication_day: double (nullable = true)
 |-- isbn13: double (nullable = true)
 |-- publication_month: double (nullable = true)
 |-- edition_information: string (nullable = true)
 |-- publication_year: double (nullable = true)
 |-- url: string (nullable = true)
 |-- image_url: s

In [7]:
books_df.show(5)

+---------+------------------+--------------------+------------+-------------+--------------------+----+--------+--------------+-----------+--------------------+--------------------+---------+--------------------+--------------------+----------------+---------+---------------+----------+-----------------+-------------------+----------------+--------------------+--------------------+-------+-------------+-------+--------------------+--------------------+--------+
|     isbn|text_reviews_count|              series|country_code|language_code|     popular_shelves|asin|is_ebook|average_rating|kindle_asin|       similar_books|         description|   format|                link|             authors|       publisher|num_pages|publication_day|    isbn13|publication_month|edition_information|publication_year|                 url|           image_url|book_id|ratings_count|work_id|               title|title_without_series|   genre|
+---------+------------------+--------------------+------------+--

In [8]:
interactions_df = spark.read.option("inferSchema",True) \
                .option("delimiter",",") \
                .option("header",True) \
  .csv("drive/MyDrive/ca4022_data/new_interactions.csv")
interactions_df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- book_id: integer (nullable = true)
 |-- is_read: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- is_reviewed: double (nullable = true)
 |-- user_count: integer (nullable = true)



In [9]:
interactions_df.show(5)

+-------+-------+-------+------+-----------+----------+
|user_id|book_id|is_read|rating|is_reviewed|user_count|
+-------+-------+-------+------+-----------+----------+
|      0|    915|      1|     5|        1.0|        15|
|      0|    873|      1|     4|        0.0|        15|
|      0|    871|      1|     2|        0.0|        15|
|      0|    870|      1|     3|        0.0|        15|
|      0|    824|      1|     5|        1.0|        15|
+-------+-------+-------+------+-----------+----------+
only showing top 5 rows



In [10]:
# create SQL view for later queries
interactions_df.createOrReplaceTempView("interactions")

## Text Processing and Featurization

**Prepare the data**

In [11]:
stop_words = set(['reading', 'read', 'currently', 'find', 'awaiting', 'on', 'books', 'book', 'owned', 'own', 'library', 'to', 'bookshelf', 'shelf', 'i', 'my', 'did', 'not', 'didn', 't',  'finish', 'finished', 'and', 'favorites', 'favorite', 's', 'recommend'])
def shelves_to_string(shelves_dict):
  shelves_list = literal_eval(shelves_dict)
  string_of_shelves = 'nothing '
  for shelf_dict in shelves_list:
    cnt = shelf_dict['count']
    shelf = shelf_dict['name']
    if shelf == 'non-fiction ':
      shelf = ''.join(shelf.split('-'))
    print(set(shelf.split('-')))
    to_add = set(shelf.split('-')).difference(stop_words)
    string_of_shelves += ' '.join(int(cnt) * list(to_add)) + ' '
  return string_of_shelves

Popular_shelves are stored in a list of dictionaries of the form: [{'count': c},{'name': n}].
The above function takes a list of dictionaries for a single book and produces a string of the shelf names combined. We muliply each shelf's name by its count as it is important to capture the popularity of each shelf as we will soon perform TF-IDF on the string. We also remove 'stop_words' (terms) that we deem uninformative.

**DataFrame with book_id and string to be vectorized**

In [12]:
shelves_df = spark.createDataFrame(books_df.rdd.map(lambda x: x.genre + ' ' + x.title + ' ' + shelves_to_string(x.popular_shelves)),StringType(),['shelves'])

In [13]:
shelves_df.head(1)

[Row(value='Children Dog Heaven nothing  picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture picture animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals animals children children children children children children children children children children children children children children children children children children children children children children childrens childrens childrens childrens c

In [14]:
DF1 = books_df.withColumn("row_id", monotonically_increasing_id())
DF2 = shelves_df.withColumn("row_id", monotonically_increasing_id())
books_expanded_df = DF1.join(DF2, ("row_id")).drop("row_id")

In [15]:
# create SQL view for later queries
books_expanded_df.createOrReplaceTempView("books")

In [16]:
book_shelves = spark.sql("SELECT book_id, value FROM books")
book_shelves.show(3)

+-------+--------------------+
|book_id|               value|
+-------+--------------------+
|  89378|Children Dog Heav...|
|  38565|Children Bootsie ...|
| 821430|Children Hug noth...|
+-------+--------------------+
only showing top 3 rows



Now we have the strings prepared for the next stage: model building.

**Create the pipeline, fit the model and transform the data**

Creating the necessary processing pipeline is a resouce-intensive process. To save on computational time we will only create the pipeline once and then save the model. The pipeline can then be loading in, in all future runs.

In [17]:
# create text processing pipeline -- this a lengthy resouce-intensive process (we only need to do it once)

# # Build the pipeline 
# regexTokenizer = RegexTokenizer(gaps = False, pattern = '\w+', inputCol = 'value', outputCol = 'token')
# stopWordsRemover = StopWordsRemover(inputCol = 'token', outputCol = 'nostopwrd')
# countVectorizer = CountVectorizer(inputCol="nostopwrd", outputCol="rawFeature")
# iDF = IDF(inputCol="rawFeature", outputCol="idf_vec")
# word2Vec = Word2Vec(vectorSize = 100, minCount = 5, inputCol = 'nostopwrd', outputCol = 'word_vec', seed=123)
# vectorAssembler = VectorAssembler(inputCols=['idf_vec', 'word_vec'], outputCol='comb_vec')
# pipeline = Pipeline(stages=[regexTokenizer, stopWordsRemover, countVectorizer, iDF, word2Vec, vectorAssembler])

# # fit the model
# pipeline_mdl = pipeline.fit(book_shelves)

# #save the pipeline model
# pipeline_mdl.write().overwrite().save('drive/MyDrive/content_model/' + 'pipe_txt')

In [19]:
# load the text transformation pipeline trained model

pipeline_mdl = PipelineModel.load('drive/MyDrive/content_model/pipe_txt')

In [20]:
# transform the data

book_shelves_trf_df = pipeline_mdl.transform(book_shelves)

In [21]:
# show the transformed data

book_shelves_trf_df.select( 'value', 'nostopwrd', 'idf_vec', 'word_vec', 'comb_vec').show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|               value|           nostopwrd|             idf_vec|            word_vec|            comb_vec|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|Children Dog Heav...|[children, dog, h...|(49019,[1,4,5,6,8...|[0.15958723609852...|(49119,[1,4,5,6,8...|
|Children Bootsie ...|[children, bootsi...|(49019,[0,1,5,6,1...|[0.15040656500409...|(49119,[0,1,5,6,1...|
|Children Hug noth...|[children, hug, n...|(49019,[1,5,6,8,1...|[0.15815774049439...|(49119,[1,5,6,8,1...|
|Children Our Tree...|[children, tree, ...|(49019,[1,3,5,6,8...|[0.15787570753524...|(49119,[1,3,5,6,8...|
|Children The Libr...|[children, librar...|(49019,[0,1,5,6,1...|[0.10311328777520...|(49119,[0,1,5,6,1...|
|Children The Libr...|[children, librar...|(49019,[1,3,4,5,6...|[0.22209633309716...|(49119,[1,3,4,5,6...|
|Children Nightmar...|[children, nigh

In [22]:
all_books_vecs = book_shelves_trf_df.select('book_id', 'word_vec').rdd.map(lambda x: (x[0], x[1])).collect()

In [23]:
# peek at one row

all_books_vecs[1]

(38565,
 DenseVector([0.1504, 0.0364, 0.0404, 0.0191, 0.0294, -0.0917, -0.138, 0.0872, 0.0097, 0.0633, 0.1165, 0.3216, 0.0251, 0.0311, 0.1541, 0.1191, -0.1177, -0.113, -0.0326, 0.2084, -0.1283, 0.1932, 0.0088, 0.1231, 0.1393, -0.0231, -0.0865, -0.0287, 0.1111, -0.0551, -0.0094, 0.092, 0.0835, 0.0421, 0.0066, -0.0133, 0.0281, -0.0514, 0.0228, -0.1691, 0.1419, 0.0556, -0.206, -0.0195, 0.0221, -0.1686, -0.2071, -0.0801, 0.2746, 0.075, 0.0156, -0.1161, -0.0239, 0.1799, -0.0831, -0.3074, -0.0577, 0.1436, -0.1707, -0.1385, 0.1471, 0.1858, -0.0246, -0.3295, -0.1627, 0.2134, 0.1844, 0.0038, -0.1225, -0.0963, -0.0857, 0.2226, -0.1052, 0.1876, 0.2017, 0.1356, -0.0656, 0.0589, -0.1337, -0.0966, -0.0294, -0.1284, 0.0884, -0.0023, 0.1276, -0.128, 0.0869, 0.1858, 0.2188, -0.1719, 0.0514, -0.0187, 0.0187, -0.1296, -0.1485, -0.0248, -0.0526, 0.2668, -0.2566, -0.281]))

## Get Similar Books

**Define a similarity measure**  
We will use cosine similarity to assess the similarity of any two books.

In [24]:
def CosineSim(vec1, vec2): 
    return np.dot(vec1, vec2) / np.sqrt(np.dot(vec1, vec1)) / np.sqrt(np.dot(vec2, vec2)) 

**Function to display book information**

In [25]:
def getBookDetails(in_book):
    
    a = in_book.alias("a")
    b = books_df.alias("b")
    
    return a.join(b, col("a.book_id") == col("b.book_id"), 'inner').select([col('a.'+xx) for xx in a.columns] + [col('b.title'), col('b.publication_year'), col('b.genre')])

**Function to get similar books**  
This function will take a list of books as input and output the top 10 most similar books to those inputted books.

In [26]:
def getSimilarBooks(b_ids, sim_books_limit=10):
    
    schema = StructType([   
                            StructField("book_id", IntegerType(), True)
                            ,StructField("score", IntegerType(), True)
                            ,StructField("input_book_id", IntegerType(), True)
                        ])
    
    similar_books_df = spark.createDataFrame([], schema)
    
    for b_id in b_ids:
        
        input_vec = [(r[1]) for r in all_books_vecs if r[0] == b_id][0]

        similar_book_rdd = sc.parallelize((i[0], float(CosineSim(input_vec, i[1]))) for i in all_books_vecs)

        similar_book_df = spark.createDataFrame(similar_book_rdd) \
            .withColumnRenamed('_1', 'book_id') \
            .withColumnRenamed('_2', 'score') \
            .orderBy("score", ascending = False)

            
        similar_book_df = similar_book_df.filter(col("book_id") != b_id).dropDuplicates(['book_id']).orderBy("score", ascending = False).limit(sim_books_limit)

        similar_book_df = similar_book_df.withColumn('input_book_id', lit(b_id))

        similar_books_df = similar_books_df \
                                    .union(similar_book_df)
        
    
    return similar_books_df

**Example 1**

In [27]:
books_df.select('book_id','title', 'publication_year', 'genre') \
    .filter(books_df.book_id.isin([412348]) == True).show(1, truncate=False)

getBookDetails(getSimilarBooks([412348])).toPandas()

+-------+---------------------------------------------------+----------------+--------+
|book_id|title                                              |publication_year|genre   |
+-------+---------------------------------------------------+----------------+--------+
|412348 |Nightmare in Death Valley (Sweet Valley High, #116)|1995.0          |Children|
+-------+---------------------------------------------------+----------------+--------+
only showing top 1 row



Unnamed: 0,book_id,score,input_book_id,title,publication_year,genre
0,12771,0.993135,412348,"A Kiss Before Dying (Sweet Valley High, #122)",1996.0,Young Adult
1,12771,0.993135,412348,"A Kiss Before Dying (Sweet Valley High, #122)",1996.0,Children
2,412190,0.99313,412348,"The Pom-Pom Wars (Sweet Valley High, #113)",1995.0,Young Adult
3,412353,0.992544,412348,"Elizabeth Betrayed (Sweet Valley High, #89)",1992.0,Young Adult
4,412353,0.992544,412348,"Elizabeth Betrayed (Sweet Valley High, #89)",1992.0,Children
5,412262,0.992341,412348,"A Date with a Werewolf (Sweet Valley High, #105)",1994.0,Young Adult
6,292386,0.991571,412348,"""""""V"""" Is for Victory (Sweet Valley High, #114)""",1995.0,Young Adult
7,292389,0.991285,412348,"Almost Married (Sweet Valley High, #102)",1995.0,Young Adult
8,412359,0.990674,412348,"Beware the Wolfman (Sweet Valley High, #106)",1994.0,Young Adult
9,412356,0.989994,412348,"Enid's Story (Sweet Valley High, Super Star #3)",1990.0,Young Adult


These seem to be reasonable recommendations. The books suggested are books from the same series (Sweet Valley High) and in descending order. The model has correctly identified that books nearer to the edition of the inputted book are more similar to the inputted book.

**Example 2**

In [28]:
books_df.select('book_id','title', 'publication_year', 'genre') \
    .filter(books_df.book_id.isin([6882]) == True).show(truncate=False)

getBookDetails(getSimilarBooks([6882])).toPandas()

+-------+--------+----------------+-----------------+
|book_id|title   |publication_year|genre            |
+-------+--------+----------------+-----------------+
|6882   |Papillon|2006.0          |History/Biography|
+-------+--------+----------------+-----------------+



Unnamed: 0,book_id,score,input_book_id,title,publication_year,genre
0,6883,0.964291,6882,Banco: The Further Adventures of Papillon,1985.0,History/Biography
1,7753,0.959523,6882,Fear and Loathing: The Strange and Terrible Sa...,2004.0,History/Biography
2,163258,0.954871,6882,Once in a House on Fire,2004.0,History/Biography
3,77344,0.952868,6882,Angela's Ashes,1996.0,History/Biography
4,856991,0.952868,6882,Angela's Ashes,1999.0,History/Biography
5,856990,0.952866,6882,Angela's Ashes: A Memoir of a Childhood,1997.0,History/Biography
6,344717,0.952399,6882,The Boy Who Fell Out of the Sky,,History/Biography
7,166562,0.950635,6882,Between a Rock and a Hard Place,2005.0,History/Biography
8,386990,0.949432,6882,Burned Alive,2005.0,History/Biography
9,23202,0.949048,6882,The Last American Man,,History/Biography


These also appear to be relevant recommendations. The number one recommended book (Banco: The Further Adventures of Papillon) is the sequel to the inputted book (Papillon). The remainder of the recommendations are all of the same genre (History/Biography).

## Recommendations for users

Function that takes a user_id as input and outputs recommendations for that user

In [29]:
def getContentRecoms(u_id, sim_books_limit=10,display=True):
    
    # select books previously read and reviewed (3+) by the user
    query = """
    SELECT book_id FROM(
    SELECT distinct book_id, rating FROM interactions  
    where is_read = 1 and rating >= 3
    and user_id = "{}"
    order by rating desc
    limit 5) 
    """.format(u_id)

    usr_rev_books = sqlContext.sql(query)
    
    # from these get sample of 10 books
    # usr_rev_books = usr_rev_books.limit(10)

    usr_rev_books_det = getBookDetails(usr_rev_books)
    
    if display == True:
      # show the sample details
      print('\nBooks previously read and reviewed by user:')
      usr_rev_books_det.select(['book_id', 'title', 'publication_year', 'genre']).show(truncate = False)

    book_list = [i.book_id for i in usr_rev_books.collect()]

    # get books similar to the sample
    sim_books_df = getSimilarBooks(book_list, sim_books_limit)

    # filter out those have been reviewd before by the user
    s = sim_books_df.alias("s")
    r = usr_rev_books.alias("r")
    j = s.join(r, col("s.book_id") == col("r.book_id"), 'left_outer') \
         .where(col("r.book_id").isNull()) \
         .select([col('s.book_id'),col('s.score')])

    a = j.orderBy("score", ascending = False).limit(sim_books_limit)

    return getBookDetails(a).orderBy("score", ascending = False)

**Example 1**

In [41]:
content_recom_df = getContentRecoms(16)

print("Books recommended to user based on previously read and reviewed books:")
content_recom_df.toPandas()


Books previously read and reviewed by user:
+-------+-----------------------------------------------------+----------------+-----------------+
|book_id|title                                                |publication_year|genre            |
+-------+-----------------------------------------------------+----------------+-----------------+
|13119  |The Dark Side Of Genius: The Life Of Alfred Hitchcock|1999.0          |History/Biography|
|13231  |Lewis Carroll: A Biography                           |1996.0          |History/Biography|
|13140  |Jack & Jill (Alex Cross, #3)                         |2003.0          |Mystery          |
|13215  |Life Doesn't Frighten Me                             |1996.0          |Poetry           |
|13223  |The Complete Poems                                   |1996.0          |Poetry           |
+-------+-----------------------------------------------------+----------------+-----------------+

Books recommended to user based on previously read and reviewed

Unnamed: 0,book_id,score,title,publication_year,genre
0,21436,0.999662,"Cat and Mouse (Alex Cross, #4)",2007.0,Mystery
1,79378,0.998365,"Roses are Red (Alex Cross, #6)",2001.0,Mystery
2,581274,0.99404,"London Bridges (Alex Cross, #10)",2004.0,Mystery
3,79379,0.992481,"Violets Are Blue (Alex Cross, #7)",2002.0,Mystery
4,772163,0.977474,Notorious: The Life of Ingrid Bergman,1997.0,History/Biography
5,247921,0.97601,Katharine Hepburn,1996.0,History/Biography
6,847106,0.975287,The Million Dollar Mermaid,1999.0,History/Biography
7,474514,0.974411,Howard Hughes: The Secret Life,2004.0,History/Biography
8,782385,0.973951,Ginger: My Story,1991.0,History/Biography
9,33014,0.973891,Thomas Hardy,,History/Biography


Here, the top 4 recommendations are books in the Alex Cross series which are clearly base on the inputted book Jack & Jill (Alex Cross, #3). The model seems to quite heavily favour books from the same series, this is unsurprising as we are performing TF-IDF on the titles. This is not necessarily a bad thing, if a user read a book from a series and rated it above a 3, then they probably are likely to be interested in other books in that series also. The remainder of the recommender books are biographies, which seem to be inline with the reader's interests.

**Example 2**

In [42]:
content_recom_df = getContentRecoms(22)

print("Books recommended to user based on previously read and reviewed books:")
content_recom_df.toPandas()


Books previously read and reviewed by user:
+-------+------------------------------------------------------------------------------------+----------------+-----------------+
|book_id|title                                                                               |publication_year|genre            |
+-------+------------------------------------------------------------------------------------+----------------+-----------------+
|14653  |Encyclopedia Brown and the Case of the Disgusting Sneakers (Encyclopedia Brown, #18)|null            |Children         |
|14652  |Encyclopedia Brown: Boy Detective (Books 1-4)                                       |2002.0          |Children         |
|14656  |Encyclopedia Brown Solves Them All (Encyclopedia Brown, #5)                         |1992.0          |Children         |
|1067   |1776                                                                                |null            |History/Biography|
|14653  |Encyclopedia Brown and the Case of t

Unnamed: 0,book_id,score,title,publication_year,genre
0,77347,1.0,1776,2005.0,History/Biography
1,864286,1.0,Faust,1999.0,Poetry
2,14704,1.0,Faust,2000.0,Poetry
3,406373,1.0,Faust,,Poetry
4,789345,0.9939,Encyclopedia Brown and the Case of the Midnigh...,1982.0,Children
5,871145,0.993244,Encyclopedia Brown Saves the Day,1982.0,Children
6,14655,0.99138,Encyclopedia Brown Finds the Clues (Encycloped...,,Children
7,14655,0.99138,Encyclopedia Brown Finds the Clues (Encycloped...,,Mystery
8,871145,0.988393,Encyclopedia Brown Saves the Day,1982.0,Children
9,24159,0.98821,Encyclopedia Brown Takes the Cake! (Encycloped...,1991.0,Children


In this example we can see one of the limitations of the dataset. The same book can be published by multipler publishers and in different formats. For this reason we are seeing one book appear in both the recommender list as in the read list, as well as one book being repeated 3 times. However, if we look beyond that the rest of the suggestions seem highly relevant. The system has identified Encyclopedia Brown books as similar to each other and has picked up on this person's interest in history. The system appears to be perfoming as intended.

## Evaluation

To evaluate our model we will use mean average precision (MAP). In order to make recommendations for a user we need at least one interaction present in the training set. To ensure that each user's interactions are split between the training and test set we will apply a stratified split by declaring the fraction for each user as 0.75. Accordingly, 75% each user's interactions will be in the training set and 25% in the test set.

In [43]:
fractions = interactions_df.select("user_id").distinct().withColumn("fraction", lit(0.75)).rdd.collectAsMap()
train = interactions_df.sampleBy("user_id", fractions, seed=10)

# Subtracting 'train' from original df to get test set 
test = interactions_df.subtract(train)

In [44]:
# Get unique values in the grouping column
groups = [x[0] for x in test.select("user_id").distinct().collect()]

# Create a filtered DataFrame for each group in a list comprehension
groups_list = [test.filter(col('user_id')==x) for x in groups]

In [45]:
precision = []
# for time purposes, we will only the precision compute 50 samples
for group in groups[0:50]:
  user = group
  recs = getContentRecoms(user,display=False).select('book_id').collect()
  acc = test.filter(col('user_id')==user).select('book_id').collect()
  prec = len(set(acc).intersection(set(recs)))/len(acc)
  precision.append(prec)

In [46]:
print('MAP: ' + str(np.mean(precision)))

MAP: 0.019385964912280697
