# Hybrid Model

This notebook combines the work done to devlop both a collaborative filtering system and a content based system. In order to combine both models, we took a simple approach. For a given user, we calculate the top 500 recommendations using both systems. The scores for both systems are then summed and ranked. The books with the highest combined score are recommended. 

## Imports

In [None]:
%%capture
import pandas as pd
import numpy as np
from ast import literal_eval

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.3-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop2.7"

import findspark
findspark.init()
import pyspark 

import pandas as pd
import numpy as np

# ! pip install pyspark
from pyspark.ml.evaluation import RegressionEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import explode, col, round, abs, when
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, TrainValidationSplit
from pyspark.sql import SparkSession

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import sys
sys.path.append('drive/MyDrive/')

from content_based import *
from collaborative_filtering import *

Mounted at /content/drive


## Build Model

In [None]:
def hybrid_mod(u_id, sim_books_limit=10):
  top_100_collab = spark.createDataFrame(df[df.user_id == u_id]).withColumnRenamed("score", "collab_score").withColumnRenamed("book_id", "book_id_col")
  top_1000_content = getContentRecoms(u_id, 500, True).withColumnRenamed("score", "cont_score")
  
  joined_top = top_1000_content.join(top_100_collab,top_1000_content.book_id == top_100_collab.book_id_col, 'inner')
  joined_top = joined_top.select(col("book_id"),col("title"),col("genre"),col("cont_score"),col("collab_score"))
  joined_top = joined_top.withColumn('comb_score', col("cont_score")+col("collab_score"))
  joined_top = joined_top.dropDuplicates(['book_id']).orderBy("comb_score", ascending = False).limit(sim_books_limit)

  return joined_top

**Examples**

In [None]:
hybrid_mod(190).toPandas()


Books previously read and reviewed by user:
+-------+-----------------------+----------------+-----------+
|book_id|title                  |publication_year|genre      |
+-------+-----------------------+----------------+-----------+
|50041  |Olivia Forms a Band    |2006.0          |Children   |
|7992   |The Great Redwall Feast|2000.0          |Children   |
|1226   |Life of Pi             |null            |Fantasy    |
|1387   |The Odyssey            |1999.0          |Poetry     |
|50053  |The Contender          |1996.0          |Young Adult|
+-------+-----------------------+----------------+-----------+



Unnamed: 0,book_id,title,genre,cont_score,collab_score,comb_score
0,37736,Forever...,Young Adult,0.928679,0.591533,1.520212
1,189442,Corgiville Fair,Children,0.969146,0.43579,1.404936
2,293724,The Lottery,Young Adult,0.96291,0.42972,1.392629
3,258171,Nailed,Young Adult,0.959539,0.430251,1.38979
4,8073,Cloudy With a Chance of Meatballs,Children,0.964862,0.420132,1.384995
5,190670,Curious George Flies a Kite,Children,0.948533,0.433191,1.381724
6,87784,To Kill a Mockingbird,History/Biography,0.933619,0.422761,1.356379
7,32531,Gilgamesh,Poetry,0.972073,0.376475,1.348549
8,259068,Shug,Young Adult,0.938798,0.409415,1.348213
9,240142,Big Mouth and Ugly Girl,Young Adult,0.958178,0.383951,1.342129


This set of recommendations appears relevant. The hybrid model has successfully identified books of the same genres and that could be of interest to this user.

In [None]:
hybrid_mod(23227).toPandas()


Books previously read and reviewed by user:
+-------+-------------------------------------------------------+----------------+-----------------+
|book_id|title                                                  |publication_year|genre            |
+-------+-------------------------------------------------------+----------------+-----------------+
|13194  |Green Arrow, Vol. 1: Quiver                            |2008.0          |Comics           |
|1067   |1776                                                   |null            |History/Biography|
|305825 |Born of a Woman                                        |1994.0          |History/Biography|
|1376   |The Iliad                                              |2003.0          |Poetry           |
|24603  |The Writer of Modern Life: Essays on Charles Baudelaire|2006.0          |Poetry           |
+-------+-------------------------------------------------------+----------------+-----------------+



Unnamed: 0,book_id,title,genre,cont_score,collab_score,comb_score
0,59984,Daredevil: Love and War,Comics,0.961979,0.801475,1.763454
1,333731,"Supreme Power, Volume 1: Contact",Comics,0.981548,0.722443,1.703992
2,22373,Kill Your Boyfriend,Comics,0.982968,0.640662,1.62363
3,43612,"Runaways, Vol. 2: Teenage Wasteland (Runaways...",Comics,0.959915,0.637246,1.597161
4,160706,Secret Six: Six Degrees of Devastation,Comics,0.98904,0.593619,1.582659
5,43555,The Best American Comics 2006,Comics,0.9678,0.610608,1.578409
6,232005,"Starman, Vol. 4: Times Past",Comics,0.982057,0.590588,1.572644
7,6655,The Divine Comedy,Poetry,0.958185,0.612632,1.570817
8,542826,Identity Crisis,Comics,0.982127,0.574605,1.556731
9,145136,Fury MAX,Comics,0.975236,0.575458,1.550694


These recommendations have captured the user's interest in comics and poetry.

## Evaluation

In [None]:
fractions = interactions_df.select("user_id").distinct().withColumn("fraction", lit(0.75)).rdd.collectAsMap()
train = interactions_df.sampleBy("user_id", fractions, seed=10)

# Subtracting 'train' from original df to get test set 
test = interactions_df.subtract(train)

In [None]:
# Get unique values in the grouping column
groups = [x[0] for x in test.select("user_id").distinct().collect()]

# Create a filtered DataFrame for each group in a list comprehension
groups_list = [test.filter(col('user_id')==x) for x in groups]

In [None]:
precision = []
# for time purposes, we will only the precision compute 25 samples
for group in groups[0:25]:
  user = group
  recs = getContentRecoms(user,display=False).select('book_id').collect()
  acc = test.filter(col('user_id')==user).select('book_id').collect()
  prec = len(set(acc).intersection(set(recs)))/len(acc)
  precision.append(prec)