## Modeling 2: BERT Recommender Systems

In [5]:
#Import Libraries
import pandas as pd
import tensorflow as tf 
import tensorflow_hub as hub 
import numpy as np
from tensorflow.keras import layers 
import bert
import re
import bert 
from sklearn.metrics.pairwise import cosine_similarity
from keras.preprocessing.sequence import pad_sequences
import pickle 

#Load Data Frame
df = pd.read_csv('final_hbo_data_3.csv', index_col=0)

The code block below contains the pre-processing steps needed to utilize BERT. The first function removes punctuations, numbers, single characters, and extra spaces from plot summaries. The second function removes hyphen and extra spaces from the MPAA/TV ratings (rating column). The code below is taken from this [link](https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/).

In [9]:
#Remove punctuations, numbers, single characters and extra spaces for plot summaries
def text_processing(sen):
  text = re.sub('[^a-zA-Z]', ' ', sen)
  text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
  text = re.sub(r'\s+', ' ', text)
  return text

#Remove hyphen and spaces for the MPAA/TV rating
def rating_preprocesing(text):
  text =  re.sub('-', '', text)
  text = re.sub(' ', '', text)
  return text

#Prepare BertTokenizer. 
#See link https://stackabuse.com/text-classification-with-bert-tokenizer-and-tf-2-0-in-python/ for explanation
bert_tokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert_tokenizer(vocabulary_file, to_lower_case)

#Tokenize and Vectorize plot column
df['clean_plot'] = df['plot'].map(text_processing)
df['token_plot'] = df['clean_plot'].map(tokenizer.tokenize)
df['id_plot'] = df['token_plot'].map(tokenizer.convert_tokens_to_ids)

#Tokenize and Vectorize genre column
df['clean_genre'] = df['genre'].map(text_processing)
df['token_genre'] = df['clean_genre'].map(tokenizer.tokenize)
df['id_genre'] = df['token_genre'].map(tokenizer.convert_tokens_to_ids)

#Tokenize and Vectorize rating column  
df['token_rating'] = df['rating'].map(tokenizer.tokenize)
df['id_rating'] = df['token_rating'].map(tokenizer.convert_tokens_to_ids)

### Recommender 4: Content-Based Recommender using BERT (based on vectorized genre, plot, and MPAA/TV ratings)

Although Recommender 3 is doing a great job at filtering relevant content. I was curious to see if changing the vectorizer from Tfidvectorizer to BERT will affect the results. 

To start, I first tokenized and vectorized the plot, genre, and MPAA/TV ratings. They are then aggregated together into a matrix that was used as input to calculate the cosine similarity.  
- Note: The arrays need to be of the same size. To ratify the problem, I padded the sequence and set a max length depending on the columns (genre = 3, rating =2, and plot =50).

In [18]:
def recommender_4(title, num=5):
  data = df.copy()

  #Pad the vectorized ids for genre, rating, and plot. 
  genre = pad_sequences(data['id_genre'], maxlen = 3)
  rating = pad_sequences(data['id_rating'], maxlen = 2)
  plot = pad_sequences(data['id_plot'], maxlen= 30)
  
  #Combined genre, plot, and plot arrays into a single matrix 
  bert_matrix = np.append(rating, genre, axis=1)
  bert_matrix = np.append(bert_matrix, plot, axis=1)
  
  #Find the cosine similarity bert_matrix
  #Setup a data frame where content title index are values and content titles are index
  cosine_sim = cosine_similarity(bert_matrix, bert_matrix)
  indices = pd.Series(data.index, index=data.title.str.lower())
  
  #Sort the similarity scores and isolate top n content indices
  #Filter data with the isolate indices and return final recommendation
  score = sorted(list(enumerate(cosine_sim[indices[title.lower()]])), key=lambda x: x[1], reverse=True)
  titles_index = [i[0] for i in score[1:num+1]]
  sort_recom =  data.iloc[titles_index]
  return sort_recom[['id', 'title', 'year', 'plot', 'genre', 'rating', 'imdb_rating', 'type']]

recommender_4('south park')

Unnamed: 0,id,title,year,plot,genre,rating,imdb_rating,type
276,16069,Batman Forever,1995,The Dark Knight of Gotham City confronts a das...,Fantasy ActionandAdventure,PG-13,5.4,movie
1471,76011,My Brilliant Career,1979,A young woman who is determined to maintain he...,Drama Romance,G,7.1,movie
1485,109142,Godzilla Raids Again,1955,Two fishing scout pilots make a startling disc...,ScienceFiction Horror KidsandFamily ActionandA...,Approved,5.9,movie
1253,132147,For All Mankind,1989,A testament to NASA's Apollo program of the 19...,Documentary History,Not Rated,8.2,movie
404,204163,War Dogs,2016,"Based on the true story of two young men, Davi...",Crime Drama Comedy WarandMilitary,R,7.1,movie


**Analysis:** This recommender is doing worst than Recommender 3. It appears to give random recommendations without any basis for the genre or MPAA/TV ratings. Additionally, none of the recommended titles are similar to South Park. A possible explanation is that similarity scores are based on the count of identical ids. Since the plot contains more ids than the other features, there is a greater emphasis on it when calculating similarities. 

### Recommender 5: Content-Based Recommender using BERT (based on vectorized genre, plot, and numerically encoded MPAA/TV ratings)

To improve the results, I combined the methodology used for Recommender 3 and Recommender 4. In this way, the MPAA/TV rating will play a more significant part in determining similarity scores. 

In [17]:
def recommender_5(title, num=5):
  data = df.copy()
  ratings = {'Not Rated': 0,'Approved': 0, 'Passed': 0,'TV-Y': 1, 'TV-Y7': 2,'TV-G': 3, 'G': 3, 'TV-PG': 4,
               'PG': 4,'PG-13': 8,'TV-14': 9,'R': 13,'TV-MA': 14,'NC-17': 15}
    
  #Convert MPAA/TV ratings to numerical equivalent and average the IMDB and TMDB scores. 
  #Isolate the two features into one data frame
  data['rating_score'] = data['rating'].map(ratings)
  data['average'] = ((data['imdb_rating'] + data['tmdb_rating'])/2)
  num_info = data[['average', 'rating_score']]

  #Pad the genre and plot ids 
  genre = pad_sequences(data['id_genre'], maxlen = 3)
  plot = pad_sequences(data['id_plot'], maxlen= 50)

  #Combine padded genre and plot together with the isolate features (num_info)
  bert_matrix = np.append(genre, plot, axis=1)
  bert_matrix = np.append(bert_matrix, num_info, axis=1)
  
  #Find the cosine similarity, setup content series, sort similarity scores, and isolate similar content
  cosine_sim = cosine_similarity(bert_matrix, bert_matrix)
  indices = pd.Series(data.index, index=data.title.str.lower())
  score = sorted(list(enumerate(cosine_sim[indices[title.lower()]])), key=lambda x: x[1], reverse=True)
  recomm_content = data.loc[[i[0] for i in score[1:num+1]]]
  return recomm_content[['id', 'title', 'year', 'plot', 'genre', 'rating', 'imdb_rating', 'type']]

recommender_5('south park')

Unnamed: 0,id,title,year,plot,genre,rating,imdb_rating,type
1471,76011,My Brilliant Career,1979,A young woman who is determined to maintain he...,Drama Romance,G,7.1,movie
404,204163,War Dogs,2016,"Based on the true story of two young men, Davi...",Crime Drama Comedy WarandMilitary,R,7.1,movie
306,449608,The Art of Racing in the Rain,2019,A family dog—with a near-human soul and a phil...,Comedy Drama Romance Sport,PG,7.5,movie
143,363037,Pokémon Detective Pikachu,2019,In a world where people collect pocket-size mo...,ActionandAdventure Fantasy Comedy KidsandFamil...,PG,6.6,movie
545,31383,Yes Man,2008,Carl Allen has stumbled across a way to shake ...,Comedy Romance,PG-13,6.8,movie


**Analysis:** Recommender 5 is doing a better job than Recommender 4. There are some similar content, such as Yes Man. However, there are still some errors. Similar to Recommender 4, the limitations of this model lies in the vectorizer matrix. The genre is most likely being overshadowed by the plot, reducing its effects. For the future, I would explorer other methods to incorporate genre into the recommender. I could also add more features, such as directors, actors, and runtime, to help with the results.