## Simple and NLP Based Recommender System

**Note:** Code for the recommenders can be found in the model.py in the script directory. Each recommenders are labeled and arrange in the same order as the flow of this notebook. Recommendations are based on own personal judgement and google recommendations.

In [2]:
#Imported Library
import script.functions as func
import script.models as model
import autoreload
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd 
df = pd.read_csv('../data/final_hbo_data_2.csv')

### Simple Recommender 

This recommendation aims to build a generalized system that shows the top movies/shows per genre. The idea behind it is that people will tend to like more popular shows/movies with high ratings. The setup includes engineering a new feature that will multiply the popularity scores with the IMDB ratings. This feature will lower the rating for films/shows with low popularity and IMDB ratings. While magnifying films/shows with high popularity and ratings. It is a straightforward model since it is primarily just sorting the data to the arguments provided. At the same time, only return titles within the 95th percentile.

In [4]:
#Returns the top 5 movies/shows from HBO Max library
model.top_content(rank=5)

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
14,Friends,1994,The misadventures of a group of friends as the...,Drama Comedy Romance,TV-14,8.9,show
7,South Park,1997,Follows the misadventures of four irreverent g...,Animation Comedy,TV-MA,8.7,show
6,Doctor Who,2005,The Doctor is a Time Lord: a 900 year old alie...,Drama ScienceFiction ActionandAdventure Kidsan...,TV-PG,8.7,show
39,Game of Thrones,2011,Seven noble families fight for control of the ...,Drama ScienceFiction ActionandAdventure Fantas...,TV-MA,9.5,show
24,Joker,2019,"During the 1980s, a failed stand-up comedian i...",Crime MysteryandThriller Drama,R,8.5,movie


In [5]:
#Returns the top 5 movies/shows in the drama category
model.top_content(genre='drama', rank=5)

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
14,Friends,1994,The misadventures of a group of friends as the...,Drama Comedy Romance,TV-14,8.9,show
6,Doctor Who,2005,The Doctor is a Time Lord: a 900 year old alie...,Drama ScienceFiction ActionandAdventure Kidsan...,TV-PG,8.7,show
39,Game of Thrones,2011,Seven noble families fight for control of the ...,Drama ScienceFiction ActionandAdventure Fantas...,TV-MA,9.5,show
24,Joker,2019,"During the 1980s, a failed stand-up comedian i...",Crime MysteryandThriller Drama,R,8.5,movie
2,Lovecraft Country,2020,The anthology horror series follows 25-year-ol...,Drama MysteryandThriller ScienceFiction Fantas...,TV-MA,7.5,show


In [6]:
#Top 5 drama shows
model.top_content(genre='drama', rank=5, content_type='show')

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
14,Friends,1994,The misadventures of a group of friends as the...,Drama Comedy Romance,TV-14,8.9,show
6,Doctor Who,2005,The Doctor is a Time Lord: a 900 year old alie...,Drama ScienceFiction ActionandAdventure Kidsan...,TV-PG,8.7,show
39,Game of Thrones,2011,Seven noble families fight for control of the ...,Drama ScienceFiction ActionandAdventure Fantas...,TV-MA,9.5,show
2,Lovecraft Country,2020,The anthology horror series follows 25-year-ol...,Drama MysteryandThriller ScienceFiction Fantas...,TV-MA,7.5,show
322,InuYasha,2000,Kagome Higurashi is a modern day young girl wh...,Drama Animation ScienceFiction ActionandAdvent...,TV-14,7.9,show


**Analysis:** The recommender above suffers from extreme limitations. First, it lacks user personalization. Second, the recommender will provide the same recommendations to anyone. Lastly, the recommended movies/shows' associations are relatively shallow. It only looks for similarities in genre, IMDB ratings, and popularity. 

### Recommender 1: Content-Based Recommender using TfidVectorizer (based on genre and plot)

To resolved the limitations mentioned above and diversity to the recommendations, a Content-Based Recommender System will be implemented. The model aims to provide mixed results where suggested titles will are similar through their plot summary and genre. This approach will add a layer of complexity in the recommender since it will be able to cluster contents that are more similar to each other. For this recommender, the plot and genre are vectorized separately. The generated document matrices are then combined into one, which is used to calculate the cosine similarity. 

In [7]:
model.recommender_1('south park', 10)

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
201,Aqua Teen Hunger Force,2000,The surreal adventures of three anthropomorphi...,Animation Comedy,TV-MA,7.7,show
176,Food Wars! Shokugeki no Soma,2015,Yukihira Souma's dream is to become a full-tim...,Animation Drama Comedy,TV-MA,8.2,show
496,Close Enough,2020,A surreal take on transitioning from 20-someth...,Drama Animation Comedy,TV-14,8.4,show
1491,Animals.,2016,An animated comedy focusing on the downtrodden...,Animation Comedy Drama,TV-MA,7.2,show
1761,Crashbox,1999,Crashbox is an educational children's televisi...,Animation,TV-Y7,8.4,show
902,"Turu, the Wacky Hen",2020,"Like a love song to differences, Turu, the Wac...",Animation,Not Rated,5.5,movie
1977,El Perro y el Gato,2011,El Perro y el Gato is a Spanish language anima...,Animation,Not Rated,8.1,show
1507,Tig n' Seek,2020,Follow 8-year-old Tiggy and his gadget-buildin...,Animation,Not Rated,7.5,show
1596,Lego Batman: The Movie - DC Super Heroes Unite,2013,Joker teams up with Lex Luthor to destroy the ...,Animation,G,6.5,movie
1883,DC Super Hero Girls: Intergalactic Games,2017,Super Hero High is facing off against Korugar ...,Animation,Not Rated,5.4,movie


**Analysis:** Compared to the Simple Recommender, this one does a better job of personalizing content. Based on the results, the top 4 recommendations are similar to South Park. However, passed that point, we start to observe some discrepancies in the output. For instance, beginning from Crashbox, and below are children shows. Its content is wholesome compared to South Park, which contains stronger languages with more explicit and mature content. A limitation of this recommender is that it doesn't consider MPAA/TV ratings. As seen from the result, the discrepancies arise are from the difference in the content's intended audiences.

### Recommender 2: Content-Based Recommeder using TfidVectorizer (based on genre, plot, and predicted probabilities of MPAA/TV ratings generated from a Logisitic Regression)

A Logistic Regression was used to add complexity to the recommender. The model was set up to perform a multi-class classification of MPAA/TV ratings. Feature used to train the model are plot summary, genre, IMDB rating, TMDB ratings, popularity score, year released, and type of content (show or tv). After adding the vectorized plot and genre to the numerical and binarized features, I ended up with over 60,000 columns. Therefore, a PCA is employed to reduce dimensionality. This step's logic is that the predicted probabilities from the model will have an added weight when calculating for similarity score due to the effects from the features use to train the model; in return, it will better fine-tune the recommendations. 

After isolating the predicted probabilities, it is appended to the document-matrix, created from vectorizing the plot and genre using TfidVectorizer. The combined matrix is then used to calculate the cosine similarity scores. 

In [137]:
#Baseline Score is 14%
lr, pca, Z_train, Z_test, y_train, y_test, X = model.log_reg(df)
print(f'Training accuracy: {lr.score(Z_train, y_train)}')
print(f'Testing accuracy: {lr.score(Z_test, y_test)}')

Training accuracy: 0.5831649831649832
Testing accuracy: 0.5353535353535354


In [182]:
model.recommender_2('south park', 10)

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
201,Aqua Teen Hunger Force,2000,The surreal adventures of three anthropomorphi...,Animation Comedy,TV-MA,7.7,show
176,Food Wars! Shokugeki no Soma,2015,Yukihira Souma's dream is to become a full-tim...,Animation Drama Comedy,TV-MA,8.2,show
496,Close Enough,2020,A surreal take on transitioning from 20-someth...,Drama Animation Comedy,TV-14,8.4,show
1491,Animals.,2016,An animated comedy focusing on the downtrodden...,Animation Comedy Drama,TV-MA,7.2,show
242,The Boondocks,2005,When Robert “Granddad” Freeman becomes legal g...,Animation ActionandAdventure Comedy Drama,TV-MA,8.4,show
1361,Yogi Bear,2010,A documentary filmmaker travels to Jellystone ...,Comedy Animation KidsandFamily,PG,4.6,movie
894,Keep Your Hands Off Eizouken!,2020,"Asakusa Midori wants to create an anime, but s...",Animation ActionandAdventure Comedy Drama,TV-14,8.2,show
1507,Tig n' Seek,2020,Follow 8-year-old Tiggy and his gadget-buildin...,Animation,Not Rated,7.5,show
420,Metalocalypse,2006,Metalocalypse is an American animated televisi...,Animation Comedy MusicandMusical,TV-MA,8.3,show
1024,The Yogi Bear Show,1961,"From his home in Jellystone Park, Yogi Bear dr...",Animation Comedy KidsandFamily,TV-G,6.6,show


**Analysis:** Compared to Recommender 1, the output recommendations have fewer discrepancies. There are more shows similar to South Park and less of the ones that aren't. However, the improvement observed still came with some errors; some children still show being recommended, such as Yogi Bear. The recommender's downfall is that the complexity generated from the Logistic Regression might be adding noise, resulting in some errors still being observed. 

### Recommender 3: Content-Based Recommender using TfidVectorizer (based on genre, plot, MPAA/TV rating, and IMDB rating.

Since the added complexity still maintains a degree of error, I decided to simplify incorporating the  MPAA/TV ratings. I encoded them into numerical values (conversion is based on personal judgment) and combined them with the IMDB ratings. The recommender then follows the same procedure from the two systems mentioned above (appending values to vectorize matrix and calculating score). The idea behind this approach is that it will magnify the effect of MPAA/TV ratings in generating similarity scores. 

In [184]:
model.recommender_3('south park', 10)

Unnamed: 0,title,year,plot,genre,rating,imdb_rating,type
242,The Boondocks,2005,When Robert “Granddad” Freeman becomes legal g...,Animation ActionandAdventure Comedy Drama,TV-MA,8.4,show
176,Food Wars! Shokugeki no Soma,2015,Yukihira Souma's dream is to become a full-tim...,Animation Drama Comedy,TV-MA,8.2,show
420,Metalocalypse,2006,Metalocalypse is an American animated televisi...,Animation Comedy MusicandMusical,TV-MA,8.3,show
3,Rick and Morty,2013,Rick is a mentally-unbalanced but scientifical...,ActionandAdventure Animation ScienceFiction Co...,TV-MA,9.3,show
201,Aqua Teen Hunger Force,2000,The surreal adventures of three anthropomorphi...,Animation Comedy,TV-MA,7.7,show
62,Curb Your Enthusiasm,2000,"The off-kilter, unscripted comic vision of Lar...",Comedy,TV-MA,8.7,show
27,Harley Quinn,2019,Harley Quinn has finally broken things off onc...,Animation ScienceFiction Comedy Crime Fantasy ...,TV-MA,8.5,show
314,Last Week Tonight with John Oliver,2014,A half-hour satirical look at the week in news...,Comedy,TV-MA,9.0,show
154,Silicon Valley,2014,In the high-tech gold rush of modern Silicon V...,Comedy,TV-MA,8.5,show
430,Oz,1997,"The daily lives of prisoners in Emerald City, ...",Drama MysteryandThriller Crime Animation Scien...,TV-MA,8.7,show


**Analysis:** This recommender is performing the best. All recommended shows are similar to South Park with their content, premise, genre, and MPAA/TV rating. However, I am curious to see if switching to a more sophisticated vectorizer will give a different result.

**NOTE:** 04_modeling.ipynb contains Recommender 4 and Recommender 5 that uses BERT instead of TfidVectorizer. These recommenders were done through Google Colab due to the lack of processing power of my computer. 