# RECOMMENDATION SYSTEM

Discalimer: I am making this recommendation system as if I was the target audience, so I am going to focus on what I would like a movie recommendation to be based of with the limited information I have. I would be happier with this dataset if it at least included a column with the name of the film's director or the cast.



So, I am going to pre-process some of the data to make it more readable for the model and then I am going to train it.

First, I am importing the libraries I will use and importing the DataFrame

In [83]:
import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

from IPython.display import Image

In [35]:
df=pd.read_csv('Datasets\Movies_ETL_EDA.csv', index_col=0)

In [36]:
df.head()

Unnamed: 0,budget,id,overview,release_date,revenue,title,release_year,return,collection_name,genres_name,pcompany_name,pcountry_name
0,30000000.0,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,373554033.0,Toy Story,1995,12.451801,Toy Story Collection,"['Animation', 'Comedy', 'Family']",['Pixar Animation Studios'],['United States of America']
1,65000000.0,8844,When siblings Judy and Peter discover an encha...,1995-12-15,262797249.0,Jumanji,1995,4.043035,,"['Adventure', 'Fantasy', 'Family']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America']
2,0.0,15602,A family wedding reignites the ancient feud be...,1995-12-22,0.0,Grumpier Old Men,1995,0.0,Grumpy Old Men Collection,"['Romance', 'Comedy']","['Warner Bros.', 'Lancaster Gate']",['United States of America']
3,16000000.0,31357,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,81452156.0,Waiting to Exhale,1995,5.09076,,"['Comedy', 'Drama', 'Romance']",['Twentieth Century Fox Film Corporation'],['United States of America']
4,0.0,11862,Just when George Banks has recovered from his ...,1995-02-10,76578911.0,Father of the Bride Part II,1995,0.0,Father of the Bride Collection,['Comedy'],"['Sandollar Productions', 'Touchstone Pictures']",['United States of America']


The only columns I am going to use for the model are overview, title and genres_name because I feel like they have enough information so that I can make a decent recommendation but not so much that is redundant.

In [37]:
model_data=df[['title','overview','genres_name']]

In [38]:
model_data.head()

Unnamed: 0,title,overview,genres_name
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","['Animation', 'Comedy', 'Family']"
1,Jumanji,When siblings Judy and Peter discover an encha...,"['Adventure', 'Fantasy', 'Family']"
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,"['Romance', 'Comedy']"
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","['Comedy', 'Drama', 'Romance']"
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,['Comedy']


I am going to select a random sample of the data because I have already tried to run the algorithm with the complete data and the cosine_similarity threw a Memory Error every time. I even tried to run this on Colab to see if I could use there the full set but the same thing happended (attaching proof of the colab error)

In [86]:
Image(url="Screenshot 2023-05-13 011133.png")

In [39]:
model_data=model_data.sample(n=20000, random_state=42)

Now, to make my model lighter I will not be using the entire descriptions in the overview column. Instead, I am going to use the RAKE(Rapid Automatic Keyword Extraction algorithm) tool I found for NLP and extract keywords from the text. I am going to assing those keywords to a new column and then drop the overviwe column. First I will put this column in lowercase to avoid duplication.

In [40]:
model_data['overview']=model_data['overview'].str.lower()

In [41]:
#creating the new column
model_data['keywords'] = ""

In [42]:
for index, row in model_data.iterrows():
    plot = row['overview']

    r = Rake()

    r.extract_keywords_from_text(plot)

    key_words_dict_scores = r.get_word_degrees()

    row['keywords'] = list(key_words_dict_scores.keys())


In [43]:
model_data.drop(columns=['overview'],inplace=True)

In [44]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
41579,Sketches of Kaitan City,['Drama'],"[seaside, city, kaitan, happy, place, –, shipy..."
43750,Secret Defense,"['Crime', 'Drama']","[sylvie, scientist, aged, 30, dig, deeper, bac..."
2494,The Love Letter,"['Comedy', 'Drama', 'Romance']","[romantic, comedy, mysterious, love, letter, t..."
8932,Cadence,['Drama'],"[punishment, drunken, rebellious, behavior, yo..."
37779,Ajab Prem Ki Ghazab Kahani,"['Drama', 'Comedy', 'Romance', 'Foreign']","[prem, ajab, kind, guy, life, president, happy..."


Since genres_name and keywords are inside brackets that do no really serve a purpose, I am going to remove those brackets and the aphostrophes in genres_names and just leave the values separated by commas.

I am transforming these values into strings so that they have the same structure

In [45]:
model_data['genres_name']=model_data['genres_name'].map(str)
model_data['keywords']=model_data['keywords'].map(str)

Now, I am replacing the characters I do not want in these columns and keeping the columns as strings.

In [46]:
for col in ['genres_name','keywords']:
    for val in ['[',']','\'']:
        model_data[col]=model_data[col].str.replace(val,'')
    model_data[col]=model_data[col].astype(str)

  model_data[col]=model_data[col].str.replace(val,'')


This is what I ended up with 

In [47]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
41579,Sketches of Kaitan City,Drama,"seaside, city, kaitan, happy, place, –, shipya..."
43750,Secret Defense,"Crime, Drama","sylvie, scientist, aged, 30, dig, deeper, back..."
2494,The Love Letter,"Comedy, Drama, Romance","romantic, comedy, mysterious, love, letter, tu..."
8932,Cadence,Drama,"punishment, drunken, rebellious, behavior, you..."
37779,Ajab Prem Ki Ghazab Kahani,"Drama, Comedy, Romance, Foreign","prem, ajab, kind, guy, life, president, happy,..."


I want to leave all of the values in lowercase, so that is what I am going to do now with the columns title and genres_name, this is to avoid any malfunction during training

In [48]:
model_data['genres_name']=model_data['genres_name'].str.lower()
model_data['title']=model_data['title'].str.lower()

In [49]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
41579,sketches of kaitan city,drama,"seaside, city, kaitan, happy, place, –, shipya..."
43750,secret defense,"crime, drama","sylvie, scientist, aged, 30, dig, deeper, back..."
2494,the love letter,"comedy, drama, romance","romantic, comedy, mysterious, love, letter, tu..."
8932,cadence,drama,"punishment, drunken, rebellious, behavior, you..."
37779,ajab prem ki ghazab kahani,"drama, comedy, romance, foreign","prem, ajab, kind, guy, life, president, happy,..."


For the model I am using CountVectorizer so, I am going to create a plain text for it to work with and I am also going to assing the title as the index

I created a function that takes all of the values in each row and unites them without commas. I am applying this for each row in the data and putting the results in a new column.

In [50]:
def create_text(model_data):
    text = model_data['title']
    for i in model_data[1:]:
        text = text + ' ' + str(i.replace(',',' '))
    return text

In [51]:
model_data['text']=model_data.apply(create_text,axis=1)

This is the final result of the data

In [52]:
model_data.head()

Unnamed: 0,title,genres_name,keywords,text
41579,sketches of kaitan city,drama,"seaside, city, kaitan, happy, place, –, shipya...",sketches of kaitan city drama seaside city k...
43750,secret defense,"crime, drama","sylvie, scientist, aged, 30, dig, deeper, back...",secret defense crime drama sylvie scientist ...
2494,the love letter,"comedy, drama, romance","romantic, comedy, mysterious, love, letter, tu...",the love letter comedy drama romance romanti...
8932,cadence,drama,"punishment, drunken, rebellious, behavior, you...",cadence drama punishment drunken rebellious ...
37779,ajab prem ki ghazab kahani,"drama, comedy, romance, foreign","prem, ajab, kind, guy, life, president, happy,...",ajab prem ki ghazab kahani drama comedy roma...


In [53]:
model_data=model_data.reset_index(drop=True)

I am going to drop the genres_name and keywords columns, since I do not really need them anymore. The column I am going to vectorize is the text column. Then I am getting the similarity matrix to get the similarity scores.

In [55]:
model_data.drop(columns=['genres_name','keywords'],inplace=True)

In [57]:
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(model_data['text'])
cosine_sim = cosine_similarity(cv_matrix,cv_matrix)

Defining a function to get recommendations

In [79]:

def recomendaciones(titulo, cosine_sim = cosine_sim):
    # Getting the index of the movie that matches the title
    idx = model_data[model_data['title'] == str(titulo).lower()].index[0]
    # Getting the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    #Sorting the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Getting the top 5 recommendations
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]
    recommendations=list(model_data['title'].iloc[movie_indices].str.title())
    return {'lista recomendada': recommendations} 

Testing the model

In [80]:
recomendaciones('batman')

{'lista recomendada': ['Batman Beyond: Return Of The Joker',
  'Batman: The Dark Knight Returns, Part 1',
  'The Dark Knight Rises',
  'Batman Vs Dracula',
  'Batman: Mask Of The Phantasm']}

In [81]:
recomendaciones('the love letter')

{'lista recomendada': ['Beautiful Lies',
  'Sex, Love & Therapy',
  'All Relative',
  'A Bela E O Paparazzo',
  'Love At First Hiccup']}

In [89]:
recomendaciones('minions')

{'lista recomendada': ['Minions: Orientation Day',
  'Despicable Me 2',
  'Banana',
  'Despicable Me 3',
  'Veggietales: Dave And The Giant Pickle']}