# RECOMMENDATION SYSTEM

Discalimer: I am making this recommendation system as if I was the target audience, so I am going to focus on what I would like a movie recommendation to be based of with the limited information I have. I would be happier with this dataset if it at least included a column with the name of the film's director or the cast.



So, I am going to pre-process some of the data to make it more readable for the model and then I am going to train it.

First, I am importing the libraries I will use and importing the DataFrame

In [1]:
import pandas as pd
import numpy as np
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df=pd.read_csv('Datasets\Movies_ETL_EDA.csv', index_col=0)

In [3]:
df.shape

(44373, 12)

In [4]:
df.head()

Unnamed: 0,budget,id,overview,release_date,revenue,title,release_year,return,collection_name,genres_name,pcompany_name,pcountry_name
0,30000000.0,862,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,373554033.0,Toy Story,1995,12.451801,Toy Story Collection,"['Animation', 'Comedy', 'Family']",['Pixar Animation Studios'],['United States of America']
1,65000000.0,8844,When siblings Judy and Peter discover an encha...,1995-12-15,262797249.0,Jumanji,1995,4.043035,,"['Adventure', 'Fantasy', 'Family']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['United States of America']
2,0.0,15602,A family wedding reignites the ancient feud be...,1995-12-22,0.0,Grumpier Old Men,1995,0.0,Grumpy Old Men Collection,"['Romance', 'Comedy']","['Warner Bros.', 'Lancaster Gate']",['United States of America']
3,16000000.0,31357,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,81452156.0,Waiting to Exhale,1995,5.09076,,"['Comedy', 'Drama', 'Romance']",['Twentieth Century Fox Film Corporation'],['United States of America']
4,0.0,11862,Just when George Banks has recovered from his ...,1995-02-10,76578911.0,Father of the Bride Part II,1995,0.0,Father of the Bride Collection,['Comedy'],"['Sandollar Productions', 'Touchstone Pictures']",['United States of America']


I know some film titles appear more than once because they are remakes of the same plot, I am going to drop those since this model is not really considering the release date of the movies, this is just taking more space. 

In [5]:
df.drop_duplicates(subset=['title'],inplace=True)
df=df.reset_index(drop=True)
df.shape

(41278, 12)

The only columns I am going to use for the model are overview, title and genres_name because I feel like they have enough information so that I can make a decent recommendation but not so much that is redundant.

In [6]:
model_data=df[['title','overview','genres_name']]

In [7]:
model_data.head()

Unnamed: 0,title,overview,genres_name
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...","['Animation', 'Comedy', 'Family']"
1,Jumanji,When siblings Judy and Peter discover an encha...,"['Adventure', 'Fantasy', 'Family']"
2,Grumpier Old Men,A family wedding reignites the ancient feud be...,"['Romance', 'Comedy']"
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","['Comedy', 'Drama', 'Romance']"
4,Father of the Bride Part II,Just when George Banks has recovered from his ...,['Comedy']


Now, to make my model lighter I will not be using the entire descriptions in the overview column. Instead, I am going to use the RAKE(Rapid Automatic Keyword Extraction algorithm) tool I found for NLP and extract keywords from the text. I am going to assing those keywords to a new column and then drop the overviwe column. First I will put this column in lowercase to avoid duplication.

In [8]:
model_data['overview']=model_data['overview'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['overview']=model_data['overview'].str.lower()


In [9]:
#creating the new column
model_data['keywords'] = ""

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['keywords'] = ""


In [10]:
for index, row in model_data.iterrows():
    plot = row['overview']

    r = Rake()

    r.extract_keywords_from_text(plot)

    key_words_dict_scores = r.get_word_degrees()

    row['keywords'] = list(key_words_dict_scores.keys())


In [11]:
model_data.drop(columns=['overview'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data.drop(columns=['overview'],inplace=True)


In [12]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
0,Toy Story,"['Animation', 'Comedy', 'Family']","[led, woody, andy, toys, live, happily, room, ..."
1,Jumanji,"['Adventure', 'Fantasy', 'Family']","[siblings, judy, peter, discover, enchanted, b..."
2,Grumpier Old Men,"['Romance', 'Comedy']","[family, wedding, reignites, ancient, feud, ne..."
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']","[cheated, mistreated, stepped, women, holding,..."
4,Father of the Bride Part II,['Comedy'],"[george, banks, recovered, daughter, wedding, ..."


Since genres_name and keywords are inside brackets that do not really serve a purpose, I am going to remove those brackets and the aphostrophes in genres_names and just leave the values separated by commas.

I am transforming these values into strings so that they have the same structure

In [13]:
model_data['genres_name']=model_data['genres_name'].map(str)
model_data['keywords']=model_data['keywords'].map(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['genres_name']=model_data['genres_name'].map(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['keywords']=model_data['keywords'].map(str)


Now, I am replacing the characters I do not want in these columns and keeping the columns as strings.

In [14]:
for col in ['genres_name','keywords']:
    for val in ['[',']','\'']:
        model_data[col]=model_data[col].str.replace(val,'')
    model_data[col]=model_data[col].astype(str)

  model_data[col]=model_data[col].str.replace(val,'')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data[col]=model_data[col].str.replace(val,'')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data[col]=model_data[col].astype(str)


This is what I ended up with 

In [15]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
0,Toy Story,"Animation, Comedy, Family","led, woody, andy, toys, live, happily, room, b..."
1,Jumanji,"Adventure, Fantasy, Family","siblings, judy, peter, discover, enchanted, bo..."
2,Grumpier Old Men,"Romance, Comedy","family, wedding, reignites, ancient, feud, nex..."
3,Waiting to Exhale,"Comedy, Drama, Romance","cheated, mistreated, stepped, women, holding, ..."
4,Father of the Bride Part II,Comedy,"george, banks, recovered, daughter, wedding, r..."


I want to leave all of the values in lowercase, so that is what I am going to do now with the columns title and genres_name, this is to avoid any malfunction during training

In [16]:
model_data['genres_name']=model_data['genres_name'].str.lower()
model_data['title']=model_data['title'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['genres_name']=model_data['genres_name'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['title']=model_data['title'].str.lower()


In [17]:
model_data.head()

Unnamed: 0,title,genres_name,keywords
0,toy story,"animation, comedy, family","led, woody, andy, toys, live, happily, room, b..."
1,jumanji,"adventure, fantasy, family","siblings, judy, peter, discover, enchanted, bo..."
2,grumpier old men,"romance, comedy","family, wedding, reignites, ancient, feud, nex..."
3,waiting to exhale,"comedy, drama, romance","cheated, mistreated, stepped, women, holding, ..."
4,father of the bride part ii,comedy,"george, banks, recovered, daughter, wedding, r..."


For the model I am using CountVectorizer so, I am going to create a plain text for it to work with 

I created a function that takes all of the values in each row and unites them without commas. I am applying this for each row in the data and putting the results in a new column.

In [18]:
def create_text(model_data):
    text = model_data['title']
    for i in model_data[1:]:
        text = text + ' ' + str(i.replace(',',' '))
    return text

In [19]:
model_data['text']=model_data.apply(create_text,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_data['text']=model_data.apply(create_text,axis=1)


This is the final result of the data

In [20]:
model_data.head()

Unnamed: 0,title,genres_name,keywords,text
0,toy story,"animation, comedy, family","led, woody, andy, toys, live, happily, room, b...",toy story animation comedy family led woody...
1,jumanji,"adventure, fantasy, family","siblings, judy, peter, discover, enchanted, bo...",jumanji adventure fantasy family siblings j...
2,grumpier old men,"romance, comedy","family, wedding, reignites, ancient, feud, nex...",grumpier old men romance comedy family weddi...
3,waiting to exhale,"comedy, drama, romance","cheated, mistreated, stepped, women, holding, ...",waiting to exhale comedy drama romance cheat...
4,father of the bride part ii,comedy,"george, banks, recovered, daughter, wedding, r...",father of the bride part ii comedy george ban...


In [21]:
model_data=model_data.reset_index(drop=True)

I am going to drop the genres_name and keywords columns, since I do not really need them anymore. The column I am going to vectorize is the text column. Then I am getting the similarity matrix to get the similarity scores.

In [22]:
model_data.drop(columns=['genres_name','keywords'],inplace=True)

In [23]:
cv = CountVectorizer(stop_words='english')
cv_matrix = cv.fit_transform(model_data['text'])
cosine_sim = cosine_similarity(cv_matrix,cv_matrix)

Defining a function to get recommendations

In [55]:

def recomendaciones(titulo, cosine_sim = cosine_sim):
    # Getting the index of the movie that matches the title
    idx = model_data[model_data['title'] == str(titulo).lower()].index[0]
    # Getting the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    #Sorting the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Getting the top 5 recommendations
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]
    recommendations=list(model_data['title'].iloc[movie_indices].str.title())
    return {'lista recomendada': recommendations} 

Testing the model

In [56]:
recomendaciones('batman')

{'lista recomendada': ['Batman Beyond: Return Of The Joker',
  'Batman: The Dark Knight Returns, Part 1',
  'The Dark Knight Rises',
  'Batman & Robin',
  'Batman Begins']}

In [27]:
recomendaciones('the love letter')

{'lista recomendada': ['E Aí... Comeu?',
  'Beautiful Lies',
  'Sex, Love & Therapy',
  'All Relative',
  'Love At First Hiccup']}

In [28]:
recomendaciones('minions')

{'lista recomendada': ['Minions: Orientation Day',
  'Despicable Me 2',
  'Banana',
  'One Hundred And One Dalmatians',
  'Mower Minions']}

In [29]:
recomendaciones('the hunger games')

{'lista recomendada': ['The Hunger Games: Mockingjay - Part 2',
  'The Hunger Games: Catching Fire',
  'The Hunger Games: Mockingjay - Part 1',
  'Arena',
  'The Fifth Element']}

In [63]:
recomendaciones('toy story')

{'lista recomendada': ['Toy Story 2',
  'Toy Story 3',
  'Toy Story Of Terror!',
  "Family Guy Presents: Seth And Alex'S Almost Live Comedy Show",
  'Botsman I Popugay']}

In [66]:
recomendaciones('Pride And Prejudice')

{'lista recomendada': ['Bride & Prejudice',
  'Pride & Prejudice',
  'Pride And Prejudice And Zombies',
  'Invitation To Happiness',
  'Tiny Times']}

This works just fine on my pc, which, to be fair, has a lot of resources. However, I know I probably will not be able to use this algorithm with the full data on the free deploy since the RAM I get is much less than what I have on my local machine. So, for the API what I am going to do is take a random sample of the data (with a size of half the data) and then just use that for the API. I am not changing my alogorithm because I think the recommendations it is giving are kind of spot on, so if you actually want to try it in its full pontential, just download this file and run the previous code if you have a computer with at least 16gb of RAM. I am going put a default in the API for the function so that if a movie you search is not on the data it gives you as recommendation the top 5 most popular movies (I am going to get this information from the EDA).

In [31]:
model_data.shape

(41278, 2)

Since I am taking as a sample half of the data, n=41278/2=20639

In [47]:
sample_md=model_data.sample(n=20639, random_state=42)

Reseting the index to avoid problems 

In [48]:
sample_md=sample_md.reset_index(drop=True)

In [49]:
sample_md

Unnamed: 0,title,text
0,sleepless in seattle,sleepless in seattle comedy drama romance yo...
1,mission to lars,mission to lars documentary kate spicer brot...
2,war for the planet of the apes,war for the planet of the apes drama science ...
3,disconnect,disconnect drama thriller disconnect interwe...
4,birdman of alcatraz,birdman of alcatraz drama killing prison gua...
...,...,...
20634,noobz,noobz comedy adventure four friends hit ro...
20635,dracula vs. frankenstein,dracula vs. frankenstein horror science ficti...
20636,jaws of satan,jaws of satan horror mystery thriller preach...
20637,kids world,kids world would wish eleven could anythi...


I am going to export this as a csv file to avoid doing all this transformations in the API. 

In [50]:
sample_md.to_csv('ML_Data.csv')

Now, I am going to put together inside a function the model, including the vectorization steps.

In [67]:
def recomendaciones1(titulo):
    try:
        cv1 = CountVectorizer(stop_words='english')
        cv_matrix1 = cv1.fit_transform(sample_md['text'])
        cosine_sim1 = cosine_similarity(cv_matrix1,cv_matrix1)
        # Getting the index of the movie that matches the title
        idx = sample_md[sample_md['title'] == str(titulo).lower()].index[0]
        # Getting the similarity scores
        sim_scores = list(enumerate(cosine_sim1[idx]))
        #Sorting the movies based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

        # Getting the top 5 recommendations
        sim_scores = sim_scores[1:6]
        movie_indices = [i[0] for i in sim_scores]
        recommendations=list(sample_md['title'].iloc[movie_indices].str.title())
        return {'lista recomendada': recommendations} 
    except:
        return {'lista recomendada': ['Minions', 'Wonder Woman', 'Beauty and the Beast', 'Baby Driver', 'Big Hero 6']}

In [68]:
#testing it with a movie in the sample data
recomendaciones1('disconnect')

{'lista recomendada': ['11 Minutes',
  'The Dead Girl',
  'Even Money',
  'Vips',
  'Toronto Stories']}

In [62]:
#testing it with a movie that is not in the sample data
recomendaciones1('barbie')

{'lista recomendada': ['Minions',
  'Wonder Woman',
  'Beauty and the Beast',
  'Baby Driver',
  'Big Hero 6']}