<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:35%"><img src='https://drive.google.com/uc?export=view&id=1n7iMRKM-p2Grue1x_QkLA5V59b-Rab19' style="width: 350px; height: 100px; "></th>
        <th style="text-align:center;"><br /><h2>IS 215 - Analytics in Python - Practical 2</h2> <br />(Students)</th>
    </tr>
</table>

### Content-based recommender system for movies

This recommender is built using content from IMDB top 250 English movies or https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7. The metadata used includes movie director, main actors and plot. 

Make sure Rake (Rapid Automatic Keyword Extraction algorithm) library is installed. If not, it can be installed via "pip install rake_nltk". Refer to https://pypi.org/project/rake-nltk/ for more information.

This python script is adapted from https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243

In [1]:
#If you haven't setup nltk and rake_nltk, uncomment the follow and run once
# !pip install nltk
# !pip install rake_nltk

In [2]:
#Uncomment and run the following once if this is the first time you are using nltk
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')

In [53]:
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

### Step 1: Read in and analyse the data

In [54]:
df = pd.read_csv('IMDB_Top250Engmovies2_OMDB_Detailed.csv')

# if you want access the csv file from the Internet with its URL:
# url = 'https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7'
# df = pd.read_csv(url)

In [55]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,...,tomatoConsensus,tomatoUserMeter,tomatoUserRating,tomatoUserReviews,tomatoURL,DVD,BoxOffice,Production,Website,Response
0,1,The Shawshank Redemption,1994,R,14 Oct 1994,142 min,"Crime, Drama",Frank Darabont,"Stephen King (short story ""Rita Hayworth and S...","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",...,,,,,http://www.rottentomatoes.com/m/shawshank_rede...,27 Jan 1998,,Columbia Pictures,,True
1,2,The Godfather,1972,R,24 Mar 1972,175 min,"Crime, Drama",Francis Ford Coppola,"Mario Puzo (screenplay), Francis Ford Coppola ...","Marlon Brando, Al Pacino, James Caan, Richard ...",...,,,,,http://www.rottentomatoes.com/m/godfather/,09 Oct 2001,,Paramount Pictures,http://www.thegodfather.com,True
2,3,The Godfather: Part II,1974,R,20 Dec 1974,202 min,"Crime, Drama",Francis Ford Coppola,"Francis Ford Coppola (screenplay), Mario Puzo ...","Al Pacino, Robert Duvall, Diane Keaton, Robert...",...,,,,,http://www.rottentomatoes.com/m/godfather_part...,24 May 2005,,Paramount Pictures,http://www.thegodfather.com/,True
3,4,The Dark Knight,2008,PG-13,18 Jul 2008,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",...,,,,,http://www.rottentomatoes.com/m/the_dark_knight/,09 Dec 2008,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
4,5,12 Angry Men,1957,APPROVED,01 Apr 1957,96 min,"Crime, Drama",Sidney Lumet,"Reginald Rose (story), Reginald Rose (screenplay)","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",...,,,,,http://www.rottentomatoes.com/m/1000013-12_ang...,06 Mar 2001,,Criterion Collection,http://www.criterion.com/films/27871-12-angry-men,True


In [56]:
df.shape

(250, 38)

<img align="left" src='https://drive.google.com/uc?export=view&id=0B08uY8vosNfoeUJ4NUxtMlVNNnM' style="width: 60px; height: 60px;"><br />
If you want to do recommendations, do you need all the features?<br />
Which features do you think should be used?

Use the following input features to base the recommendations.

In [57]:
df = df[['Title','Director','Actors','Plot']]
df.head()

Unnamed: 0,Title,Director,Actors,Plot
0,The Shawshank Redemption,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...
1,The Godfather,Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...
3,The Dark Knight,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...
4,12 Angry Men,Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...


In [58]:
df.shape

(250, 4)

### Step 2a: Data pre-processing
Transforming the full names of <b>actors</b>, <b>genres</b> and <b>directors</b> in single words so they are considered as <b>unique values</b>.

#### Transforming `Actors`

In [59]:
df['Actors'].head()

0    Tim Robbins, Morgan Freeman, Bob Gunton, Willi...
1    Marlon Brando, Al Pacino, James Caan, Richard ...
2    Al Pacino, Robert Duvall, Diane Keaton, Robert...
3    Christian Bale, Heath Ledger, Aaron Eckhart, M...
4    Martin Balsam, John Fiedler, Lee J. Cobb, E.G....
Name: Actors, dtype: object

In [60]:
# We will be getting only the first three names,
# discarding the commas between the actors' full names and 
# putting the actors in a list of words
df['Actors'] = df['Actors'].map(lambda x: x.split(',')[:3])

In [61]:
df['Actors'].head()

0        [Tim Robbins,  Morgan Freeman,  Bob Gunton]
1           [Marlon Brando,  Al Pacino,  James Caan]
2         [Al Pacino,  Robert Duvall,  Diane Keaton]
3    [Christian Bale,  Heath Ledger,  Aaron Eckhart]
4       [Martin Balsam,  John Fiedler,  Lee J. Cobb]
Name: Actors, dtype: object

In [62]:
# merging first and last name for each actor into one word 
# to ensure no mix up between people sharing a first name
for index, row in df.iterrows():
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]

In [63]:
df['Actors'].head()

0        [timrobbins, morganfreeman, bobgunton]
1           [marlonbrando, alpacino, jamescaan]
2         [alpacino, robertduvall, dianekeaton]
3    [christianbale, heathledger, aaroneckhart]
4        [martinbalsam, johnfiedler, leej.cobb]
Name: Actors, dtype: object

#### Transforming `Director`

In [64]:
df['Director'].head()

0          Frank Darabont
1    Francis Ford Coppola
2    Francis Ford Coppola
3       Christopher Nolan
4            Sidney Lumet
Name: Director, dtype: object

In [65]:
# putting the directors in a list of words
df['Director'] = df['Director'].map(lambda x: x.split(' '))

In [66]:
df['Director'].head()

0           [Frank, Darabont]
1    [Francis, Ford, Coppola]
2    [Francis, Ford, Coppola]
3        [Christopher, Nolan]
4             [Sidney, Lumet]
Name: Director, dtype: object

In [67]:
# merging first and last name for each director into one word
for index, row in df.iterrows():
    row['Director'] = ''.join(row['Director']).lower()

In [68]:
df['Director'].head()

0         frankdarabont
1    francisfordcoppola
2    francisfordcoppola
3      christophernolan
4           sidneylumet
Name: Director, dtype: object

In [69]:
# Finally
df.head()

Unnamed: 0,Title,Director,Actors,Plot
0,The Shawshank Redemption,frankdarabont,"[timrobbins, morganfreeman, bobgunton]",Two imprisoned men bond over a number of years...
1,The Godfather,francisfordcoppola,"[marlonbrando, alpacino, jamescaan]",The aging patriarch of an organized crime dyna...
2,The Godfather: Part II,francisfordcoppola,"[alpacino, robertduvall, dianekeaton]",The early life and career of Vito Corleone in ...
3,The Dark Knight,christophernolan,"[christianbale, heathledger, aaroneckhart]",When the menace known as the Joker emerges fro...
4,12 Angry Men,sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]",A jury holdout attempts to prevent a miscarria...


### Step 2b: Data pre-processing on plot

Extracting the key words from the plot description.

In [70]:
# creating and initializing the new column to empty string for all rows
df['Key_words'] = ""

for index, row in df.iterrows():
    plot = row['Plot']
    
    # instantiating a Rake object
    # by default it uses english stopwords from NLTK (natural language tool kit)
    # and discards all puntuation characters
    r = Rake()

    # extracting the keywords from the text by passing plot 
    r.extract_keywords_from_text(plot)

    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()
    
    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())

In [71]:
df['Key_words'].head()

0    [two, imprisoned, men, bond, number, years, fi...
1    [aging, patriarch, organized, crime, dynasty, ...
2    [early, life, career, vito, corleone, 1920s, n...
3    [menace, known, joker, emerges, mysterious, pa...
4    [jury, holdout, attempts, prevent, miscarriage...
Name: Key_words, dtype: object

In [72]:
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# if have error - use df.drop('Plot', axis=1, inplace=True)

In [73]:
# check all the columns now
df.head()

Unnamed: 0,Title,Director,Actors,Key_words
0,The Shawshank Redemption,frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[two, imprisoned, men, bond, number, years, fi..."
1,The Godfather,francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[aging, patriarch, organized, crime, dynasty, ..."
2,The Godfather: Part II,francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[early, life, career, vito, corleone, 1920s, n..."
3,The Dark Knight,christophernolan,"[christianbale, heathledger, aaroneckhart]","[menace, known, joker, emerges, mysterious, pa..."
4,12 Angry Men,sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[jury, holdout, attempts, prevent, miscarriage..."


### Step 3: Create word representation via a bag of words 

Using the values from the df columns

In [74]:
# Title should be omitted from bag of words creation
columns = df.columns[1:]

df['bag_of_words'] = ''

for index, row in df.iterrows():
    words = ''
    for col in columns:
        if col != 'Director':
            # to convert the list into a string of words separated by a space
            words = words + ' '.join(row[col])+ ' '
        else:
            words = words + row[col]+ ' '
    row['bag_of_words'] = words

# let's keep only the title and the bag of words in the dataframe
df = df[['Title','bag_of_words']]

In [75]:
df.head()

Unnamed: 0,Title,bag_of_words
0,The Shawshank Redemption,frankdarabont timrobbins morganfreeman bobgunt...
1,The Godfather,francisfordcoppola marlonbrando alpacino james...
2,The Godfather: Part II,francisfordcoppola alpacino robertduvall diane...
3,The Dark Knight,christophernolan christianbale heathledger aar...
4,12 Angry Men,sidneylumet martinbalsam johnfiedler leej.cobb...


### Step 4: Create cosine similarity matrix 

In [76]:
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])

In [77]:
# creating a Series for the movie titles so they are associated to an ordered numerical
# list that can be used to match the indexes
indices = pd.Series(df['Title'])
indices[:5]

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
3             The Dark Knight
4                12 Angry Men
Name: Title, dtype: object

In [78]:
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
print(cosine_sim)

[[1.         0.         0.         ... 0.         0.         0.        ]
 [0.         1.         0.22537447 ... 0.         0.         0.        ]
 [0.         0.22537447 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


### Step 5: Create and Run the model (recommender)

In [79]:
# function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
    
    recommended_movies = []
    
    # getting the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df['Title'])[i])
        
    return recommended_movies

In [80]:
recommendations('The Dark Knight')

['The Dark Knight Rises',
 'Batman Begins',
 'The Prestige',
 'Guardians of the Galaxy Vol. 2',
 'Terminator 2: Judgment Day',
 'The Avengers',
 'The Green Mile',
 'Sin City',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Out of the Past']

<img align="left" src='https://drive.google.com/uc?export=view&id=1gvji0A564aEIZRmSzm4J2H5euNV3kY4Q' style="width: 200px; height: 150px;">

### You have built your first recommender!!!

<img align="left" src='https://drive.google.com/uc?export=view&id=0B08uY8vosNfobDBuOXVXQWVxMFE' style="width: 60px; height: 60px;"><br /><br /><br />
- Now can you build another movie recommender, this time by considering the `genre` as additional feature?
- Does the new recommender provide the same recommendation?


# Practice

In [103]:
practice_df = pd.read_csv('IMDB_Top250Engmovies2_OMDB_Detailed.csv')
practice_df.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,...,tomatoConsensus,tomatoUserMeter,tomatoUserRating,tomatoUserReviews,tomatoURL,DVD,BoxOffice,Production,Website,Response
0,1,The Shawshank Redemption,1994,R,14 Oct 1994,142 min,"Crime, Drama",Frank Darabont,"Stephen King (short story ""Rita Hayworth and S...","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",...,,,,,http://www.rottentomatoes.com/m/shawshank_rede...,27 Jan 1998,,Columbia Pictures,,True
1,2,The Godfather,1972,R,24 Mar 1972,175 min,"Crime, Drama",Francis Ford Coppola,"Mario Puzo (screenplay), Francis Ford Coppola ...","Marlon Brando, Al Pacino, James Caan, Richard ...",...,,,,,http://www.rottentomatoes.com/m/godfather/,09 Oct 2001,,Paramount Pictures,http://www.thegodfather.com,True
2,3,The Godfather: Part II,1974,R,20 Dec 1974,202 min,"Crime, Drama",Francis Ford Coppola,"Francis Ford Coppola (screenplay), Mario Puzo ...","Al Pacino, Robert Duvall, Diane Keaton, Robert...",...,,,,,http://www.rottentomatoes.com/m/godfather_part...,24 May 2005,,Paramount Pictures,http://www.thegodfather.com/,True
3,4,The Dark Knight,2008,PG-13,18 Jul 2008,152 min,"Action, Crime, Drama",Christopher Nolan,"Jonathan Nolan (screenplay), Christopher Nolan...","Christian Bale, Heath Ledger, Aaron Eckhart, M...",...,,,,,http://www.rottentomatoes.com/m/the_dark_knight/,09 Dec 2008,"$533,316,061",Warner Bros. Pictures/Legendary,http://thedarkknight.warnerbros.com/,True
4,5,12 Angry Men,1957,APPROVED,01 Apr 1957,96 min,"Crime, Drama",Sidney Lumet,"Reginald Rose (story), Reginald Rose (screenplay)","Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",...,,,,,http://www.rottentomatoes.com/m/1000013-12_ang...,06 Mar 2001,,Criterion Collection,http://www.criterion.com/films/27871-12-angry-men,True


In [104]:
practice_df = practice_df[['Title','Director','Actors','Plot','Genre']]
practice_df.head()

Unnamed: 0,Title,Director,Actors,Plot,Genre
0,The Shawshank Redemption,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...,"Crime, Drama"
1,The Godfather,Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...,"Crime, Drama"
2,The Godfather: Part II,Francis Ford Coppola,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",The early life and career of Vito Corleone in ...,"Crime, Drama"
3,The Dark Knight,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as the Joker emerges fro...,"Action, Crime, Drama"
4,12 Angry Men,Sidney Lumet,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",A jury holdout attempts to prevent a miscarria...,"Crime, Drama"


In [105]:
practice_df['Actors'] = practice_df['Actors'].map(lambda x: x.split(',')[:3])
for index, row in practice_df.iterrows():
    row['Actors'] = [x.lower().replace(' ','') for x in row['Actors']]
    
practice_df['Director'] = practice_df['Director'].map(lambda x: x.split(' '))
for index, row in practice_df.iterrows():
    row['Director'] = ''.join(row['Director']).lower()

practice_df['Genre'] = practice_df['Genre'].map(lambda x: x.split(',')[:3])
for index, row in practice_df.iterrows():
    row['Genre'] = [x.lower().replace(' ','') for x in row['Genre']]
    
practice_df['Key_words'] = ""

for index, row in practice_df.iterrows():
    plot = row['Plot']
    
    # instantiating a Rake object
    # by default it uses english stopwords from NLTK (natural language tool kit)
    # and discards all puntuation characters
    r = Rake()

    # extracting the keywords from the text by passing plot 
    r.extract_keywords_from_text(plot)

    # getting the dictionary whith key words and their scores
    key_words_dict_scores = r.get_word_degrees()
    
    # assigning the key words to the new column
    row['Key_words'] = list(key_words_dict_scores.keys())
    
practice_df.drop(columns = ['Plot'], inplace = True)
practice_df

Unnamed: 0,Title,Director,Actors,Genre,Key_words
0,The Shawshank Redemption,frankdarabont,"[timrobbins, morganfreeman, bobgunton]","[crime, drama]","[two, imprisoned, men, bond, number, years, fi..."
1,The Godfather,francisfordcoppola,"[marlonbrando, alpacino, jamescaan]","[crime, drama]","[aging, patriarch, organized, crime, dynasty, ..."
2,The Godfather: Part II,francisfordcoppola,"[alpacino, robertduvall, dianekeaton]","[crime, drama]","[early, life, career, vito, corleone, 1920s, n..."
3,The Dark Knight,christophernolan,"[christianbale, heathledger, aaroneckhart]","[action, crime, drama]","[menace, known, joker, emerges, mysterious, pa..."
4,12 Angry Men,sidneylumet,"[martinbalsam, johnfiedler, leej.cobb]","[crime, drama]","[jury, holdout, attempts, prevent, miscarriage..."
...,...,...,...,...,...
245,The Lost Weekend,billywilder,"[raymilland, janewyman, phillipterry]","[drama, film-noir]","[desperate, life, chronic, alcoholic, followed..."
246,Short Term 12,destindanielcretton,"[brielarson, johngallagherjr., stephaniebeatriz]",[drama],"[20, something, supervising, staff, member, re..."
247,His Girl Friday,howardhawks,"[carygrant, rosalindrussell, ralphbellamy]","[comedy, drama, romance]","[newspaper, editor, uses, every, trick, book, ..."
248,The Straight Story,davidlynch,"[sissyspacek, janegallowayheitz, josepha.carpe...","[biography, drama]","[old, man, makes, long, journey, lawn, mover, ..."


In [106]:
columns = practice_df.columns[1:]

practice_df['bag_of_words'] = ''

for index, row in practice_df.iterrows():
    words = ''
    for col in columns:
        if col != 'Director':
            # to convert the list into a string of words separated by a space
            words = words + ' '.join(row[col])+ ' '
        else:
            words = words + row[col]+ ' '
    row['bag_of_words'] = words

# let's keep only the title and the bag of words in the dataframe
new_df = practice_df[['Title','bag_of_words']]
new_df.head()

Unnamed: 0,Title,bag_of_words
0,The Shawshank Redemption,frankdarabont timrobbins morganfreeman bobgunt...
1,The Godfather,francisfordcoppola marlonbrando alpacino james...
2,The Godfather: Part II,francisfordcoppola alpacino robertduvall diane...
3,The Dark Knight,christophernolan christianbale heathledger aar...
4,12 Angry Men,sidneylumet martinbalsam johnfiedler leej.cobb...


In [107]:
new_df['bag_of_words'][0]

'frankdarabont timrobbins morganfreeman bobgunton crime drama two imprisoned men bond number years finding solace eventual redemption acts common decency '

In [108]:
count_2 = CountVectorizer()
count_matrix_2 = count.fit_transform(practice_df['bag_of_words'])
cosine_sim_2 = cosine_similarity(count_matrix_2, count_matrix_2)
print(cosine_sim_2)

[[1.         0.15789474 0.13764944 ... 0.05263158 0.05263158 0.05564149]
 [0.15789474 1.         0.36706517 ... 0.05263158 0.05263158 0.05564149]
 [0.13764944 0.36706517 1.         ... 0.04588315 0.04588315 0.04850713]
 ...
 [0.05263158 0.05263158 0.04588315 ... 1.         0.05263158 0.05564149]
 [0.05263158 0.05263158 0.04588315 ... 0.05263158 1.         0.05564149]
 [0.05564149 0.05564149 0.04850713 ... 0.05564149 0.05564149 1.        ]]


In [109]:
def recommendations_2(title, cosine_sim = cosine_sim_2):
    
    recommended_movies = []
    
    # getting the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df['Title'])[i])
        
    return recommended_movies

In [110]:
recommendations_2('The Dark Knight')

['The Dark Knight Rises',
 'Batman Begins',
 'The Prestige',
 'The Green Mile',
 'Witness for the Prosecution',
 'Out of the Past',
 'Rush',
 'The Godfather',
 'V for Vendetta',
 'Reservoir Dogs']

In [111]:
recommendations('The Dark Knight')

['The Dark Knight Rises',
 'Batman Begins',
 'The Prestige',
 'Guardians of the Galaxy Vol. 2',
 'Terminator 2: Judgment Day',
 'The Avengers',
 'The Green Mile',
 'Sin City',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Out of the Past']