# Building a movie recommendation system
In this assignment you will be building a movie recommendation system which recommends movies that are similar to the movie you search for. This is knew as the content based filtering approach when building recommendation systems. There are also other methods, but in this assignment we only practice the content based filter method and you will be using what you have learned like k-nn, k-mean to implement the system. We will use the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) dataset which includes data about 5000 movies for our movie recommendation system.

## The dataset
You will not use the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) dataset directly instead you use the smaller version that we have prepared for you. The smaller version includes fewer columns than the original, has some pre-processing and contains only one file. Let's read the data using pandas using the cell below.

In [1]:
# In this assignment, we contrast 2 type of distances: cosine and euclidean
# so we should use cosine if observations have different length. Use these distant matrix
# to get similar movies. we also contrast tfidf and countevectorizer
# At the second session we use kmeans and use tfidf and countvectorizer to feed our model
import pandas as pd
df = pd.read_csv("data/tmdb5000.csv")

df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
0,Avatar,"['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",James Cameron,"['culture clash', 'future', 'space war', 'spac...","['Action', 'Adventure', 'Fantasy', 'Science Fi...","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"['Johnny Depp', 'Orlando Bloom', 'Keira Knight...",Gore Verbinski,"['ocean', 'drug abuse', 'exotic island', 'east...","['Adventure', 'Fantasy', 'Action']","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"['Daniel Craig', 'Christoph Waltz', 'Léa Seydo...",Sam Mendes,"['spy', 'based on novel', 'secret agent', 'seq...","['Action', 'Adventure', 'Crime']",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"['Christian Bale', 'Michael Caine', 'Gary Oldm...",Christopher Nolan,"['dc comics', 'crime fighter', 'terrorist', 's...","['Action', 'Crime', 'Drama', 'Thriller']",Following the death of District Attorney Harve...
4,John Carter,"['Taylor Kitsch', 'Lynn Collins', 'Samantha Mo...",Andrew Stanton,"['based on novel', 'mars', 'medallion', 'space...","['Action', 'Adventure', 'Science Fiction']","John Carter is a war-weary, former military ca..."


### Filling na values
As you can see our dataset has six columns. First, let's check if any columns that have na values and fill those values.

In [2]:
print (df.columns.tolist())
print (df.dtypes)
print ("has na columns: ", df.columns[df.isna().any()].tolist())

['title', 'cast', 'director', 'keywords', 'genres', 'overview']
title       object
cast        object
director    object
keywords    object
genres      object
overview    object
dtype: object
has na columns:  ['director', 'overview']


Fill the na values in the **director** and **overview** columns with empty strings.

In [3]:
# YOUR CODE HERE
df["director"] = df['director'].fillna('')
df["overview"] = df['overview'].fillna('')
# YOUR CODE HERE

df[df["director"] == ""].head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
3661,Flying By,"['Billy Ray Cyrus', 'Heather Locklear', 'Ahnai...",,[],['Drama'],A real estate developer goes to his 25th high ...
3670,Running Forever,[],,[],['Family'],After being estranged since her mother's death...
3729,Paa,"['Amitabh Bachchan', 'Abhishek Bachchan', 'Vid...",,[],"['Drama', 'Family', 'Foreign']",He suffers from a progeria like syndrome. Ment...
3977,Boynton Beach Club,"['Brenda Vaccaro', 'Dyan Cannon', 'Joseph Bolo...",,['independent film'],"['Comedy', 'Drama', 'Romance']",A handful of men and women of a certain age pi...
4068,Sharkskin,[],,[],[],The Post War II story of Manhattan born Mike E...


### Convert string array to array
The **cast**, **keywords**, **genres** columns are currently string type. We need to convert them into their actual array type. 

In [4]:
print(type(df["cast"][0]))

# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'keywords', 'genres']
# YOUR CODE HERE
for feature in features:
    df[feature] = df[feature].apply(lambda x: literal_eval(x))
# YOUR CODE HERE
    
print (type(df["cast"][0]))

<class 'str'>
<class 'list'>


## Recommend movies using overview
Let's use the **overview** column from the dataset to recommend movies. It is pretty reasonable to use the overview plot to recommend movies beacuse people might like movies with similar plot.

### Represent movie overviews as vectors
We use **tf-idf** to transform our movie reviews into vectors then we can use them to recommend similar movies. Let's use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from sklearn to get a matrix of vectors for each movie overview in our dataset. Complete to code below.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# YOUR CODE HERE
tfidf_matrix = tfidf.fit_transform(df['overview'])
# YOUR CODE HERE

# You shoud get (4803, 20978)
tfidf_matrix.shape

(4803, 20978)

Now, we compute the similarities between each pair of movies then save them for the recommendation later. There are several methods for computing the similarites like consine similarity, euclidean, ... We do not know immediately what is the wright method for this so let's try them out. We use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), [euclidean_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html) from sklearn to compute them. Complete the code below to compute the cosine similarity and euclidean distance for each pair of movies. Because euclidean distance is kind of opposite of consine similarity so for convenience, we will multiple euclidan distances with -1 to flip them into similarities. 

In [6]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cosine_sims = cosine_similarity(tfidf_matrix, tfidf_matrix) # YOUR CODE HERE
euclid_distances = euclidean_distances(tfidf_matrix, tfidf_matrix) # YOUR CODE HERE
# Here we use a simple trick to convert euclidean distances into similarity scores
# we just negative the distances so, higher distances become lower scores
euclid_sims = -euclid_distances

# you should get (4803, 4803)
print (cosine_sims.shape)
print (euclid_sims.shape)

(4803, 4803)
(4803, 4803)


Let's create a look up a movie indexes table from the movie titles to use later.

In [7]:
title2index = pd.Series(df.index, index=df['title']).drop_duplicates()

Complete the **get_recommendations** function below

In [8]:
def get_recommendations(title, sims, title2index, df):
    """
        Get recommendation movies from a movie title
        
        :param title: The query movie title
        :param sims: The similarity matrix for each pair of movie
        :param title2index: Title to index look up table
        :param df: The dataset dataframe
        
        :return: The top 10 recommendation movies
    """
    
    # get the movie index by the title
    idx = title2index[title]

    # get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(sims[idx]))

    # sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1]) # YOUR CODE HERE
   
    # get the scores of the 10 most similar movies
    # we ignore the first one beacause it is the same movie
    sim_scores = sim_scores[1:11]

    # get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # return the top 10 most similar movies
    return df[:].iloc[movie_indices]

In [9]:
get_recommendations('The Dark Knight', cosine_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[ocean, drug abuse, exotic island, east india ...","[Adventure, Fantasy, Action]","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Sam Mendes,"[spy, based on novel, secret agent, sequel]","[Action, Adventure, Crime]",A cryptic message from Bond’s past sends him o...
5,Spider-Man 3,"[Tobey Maguire, Kirsten Dunst, James Franco, T...",Sam Raimi,"[dual identity, amnesia, sandstorm, love of on...","[Fantasy, Action, Adventure]",The seemingly invincible Spider-Man goes up ag...
8,Harry Potter and the Half-Blood Prince,"[Daniel Radcliffe, Rupert Grint, Emma Watson, ...",David Yates,"[witch, magic, broom, school of witchcraft]","[Adventure, Fantasy, Family]","As Harry begins his sixth year at Hogwarts, he..."
10,Superman Returns,"[Brandon Routh, Kevin Spacey, Kate Bosworth, J...",Bryan Singer,"[saving the world, dc comics, invulnerability,...","[Adventure, Fantasy, Action, Science Fiction]",Superman returns to discover his 5-year absenc...
11,Quantum of Solace,"[Daniel Craig, Olga Kurylenko, Mathieu Amalric...",Marc Forster,"[killing, undercover, secret agent, british se...","[Adventure, Action, Thriller, Crime]",Quantum of Solace continues the adventures of ...
12,Pirates of the Caribbean: Dead Man's Chest,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[witch, fortune teller, bondage, exotic island]","[Adventure, Fantasy, Action]",Captain Jack Sparrow works his way out of a bl...
13,The Lone Ranger,"[Johnny Depp, Armie Hammer, William Fichtner, ...",Gore Verbinski,"[texas, horse, survivor, texas ranger]","[Action, Adventure, Western]",The Texas Rangers chase down a gang of outlaws...
14,Man of Steel,"[Henry Cavill, Amy Adams, Michael Shannon, Kev...",Zack Snyder,"[saving the world, dc comics, superhero, based...","[Action, Adventure, Fantasy, Science Fiction]",A young boy learns that he has extraordinary p...
17,Pirates of the Caribbean: On Stranger Tides,"[Johnny Depp, Penélope Cruz, Ian McShane, Kevi...",Rob Marshall,"[sea, captain, mutiny, sword]","[Adventure, Action, Fantasy]",Captain Jack Sparrow crosses paths with a woma...


In [10]:
get_recommendations('The Avengers', cosine_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[ocean, drug abuse, exotic island, east india ...","[Adventure, Fantasy, Action]","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux, R...",Sam Mendes,"[spy, based on novel, secret agent, sequel]","[Action, Adventure, Crime]",A cryptic message from Bond’s past sends him o...
5,Spider-Man 3,"[Tobey Maguire, Kirsten Dunst, James Franco, T...",Sam Raimi,"[dual identity, amnesia, sandstorm, love of on...","[Fantasy, Action, Adventure]",The seemingly invincible Spider-Man goes up ag...
6,Tangled,"[Zachary Levi, Mandy Moore, Donna Murphy, Ron ...",Byron Howard,"[hostage, magic, horse, fairy tale]","[Animation, Family]",When the kingdom's most wanted-and most charmi...
10,Superman Returns,"[Brandon Routh, Kevin Spacey, Kate Bosworth, J...",Bryan Singer,"[saving the world, dc comics, invulnerability,...","[Adventure, Fantasy, Action, Science Fiction]",Superman returns to discover his 5-year absenc...
11,Quantum of Solace,"[Daniel Craig, Olga Kurylenko, Mathieu Amalric...",Marc Forster,"[killing, undercover, secret agent, british se...","[Adventure, Action, Thriller, Crime]",Quantum of Solace continues the adventures of ...
12,Pirates of the Caribbean: Dead Man's Chest,"[Johnny Depp, Orlando Bloom, Keira Knightley, ...",Gore Verbinski,"[witch, fortune teller, bondage, exotic island]","[Adventure, Fantasy, Action]",Captain Jack Sparrow works his way out of a bl...
13,The Lone Ranger,"[Johnny Depp, Armie Hammer, William Fichtner, ...",Gore Verbinski,"[texas, horse, survivor, texas ranger]","[Action, Adventure, Western]",The Texas Rangers chase down a gang of outlaws...
15,The Chronicles of Narnia: Prince Caspian,"[Ben Barnes, William Moseley, Anna Popplewell,...",Andrew Adamson,"[based on novel, fictional place, brother sist...","[Adventure, Family, Fantasy]",One year after their incredible adventures in ...
19,The Hobbit: The Battle of the Five Armies,"[Martin Freeman, Ian McKellen, Richard Armitag...",Peter Jackson,"[corruption, elves, dwarves, orcs]","[Action, Adventure, Fantasy]",Immediately after the events of The Desolation...


In [11]:
get_recommendations('The Dark Knight', euclid_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
24,King Kong,"[Naomi Watts, Jack Black, Adrien Brody, Thomas...",Peter Jackson,"[film business, screenplay, show business, fil...","[Adventure, Drama, Action]","In 1933 New York, an overly ambitious movie pr..."
32,Alice in Wonderland,"[Mia Wasikowska, Johnny Depp, Anne Hathaway, H...",Tim Burton,"[based on novel, fictional place, queen, fantasy]","[Family, Fantasy, Adventure]","Alice, an unpretentious and individual 19-year..."
34,Monsters University,"[Billy Crystal, John Goodman, Steve Buscemi, H...",Dan Scanlon,"[monster, dormitory, games, animation]","[Animation, Family]",A look at the relationship between Mike and Su...
38,The Amazing Spider-Man 2,"[Andrew Garfield, Emma Stone, Jamie Foxx, Dane...",Marc Webb,"[obsession, marvel comic, sequel, based on com...","[Action, Adventure, Fantasy]","For Peter Parker, life is busy. Between taking..."
66,Up,"[Ed Asner, Christopher Plummer, Jordan Nagai, ...",Pete Docter,"[age difference, central and south america, ba...","[Animation, Comedy, Family, Adventure]",Carl Fredricksen spent his entire life dreamin...
81,Maleficent,"[Angelina Jolie, Elle Fanning, Sharlto Copley,...",Robert Stromberg,"[fairy tale, villain, sleeping beauty, dark fa...","[Fantasy, Adventure, Action, Family]",The untold story of Disney's most iconic villa...
89,Wreck-It Ralph,"[John C. Reilly, Sarah Silverman, Jack McBraye...",Rich Moore,"[support group, product placement, bullying, r...","[Family, Animation, Comedy, Adventure]","Wreck-It Ralph is the 9-foot-tall, 643-pound v..."
146,Madagascar 3: Europe's Most Wanted,"[Ben Stiller, Sacha Baron Cohen, David Schwimm...",Conrad Vernon,"[madagascar, 3d]","[Animation, Family]","Alex, Marty, Gloria and Melman are still tryin..."
163,Watchmen,"[Malin Åkerman, Billy Crudup, Carla Gugino, Je...",Zack Snyder,"[dc comics, secret identity, mass murder, reti...","[Action, Mystery, Science Fiction]",In a gritty and alternate 1985 the glory days ...
198,R.I.P.D.,"[Jeff Bridges, Ryan Reynolds, Kevin Bacon, Ste...",Robert Schwentke,"[gold, police operation, partner, revenge]","[Fantasy, Action, Comedy, Crime]",A recently slain cop joins a team of undead po...


In [12]:
get_recommendations('The Avengers', euclid_sims, title2index, df)

Unnamed: 0,title,cast,director,keywords,genres,overview
34,Monsters University,"[Billy Crystal, John Goodman, Steve Buscemi, H...",Dan Scanlon,"[monster, dormitory, games, animation]","[Animation, Family]",A look at the relationship between Mike and Su...
77,Inside Out,"[Amy Poehler, Phyllis Smith, Richard Kind, Bil...",Pete Docter,"[dream, cartoon, imaginary friend, animation]","[Drama, Comedy, Animation, Family]","Growing up can be a bumpy road, and it's no ex..."
89,Wreck-It Ralph,"[John C. Reilly, Sarah Silverman, Jack McBraye...",Rich Moore,"[support group, product placement, bullying, r...","[Family, Animation, Comedy, Adventure]","Wreck-It Ralph is the 9-foot-tall, 643-pound v..."
122,X-Men Origins: Wolverine,"[Hugh Jackman, Liev Schreiber, Danny Huston, L...",Gavin Hood,"[corruption, mutant, boxer, army]","[Adventure, Action, Thriller, Science Fiction]","After seeking to live a normal life, Logan set..."
140,White House Down,"[Channing Tatum, Jamie Foxx, Joey King, Maggie...",Roland Emmerich,"[usa president, conspiracy, secret service, th...","[Action, Drama, Thriller]",Capitol Policeman John Cale has just been deni...
163,Watchmen,"[Malin Åkerman, Billy Crudup, Carla Gugino, Je...",Zack Snyder,"[dc comics, secret identity, mass murder, reti...","[Action, Mystery, Science Fiction]",In a gritty and alternate 1985 the glory days ...
232,The Wolverine,"[Hugh Jackman, Hiroyuki Sanada, Famke Janssen,...",James Mangold,"[japan, samurai, mutant, world war i]","[Action, Science Fiction, Adventure, Fantasy]",Wolverine faces his ultimate nemesis - and tes...
257,Real Steel,"[Hugh Jackman, Dakota Goyo, Evangeline Lilly, ...",Shawn Levy,"[father son relationship, fight, sport, robot]","[Action, Science Fiction, Drama]","In the near-future, Charlie Kenton is a washed..."
281,American Gangster,"[Denzel Washington, Russell Crowe, Chiwetel Ej...",Ridley Scott,"[underdog, black people, drug traffic, drug sm...","[Drama, Crime]",Following the death of his employer and mentor...
282,True Lies,"[Arnold Schwarzenegger, Jamie Lee Curtis, Tom ...",James Cameron,"[spy, terrorist, florida, gun]","[Action, Thriller]",Harry Tasker is a secret agent for the United ...


If you implement correctly, for cosine similarity case you should get **The Dark Knight Rises**, **Batman Returns** in top 2 for **The Dark Knight** movie and **Avengers: Age of Ultron**, **Plastic** for the **The Avengers** movie.
<br>
But in the ecludian case, you can see the top 4 recommendation movies are nothing similar to the query movie. There is a problem here, because the **overview** for the top 4 movie is empty so we can ignore the top 4 movies and use from the 5th movies which are much similar to the query movie.

## Recommend movies using multiple features
Previously, we only use the overview plot of the movies to recommend similar movies. Many people might like other movies beacause they are the same genres or from the same director, actors, actresses.

We will use the **cast**, **director**, **keywords**, **genres** features from our dataset to recommend movies. First, let's preprocess the data. Complete the code below to apply the **clean_data** function to the features. Here, we lower case the people name and replace the space character with the underscore character so that people name are preserved.

In [13]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "_")) for i in x]
    else:
        # check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''
        
# apply clean_data function to your features.
features = ['cast', 'director', 'keywords', 'genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data) # YOUR CODE HERE

In [14]:
df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview
0,Avatar,"[sam_worthington, zoe_saldana, sigourney_weave...",jamescameron,"[culture_clash, future, space_war, space_colony]","[action, adventure, fantasy, science_fiction]","In the 22nd century, a paraplegic Marine is di..."
1,Pirates of the Caribbean: At World's End,"[johnny_depp, orlando_bloom, keira_knightley, ...",goreverbinski,"[ocean, drug_abuse, exotic_island, east_india_...","[adventure, fantasy, action]","Captain Barbossa, long believed to be dead, ha..."
2,Spectre,"[daniel_craig, christoph_waltz, léa_seydoux, r...",sammendes,"[spy, based_on_novel, secret_agent, sequel]","[action, adventure, crime]",A cryptic message from Bond’s past sends him o...
3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...
4,John Carter,"[taylor_kitsch, lynn_collins, samantha_morton,...",andrewstanton,"[based_on_novel, mars, medallion, space_travel]","[action, adventure, science_fiction]","John Carter is a war-weary, former military ca..."


Let's combine the four features into one feature. Complete the code below.

In [15]:
print(df[['cast', 'director', 'keywords', 'genres']].head())

                                                cast  ...                                         genres
0  [sam_worthington, zoe_saldana, sigourney_weave...  ...  [action, adventure, fantasy, science_fiction]
1  [johnny_depp, orlando_bloom, keira_knightley, ...  ...                   [adventure, fantasy, action]
2  [daniel_craig, christoph_waltz, léa_seydoux, r...  ...                     [action, adventure, crime]
3  [christian_bale, michael_caine, gary_oldman, a...  ...               [action, crime, drama, thriller]
4  [taylor_kitsch, lynn_collins, samantha_morton,...  ...           [action, adventure, science_fiction]

[5 rows x 4 columns]


In [16]:
def combine(x):
    return ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['keywords']) + ' ' + ' '.join(x['genres'])
df['combined'] = df[['cast', 'director', 'keywords', 'genres']].apply(combine, axis=1) # YOUR CODE HERE

In [17]:
df.head(5)

Unnamed: 0,title,cast,director,keywords,genres,overview,combined
0,Avatar,"[sam_worthington, zoe_saldana, sigourney_weave...",jamescameron,"[culture_clash, future, space_war, space_colony]","[action, adventure, fantasy, science_fiction]","In the 22nd century, a paraplegic Marine is di...",sam_worthington zoe_saldana sigourney_weaver s...
1,Pirates of the Caribbean: At World's End,"[johnny_depp, orlando_bloom, keira_knightley, ...",goreverbinski,"[ocean, drug_abuse, exotic_island, east_india_...","[adventure, fantasy, action]","Captain Barbossa, long believed to be dead, ha...",johnny_depp orlando_bloom keira_knightley stel...
2,Spectre,"[daniel_craig, christoph_waltz, léa_seydoux, r...",sammendes,"[spy, based_on_novel, secret_agent, sequel]","[action, adventure, crime]",A cryptic message from Bond’s past sends him o...,daniel_craig christoph_waltz léa_seydoux ralph...
3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
4,John Carter,"[taylor_kitsch, lynn_collins, samantha_morton,...",andrewstanton,"[based_on_novel, mars, medallion, space_travel]","[action, adventure, science_fiction]","John Carter is a war-weary, former military ca...",taylor_kitsch lynn_collins samantha_morton wil...


### Represent the combined feature as vectors
We do not use **tf-idf** for this feature because we do not want for example a director appears in many movies which does not mean that director is less important so we use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) instead.
<br>
Complete the code below to get the **count_matrix**

In [18]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['combined']) # YOUR CODE HERE

Complete the code below to calculate the cosine similarities, euclidean distances using the **count_matrix**

In [19]:
cosine_sims2 = cosine_similarity(count_matrix, count_matrix) # YOUR CODE HERE
euclid_distances2 = euclidean_distances(count_matrix, count_matrix) # YOUR CODE HERE

# Here we use a simple trick to convert euclidean distances into similarity scores
# we just negative the distances so, higher distances become lower scores
euclid_sims2 = -euclid_distances2

# you should get (4803, 4803)
print (cosine_sims2.shape)
print (euclid_sims2.shape)

(4803, 4803)
(4803, 4803)


In [20]:
df = df.reset_index()
indices = pd.Series(df.index, index=df['title'])

Let's check out the recommendations

In [21]:
get_recommendations('The Dark Knight', cosine_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
8,8,Harry Potter and the Half-Blood Prince,"[daniel_radcliffe, rupert_grint, emma_watson, ...",davidyates,"[witch, magic, broom, school_of_witchcraft]","[adventure, fantasy, family]","As Harry begins his sixth year at Hogwarts, he...",daniel_radcliffe rupert_grint emma_watson tom_...
15,15,The Chronicles of Narnia: Prince Caspian,"[ben_barnes, william_moseley, anna_popplewell,...",andrewadamson,"[based_on_novel, fictional_place, brother_sist...","[adventure, family, fantasy]",One year after their incredible adventures in ...,ben_barnes william_moseley anna_popplewell ska...
22,22,The Hobbit: The Desolation of Smaug,"[martin_freeman, ian_mckellen, richard_armitag...",peterjackson,"[elves, dwarves, orcs, hobbit]","[adventure, fantasy]","The Dwarves, Bilbo and Gandalf have successful...",martin_freeman ian_mckellen richard_armitage k...
23,23,The Golden Compass,"[dakota_blue_richards, nicole_kidman, daniel_c...",chrisweitz,"[england, compass, experiment, lordship]","[adventure, fantasy]","After overhearing a shocking secret, precociou...",dakota_blue_richards nicole_kidman daniel_crai...
32,32,Alice in Wonderland,"[mia_wasikowska, johnny_depp, anne_hathaway, h...",timburton,"[based_on_novel, fictional_place, queen, fantasy]","[family, fantasy, adventure]","Alice, an unpretentious and individual 19-year...",mia_wasikowska johnny_depp anne_hathaway helen...
34,34,Monsters University,"[billy_crystal, john_goodman, steve_buscemi, h...",danscanlon,"[monster, dormitory, games, animation]","[animation, family]",A look at the relationship between Mike and Su...,billy_crystal john_goodman steve_buscemi helen...
37,37,Oz: The Great and Powerful,"[james_franco, mila_kunis, rachel_weisz, miche...",samraimi,"[circus, witch, magic, hope]","[fantasy, adventure, family]","Oscar Diggs, a small-time circus illusionist a...",james_franco mila_kunis rachel_weisz michelle_...
42,42,Toy Story 3,"[tom_hanks, tim_allen, ned_beatty, joan_cusack]",leeunkrich,"[hostage, college, toy, barbie]","[animation, family, comedy]","Woody, Buzz, and the rest of Andy's toys haven...",tom_hanks tim_allen ned_beatty joan_cusack lee...
54,54,The Good Dinosaur,"[raymond_ochoa, jack_bright, jeffrey_wright, f...",petersohn,"[tyrannosaurus_rex, friends, alternate_history...","[adventure, animation, family]",An epic journey into the world of dinosaurs wh...,raymond_ochoa jack_bright jeffrey_wright franc...
55,55,Brave,"[kelly_macdonald, julie_walters, billy_connoll...",brendachapman,"[scotland, rebel, bravery, kingdom]","[animation, adventure, comedy, family]",Brave is set in the mystical Scottish Highland...,kelly_macdonald julie_walters billy_connolly e...


In [22]:
get_recommendations('The Avengers', cosine_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
25,25,Titanic,"[kate_winslet, leonardo_dicaprio, frances_fish...",jamescameron,"[shipwreck, iceberg, ship, panic]","[drama, romance, thriller]","84 years later, a 101-year-old woman named Ros...",kate_winslet leonardo_dicaprio frances_fisher ...
34,34,Monsters University,"[billy_crystal, john_goodman, steve_buscemi, h...",danscanlon,"[monster, dormitory, games, animation]","[animation, family]",A look at the relationship between Mike and Su...,billy_crystal john_goodman steve_buscemi helen...
42,42,Toy Story 3,"[tom_hanks, tim_allen, ned_beatty, joan_cusack]",leeunkrich,"[hostage, college, toy, barbie]","[animation, family, comedy]","Woody, Buzz, and the rest of Andy's toys haven...",tom_hanks tim_allen ned_beatty joan_cusack lee...
49,49,The Great Gatsby,"[leonardo_dicaprio, tobey_maguire, carey_mulli...",bazluhrmann,"[based_on_novel, infidelity, obsession, hope]","[drama, romance]",An adaptation of F. Scott Fitzgerald's Long Is...,leonardo_dicaprio tobey_maguire carey_mulligan...
57,57,WALL·E,"[ben_burtt, elissa_knight, jeff_garlin, fred_w...",andrewstanton,[romantic_comedy],"[animation, family]",WALL·E is the last robot left on an Earth that...,ben_burtt elissa_knight jeff_garlin fred_willa...
60,60,A Christmas Carol,"[gary_oldman, jim_carrey, steve_valentine, dar...",robertzemeckis,"[holiday, based_on_novel, victorian_england, m...","[animation, drama]",Miser Ebenezer Scrooge is awakened on Christma...,gary_oldman jim_carrey steve_valentine daryl_s...
73,73,Evan Almighty,"[steve_carell, lauren_graham, john_goodman, ji...",tomshadyac,"[father_son_relationship, daily_life, married_...","[fantasy, comedy, family]",God contacts Congressman Evan Baxter and tells...,steve_carell lauren_graham john_goodman jimmy_...
77,77,Inside Out,"[amy_poehler, phyllis_smith, richard_kind, bil...",petedocter,"[dream, cartoon, imaginary_friend, animation]","[drama, comedy, animation, family]","Growing up can be a bumpy road, and it's no ex...",amy_poehler phyllis_smith richard_kind bill_ha...
100,100,The Curious Case of Benjamin Button,"[cate_blanchett, brad_pitt, tilda_swinton, jul...",davidfincher,"[diary, navy, funeral, tea]","[fantasy, drama, thriller, mystery]","Tells the story of Benjamin Button, a man who ...",cate_blanchett brad_pitt tilda_swinton julia_o...
105,105,Alice Through the Looking Glass,"[johnny_depp, mia_wasikowska, anne_hathaway, h...",jamesbobin,"[based_on_novel, clock, queen, sequel]",[fantasy],"In the sequel to Tim Burton's ""Alice in Wonder...",johnny_depp mia_wasikowska anne_hathaway helen...


In [23]:
get_recommendations('The Dark Knight', euclid_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
358,358,Atlantis: The Lost Empire,"[michael_j._fox, corey_burton, claudia_christi...",garytrousdale,"[sea, atlantis, animation, underwater]","[animation, family, adventure, science_fiction]",The world's most highly qualified crew of arch...,michael_j._fox corey_burton claudia_christian ...
2877,2877,The Host,"[song_kang-ho, park_hae-il, bae_doona, ko_ah-s...",bongjoon-ho,"[river, mobile_phone, bravery, archer]","[horror, drama, science_fiction]",Gang-du is a dim-witted man working at his fat...,song_kang-ho park_hae-il bae_doona ko_ah-sung ...
66,66,Up,"[ed_asner, christopher_plummer, jordan_nagai, ...",petedocter,"[age_difference, central_and_south_america, ba...","[animation, comedy, family, adventure]",Carl Fredricksen spent his entire life dreamin...,ed_asner christopher_plummer jordan_nagai bob_...
254,254,The Smurfs,"[hank_azaria, neil_patrick_harris, jayma_mays,...",rajagosnell,"[moon, magic, based_on_comic_book, animation]","[animation, family, adventure, comedy]",When the evil wizard Gargamel chases the tiny ...,hank_azaria neil_patrick_harris jayma_mays sof...
503,503,The Adventures of Rocky & Bullwinkle,"[rene_russo, jason_alexander, piper_perabo, ra...",desmcanuff,"[adventure, cartoon, comedy, breaking_the_four...","[action, adventure, animation, comedy]",Rocky and Bullwinkle have been living off the ...,rene_russo jason_alexander piper_perabo randy_...
572,572,Hook,"[robin_williams, dustin_hoffman, julia_roberts...",stevenspielberg,"[flying, swordplay, sword, fantasy]","[adventure, fantasy, comedy, family]",The boy who wasn't supposed to grow up—Peter P...,robin_williams dustin_hoffman julia_roberts bo...
661,661,Zathura: A Space Adventure,"[jonah_bobo, josh_hutcherson, dax_shepard, kri...",jonfavreau,"[adventure, house, alien, giant_robot]","[family, fantasy, science_fiction, adventure]","After their father is called into work, two yo...",jonah_bobo josh_hutcherson dax_shepard kristen...
1016,1016,Kate & Leopold,"[meg_ryan, hugh_jackman, liev_schreiber, breck...",jamesmangold,"[lover_(female), love_of_one's_life, time_trav...","[comedy, fantasy, romance, science_fiction]",When her scientist ex-boyfriend discovers a po...,meg_ryan hugh_jackman liev_schreiber breckin_m...
1086,1086,Aliens in the Attic,"[carter_jenkins, austin_butler, kevin_nealon, ...",johnschultz,"[alien, comedy, duringcreditsstinger, beforecr...","[adventure, comedy, family, fantasy]","It's summer vacation, but the Pearson family k...",carter_jenkins austin_butler kevin_nealon robe...
1165,1165,Back to the Future Part III,"[michael_j._fox, christopher_lloyd, mary_steen...",robertzemeckis,"[railroad_robber, california, delorean, indian...","[adventure, comedy, family, science_fiction]",The final installment of the Back to the Futur...,michael_j._fox christopher_lloyd mary_steenbur...


In [24]:
get_recommendations('The Avengers', euclid_sims2, title2index, df)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
2877,2877,The Host,"[song_kang-ho, park_hae-il, bae_doona, ko_ah-s...",bongjoon-ho,"[river, mobile_phone, bravery, archer]","[horror, drama, science_fiction]",Gang-du is a dim-witted man working at his fat...,song_kang-ho park_hae-il bae_doona ko_ah-sung ...
3391,3391,Dom Hemingway,"[jude_law, demián_bichir, richard_e._grant, ma...",richardshepard,"[growing_up, money, crime, fatherhood]","[comedy, crime, drama]",After spending 12 years in prison for keeping ...,jude_law demián_bichir richard_e._grant matthe...
3530,3530,Don Jon,"[joseph_gordon-levitt, scarlett_johansson, jul...",josephgordon-levitt,"[pornography, sex, sex_addiction, male_female_...","[romance, comedy, drama]","A New Jersey guy dedicated to his family, frie...",joseph_gordon-levitt scarlett_johansson julian...
3940,3940,Oldboy,"[choi_min-sik, yoo_ji-tae, kang_hye-jung, kim_...",parkchan-wook,"[sushi_restaurant, rage_and_hate, notebook, da...","[drama, thriller, mystery, action]","With no clue how he came to be imprisoned, dru...",choi_min-sik yoo_ji-tae kang_hye-jung kim_byeo...
77,77,Inside Out,"[amy_poehler, phyllis_smith, richard_kind, bil...",petedocter,"[dream, cartoon, imaginary_friend, animation]","[drama, comedy, animation, family]","Growing up can be a bumpy road, and it's no ex...",amy_poehler phyllis_smith richard_kind bill_ha...
855,855,Gods and Generals,"[stephen_lang, jeff_daniels, robert_duvall, ke...",ronaldf.maxwell,"[war, battle, union_soldier, confederate_soldier]","[drama, history, war]",The film centers mostly around the personal an...,stephen_lang jeff_daniels robert_duvall kevin_...
875,875,Moulin Rouge!,"[nicole_kidman, ewan_mcgregor, john_leguizamo,...",bazluhrmann,"[duke, musical, writer's_block, music]","[drama, music, romance]",A celebration of love and creative inspiration...,nicole_kidman ewan_mcgregor john_leguizamo jim...
1078,1078,The Ant Bully,"[julia_roberts, meryl_streep, nicolas_cage, pa...",johna.davis,"[ant, child_hero, shrinking, ant-hill]","[fantasy, adventure, animation, comedy]",Fed up with being targeted by the neighborhood...,julia_roberts meryl_streep nicolas_cage paul_g...
1304,1304,The Grandmaster,"[tony_leung_chiu-wai, zhang_ziyi, song_hye-kyo...",wongkar-wai,"[martial_arts, kung_fu, biography, kung_fu_mas...","[action, drama, history]",Ip Man's peaceful life in Foshan changes after...,tony_leung_chiu-wai zhang_ziyi song_hye-kyo ch...
1647,1647,Wicker Park,"[josh_hartnett, rose_byrne, matthew_lillard, d...",paulmcguigan,"[love_of_one's_life, leave, look-alike, intrigue]","[drama, mystery, romance, thriller]","Matthew, a young advertising executive in Chic...",josh_hartnett rose_byrne matthew_lillard diane...


If you implement correctly, you can see the results are pretty reasonable. We get **The Dark Knight Rises**, **Batman Begins**, **The Prestige** for the **The Dark Night** movie, they are all directed by **Christopher Nolan**. We get **Avengers: Age of Ultron**, **Captain America: Civil War**, **Iron Man 2** for the **The Avengers** movie; they all have the **Robert Downey Jr** actor in the cast who is a very popular actor. 

## Clustering movies using k-mean

We have used k-nn to find similar movies using cosine similarity and ecludian distance in previous sections. In this section let's use the k-mean algorithm to cluster the movies. We will use the [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) class from sklearn.

Complete the **kmeans_tfidf** and **kmeans_count** variable using the **KMeans** class with the following parameters:
<br>
**n_clusters**: **500**
<br>
**random_state**: **96**
<br>
**max_iter**: **1000**
<br>
Remember to use the **fit** function on the **tfidf_matrix** and **count_matrix** variables.

In [25]:
from sklearn.cluster import KMeans

kmeans_tfidf = KMeans(n_clusters=500, random_state=96, max_iter=1000).fit(tfidf_matrix) # YOUR CODE HERE
kmeans_count = KMeans(n_clusters=500, random_state=96, max_iter=1000).fit(count_matrix) # YOUR CODE HERE

Read the **get_cluster_movies** and **get_recommendations_kmean** functions below and complete them.

In [26]:
def get_cluster_movies(labels):
    """
        Create movies look up table by label
        :param labels: Array of labels with indexes are movie indexes
        :return: Return a dictionary with cluster labels are keys and items are array movie indexes
    """
    res_dict = {}
    for i, l in enumerate(labels):
        if l not in res_dict:
            res_dict[l] = []
        res_dict[l].append(i)
    return res_dict

def get_recommendations_kmean(title, title2index, cluster_dict, labels):
    """
        Get the movies in the same cluster give a movie title
        
        :param title: The title of the query movie
        :param title2index: Title to index look up table
        :param cluster_dict: A dictionary for looking up movies in the same cluster
        :param labels: Array of labels with indexes are movie indexes
    """
    # get the movie index by the title
    idx = title2index[title]
    
    # Get the movie indices with maximun of 10
    movie_indices =  cluster_dict[labels[idx]][:10] # YOUR CODE HERE
    
    # Return the top 10 most similar movies
    return df[:].iloc[movie_indices]

Let's check the resutls

In [27]:
tfidf_clusters = get_cluster_movies(kmeans_tfidf.labels_)
count_clusters = get_cluster_movies(kmeans_count.labels_)

In [28]:
get_recommendations_kmean('The Dark Knight', title2index, tfidf_clusters, kmeans_tfidf.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
3,3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
9,9,Batman v Superman: Dawn of Justice,"[ben_affleck, henry_cavill, gal_gadot, amy_adams]",zacksnyder,"[dc_comics, vigilante, superhero, based_on_com...","[action, adventure, fantasy]",Fearing the actions of a god-like Super Hero l...,ben_affleck henry_cavill gal_gadot amy_adams z...
65,65,The Dark Knight,"[christian_bale, heath_ledger, aaron_eckhart, ...",christophernolan,"[dc_comics, crime_fighter, secret_identity, sc...","[drama, action, crime, thriller]",Batman raises the stakes in his war on crime. ...,christian_bale heath_ledger aaron_eckhart mich...
299,299,Batman Forever,"[val_kilmer, tommy_lee_jones, jim_carrey, nico...",joelschumacher,"[riddle, dc_comics, rose, gotham_city]","[action, crime, fantasy]",The Dark Knight of Gotham City confronts a das...,val_kilmer tommy_lee_jones jim_carrey nicole_k...
428,428,Batman Returns,"[michael_keaton, danny_devito, michelle_pfeiff...",timburton,"[holiday, corruption, double_life, dc_comics]","[action, fantasy]","Having defeated the Joker, Batman now faces th...",michael_keaton danny_devito michelle_pfeiffer ...
3854,3854,"Batman: The Dark Knight Returns, Part 2","[peter_weller, ariel_winter, david_selby, mich...",jayoliva,"[dc_comics, future, joker, robin]","[action, animation]",Batman has stopped the reign of terror that Th...,peter_weller ariel_winter david_selby michael_...


In [29]:
get_recommendations_kmean('The Avengers', title2index, tfidf_clusters, kmeans_tfidf.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
16,16,The Avengers,"[robert_downey_jr., chris_evans, mark_ruffalo,...",josswhedon,"[new_york, shield, marvel_comic, superhero]","[science_fiction, action, adventure]",When an unexpected enemy emerges and threatens...,robert_downey_jr. chris_evans mark_ruffalo chr...
370,370,Now You See Me 2,"[jesse_eisenberg, woody_harrelson, mark_ruffal...",jonm.chu,"[london_england, china, magic, secret_society]","[action, adventure, comedy, crime]",One year after outwitting the FBI and winning ...,jesse_eisenberg woody_harrelson mark_ruffalo d...
519,519,Now You See Me,"[jesse_eisenberg, mark_ruffalo, woody_harrelso...",louisleterrier,"[paris, bank, secret, fbi]","[thriller, crime]",An FBI agent and an Interpol detective track a...,jesse_eisenberg mark_ruffalo woody_harrelson m...
658,658,Death Race,"[jason_statham, joan_allen, ian_mcshane, tyres...",paulw.s.anderson,"[car_race, dystopia, matter_of_life_and_death,...","[action, thriller, science_fiction]","Terminal Island, New York: 2020. Overcrowding ...",jason_statham joan_allen ian_mcshane tyrese_gi...
1443,1443,American Outlaws,"[colin_farrell, scott_caan, ali_larter, gabrie...",lesmayfield,"[sheriff, horse, outlaw, jesse_james]","[action, western]",When a Midwest town learns that a corrupt rail...,colin_farrell scott_caan ali_larter gabriel_ma...
1847,1847,GoodFellas,"[robert_de_niro, ray_liotta, joe_pesci, lorrai...",martinscorsese,"[prison, based_on_novel, florida, 1970s]","[drama, crime]","The true story of Henry Hill, a half-Irish, ha...",robert_de_niro ray_liotta joe_pesci lorraine_b...
2212,2212,Triple 9,"[casey_affleck, chiwetel_ejiofor, woody_harrel...",johnhillcoat,"[heist, betrayal, dirty_cop]","[action, thriller]",A gang of criminals and corrupt cops plan the ...,casey_affleck chiwetel_ejiofor woody_harrelson...
4124,4124,This Thing of Ours,"[james_caan, james_caan, frank_vincent, vincen...",dannyprovenzano,[heist_mafia_internet],"[drama, action, thriller]","Using the Internet and global satellites, a gr...",james_caan james_caan frank_vincent vincent_pa...
4256,4256,Checkmate,"[danny_glover, sean_astin, vinnie_jones, misch...",timothywoodwardjr.,[],"[thriller, action, crime]",Six people are thrown together during an elabo...,danny_glover sean_astin vinnie_jones mischa_ba...
4268,4268,"Lock, Stock and Two Smoking Barrels","[jason_flemyng, dexter_fletcher, nick_moran, j...",guyritchie,"[ambush, alcohol, shotgun, tea]","[comedy, crime]",A card sharp and his unwillingly-enlisted frie...,jason_flemyng dexter_fletcher nick_moran jason...


In [30]:
get_recommendations_kmean('The Dark Knight', title2index, count_clusters, kmeans_count.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
3,3,The Dark Knight Rises,"[christian_bale, michael_caine, gary_oldman, a...",christophernolan,"[dc_comics, crime_fighter, terrorist, secret_i...","[action, crime, drama, thriller]",Following the death of District Attorney Harve...,christian_bale michael_caine gary_oldman anne_...
65,65,The Dark Knight,"[christian_bale, heath_ledger, aaron_eckhart, ...",christophernolan,"[dc_comics, crime_fighter, secret_identity, sc...","[drama, action, crime, thriller]",Batman raises the stakes in his war on crime. ...,christian_bale heath_ledger aaron_eckhart mich...
119,119,Batman Begins,"[christian_bale, michael_caine, liam_neeson, k...",christophernolan,"[himalaya, martial_arts, dc_comics, crime_figh...","[action, crime, drama]","Driven by tragedy, billionaire Bruce Wayne ded...",christian_bale michael_caine liam_neeson katie...


In [31]:
get_recommendations_kmean('The Avengers', title2index, count_clusters, kmeans_count.labels_)

Unnamed: 0,index,title,cast,director,keywords,genres,overview,combined
7,7,Avengers: Age of Ultron,"[robert_downey_jr., chris_hemsworth, mark_ruff...",josswhedon,"[marvel_comic, sequel, superhero, based_on_com...","[action, adventure, science_fiction]",When Tony Stark tries to jumpstart a dormant p...,robert_downey_jr. chris_hemsworth mark_ruffalo...
16,16,The Avengers,"[robert_downey_jr., chris_evans, mark_ruffalo,...",josswhedon,"[new_york, shield, marvel_comic, superhero]","[science_fiction, action, adventure]",When an unexpected enemy emerges and threatens...,robert_downey_jr. chris_evans mark_ruffalo chr...
174,174,The Incredible Hulk,"[edward_norton, liv_tyler, tim_roth, william_h...",louisleterrier,"[new_york, rio_de_janeiro, marvel_comic, super...","[science_fiction, action, adventure]",Scientist Bruce Banner scours the planet for a...,edward_norton liv_tyler tim_roth william_hurt ...


As you can see the result are pretty reasonable. We encourage you to fine tune the k-mean paramters to see different results.