# Notebook Summary

A POC to see if we can use crowd-sourced IMDB tags to see if we can get similar movies, which could be used to create a content-based recommendation system

In [1]:
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests

In [9]:
# IMDB does not have an API, so we have to scrape their webpages manually
# I've selected a few major film studios to conduct this POC

MOVIE_LIST_URLS = {
    'Walt Disney Pictures': {
        'url': 'https://www.imdb.com/search/title?companies=co0008970&sort=boxoffice_gross_us,desc'
    },
    '20th Century Fox': {
        'url': 'https://www.imdb.com/search/title?companies=co0000756&sort=boxoffice_gross_us,desc'
    },
    'A24': {
        'url': 'https://www.imdb.com/search/title?companies=co0390816&sort=boxoffice_gross_us,desc'
    },
    'Sony Pictures': {
        'url': 'https://www.imdb.com/search/title?companies=co0026545&sort=boxoffice_gross_us,desc'
    }
}

MINIMUM_REVENUE = 1000000 # The movie had to have made at least 1 million to be included

In [10]:
# Scrapes the urls above to get a list of movies and their imdb ids

movies = []
for key, val in MOVIE_LIST_URLS.items():
    next = True
    url = val['url']
    while next:
        result = requests.get(url)
        soup = bs(result.content, 'lxml')
        list_items = soup.find_all("div", "lister-item-content")
        for item in list_items:
            gross_span = item.find("span", text='Gross:')
            if gross_span is None:
                next = False
            else:
                revenue = int(gross_span.next_sibling.next_sibling['data-value'].replace(',', ''))
                if revenue >= MINIMUM_REVENUE:
                    header = item.find("h3", "lister-item-header")
                    movie_title = header.find("a").contents[0]
                    movie_year = header.find("span", "lister-item-year").contents[0]
                    movie_href = header.find("a")['href']
                    movie_id = movie_href.split("/")[2]
                    movies.append({
                        'Title': movie_title + " " + movie_year,
                        'id': movie_id
                    })
                else:
                    next = False
        # If there's a "Next" page, we scrape that page as well
        if len(soup.find_all("a", "lister-page-next next-page")) > 0:
            url = 'https://www.imdb.com' + soup.find_all("a", "lister-page-next next-page")[0]['href']
        else:
            next = False

In [11]:
# Lets remove any duplicates
movies = [dict(t) for t in {tuple(movie.items()) for movie in movies}]
len(movies)

1406

In [12]:
# Let's use the movie id value to get the url for the tags page and then scrape the tags
for movie in movies:
    tag_url = "https://www.imdb.com/title/{}/keywords?ref_=tt_stry_kw".format(movie['id'])
    result = requests.get(tag_url)
    soup = bs(result.content, 'lxml')
    movie_tags = []
    tag_tds = soup.find_all("td", "soda sodavote")
    for td in tag_tds:
        movie_tags.append(td['data-item-keyword'])
    movie['tags'] = movie_tags

In [58]:
# Example of IMDB crowdsourced tags
movies[13]

{'Title': 'Baby Driver (2017)',
 'id': 'tt3890160',
 'tags': ['thin woman',
  'pig',
  'short skirt',
  'bacon',
  'bare midriff',
  'mini skirt with heels',
  'mini skirt',
  'girl wearing a miniskirt',
  'hard body',
  "camera shot of a woman's legs",
  'long haired woman',
  'tattooed trash',
  'mini dress',
  'miniskirt',
  'hearing impairment',
  'listening to music on headphones',
  'sunglasses',
  'short dress',
  'getaway driver',
  'hot wiring a car',
  'woman wears a miniskirt',
  'casing a robbery',
  'elderly man',
  'driving in reverse',
  'getaway car',
  'face mask',
  'reference to mike myers',
  'sign language',
  'waitress',
  'halloween mask',
  'singing in a car',
  'heist',
  "close up of a woman's butt",
  'ipod',
  'deaf man',
  'ear bud',
  'reference to queen',
  'assisted senior living center',
  'telephone',
  'ear exam',
  'shooting a police officer',
  'deafness',
  'listening to music',
  'imitating a violin player',
  'reference to egyptian reggae',
  're

In [14]:
# Let's get all of the tags in one list so we can count their occurrences
all_tags = []
for movie in movies:
    all_tags.extend(movie['tags'])

In [15]:
len(all_tags)

185929

In [16]:
# Let's reduce to tags that appear more than x times
# TODO: Improve performance of this task

MIN_TAG_COUNT = 5

common_tags = set([tag for tag in all_tags if all_tags.count(tag) > MIN_TAG_COUNT])
common_tags = sorted(list(common_tags))

In [17]:
# We'll use these as features in our dataset
len(common_tags)

5665

In [18]:
movie_titles = [str(movie['Title']) for movie in movies]
movie_features = pd.DataFrame(0, index = movie_titles, columns=common_tags)

In [19]:
# Our movie features dataframe has each movie as a row, and a 1 in a column if it has that tag
for movie in movies:
    movie_tags = set(movie['tags']) & set(common_tags)
    movie_features.loc[movie['Title'], movie_tags] = 1

In [20]:
movie_features.head()

Unnamed: 0,007,10 year old,11 year old,12 year old,13 year old,15 year old,16 year old,1600s,17 year old,17th century,...,young love,younger version of character,youtube,yuppie,zebra,zero gravity,zip line,zippo lighter,zombie,zoo
Monsters University (2013),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Meet the Spartans (2008),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
War for the Planet of the Apes (2017),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Enemy (2013),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
The Day After Tomorrow (2004),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [21]:
# save this movie features dataframe
movie_features.to_pickle('movie_features.pkl')

If you just want to use the `movie_features` I've created instead of running the code above, you can just run the first cell and the following cells to load the pickle file

In [None]:
# load the dataframe from a pickle file
movie_features = pd.read_pickle('movie_features.pkl')

In [33]:
# Use cosine similarity to find similar movies

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

def get_cosine_similar_vectors(df, index, n):
    """
    This finds the indices of the n most cosine similar vectors in X
    to a given row in X
    """
    sorted_indices = np.argsort(cosine_similarity(df.loc[[index]], df))[0][::-1] # reverses so the sorted indices are in descending order
    top_n_similar = sorted_indices[1:n + 1] # removes itself and truncates it to the top n 
    return top_n_similar

def get_similar_movies(movie, num_similar=3):
    neighbors = get_cosine_similar_vectors(movie_features, movie, num_similar)
    print("Similar to " + movie + ":")
    print("")
    neighbor_titles = []
    # Let's print out the common tags between the movie and its neighbor so we can see why
    # it's considered similar
    for neighbor in neighbors:
        neighbor = movie_features.iloc[neighbor].name
        neighbor_titles.append(neighbor)
        movie_tags = set(movie_features.loc[movie][movie_features.loc[movie] == 1].index)
        neighbor_tags = set(movie_features.loc[neighbor][movie_features.loc[neighbor] == 1].index)
        shared_tags = movie_tags & neighbor_tags
        print("{}: {}".format(neighbor, ", ".join(shared_tags)))
        print("")
    return neighbor_titles

In [None]:
# Examples of Results

In [28]:
get_similar_movies('Black Panther (2018)', num_similar = 3)

Similar to Black Panther (2018):

Captain America: Civil War (2016): news report, scene during end credits, hostage, spy, undercover, fistfight, car crash, mission, fire, statue, masked superhero, severed arm, strong female character, airplane, redemption, marvel comics, car accident, laboratory, held at gunpoint, henchman, secret laboratory, gunfight, sequel, cameo, ak 47, opening action scene, weapons fire, handcuffs, surprise ending, battlefield, shootout, suspense, jumping from height, knocked out, tough guy, shot to death, slow motion scene, shot in the head, bombardment, pistol, one man army, surprise during end credits, female agent, bomb, female warrior, guilt, bar, armored car, surprise after end credits, snow, death of father, disarming someone, security guard, kicked in the face, haunted by the past, king, shared universe, apartment, moral dilemma, blockbuster, betrayal, waterfall, fighting, ambulance, robbery, fighting in the air, anger, mercenary, final battle, falling fro

['Captain America: Civil War (2016)',
 'Thor: Ragnarok (2017)',
 'Captain America: The Winter Soldier (2014)']

In [29]:
get_similar_movies('Finding Nemo (2003)')

Similar to Finding Nemo (2003):

Finding Dory (2016): friend, no opening credits, shark, starfish, crab, aquarium, ocean, two word title, teamwork, overprotective father, squid, underwater, seagull, talking animal, animal character name in title, computer animation, fish, father son relationship, character's point of view camera shot, whale, cgi animation

Monster House (2006): friendship, scene during end credits, flashback, 3 dimensional, subjective camera, cgi animation, no opening credits, death of wife, explosion, computer animation

Ice Age: The Meltdown (2006): friendship, friend, flashback, turtle, subjective camera, no opening credits, talking animal, cgi film, blockbuster, cgi animation, computer animation



['Finding Dory (2016)', 'Monster House (2006)', 'Ice Age: The Meltdown (2006)']

In [32]:
get_similar_movies('Coco (I) (2017)')

Similar to Coco (I) (2017):

Rachel Getting Married (2008): thief, hammock, death, mother daughter relationship, family relationships, title spoken by character, police, policeman, bridge, dog, musician, microphone, guitar, singer, dancer, candle, photograph, theft, boy, contest, memory, father son relationship, father daughter relationship, singing

Dumbo (2019): boy, one word title, father daughter relationship, photograph, title spoken by character, singer, father son relationship, disability, dog, framed photograph, disney, death, singing

Ferdinand (2017): son, father daughter relationship, flower, stealing, moustached man, title spoken by character, canine, thief, guitar, mustache, father son relationship, bridge, moustache, dog, man's best friend, cgi animation, theft, death, framed photograph



['Rachel Getting Married (2008)', 'Dumbo (2019)', 'Ferdinand (2017)']

In [34]:
get_similar_movies('Hereditary (2018)')

Similar to Hereditary (2018):

Krampus (I) (2015): apology, nickname, husband wife relationship, knocking on a door, grandmother granddaughter relationship, running, fire, death, mother daughter relationship, reference to god, blood, anger, family relationships, sleeping, mother son relationship, dog, dysfunctional family, looking out a window, food, computer, cellphone, one word title, promise, fear, teenage boy, surprise ending, girl, f word, candle, boyfriend girlfriend relationship, grandmother grandson relationship, toy, demon, eating, supernatural power, attic, father son relationship, overhead camera shot, nightmare, telephone call, reference to jesus christ, fireplace, father daughter relationship, brother sister relationship

The VVitch: A New-England Folktale (2015): barking dog, husband wife relationship, crying woman, terror, grave, death, rain, mother daughter relationship, reference to god, death of mother, blood, family relationships, sleeping, burial, dog, mother son re

['Krampus (I) (2015)',
 'The VVitch: A New-England Folktale (2015)',
 'It Comes at Night (2017)']

In [35]:
get_similar_movies('Moonlight (I) (2016)')

Similar to Moonlight (I) (2016):

Mid90s (2018): one word title, coming of age, mother son relationship, homophobia, dysfunctional family, gay slur

Get on the Bus (1996): african american, absent father, diner, gay couple, gay african american, homophobia, gay slur, homosexual

Go (1999): drug dealer, one word title, male objectification, title spoken by character, punched in the face, gay couple, joint, gay kiss, revenge, gay slur, homosexual



['Mid90s (2018)', 'Get on the Bus (1996)', 'Go (1999)']

In [38]:
get_similar_movies('Ex Machina (2014)')

Similar to Ex Machina (2014):

Morgan (2016): independent film, character repeating someone else's dialogue, bed, chase, mirror, moral dilemma, betrayal, panic, bare chested male, danger, death, close up of eyes, paranoia, stabbed to death, blood, anger, artificial intelligence, bunker, punched in the face, directorial debut, murder, blood splatter, sabotage, laboratory, humanoid, river, computer, double cross, corporation, high tech, secret laboratory, distrust, corpse, flashback, knife, fear, woods, looking at oneself in a mirror, scientist, suspicion, surprise ending, interview, deception, f word, android, beard, security camera, isolation, revenge, science, stabbed in the chest, suspense, ambush, knocked out, surveillance, escape, stabbed in the back, electronic music score, research facility, shower, forest

Alien: Covenant (2017): robot as pathos, female rear nudity, chase, waterfall, moral dilemma, betrayal, panic, bare chested male, danger, death, close up of eyes, paranoia, bl

['Morgan (2016)', 'Alien: Covenant (2017)', "Assassin's Creed (2016)"]

In [39]:
get_similar_movies('Spider-Man: Into the Spider-Verse (2018)')

Similar to Spider-Man: Into the Spider-Verse (2018):

Big Hero 6 (2014): cemetery, final showdown, superhero, self sacrifice, funeral, teenage superhero, based on comic book, no title at beginning, brother brother relationship, supervillain, scene after end credits, father son relationship, masked man, stan lee cameo, secret identity, marvel comics, based on comic, good versus evil, cgi animation, teenager

The Kid Who Would Be King (2019): good versus evil, final showdown, final battle

The Rocketeer (1991): superhero, based on comic book, evil man, based on comic, origin of hero, good versus evil



['Big Hero 6 (2014)',
 'The Kid Who Would Be King (2019)',
 'The Rocketeer (1991)']

In [40]:
get_similar_movies('The Martian (2015)')

Similar to The Martian (2015):

SpaceCamp (1986): spacecraft, space exploration, trapped in space, space travel, outer space, nasa, astronaut, spaceship, space shuttle, spacesuit, commander

Independence Day: Resurgence (2016): news report, press conference, china, race against time, no opening credits, earth viewed from space, humor, fire, mission, map, near death experience, evacuation, rescue, survival, satellite, family relationships, desert, laboratory, computer, explosion, high tech, chinese, space travel, pilot, scientist, spacesuit, presumed dead, surprise ending, helmet, watching tv, disobeying orders, photograph, boyfriend girlfriend relationship, aerial shot, friendship, spacecraft, flashlight, surveillance, escape, outer space, spaceship, subtitled scene, london england, bomb, crater, hope

Alien: Covenant (2017): impalement, race against time, video recording, moral dilemma, lightning, fire, mission, near death experience, evacuation, laptop, rescue, survival, two word tit

['SpaceCamp (1986)',
 'Independence Day: Resurgence (2016)',
 'Alien: Covenant (2017)']

In [41]:
get_similar_movies('Mr. & Mrs. Smith (2005)')

Similar to Mr. & Mrs. Smith (2005):

Once Upon a Time in Mexico (2003): bar, sunglasses, husband wife relationship, hostage, motorcycle, disarming someone, knife throwing, nightclub, assassination attempt, machine gun, fistfight, fbi, black comedy, death, doll, gun, pump action shotgun, shotgun, standoff, convoy, drunkenness, blood, shooting, assassin, eyeglasses, bicycle, drink, hit by a car, shot in the back, dog, martial arts, falling from height, car accident, murder, tied up, mass murder, explosion, wedding, silencer, marriage, cia, bazooka, rifle, party, ak 47, knife, kiss, cafe, taxi, dual wield, target practice, singer, cult film, handcuffs, underwear, surprise ending, fight, watching tv, candle, shot in the arm, shootout, dancer, waitress, money, boy, tough guy, dancing, escape, elevator, pistol, gambling, telephone call, car chase, father daughter relationship, drinking, bomb, hand to hand combat, singing

Spy (2015): bar, hostage, woman murders a man, spy, motorcycle, touris

['Once Upon a Time in Mexico (2003)', 'Spy (2015)', 'Knight and Day (2010)']

In [46]:
get_similar_movies('The Florida Project (2017)')

Similar to The Florida Project (2017):

Rachel Getting Married (2008): friend, bathtub, grandmother granddaughter relationship, running, fire, mother daughter relationship, police, policeman, food, vomiting, tattoo, best friend, fight, dancer, theft, cigarette smoking, friendship, female protagonist, bath, eating, father son relationship, cigarette lighter, crying

One Hour Photo (2002): voyeurism, birthday, security guard, running, little boy, police, female nudity, nudity, tattoo, fight, waitress, hotel, arrest, theft, cigarette smoking, toy, voyeur, father son relationship, restaurant, crying, child

Whatever It Takes (2000): friendship, friend, voyeurism, little girl, swimming pool, nudity, voyeur, police, policeman, best friend, waitress, bikini, restaurant, crying, arrest, vomiting, taking a photograph



['Rachel Getting Married (2008)',
 'One Hour Photo (2002)',
 'Whatever It Takes (2000)']