# Movie Recommender System Challenge

In this notebook, I, Joshua Zingale, create a movie recommender system with a subset of a [kaggle dataset](https://www.kaggle.com/datasets/utkarshx27/movies-dataset?resource=download), which I have also included in this repository.

## Data Loading and Cleaning

In [1]:
import pandas as pd
pd.options.mode.copy_on_write = True

# Load only smaller subset of the movies per the instructions
df = pd.read_csv("data/movie_dataset.csv").sample(500, random_state = 115)

# View a small random sample to get a feel for the data
df.sample(3, random_state = 935)

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
1879,1879,25000000,Comedy,,57431,babysitter duringcreditsstinger,en,The Sitter,"Noah, is not your typical entertain-the-kids-n...",19.428994,...,81.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Worst. Babysitter. Ever.,The Sitter,5.4,325,Sam Rockwell Jonah Hill Max Records Ari Grayno...,"[{'name': 'Michael De Luca', 'gender': 2, 'dep...",David Gordon Green
2672,2672,14000000,Comedy,,864,winter trainer olympic games jamaica training ...,en,Cool Runnings,When a Jamaican sprinter is disqualified from ...,22.409117,...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,One dream. Four Jamaicans. Twenty below zero.,Cool Runnings,6.8,491,Leon Robinson Doug E. Doug Rawle D. Lewis Mali...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Jon Turteltaub
651,651,65000000,Comedy Drama Family,,196867,musical orphan foster child,en,Annie,"Ever since her parents left her as a baby, lit...",33.439187,...,119.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,It's a Hard Knock Life,Annie,6.0,466,Quvenzhan\u00e9 Wallis Jamie Foxx Rose Byrne C...,"[{'name': 'Will Smith', 'gender': 2, 'departme...",Will Gluck


In [2]:
df.columns

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

### Choosing Columns
Looking at the column names and some examples, the most immediately relevant columns for this challenge seem to be "genres", "keywords", "original_title", and "overview", though other columns could be used to improve relevancy. Also, I am going to filter the data only to include those movies originally in English.

In [3]:
# Load the most relevant columns for all movies in the English language, dropping any rows with NaN values
df = df[df["original_language"] == "en"][["index", "genres", "keywords", "original_title", "overview"]].dropna()

print(len(df))

438


In [4]:
df.sample(3, random_state=115)

Unnamed: 0,index,genres,keywords,original_title,overview
2980,2980,Action Horror Thriller,dystopia sequel legalized murder,The Purge: Election Year,Two years after choosing not to kill the man w...
3812,3812,Drama,corruption sex adultery television profit,Network,A TV network cynically exploits a deranged ex-...
4178,4178,Drama Thriller,baby wife husband relationship christian faith...,Higher Ground,A chronicle of one woman's lifelong struggle w...


### Combining Columns
Fortunately, most of the data are English movies so I retained enough data. To finish cleaning the data, I now will concatenate genres, the keywords, and the film overview into a single column, which will later be turned into a vector.

In [5]:
# Build a new DataFrame with a composite "desription" column
df["description"] = df["original_title"] + " " + df["genres"] + " " + df["keywords"] + " " + df["overview"]

In [6]:
df.sample(3, random_state = 935)

Unnamed: 0,index,genres,keywords,original_title,overview,description
867,867,Crime Drama Thriller,italy christianity new york assassination ital...,The Godfather: Part III,In the midst of trying to legitimize his busin...,The Godfather: Part III Crime Drama Thriller i...
2402,2402,Horror Drama Mystery Thriller,nanny haunted house channel islands parallel w...,The Others,Grace is a religious woman who lives in an old...,The Others Horror Drama Mystery Thriller nanny...
2455,2455,Comedy Romance Drama,new york wife husband relationship restaurant ...,When Harry Met Sally...,"During their travels from Chicago to New York,...",When Harry Met Sally... Comedy Romance Drama n...


## Vectorized Movie Storage

I created a vectorized database wherein each movie has an associated vector. A movie's vector is a TF-IDF vector of the text in the composite "description" field created above.

To generate a TF-IDF vector for a piece of text, I implemented a tokenizer. The tokenizer removes punctuation, sets everything to lowercase, and then attempts to get the stem of each word, e.g. by removing verb endings or plural markers. Once the text is in a normilized ("stemified") form, each word is a token; and a vector is asigned to the text based on the frequency of words present in the text and based on the inverse document frequency of each token.

I downloaded a list of English stop words from [here](https://gist.github.com/larsyencken/1440509), which I used to remove stop words from the descriptions during the tokenization stage. I added "movie" and "movies" to the list of stop words because queries of the form "I like moves that..." were resulting in irrelevant movies simply for a match to "movie".

When vectorizing a query, any word following I is removed to prevent "love" in constructions like "I love horrific war films" from biasing the results toward romantic comedies.


In [7]:
from nltk.stem import PorterStemmer
import numpy as np
ps = PorterStemmer()

In [8]:
# Load in the stopwords list
with open("stopwords.txt") as f:
    stopwords = f.read().split()
stopwords = set(stopwords)

In [9]:
class Tokenizer():
    def __init__(self, documents):
        """
        Initializes a tokenizer for a set of documents.
        The vocabulary for the Tokenizer is determined by the documents used to initialize it.
        """

        ## Get the vocabulary
        self.vocabulary = set()
        for text in documents:
            self.vocabulary = self.vocabulary.union(set(self._stemify(text)))

        self.vocabulary_size = len(self.vocabulary)
        

        ## Get the stem to token id mappings
        self.stem_to_id = dict()
        self.id_to_stem = dict()

        for i, word in enumerate(self.vocabulary):
            self.stem_to_id[word] = i
            self.id_to_stem[i] = word

        ## Get the inverse document frequencies for each token
        self.idf = np.zeros(self.vocabulary_size)
        for text in documents:
            self.idf += self.frequency_vectorize(text).clip(0, 1)

        self.idf = np.log(len(documents)/self.idf)

    def tokenize(self, text: str) -> list:
        """Tokenizes input text"""
        return [self.stem_to_id[stem] for stem in self._stemify(text) if stem in self.vocabulary]

    def vectorize(self, text: str, smoothing = 0.0) -> np.ndarray:
        """Returns a tf-idf vector for the input text, where each index contains the tf-idf of a token in the input text.
        smoothing is the amount of smoothing added to the vector."""
        vec = np.zeros(self.vocabulary_size) + smoothing
        for token_id in self.tokenize(text):
            vec[token_id] += 1
        return vec * self.idf
    
    def frequency_vectorize(self, text: str) -> np.ndarray:
        """Returns a frequency vector for the input text, where each index contains the number of appearances of a token in the input text."""
        vec = np.zeros(self.vocabulary_size)
        for token_id in self.tokenize(text):
            vec[token_id] += 1
        return vec
        
    def _remove_punctuation(self, text: str) -> str:
        """ Removes common punctuation from a string """
        for mark in ["!", "(", ")", ";", ":", "\"", ",", ".", "?"]:
            text = text.replace(mark, "")
        return text
    
    def _stemify(self, text: str) -> list:
        """ Converts text into a list of stems without punctuation"""
        text = self._remove_punctuation(text)
        words = text.lower().split()
        words = [ps.stem(word) for word in words if word not in stopwords]
        return words

In [10]:
class VectorDB():
    """
    Stores documents in a vector database, wherein lookups use a vector similarity metric, i.e. cosin similarity.
    """
    def __init__(self, data, embedded_row, embedding_function):
        """
        Initializes a vector database for a set of data.

        :param data: pandas DataFrame around which this database wraps 
        :embedding_function: function that takes a document to an embedding thereof
        """

        ## Build the vector database
        embedding_size = embedding_function(data[embedded_row].iloc[0]).size
        
        
        self.db = np.ndarray((len(data), embedding_size))
        self.data = data
        
        for i, document in enumerate(data[embedded_row]):
            self.db[i] = embedding_function(document)

        # normalize each db row
        self.db /= np.linalg.norm(self.db, axis = 1, keepdims = True)

    def search(self, x, k = 1, return_similarities = False):
        """Returns the top k closest matches for input vector x"""
        # normalize x
        x = x / np.linalg.norm(x)

        # Get top k
        scores = self.db @ x
        top_idc = np.argpartition(scores, -k)[-k:]

        # Sort top k
        top_idc = sorted(top_idc, key = lambda i: -scores[i])
        if return_similarities:
            return self.data.iloc[top_idc], scores[top_idc]
        return self.data.iloc[top_idc]

In [11]:
# Get a Tokenizer for the data
tokenizer = Tokenizer(df.loc[:, "description"])

print(f"The vocabulary has {tokenizer.vocabulary_size} words")

The vocabulary has 5124 words


In [12]:
# Create the vectorized database
db = VectorDB(df, embedded_row = "description", embedding_function = tokenizer.vectorize)

In [13]:
def vectorize_query(text):
    """Vectorizes a search query"""

    words = text.lower().split()
    new_words = [words[0]]
    # Remove any word the follows "I"
    for prev_word, word in zip(words[:-1], words[1:]):
        if prev_word != "i":
            new_words.append(word)

    text = " ".join(new_words)
    
    return tokenizer.vectorize(text)

## Testing

In [14]:
def search(text, k = 5):
    rows, scores = db.search(vectorize_query(text), k = k, return_similarities = True)

    print("Results the following query:", text)
    for row, score in zip(rows.iloc, scores):
        title = row["original_title"]
        index = row["index"]
        overview = row["overview"]
        keywords = row["keywords"]
        genres = row["genres"]
        print(f"Title: {title} ({index})\nCosine Similarity {score}\nKeywords & Genres: {keywords}, {genres}\nOverview: {overview}\n")

In [15]:
search("I love thrilling action movies set in space, with a comedic twist.")

Results the following query: I love thrilling action movies set in space, with a comedic twist.
Title: Zathura: A Space Adventure (661)
Cosine Similarity 0.13740214238710163
Keywords & Genres: adventure house alien giant robot outer space, Family Fantasy Science Fiction Adventure
Overview: After their father is called into work, two young boys, Walter and Danny, are left in the care of their teenage sister, Lisa, and told they must stay inside. Walter and Danny, who anticipate a boring day, are shocked when they begin playing Zathura, a space-themed board game, which they realize has mystical powers when their house is shot into space. With the help of an astronaut, the boys attempt to return home.

Title: Hard Rain (603)
Cosine Similarity 0.1253841897040928
Keywords & Genres: sheriff rain evacuation armored car crook, Thriller
Overview: Get swept up in the action as an armored car driver (Christian Slater) tries to elude a gang of thieves (led by Morgan Freeman) while a flood ravages 

In [16]:
search("I like action movies set in space")

Results the following query: I like action movies set in space
Title: Zathura: A Space Adventure (661)
Cosine Similarity 0.2652030078167088
Keywords & Genres: adventure house alien giant robot outer space, Family Fantasy Science Fiction Adventure
Overview: After their father is called into work, two young boys, Walter and Danny, are left in the care of their teenage sister, Lisa, and told they must stay inside. Walter and Danny, who anticipate a boring day, are shocked when they begin playing Zathura, a space-themed board game, which they realize has mystical powers when their house is shot into space. With the help of an astronaut, the boys attempt to return home.

Title: Capricorn One (3668)
Cosine Similarity 0.2176617345136039
Keywords & Genres: helicopter nasa texas spacecraft beguilement, Drama Action Thriller Science Fiction
Overview: In order to protect the reputation of the American space program, a team of scientists stages a phony Mars landing. Willingly participating in the 

In [17]:
search("I like movies that are informative and teach me something")

Results the following query: I like movies that are informative and teach me something
Title: Black Mass (877)
Cosine Similarity 0.131107487118041
Keywords & Genres: boston based on true story organized crime, Crime Drama
Overview: The true story of Whitey Bulger, the brother of a state senator and the most infamous violent criminal in the history of South Boston, who became an FBI informant to take down a Mafia family invading his turf.

Title: Crazy, Stupid, Love. (925)
Cosine Similarity 0.10819608049792716
Keywords & Genres: soulmates midlife crisis marriage crisis womanizer law school, Comedy Drama Romance
Overview: Cal Weaver is living the American dream. He has a good job, a beautiful house, great children and a beautiful wife, named Emily. Cal's seemingly perfect life unravels, however, when he learns that Emily has been unfaithful and wants a divorce. Over 40 and suddenly single, Cal is adrift in the fickle world of dating. Enter, Jacob Palmer, a self-styled player who takes Ca

In [18]:
search("I like calm documentaries about nature.")

Results the following query: I like calm documentaries about nature.
Title: Give Me Shelter (4660)
Cosine Similarity 0.1681818266225291
Keywords & Genres: helping animals, Documentary
Overview: Give Me Shelter is a documentary to raise awareness for important animal issues around the world. This film uncovers the most prevalent issues in the animal world through the eyes of individuals dedicating their lives to them daily.

Title: The Horse Whisperer (713)
Cosine Similarity 0.16392059011182014
Keywords & Genres: love triangle new york montana attachment to nature confidence, Drama Romance
Overview: Based on the novel by the same name from Nicholas Evans, the talented Robert Redford presents this meditative family drama set in the country side. Redford not only directs but also stars in the roll of a cowboy with a magical talent for healing.

Title: Wordplay (4520)
Cosine Similarity 0.15831009727168782
Keywords & Genres: competition documentary contest crossword puzzle, Documentary
Over

In [19]:
search("I like calm documentaries about war.")

Results the following query: I like calm documentaries about war.
Title: Give Me Shelter (4660)
Cosine Similarity 0.24095242118775184
Keywords & Genres: helping animals, Documentary
Overview: Give Me Shelter is a documentary to raise awareness for important animal issues around the world. This film uncovers the most prevalent issues in the animal world through the eyes of individuals dedicating their lives to them daily.

Title: Wordplay (4520)
Cosine Similarity 0.22680929326386495
Keywords & Genres: competition documentary contest crossword puzzle, Documentary
Overview: From the masters who create the mind-bending diversions to the tense competition at the American Crossword Puzzle Tournament, Patrick Creadon's documentary reveals a fascinating look at a decidedly addictive pastime. Creadon captures New York Times editor Will Shortz at work, talks to celebrity solvers -- including Bill Clinton and Ken Burns -- and presents an intimate look at the national tournament and its competitor

In [20]:
search("I love comedies for the family")

Results the following query: I love comedies for the family
Title: August: Osage County (1885)
Cosine Similarity 0.184359662874574
Keywords & Genres: suicide drug addiction funeral dysfunctional family based on play, Comedy Drama
Overview: A look at the lives of the strong-willed women of the Weston family, whose paths have diverged until a family crisis brings them back to the Midwest house they grew up in, and to the dysfunctional woman who raised them.

Title: Post Grad (2619)
Cosine Similarity 0.13244780221747893
Keywords & Genres: career family unemployment woman director graduation speech, Comedy
Overview: Ryden Malby has a master plan. Graduate college, get a great job, hang out with her best friend and find the perfect guy. But her plan spins hilariously out of control when she’s forced to move back home with her eccentric family.

Title: The Royal Tenenbaums (1710)
Cosine Similarity 0.1313249277804484
Keywords & Genres: forgiveness child prodigy terminal illness dysfunctional 

In [21]:
search("I love comedies")

Results the following query: I love comedies
Title: Dry Spell (4781)
Cosine Similarity 0.13474827472071382
Keywords & Genres: dating divorce sex scene sex comedy anti romantic comedy, Comedy Romance
Overview: Sasha tries to get her soon-to-be ex husband Kyle laid so she can move on with her sex life guilt-free.

Title: Dumb and Dumber To (1171)
Cosine Similarity 0.10993042811701818
Keywords & Genres: friendship sequel road movie buddy comedy, Comedy
Overview: 20 years after the dimwits set out on their first adventure, they head out in search of one of their long lost children in the hope of gaining a new kidney.

Title: The Salon (4241)
Cosine Similarity 0.09109911344825801
Keywords & Genres: independent film, Comedy
Overview: A Beauty shop owner finds romance as she struggles to save her business.

Title: Bandits (534)
Cosine Similarity 0.08048798673807012
Keywords & Genres: prison, Action Comedy Crime Romance
Overview: Two bank robbers fall in love with the girl they've kidnapped.



In [22]:
search("I like horror films that take place in the wild")

Results the following query: I like horror films that take place in the wild
Title: Black Snake Moan (2608)
Cosine Similarity 0.2214723689312577
Keywords & Genres: southern usa blues military service independent film, Drama
Overview: A God-fearing bluesman takes to a wild young woman who, as a victim of childhood sexual abuse, is looking everywhere for love, but never quite finding it.

Title: The Theory of Everything (2547)
Cosine Similarity 0.12317056138960344
Keywords & Genres: wife husband relationship biography physicist based on memoir stephen hawking, Drama Romance
Overview: The Theory of Everything is the extraordinary story of one of the world’s greatest living minds, the renowned astrophysicist Stephen Hawking, who falls deeply in love with fellow Cambridge student Jane Wilde.

Title: Where the Wild Things Are (293)
Cosine Similarity 0.11762624358179284
Keywords & Genres: children's book igloo wolf costume swallowed whole hit with a rock, Family Fantasy
Overview: Max imagines