# Movie Search 
This is a simple search engine that uses Wikipedia movie plot data to search for movies. The data for this project is from https://www.kaggle.com/jrobischon/wikipedia-movie-plots. We are going to look at a subset of the data, specifically American films from 1972-2017.

## Loading Data
The first thing we need to do (after downloading the data and moving it to the appropriate directory) is to load and process the data. We can use the pandas library to read the csv:

In [2]:
import pandas as pd
data = pd.read_csv("archive/wiki_movie_plots_deduped.csv", nrows=17378-8796, 
            usecols=["Title", "Plot"], skiprows=[i for i in range(1, 8796)])

The number 8796 is the row of The Godfather, which is the first movie in our dataset. We are reading rows 8796 to 17378 because 17378 is the final American movie. To find similar movies, only the title and plot matter, so only those columns will be loaded. Here is some of the data, before preprocessing:

In [3]:
data.head()

Unnamed: 0,Title,Plot
0,The Godfather,"In 1945, at his daughter Connie's wedding, Vit..."
1,Grave of the Vampire,Several years after his death by electrocution...
2,The Great Northfield Minnesota Raid,"In the mid-1870s, outlaws Jesse James, Cole Yo..."
3,Hammer,B.J. Hammer is a boxer who rises up the ranks ...
4,Hammersmith Is Out,Billy Breedlove (Beau Bridges) is an orderly a...


## Transforming Data
Now that we have some data loaded, we can preprocess it to remove unecessary information. NLTK is a good library for NLP processing, so we will use it to remove stopwords (words that don't add much meaning, like prepositions) and to lemmatize word (transform words into a base meaning, like changing dogs -> dog, or ran -> runs).

In [4]:
import regex as re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lem = WordNetLemmatizer()
def preprocess(text):
    text = re.split('\W+', text.lower())
    for i in range(len(text)):
        if text[i] in stop_words:
            text[i] = ''
        else:
            text[i] = lem.lemmatize(text[i])
    return ' '.join([t for t in text if t != ''])

We will use this to process the Plot fields of our data entries, and we will also strip excess spaces from all titles:

In [5]:
data["Plot"] = data.Plot.apply(lambda x: preprocess(x))
data["Title"] = data.Title.apply(lambda x: x.strip())

Here is an example of our processed data:

In [6]:
data.head()

Unnamed: 0,Title,Plot
0,The Godfather,1945 daughter connie wedding vito corleone hea...
1,Grave of the Vampire,several year death electrocution late 1930s gh...
2,The Great Northfield Minnesota Raid,mid 1870s outlaw jesse james cole younger brot...
3,Hammer,b j hammer boxer rise rank help mafia however ...
4,Hammersmith Is Out,billy breedlove beau bridge orderly texas psyc...


## Computing TF-IDF and top words
With processed data, we can begin to analyze significant words from the movie plots. We will use TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to do this. It will measure how frequently words occur in a document (or a movie plot in this case) compared to how often they occur overall. The scikit-learn library can do this for us:

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorize the data, limited to the top 20k words to prevent movie 
# descriptions from just being names, and also to improve performance
vectorizer = TfidfVectorizer(max_features=20000)
tfidf = vectorizer.fit_transform(data["Plot"])
vocab = vectorizer.get_feature_names()

This gives us tfidf, which is a matrix of the TF-IDF value of each word in our vocabulary. Its shape is (num movies, vocabulary size). The vocab variable is a list of all vocabulary words that correspond to the columns in tfidf.

In [8]:
tfidf.shape

(8582, 20000)

We will iterate through the in data, retrieving the k words with the highest TF-IDF:

In [9]:
import numpy as np
from collections import defaultdict
from tqdm import tqdm

movies = defaultdict(set)

# iterate through movies and find the top words from the TF-IDF
for i in tqdm(range(data.shape[0]), 
              desc="Finding most significant words from each movie"):
    ind = np.argpartition(tfidf[i,].toarray()[0], kth=-25)[-25:]
    movies[data.at[i, 'Title']] = set([vocab[j] for j in ind])

Finding most significant words from each movie: 100%|████████████████████████████| 8582/8582 [00:01<00:00, 5729.39it/s]


Now we can use the movies dictionary to get the top k (25 in our case) words for a specific movie:

In [10]:
movies['The Godfather']

{'bruno',
 'business',
 'capo',
 'carlo',
 'connie',
 'corleone',
 'family',
 'fish',
 'five',
 'fredo',
 'godfather',
 'greene',
 'hagen',
 'kay',
 'la',
 'luca',
 'michael',
 'moe',
 'murder',
 'role',
 'sicily',
 'sonny',
 'tom',
 'vega',
 'vito'}

## Search
Using our movies dictionary we can search for specific keywords/titles. We will iterate through all of our movies, finding the union between a user's query and a movie's title or plot. The top 5 movies based on title/plot will be printed for each query. After the first query, the queries will find matches very quickly.

In [16]:
while True:
    text = input("\nEnter a movie title or a description of a movie, or 'quit' to stop: ")
    if text == 'quit':
        break
    plot = set(preprocess(text).split()) # preprocess the query to match plots (lemmatized)
    title = set(text.lower().split()) - stop_words # not lemmatized to match exact wording of title
    share_plot = []
    share_title = []

    for m in movies:
        s = set(m.lower().split())
        shared = len(s & title)
        if shared: # add titles that share words
            share_title.append((shared, m))

        shared = len(movies[m] & plot)
        if shared: # add plots that share words
            share_plot.append((shared, m))

    share_title.sort(reverse=True)
    print(f"Movie titles most similar to '{text}':")
    for i in range(min(5, len(share_title))):
        print(f"\t{i+1}. {share_title[i][1]}")
        
    share_plot.sort(reverse=True)
    print(f"Movie plots most similar to '{text}':")
    for i in range(min(5, len(share_plot))):
        print(f"\t{i+1}. {share_plot[i][1]}")


Enter a movie title or a description of a movie, or 'quit' to stop: star wars luke leia skywalker yoda darth vader anakin fight jedi
Movie titles most similar to 'star wars luke leia skywalker yoda darth vader anakin fight jedi':
	1. Star Wars: The Last Jedi
	2. Star Wars: The Clone Wars
	3. Star Wars Episode IV: A New Hope (aka Star Wars)
	4. Rogue One: A Star Wars Story (film)
	5. Wish Upon a Star
Movie plots most similar to 'star wars luke leia skywalker yoda darth vader anakin fight jedi':
	1. Star Wars: Episode III – Revenge of the Sith
	2. Return of the Jedi
	3. Star Wars Episode IV: A New Hope (aka Star Wars)
	4. The Empire Strikes Back
	5. Star Wars: The Last Jedi

Enter a movie title or a description of a movie, or 'quit' to stop: quit


## Conclusion
This project was very fun to make and is pretty good for finding similar movies. It could be improved if it weighted the words that appeared by their rarity. The current search finds the number of words shared by the query and the description. This approach is simple and effective for short-medium queries, but is drowned out by common words when queries get longer. Using weighted words would ensure that rare words from queries/descriptions are matchted.