### Wikipedia Movies: Recommendations Using TF-IDF

Welcome to this Kaggle notebook where we explore how to create the sort of movie recommendations that you get on Netflix. In this notebook, we'll use a concept called **Term Frequency-Inverse Document Frequency (TF-IDF)** to generate insightful movie recommendations based on the plots of movies.

TF-IDF is a widely used technique in information retrieval and natural language processing (NLP). It quantifies the importance of a word in a given document by considering its frequency in the document (Term Frequency) and its rarity across the entire dataset (Inverse Document Frequency).

By applying TF-IDF on the plot column of our movie dataset, we can extract key features and characteristics from each movie's plot, allowing us to make informed recommendations based on textual similarities.

Through this notebook, we'll walk through the steps of building a recommendation system using TF-IDF. We'll preprocess the movie plot data, compute TF-IDF scores, and implement a similarity metric to identify movies with similar plots. The resulting recommendations will provide users with a curated list of movies that share thematic elements or storylines, enabling them to discover new films that resonate with their interests.

### Load the Wikipedia Movies dataset

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(columns=['title', 'image', 'plot'])

In [3]:
for dirname, _, filenames in os.walk('../DataSets/Movies'):
    for filename in filenames:
        if filename.endswith(".csv"):
            df_new = pd.read_csv(os.path.join(dirname, filename))
            df = pd.concat([df, df_new], ignore_index=True)
            print(f"Loaded {filename}")

print("Done!")

Loaded 1980s-movies.csv
Done!


In [4]:
# Reset the index to create a new column with a sequential index
df = df.reset_index(drop=True)

# Truncate the plot
df['plot'] = df['plot'].str[:2000]

# Show record counts
df.count()

title    2325
image    2325
plot     2325
dtype: int64

### View a sample of movies

In [5]:
df.sample(5)

Unnamed: 0,title,image,plot
1923,Spaceballs,https://upload.wikimedia.org/wikipedia/en/4/45...,"Planet Spaceball, led by the incompetent Presi..."
1013,Reform School Girls,https://upload.wikimedia.org/wikipedia/en/d/df...,The film is a satire of the women in prison fi...
1249,Death Nurse,https://upload.wikimedia.org/wikipedia/en/3/30...,"From their suburban home, Doctor Gordon Mortle..."
528,Big Trouble (1986 film),https://upload.wikimedia.org/wikipedia/en/e/e2...,Leonard Hoffman is a Los Angeles insurance age...
175,The Lost Empire (1984 film),https://upload.wikimedia.org/wikipedia/en/5/5f...,"The film opens at a jewelry shop in Chinatown,..."


### Compute TF-IDF scores and create similarity matrix¶

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Fit and transform the plot column to TF-IDF vectors
tfidf_matrix = tfidf.fit_transform(df['title'] + ' ' + df['plot'])

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

### Define helper functions

In [1]:
def get_index_from_title(title):
    try:
        return df[df["title"].str.lower().str.startswith(title.lower())].index[0]
    except:
        return -1

def get_title_from_index(index):
    return df.iloc[index]["title"]

### Recommend similar movies

In [8]:
# Find similar movies to this one
title = "top gun"

# Get the index of the movie
movie_index = get_index_from_title(title)

if movie_index > -1:

    # Create a list of cosine similiarities for the specified film
    similar_movies = list(enumerate(cosine_sim[movie_index]))

    # Sort the list, highest values to lowest
    sorted_similar_movies = sorted(similar_movies, key=lambda x:x[1], reverse=True)

    # Display the data
    # Note the best match (and first in the list) will be the specified film itself
    print("Similar movies:")
    for movies in sorted_similar_movies[:20]:
        print(get_title_from_index(movies[0]))

else:
    print("Movie not found.")

Similar movies:
Top Gun
Deadbeat at Dawn
The Final Countdown (film)
Bird (1988 film)
Slaughter High
An Officer and a Gentleman
Always (1989 film)
Black Rain (1989 American film)
Mike's Murder
Iceman (1984 film)
Bring Me the Head of Charlie Brown
The Pope of Greenwich Village
Skyward (film)
Rain Man
Feds
Judgement Day (1988 film)
The Principal
Private Benjamin (1980 film)
The Big Picture (1989 film)
Something Wild (1986 film)
