## 1. Data Input
We will be using a data set that contains names, keywords, and overview of 4803 movies. We exclude the 3 movies with invalid overviews, and the resulting dataset contains 4800 movie names, keywords, and overviews. For simplicity in handling, we reduce the set to 439 valid entries.

In [2]:
import kagglehub
# Download latest version
path = kagglehub.dataset_download("harshshinde8/movies-csv")


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
#!pip install pandas
import pandas as pd

# Importing the data and select 1 row from every 10 rows.
data = pd.read_csv(path+'/movies.csv',usecols=['original_title','overview','keywords'])
df_cleaned = data.dropna()
df_cleaned = df_cleaned.reset_index(drop=True)
df = df_cleaned.iloc[::10]

## 1.1 Data Cleaning

We futher re-organize the data as following: To implement a TF-IDF algorithm, we combine the movie title, summary, and movie keywords to a single combined feature.

In [5]:
# A new column is added:
df.loc[:,'combined_features'] = df.loc[:,'original_title'] + ' ' + df.loc[:,'overview'] + ' ' + df.loc[:,'keywords']
df.reset_index(drop=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:,'combined_features'] = df.loc[:,'original_title'] + ' ' + df.loc[:,'overview'] + ' ' + df.loc[:,'keywords']


Unnamed: 0,keywords,original_title,overview,combined_features
0,culture clash future space war space colony so...,Avatar,"In the 22nd century, a paraplegic Marine is di...","Avatar In the 22nd century, a paraplegic Marin..."
1,saving the world dc comics invulnerability seq...,Superman Returns,Superman returns to discover his 5-year absenc...,Superman Returns Superman returns to discover ...
2,loss of father vigilante serum marvel comic sc...,The Amazing Spider-Man,Peter Parker is an outcast high schooler aband...,The Amazing Spider-Man Peter Parker is an outc...
3,dual identity love of one's life pizza boy mar...,Spider-Man 2,Peter Parker is going through a major identity...,Spider-Man 2 Peter Parker is going through a m...
4,car race sequel comedy anthropomorphism best f...,Cars 2,Star race car Lightning McQueen and his pal Ma...,Cars 2 Star race car Lightning McQueen and his...
...,...,...,...,...
434,london england countryside rape sex trauma,Straightheads,There is instant chemistry between Alice (Gill...,Straightheads There is instant chemistry betwe...
435,transplantation experiment mutant brain fianc\...,The Brain That Wouldn't Die,Dr. Bill Cortner (Jason Evers) and his fiancée...,The Brain That Wouldn't Die Dr. Bill Cortner (...
436,independent film,George Washington,A delicately told and deceptively simple story...,George Washington A delicately told and decept...
437,office love independent film secretary misogynist,In the Company of Men,Two business executives--one an avowed misogyn...,In the Company of Men Two business executives-...


### 1.2 (Optional) Storing the dataframe to a local csv

In [7]:
import os

os.makedirs('Data', exist_ok=True)
df.to_csv("./Data/my_data.csv", index=False)  


#### 1.2.1 Loading the dataframe from a local csv

In [9]:
df = pd.read_csv("./Data/my_data.csv",usecols=['original_title','overview','keywords'])
df.loc[:,'combined_features'] = df.loc[:,'original_title'] + ' ' + df.loc[:,'overview'] + ' ' + df.loc[:,'keywords']
df.reset_index(drop=True)
df

Unnamed: 0,keywords,original_title,overview,combined_features
0,culture clash future space war space colony so...,Avatar,"In the 22nd century, a paraplegic Marine is di...","Avatar In the 22nd century, a paraplegic Marin..."
1,saving the world dc comics invulnerability seq...,Superman Returns,Superman returns to discover his 5-year absenc...,Superman Returns Superman returns to discover ...
2,loss of father vigilante serum marvel comic sc...,The Amazing Spider-Man,Peter Parker is an outcast high schooler aband...,The Amazing Spider-Man Peter Parker is an outc...
3,dual identity love of one's life pizza boy mar...,Spider-Man 2,Peter Parker is going through a major identity...,Spider-Man 2 Peter Parker is going through a m...
4,car race sequel comedy anthropomorphism best f...,Cars 2,Star race car Lightning McQueen and his pal Ma...,Cars 2 Star race car Lightning McQueen and his...
...,...,...,...,...
434,london england countryside rape sex trauma,Straightheads,There is instant chemistry between Alice (Gill...,Straightheads There is instant chemistry betwe...
435,transplantation experiment mutant brain fianc\...,The Brain That Wouldn't Die,Dr. Bill Cortner (Jason Evers) and his fiancée...,The Brain That Wouldn't Die Dr. Bill Cortner (...
436,independent film,George Washington,A delicately told and deceptively simple story...,George Washington A delicately told and decept...
437,office love independent film secretary misogynist,In the Company of Men,Two business executives--one an avowed misogyn...,In the Company of Men Two business executives-...


## 2. Method 1: TF-IDF via sklearn

First, we create the TF-IDF tokenizer via the preset English tokenizer, fit the tokenizer on the 'combined_features' column, and we also create a TF-IDF matrix of shape (439, 6151). In other words, 439 documents (each one of them being a string of concatenated movie name, summary, and key words) and 6151 tokens.

### 2.1 Converting data to TF-IDF matrix.

In [13]:
#!pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['combined_features'])

print(tfidf_matrix.shape)

(439, 6151)


### 2.2 Building the pipeline for recommendation retrieval

In [15]:
from sklearn.metrics.pairwise import linear_kernel
indices = pd.Series(df.index, index=df['original_title']).drop_duplicates()

def recommend_by_description(description, top_n=5):
    """
    Given a user description (e.g., "I like action movies set in space"),
    return the top_n most similar movies.
    """
    query_vector = tfidf.transform([description])
    
    # Compute similarity between a certain input vector and all movie vectors
    similarities = linear_kernel(query_vector, tfidf_matrix).flatten()
    
    # Return the indices of the top_n movies (descending order of similarity)
    top_indices = similarities.argsort()[::-1][:top_n]
    
    # Build a DataFrame of the recommended movies
    recommended = df.iloc[top_indices].copy()
    recommended['similarity'] = similarities[top_indices]
    
    return recommended[['original_title', 'similarity', 'overview']]

## 2.1 Example output

In [17]:
recommend_by_description("A movie about life, love, and death.")

Unnamed: 0,original_title,similarity,overview
144,The Fountain,0.232185,"Spanning over one thousand years, and three pa..."
372,My Big Fat Independent Movie,0.221189,"This film is a spoof along the lines of ""Scary..."
335,Jackass: The Movie,0.217145,Johnny Knoxville and his crazy friends appear ...
319,Veer-Zaara,0.185271,The story of the love between Veer Pratap Sing...
268,Lars and the Real Girl,0.166271,Sometimes you find love where you'd least expe...


In [18]:
#!pip freeze > requirements.txt
!python -V
#!pipreqsnb --force .

Python 3.11.11


## Salary requirements:

Expected salary: Monthly salary of $36000+ for 40-hr week.