<br>

<br>

<br>

# 🎬 **MOVIE RECOMMENDATION** 🎬

**K-NEAREST NEIGHBORS**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: DATA EXPLORATION AND CLEANING**
- **STEP 3: DATA PROCESSING**
- **STEP 4: FEATURE ENGINEERING**
- **STEP 5: MODEL DEVELOPMENT**
- **STEP 6: STEP 6: RECOMMENDATION SYSTEM IMPLEMENTATION**
- **STEP 7: MODEL SAVING**
- **STEP 8: CONCLUSION**

<br>

## **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Problem definition
- 1.2. Library Importing
- 1.3. Data Collection

<br>

**1.1. PROBLEM DEFINITION**


The goal of this project is to create a movie recommendation system that predicts which movies might be of interest to a user based on the similarity to a given movie. The project leverages the K-Nearest Neighbors (KNN) algorithm to calculate the similarity between movies by processing and analyzing metadata from two datasets: `tmdb_5000_movies` and `tmdb_5000_credits`.

<br>

**Datasets and Interrelation**
- **`tmdb_5000_movies.csv`**: Contains information like `movie_id`, `title`, `overview`, `genres`, and `keywords`.
- **`tmdb_5000_credits.csv`**: Includes cast and crew details for each movie.
- Both datasets share the `title` column, which is used to join them and create a unified dataset for analysis.

<br>

**Methodology: K-Nearest Neighbors (KNN)**

- **KNN** is a non-parametric, instance-based learning algorithm used for classification and regression.
- In essence, the term "non-parametric" means that **KNN** does not make rigid assumptions about the data and bases its decisions directly on the observed instances. It's like having an algorithm that "learns on the fly" every time it needs to make a prediction.
- The methodology involves vectorizing movie metadata into numerical representations, enabling comparisons between movies.
- Using cosine similarity as a distance metric, **KNN** identifies the closest neighbors in the feature space, where proximity signifies greater similarity.
- By leveraging metadata such as **genres**, **keywords**, **cast**, and **crew**, KNN directly aligns with the project's goal: to recommend movies based on their resemblance to a given input.

<br>

**1.2. LIBRARY IMPORTING**

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import json
import pickle
import warnings
warnings.filterwarnings('ignore')

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer

**1.3. DATA COLLECTION**

In [2]:

movies_data_url = "https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv"
credits_data_url = "https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv"

In [3]:
# Load datasets
movies_df = pd.read_csv(movies_data_url)
credits_df = pd.read_csv(credits_data_url)

In [4]:
movies_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [5]:
credits_df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [6]:
# Merge datasets
df = movies_df.merge(credits_df, on='title', how='left')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


<br>

## **STEP 2: DATA EXPLORATION AND CLEANING**

- 2.1. Exploration
- 2.2. Cleaning

<br>

**2.1. EXPLORATION**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

<br>

**2.2. CLEANING**

In [8]:
# Drop unnecessary column
df = df[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']] #hemos mantenido esto
df.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [9]:
# Check for missing values.
print(df.isnull().sum())

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64


In [10]:
# Fill or drop missing values
df.dropna(subset=['overview', 'genres', 'keywords', 'cast', 'crew'], inplace=True)

<br>

<br>

## **STEP 3: DATA PROCESSING**

- 3.1. Helper Functions for JSON Processing
- 3.2. Text Normalization Function
- 3.3. Applying the Functions to the Dataset

In this step, the goal is to clean and transform the raw data into a structured and meaningful format. This involves extracting relevant information from JSON-like columns, normalizing text data, and preparing it for further analysis and feature engineering.


Summary of objectives
- **Extract Relevant Information**: Convert JSON strings into usable lists of names (e.g., `genres`, `keywords`, `cast`, `director`).
- **Normalize Data**: Standardize the format of text columns to avoid inconsistencies.
- **Prepare Data for Feature Engineering**: Ensure the processed data is ready to be combined and transformed in the next steps.


<br>

**3.1. HELPER FUNCTIONS FOR JSON PROCESSING**

In [11]:
# Define helper functions to process JSON strings
def extract_names_from_json(json_str, key="name"):
    try:
        data = json.loads(json_str)
        return [item[key] for item in data]
    except (TypeError, json.JSONDecodeError):
        return []

def extract_top_n_names_from_json(json_str, n=3, key="name"):
    try:
        data = json.loads(json_str)
        return [item[key] for item in data[:n]]
    except (TypeError, json.JSONDecodeError):
        return []

def extract_director(json_str, key="job", value="Director", name_key="name"):
    try:
        data = json.loads(json_str)
        for item in data:
            if item.get(key) == value:
                return item.get(name_key, "")
        return ""
    except (TypeError, json.JSONDecodeError):
        return ""


**3.2. TEXT NORMALIZATION FUNCTION**

In [12]:
def remove_spaces(text):
    if isinstance(text, str):
        return text.replace(" ", "")
    return text

**3.3. APPLYING THE FUNCTIONS TO THE DATASET**

In [13]:
# Apply processing functions to clean data
df["genres"] = df["genres"].apply(lambda x: " ".join(extract_names_from_json(x)))
df["keywords"] = df["keywords"].apply(lambda x: " ".join(extract_names_from_json(x)))
df["cast"] = df["cast"].apply(lambda x: " ".join(extract_top_n_names_from_json(x, n=3)))
df["crew"] = df["crew"].apply(lambda x: extract_director(x))
df["overview"] = df["overview"].apply(lambda x: x.split() if isinstance(x, str) else [])

# Remove spaces from processed columns
df["genres"] = df["genres"].apply(remove_spaces)
df["keywords"] = df["keywords"].apply(remove_spaces)
df["cast"] = df["cast"].apply(remove_spaces)
df["crew"] = df["crew"].apply(remove_spaces)


<br>

## **STEP 4: FEATURE ENGINEERING**

The **Feature Engineering** step involves creating a new feature that consolidates all relevant information into a single column. This step is critical for preparing the dataset for vectorization and similarity calculations in the next steps.

1. The goal is to combine data from multiple processed columns (`genres`, `keywords`, `cast`, `crew`, and `overview`) into a single column called `tags`. This column represents a concise summary of all the important attributes of a movie, making it easier to calculate text similarity later.

2. Combine Relevant Columns into the `tags` Column:

     - `genres`: Contains the names of movie genres (e.g., `Action`, `Comedy`).
     - `keywords`: Contains descriptive keywords related to the movie.
     - `cast`: Contains the names of the top 3 cast members.
     - `crew`: Contains the director's name.
     - `overview`: Contains a list of words from the movie's description.

3. Delimiter:
     - The `|` delimiter is used to clearly separate the different elements in the `tags` column.

4. Row-wise Combination:
     - The `apply` function processes each row individually to combine the selected columns into a single string.



In [14]:
# Combine relevant columns into a single 'tags' column with a clear delimiter
df["tags"] = df.apply(lambda row: "|".join(map(str, [
    row["genres"], row["keywords"], row["cast"], row["crew"], " ".join(row["overview"])
])), axis=1)


df=df[["movie_id","title","tags"]]

In [15]:
df.head ()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,ActionAdventureFantasyScienceFiction|culturecl...
1,285,Pirates of the Caribbean: At World's End,AdventureFantasyAction|oceandrugabuseexoticisl...
2,206647,Spectre,ActionAdventureCrime|spybasedonnovelsecretagen...
3,49026,The Dark Knight Rises,ActionCrimeDramaThriller|dccomicscrimefightert...
4,49529,John Carter,ActionAdventureScienceFiction|basedonnovelmars...


<br>

<br>

## **STEP 5: MODEL DEVELOPMENT**

- 5.1. Vectorize the 'tags' column
- 5.2. Calculate cosine similarity

The **Model Development** step focuses on converting the text data in the `tags` column into numerical representations (vectors) and calculating similarity between movies based on these vectors.

- Purpose:
   - To prepare the `tags` column for mathematical operations by vectorizing it.
   - To compute pairwise cosine similarity between movies to measure how similar they are.

<br>

**5.1. VECTORIZE THE **`tags`** COLUMN**

In [16]:
# Vectorize the 'tags' column
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(token_pattern=r'\b\w+\b', lowercase=True)



In [17]:
matrix = vect.fit_transform(df['tags'])
knn_model = NearestNeighbors(n_neighbors=5, algorithm='brute', metric='cosine')
knn_model.fit(matrix)

<br>

<br>

## **STEP 6: RECOMMENDATION SYSTEM IMPLEMENTATION**


The **Recommendation System Implementation** step uses the similarity matrix to find and suggest movies similar to a given input movie.

1. Purpose:
   - To build a function that leverages cosine similarity to recommend the top 5 movies most similar to the input movie.

2. Code:
   - `recommend` function:
     - Finds the index of the input movie.
     - Retrieves similarity scores from the similarity matrix.
     - Sorts and selects the top 5 most similar movies (excluding the input movie).

In [18]:
def recommend_similar_movies(movie_title, model, vectorizer, data):
    input_vector = vectorizer.transform([movie_title])
    distances, indices = model.kneighbors(input_vector)
    recommended_movies = [(df["title"][i]) for i in enumerate(indices[0])]
    return recommended_movies


In [19]:
def get_movie_recommendations(movie_title):
    movie_index = df[df["title"] == movie_title].index[0]
    distances, indices = knn_model.kneighbors(matrix[movie_index])
    similar_movies = [(df["title"][i], distances[0][j]) for j, i in enumerate(indices[0])]
    return similar_movies[1:]

In [20]:
input_movie_title = 'Avatar'
recommended_movies = get_movie_recommendations(input_movie_title)
print("Recommended movies for:".format(input_movie_title))
for movie in recommended_movies:
    print(movie)

Recommended movies for:
('Lone Wolf McQuade', np.float64(0.842325164001489))
('Tears of the Sun', np.float64(0.8644061210172322))
('The American', np.float64(0.8807373542628273))
('The Inhabited Island', np.float64(0.8891495853774054))


<br>

<br>

## **STEP 7: MODEL SAVING**

- 7.1. SAVE THE VECTORIZER
- 7.2. SAVE THE SIMILARITY MATRIX

<br>

-  **Purpose**:
   - Save the `CountVectorizer` object, which transforms text into numerical vectors.
   - Save the similarity matrix generated using cosine similarity, which forms the core of the recommendation system.

-  **Method Used**:
   - **Pickle (PKL)**: A Python-native serialization library used to store and retrieve Python objects.
   - The `vectorizer.pkl` file stores the trained `CountVectorizer`.
   - The `similarity_matrix.pkl` file stores the precomputed similarity matrix.


**7.1. SAVE THE VECTORIZER**

In [21]:
# Save the vectorizer
with open("vectorizer.pkl", "wb") as file:
    pickle.dump(vect, file)

**7.2. SAVE THE SIMILARITY MATRIX**

In [26]:
# Save the similarity matrix
with open("knn_model.pkl", "wb") as file:
    pickle.dump(knn_model, file)

print("Model saved successfully!")

Model saved successfully!


In [27]:
df.to_csv('base_movies.csv', index=False)


<br>

<br>

<br>

# **STEP 8: CONCLUSION**

   - The objective was to create a content-based movie recommendation system using a dataset of movies and their attributes.
   - The system identifies movies similar to a given input based on textual similarity of features like genres, keywords, cast, crew, and overview.

<br>

----

<br>

### **Solution Developed**:
   - **Data Collection**: We used two datasets (`tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`) and merged them to combine relevant information.
   - **Data Processing**: JSON-formatted data was processed to extract meaningful attributes (e.g., genres, top 3 cast members, director) and normalized for consistency.
   - **Feature Engineering**: A consolidated `tags` column was created to summarize all key attributes of each movie into a single feature for comparison.
   - **Model Development**:
     - The `CountVectorizer` was used to transform the `tags` column into numerical vectors.
     - Cosine similarity was computed to measure how closely related two movies are based on their vectorized features.
   - **Recommendation System Implementation**:
     - A function was built to use the similarity matrix and recommend the top 5 movies most similar to the input movie.
   - **Model Saving**:
     - The `CountVectorizer` and similarity matrix were stored as  `.pkl`  files for reuse in future projects.

<br>

---

<br>


### **Outcome**:
   - The recommendation system is functional, efficient, and ready for integration into a web application.
   - The approach ensures scalability and reusability by saving critical components for later deployment.

<br>

---

<br>

### **Limitations and Future Work**:
   - The current system is content-based, so it cannot handle collaborative recommendations (user preferences).
   - Further optimization and integration into a web framework (e.g., Flask) can be developed in the next phase of the project.
