# 📌 Hybrid Recommendation System (Movies & Anime)

## **Unsupervised Clustering & Content-Based Approach**  

This project aims to build a **movie & anime recommendation system** using **unsupervised machine learning (clustering) and content-based filtering**.  


**Key Objectives:**  

- Cluster Netflix movies & anime based on features  
- Use content similarity to recommend relevant titles  
- Explore data to understand key patterns and trends  


### 📖 Notebook Overview  


This notebook focuses on **loading and exploring the dataset** to understand its structure, quality, and key characteristics.  


**What we are doing:**  

✔ Loading Netflix and anime datasets 
✔ Performing an initial inspection (shape, columns, missing values)  
✔ Generating summary statistics and sample data  

**Why we are doing this:**  

Understanding the dataset is crucial before feature engineering and modeling. We need to ensure data consistency and identify potential issues like missing values or incorrect formats.  

**What we expect:**  

By the end of this notebook, we should have:

✔ A clear understanding of dataset structure
✔ Insights into missing or inconsistent data  
✔ A plan for preprocessing steps  


## 🛠 **Importing Libraries & Loading Data**

In [1]:
import sys
import os

In [2]:
# Get absolute path to the project's root directory
project_root = os.path.abspath("..")

# Add project root to sys.path (so src/ and utils/ are accessible)
if project_root not in sys.path:
    sys.path.append(project_root)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
from utils.config import MOVIES_PATH, ANIME_PATH
from utils.data_analysis import analyze_dataset
from src.data_preprocessing import create_dataframe
from summaries.register_summaries import summary_factory

In [5]:
pd.set_option('display.float_format', '{:,.2f}'.format)

### Load the Data

In [6]:
# Load datasets
movies_df = create_dataframe(MOVIES_PATH)
anime_df = create_dataframe(ANIME_PATH)

### First Look - Movies

In [7]:
result = analyze_dataset(movies_df, exclude_columns=['id'])

In [8]:
print(summary_factory.generate_summary("overview", result))
print("\n" + "="*80 + "\n")
print(summary_factory.generate_summary("observations", result))

The dataset contains 4803 rows and 20 columns.

There are 3941 missing values across 5 columns.
Missing values account for 4.10% of the dataset.
Columns with missing values and their counts:
  - homepage: 3091 missing values
  - overview: 3 missing values
  - release_date: 1 missing values
  - runtime: 2 missing values
  - tagline: 844 missing values

There are no duplicate rows in the dataset.

Data Types:

              Column DataType
              budget    int64
              genres   object
            homepage   object
                  id    int64
            keywords   object
   original_language   object
      original_title   object
            overview   object
          popularity  float64
production_companies   object
production_countries   object
        release_date   object
             revenue    int64
             runtime  float64
    spoken_languages   object
              status   object
             tagline   object
               title   object
        vote_avera

### Handling Missing Values

The dataset contains 4803 rows and 20 columns.

There are 3941 missing values across 5 columns.

  - homepage: 3091 missing values (64.35561107641058% of data is missing)
  - overview: 3 missing values - 
  - release_date: 1 missing values
  - runtime: 2 missing values
  - tagline: 844 missing values

In [9]:
movies_df[movies_df['release_date'].isna()]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
4553,0,[],,380097,[],en,America Is Still the Place,1971 post civil rights San Francisco seemed li...,0.0,[],[],,0,0.0,[],Released,,America Is Still the Place,0.0,0


In [10]:
movies_df = movies_df.dropna(subset=['release_date'])

In [11]:
movies_df[movies_df['overview'].isna()]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2656,15000000,"[{""id"": 18, ""name"": ""Drama""}]",,370980,"[{""id"": 717, ""name"": ""pope""}, {""id"": 5565, ""na...",it,Chiamatemi Francesco - Il Papa della gente,,0.74,"[{""name"": ""Taodue Film"", ""id"": 45724}]","[{""iso_3166_1"": ""IT"", ""name"": ""Italy""}]",2015-12-03,0,,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,,Chiamatemi Francesco - Il Papa della gente,7.3,12
4140,2,"[{""id"": 99, ""name"": ""Documentary""}]",,459488,"[{""id"": 6027, ""name"": ""music""}, {""id"": 225822,...",en,"To Be Frank, Sinatra at 100",,0.05,"[{""name"": ""Eyeline Entertainment"", ""id"": 60343}]","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2015-12-12,0,,[],Released,,"To Be Frank, Sinatra at 100",0.0,0
4431,913000,"[{""id"": 99, ""name"": ""Documentary""}]",,292539,[],de,Food Chains,,0.8,[],[],2014-04-26,0,83.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,Food Chains,7.4,8


In [12]:
movies_df[movies_df['runtime'].isna()]

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2656,15000000,"[{""id"": 18, ""name"": ""Drama""}]",,370980,"[{""id"": 717, ""name"": ""pope""}, {""id"": 5565, ""na...",it,Chiamatemi Francesco - Il Papa della gente,,0.74,"[{""name"": ""Taodue Film"", ""id"": 45724}]","[{""iso_3166_1"": ""IT"", ""name"": ""Italy""}]",2015-12-03,0,,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,,Chiamatemi Francesco - Il Papa della gente,7.3,12
4140,2,"[{""id"": 99, ""name"": ""Documentary""}]",,459488,"[{""id"": 6027, ""name"": ""music""}, {""id"": 225822,...",en,"To Be Frank, Sinatra at 100",,0.05,"[{""name"": ""Eyeline Entertainment"", ""id"": 60343}]","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""}]",2015-12-12,0,,[],Released,,"To Be Frank, Sinatra at 100",0.0,0


In [13]:
movies_df.drop(columns=['homepage'], inplace=True)

In [14]:
movies_df['tagline'].fillna('no tagline', inplace=True)

In [15]:
movies_df.loc[2656, 'overview'] = "On his path to becoming Pope Francis, Father Jorge Bergoglio pursues his religious vocation in a country ravaged by a brutal military dictatorship."
movies_df.loc[4140, 'overview'] = "An exploration of how singer and actor Frank Sinatra became one of the biggest stars of the 20th century while remaining, in his heart, a normal person."
movies_df.loc[4431, 'overview'] = "In America, farm labor has always been one of the most difficult, poorly paid jobs and has relied on some of the nation's most vulnerable people. While the legal restrictions that keep people bound to farms have been abolished, exploitation still exists. Ranging from wage theft to modern-day slavery, this exploitation is perpetuated by the corporations at the top of the food chain: supermarkets."

In [16]:
movies_df.loc[2656, 'runtime'] = 113
movies_df.loc[4140, 'runtime'] = 81

In [17]:
# Convert release_date to datetime and extract year
movies_df['release_date'] = pd.to_datetime(movies_df['release_date'], errors='coerce')

In [18]:
movies_df['release_year'] = movies_df['release_date'].dt.year

In [19]:

# Convert budget and revenue to millions for better readability
movies_df['budget_million'] = movies_df['budget'] / 1e6
movies_df['revenue_million'] = movies_df['revenue'] / 1e6


In [20]:
def process_list_columns(column):
    return movies_df[column].fillna('[]').apply(lambda x: [item['name'] for item in eval(x)] if isinstance(x, str) and x.startswith('[') else [])

In [21]:
list_columns = ['genres', 'keywords', 'production_companies', 'production_countries']

In [22]:
for col in list_columns:
    movies_df[col] = process_list_columns(col)

In [23]:

# Outlier detection using IQR method
numeric_cols = ['budget_million', 'revenue_million', 'popularity', 'vote_average']
for col in numeric_cols:
    Q1 = movies_df[col].quantile(0.25)
    Q3 = movies_df[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    lower_bound = Q1 - 1.5 * IQR
    print(col, upper_bound, lower_bound)

budget_million 98.80000000000001 -58.00000000000001
revenue_million 232.2979875 -139.3787925
popularity 63.822440875000005 -30.818690125000003
vote_average 8.6 3.7999999999999994
