# Netflix Movie Recommendation System using Content-Based Filtering

## Project Overview
This notebook implements a content-based recommendation system for Netflix movies and TV shows. The system analyzes movie metadata (genre, cast, director, description) to recommend similar titles.

## Objectives
1. Understand content-based filtering algorithms
2. Perform text preprocessing using NLP techniques
3. Convert metadata to numerical feature vectors
4. Build similarity-based recommendation model
5. Evaluate and visualize recommendations

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
import re
import warnings
warnings.filterwarnings('ignore')

nltk.download('stopwords')

print("All libraries imported successfully!")

## 2. Load and Explore the Dataset

In [None]:
df = pd.read_csv('../data/netflix_sample.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
print("\nDataset Info:")
print(df.info())
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
print("\nBasic Statistics:")
df.describe()

In [None]:
print("\nContent Types:")
print(df['type'].value_counts())
print("\nSample Genres:")
print(df['listed_in'].value_counts().head(10))

## 3. Data Preprocessing

In [None]:
print("Missing values before preprocessing:")
print(df.isnull().sum())

df['listed_in'] = df['listed_in'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['director'] = df['director'].fillna('Unknown')
df['description'] = df['description'].fillna('Unknown')

print("\nMissing values after preprocessing:")
print(df.isnull().sum())

## 4. Text Cleaning Function

In [None]:
def clean_text(text):
    if pd.isna(text):
        return ""
    
    text = str(text).lower()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words and len(word) > 2]
    
    return ' '.join(words)

print("Text cleaning function defined.")

## 5. Feature Engineering - Create Metadata Soup

In [None]:
print("Creating metadata soup...")

df['metadata_soup'] = (
    df['listed_in'].fillna('') + ' ' +
    df['cast'].fillna('') + ' ' +
    df['director'].fillna('') + ' ' +
    df['description'].fillna('')
)

print(f"\nSample raw metadata soup:\n{df['metadata_soup'].iloc[0][:300]}...")

In [None]:
print("\nCleaning metadata soup...")
df['metadata_soup'] = df['metadata_soup'].apply(clean_text)

print(f"Sample cleaned metadata soup:\n{df['metadata_soup'].iloc[0]}")

## 6. Feature Vectorization with TF-IDF

In [None]:
print("Vectorizing features using TF-IDF...")

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.8
)

tfidf_matrix = tfidf_vectorizer.fit_transform(df['metadata_soup'])
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of features created: {len(tfidf_vectorizer.get_feature_names_out())}")

In [None]:
print("\nTop 20 features (keywords):")
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
print(feature_names[:20])

## 7. Compute Cosine Similarity Matrix

In [None]:
print("Computing cosine similarity matrix...")
similarity_matrix = cosine_similarity(tfidf_matrix)
print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"Similarity matrix type: {type(similarity_matrix)}")

In [None]:
print("\nSample similarity values:")
print(similarity_matrix[0][:10])
print(f"\nMin similarity: {similarity_matrix.min():.4f}")
print(f"Max similarity: {similarity_matrix.max():.4f}")
print(f"Mean similarity: {similarity_matrix.mean():.4f}")

## 8. Build Recommendation Function

In [None]:
def get_recommendations(title, num_recommendations=10):
    movie_list = df[df['title'].str.contains(title, case=False, na=False)]
    
    if len(movie_list) == 0:
        print(f"No movies found matching '{title}'")
        return None
    
    movie_index = movie_list.index[0]
    
    similarity_scores = similarity_matrix[movie_index]
    similar_indices = similarity_scores.argsort()[::-1][1:num_recommendations + 1]
    
    recommendations = df.iloc[similar_indices][['title', 'type', 'listed_in', 'description', 'release_year']].copy()
    recommendations['similarity_score'] = similarity_scores[similar_indices]
    recommendations = recommendations.reset_index(drop=True)
    recommendations.index = recommendations.index + 1
    
    return recommendations

print("Recommendation function defined.")

## 9. Test Recommendations

In [None]:
print("Available titles in dataset:")
print(df['title'].unique()[:20])

In [None]:
test_title = df['title'].iloc[0]
print(f"\nGetting recommendations for: '{test_title}'")
print(f"Type: {df[df['title'] == test_title]['type'].values[0]}")
print(f"Genre: {df[df['title'] == test_title]['listed_in'].values[0]}")
print(f"\nTop 5 Recommendations:")

recommendations = get_recommendations(test_title, num_recommendations=5)
recommendations

In [None]:
test_title2 = df['title'].iloc[5]
print(f"Getting recommendations for: '{test_title2}'")
print(f"Type: {df[df['title'] == test_title2]['type'].values[0]}")
print(f"Genre: {df[df['title'] == test_title2]['listed_in'].values[0]}")
print(f"\nTop 10 Recommendations:")

recommendations2 = get_recommendations(test_title2, num_recommendations=10)
recommendations2

## 10. Evaluation and Analysis

In [None]:
print("Analyzing Similarity Scores Distribution:")

similarity_scores_flat = similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(similarity_scores_flat, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Similarity Score')
plt.ylabel('Frequency')
plt.title('Distribution of Similarity Scores')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(similarity_scores_flat)
plt.ylabel('Similarity Score')
plt.title('Similarity Scores Box Plot')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Min similarity: {similarity_scores_flat.min():.4f}")
print(f"Max similarity: {similarity_scores_flat.max():.4f}")
print(f"Mean similarity: {similarity_scores_flat.mean():.4f}")
print(f"Median similarity: {np.median(similarity_scores_flat):.4f}")

## 11. Genre-Based Analysis

In [None]:
print("Genre Distribution:")

genres = df['listed_in'].str.split(',').explode().str.strip()
genre_counts = genres.value_counts()

plt.figure(figsize=(12, 6))
genre_counts.head(15).plot(kind='barh', color='steelblue')
plt.xlabel('Number of Shows/Movies')
plt.title('Top 15 Genres in Dataset')
plt.tight_layout()
plt.show()

print(f"\nTotal unique genres: {len(genre_counts)}")
print(f"\nTop 10 genres:\n{genre_counts.head(10)}")

## 12. Content Type Distribution

In [None]:
print("Content Type Distribution:")

type_counts = df['type'].value_counts()

plt.figure(figsize=(8, 6))
colors = ['#FF6B6B', '#4ECDC4']
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', colors=colors, startangle=90)
plt.title('Distribution of Content Type')
plt.tight_layout()
plt.show()

print(f"\n{type_counts}")

## 13. Release Year Trends

In [None]:
print("Release Year Trends:")

year_counts = df['release_year'].value_counts().sort_index()

plt.figure(figsize=(12, 6))
plt.plot(year_counts.index, year_counts.values, marker='o', linewidth=2, color='steelblue')
plt.xlabel('Release Year')
plt.ylabel('Number of Releases')
plt.title('Content Releases Over Years')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## 14. Model Performance Metrics

In [None]:
print("Model Performance Summary:")
print("="*50)
print(f"Total movies/shows in database: {len(df)}")
print(f"TF-IDF features generated: {tfidf_matrix.shape[1]}")
print(f"Sparsity of TF-IDF matrix: {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")
print(f"\nSimilarity Matrix Statistics:")
print(f"  - Shape: {similarity_matrix.shape}")
print(f"  - Min value: {similarity_matrix.min():.4f}")
print(f"  - Max value: {similarity_matrix.max():.4f}")
print(f"  - Mean value: {similarity_matrix.mean():.4f}")
print(f"  - Std deviation: {similarity_matrix.std():.4f}")

## 15. Conclusions and Insights

### Key Findings:

1. **Dataset Composition**: The Netflix dataset contains a mix of Movies and TV Shows with diverse genres.

2. **Feature Engineering Success**: By combining genre, cast, director, and description, we created meaningful metadata that captures the essence of each show.

3. **TF-IDF Vectorization**: Successfully converted text to numerical features with 5000 dimensions, capturing important keywords and patterns.

4. **Similarity Distribution**: The cosine similarity scores show a reasonable distribution, indicating the model can differentiate between similar and dissimilar content.

5. **Recommendation Quality**: The system successfully recommends shows based on:
   - Genre similarity
   - Cast overlap
   - Director similarity
   - Plot description similarity

### How the System Works:

1. **Data Preprocessing**: Handle missing values and combine relevant features
2. **Text Cleaning**: Remove stopwords, special characters, and normalize text
3. **Vectorization**: Convert text to TF-IDF vectors
4. **Similarity Computation**: Calculate cosine similarity between all pairs
5. **Ranking**: Return top-N most similar items

### Advantages of Content-Based Filtering:
- No cold start problem for new content
- Transparent recommendations based on features
- Works well for niche content
- No user data required

### Limitations:
- Cannot discover completely new genres for users
- Relies on quality of metadata
- May create "filter bubbles" by recommending similar content

### Future Improvements:
- Incorporate user ratings and viewing history
- Hybrid approach combining collaborative filtering
- Deep learning for feature extraction
- User preference learning