# üé¨ Netflix Recommendation System - ULTIMATE OPTIMIZED MODEL

## üöÄ Maximum Performance Optimization

**Problem:** Current model shows 77.9% poor performance (similarity < 40%)

**Goal:** Achieve 50-60% average similarity with < 30% poor recommendations

### Key Optimizations:
1. ‚úÖ **Increased features**: 3000 ‚Üí 5000 (better vocabulary)
2. ‚úÖ **Trigrams included**: (1,3) instead of (1,2) for better context
3. ‚úÖ **Aggressive weighting**: Genre 4x, Description 3x
4. ‚úÖ **Better filtering**: Optimized min_df and max_df
5. ‚úÖ **Enhanced preprocessing**: Multiple cleaning passes
6. ‚úÖ **Country/Rating features**: Added for better matching

---

## Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import pickle
import re
from datetime import datetime
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported!")
print(f"üìÖ {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nüéØ Target: Average similarity > 50%")
print("üéØ Target: Poor recommendations < 30%")

## Step 2: Load Cleaned Data

In [None]:
# Load cleaned dataset
df = pd.read_csv('netflix_cleaned.csv')

print(f"‚úÖ Loaded {len(df):,} titles")
print(f"üìä Shape: {df.shape}")
print(f"üìã Columns: {df.columns.tolist()}")

# Reset index
df = df.reset_index(drop=True)
print("‚úÖ Index reset")

# Check for missing values
print(f"\nüìä Missing values:")
for col in df.columns:
    missing = df[col].isna().sum()
    if missing > 0:
        print(f"   {col}: {missing} ({missing/len(df)*100:.1f}%)")

## Step 3: ULTIMATE Feature Engineering

### Revolutionary Improvements:
1. **Multi-pass text cleaning** - Remove noise thoroughly
2. **Aggressive feature weighting**:
   - Genre: 4x weight (most critical)
   - Description: 3x weight (very important)
   - Director: 2x weight
   - Cast: 2x weight
   - Country: 2x weight (NEW - helps regional matching)
   - Rating: 1x weight (NEW - helps age-appropriate matching)
3. **Enhanced text preprocessing** - Better normalization

In [None]:
def advanced_text_cleaning(text):
    """
    Multi-pass advanced text cleaning
    """
    if pd.isna(text) or text == 'Unknown':
        return ''
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove special characters but keep spaces and alphanumeric
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # Remove numbers (optional - uncomment if you want)
    # text = re.sub(r'\d+', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Remove single characters
    text = ' '.join([word for word in text.split() if len(word) > 1])
    
    return text.strip()

def create_ultimate_features(row):
    """
    Create ultimate feature string with AGGRESSIVE weighting:
    - Genre: 4x weight (MOST CRITICAL for similarity)
    - Description: 3x weight (Very important for content)
    - Director: 2x weight (Important for style)
    - Cast: 2x weight (Important for type)
    - Country: 2x weight (NEW - Regional/cultural matching)
    - Rating: 1x weight (NEW - Age-appropriate matching)
    - Type: 1x weight (Movie vs TV Show)
    """
    
    # Clean each component with advanced cleaning
    genre = advanced_text_cleaning(row['listed_in'])
    description = advanced_text_cleaning(row['description'])
    director = advanced_text_cleaning(row['director'])
    cast = advanced_text_cleaning(row['cast'])
    country = advanced_text_cleaning(row.get('country', ''))
    rating = advanced_text_cleaning(row.get('rating', ''))
    content_type = advanced_text_cleaning(row['type'])
    
    # Apply aggressive weighting
    genre_weighted = ' '.join([genre] * 4)  # 4x weight!
    description_weighted = ' '.join([description] * 3)  # 3x weight!
    director_weighted = ' '.join([director] * 2)  # 2x weight
    cast_weighted = ' '.join([cast] * 2)  # 2x weight
    country_weighted = ' '.join([country] * 2)  # 2x weight (NEW!)
    
    # Combine all features with spacing
    combined = (
        f"{genre_weighted} "
        f"{description_weighted} "
        f"{director_weighted} "
        f"{cast_weighted} "
        f"{country_weighted} "
        f"{rating} "
        f"{content_type}"
    )
    
    return combined.strip()

print("üîß Creating ULTIMATE enhanced features...")
print("This will take 30-60 seconds...\n")

df['ultimate_features'] = df.apply(create_ultimate_features, axis=1)

print("‚úÖ Ultimate features created!")
print(f"\nüìù Sample ultimate feature (first 300 chars):")
print("-" * 70)
print(df['ultimate_features'].iloc[0][:300] + "...")
print("-" * 70)

# Check feature length distribution
feature_lengths = df['ultimate_features'].str.len()
print(f"\nüìä Feature Length Stats:")
print(f"   Average: {feature_lengths.mean():.0f} characters")
print(f"   Median:  {feature_lengths.median():.0f} characters")
print(f"   Min:     {feature_lengths.min():.0f} characters")
print(f"   Max:     {feature_lengths.max():.0f} characters")

## Step 4: MAXIMUM TF-IDF Optimization

### Ultimate Optimizations:
- **max_features=5000** (up from 3000 - maximum granularity)
- **ngram_range=(1,3)** (trigrams! - captures 3-word phrases)
- **min_df=2** (remove ultra-rare terms)
- **max_df=0.75** (more aggressive common term removal)
- **sublinear_tf=True** (logarithmic term frequency scaling)
- **smooth_idf=True** (prevents zero divisions)
- **use_idf=True** (apply inverse document frequency)

In [None]:
print("üöÄ Creating MAXIMUM OPTIMIZED TF-IDF matrix...")
print("This will take 2-3 minutes...")
print("="*70)

# Initialize ULTIMATE TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=5000,          # üî• INCREASED! 3000 ‚Üí 5000
    stop_words='english',       # Remove common English words
    ngram_range=(1, 3),         # üî• TRIGRAMS! (1,2) ‚Üí (1,3)
    min_df=2,                   # Word must appear in ‚â•2 documents
    max_df=0.75,                # üî• MORE AGGRESSIVE! 0.8 ‚Üí 0.75
    sublinear_tf=True,          # Use log scaling for TF
    smooth_idf=True,            # Smooth IDF weights
    use_idf=True,               # Use inverse document frequency
    norm='l2',                  # L2 normalization
    strip_accents='unicode'     # Remove accents
)

# Fit and transform
print("\n‚è≥ Fitting TF-IDF vectorizer...")
tfidf_matrix = tfidf.fit_transform(df['ultimate_features'])

print(f"\n‚úÖ TF-IDF matrix created!")
print("="*70)
print(f"   Shape:            {tfidf_matrix.shape}")
print(f"   Total features:   {len(tfidf.get_feature_names_out()):,}")
print(f"   Sparsity:         {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1]))*100:.2f}%")
print(f"   Non-zero elements: {tfidf_matrix.nnz:,}")
print(f"   Memory usage:     {tfidf_matrix.data.nbytes / (1024**2):.2f} MB")

# Show sample features
feature_names = tfidf.get_feature_names_out()
print(f"\nüìù Sample extracted features (first 20):")
print("   ", list(feature_names[:20]))
print(f"\nüìù Sample trigrams (if any):")
trigrams = [f for f in feature_names if len(f.split()) == 3][:10]
print("   ", trigrams if trigrams else "[Computing...]")

## Step 5: Compute Cosine Similarity Matrix

In [None]:
print("\nüîÑ Computing OPTIMIZED cosine similarity matrix...")
print("This may take 2-3 minutes for ~8000 titles...")
print("="*70)

import time
start_time = time.time()

# Compute similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

elapsed_time = time.time() - start_time

print(f"\n‚úÖ Similarity matrix computed in {elapsed_time:.1f} seconds!")
print("="*70)
print(f"   Shape:           {cosine_sim.shape}")
print(f"   Total pairs:     {(cosine_sim.shape[0] * cosine_sim.shape[1]):,}")
print(f"   Memory usage:    {cosine_sim.nbytes / (1024**2):.2f} MB")

# Analyze similarity distribution (excluding diagonal)
mask = np.ones(cosine_sim.shape, dtype=bool)
np.fill_diagonal(mask, 0)
off_diagonal = cosine_sim[mask]

print(f"\nüìä Similarity Statistics (excluding self-similarity):")
print(f"   Min:             {off_diagonal.min():.4f}")
print(f"   Max:             {off_diagonal.max():.4f}")
print(f"   Mean:            {off_diagonal.mean():.4f} ({off_diagonal.mean()*100:.2f}%)")
print(f"   Median:          {np.median(off_diagonal):.4f} ({np.median(off_diagonal)*100:.2f}%)")
print(f"   Std Dev:         {off_diagonal.std():.4f}")

# Percentiles
print(f"\nüìä Percentiles:")
for p in [25, 50, 75, 90, 95, 99]:
    val = np.percentile(off_diagonal, p)
    print(f"   {p}th percentile: {val:.4f} ({val*100:.2f}%)")

# Distribution
print(f"\nüìä Similarity Distribution:")
print(f"   ‚â•80%:  {(off_diagonal >= 0.8).sum():,} pairs ({(off_diagonal >= 0.8).sum()/len(off_diagonal)*100:.2f}%)")
print(f"   60-80%: {((off_diagonal >= 0.6) & (off_diagonal < 0.8)).sum():,} pairs ({((off_diagonal >= 0.6) & (off_diagonal < 0.8)).sum()/len(off_diagonal)*100:.2f}%)")
print(f"   40-60%: {((off_diagonal >= 0.4) & (off_diagonal < 0.6)).sum():,} pairs ({((off_diagonal >= 0.4) & (off_diagonal < 0.6)).sum()/len(off_diagonal)*100:.2f}%)")
print(f"   20-40%: {((off_diagonal >= 0.2) & (off_diagonal < 0.4)).sum():,} pairs ({((off_diagonal >= 0.2) & (off_diagonal < 0.4)).sum()/len(off_diagonal)*100:.2f}%)")
print(f"   <20%:  {(off_diagonal < 0.2).sum():,} pairs ({(off_diagonal < 0.2).sum()/len(off_diagonal)*100:.2f}%)")

## Step 6: Create Index Mappings

In [None]:
# Create bidirectional mappings
title_to_index = pd.Series(df.index, index=df['title']).to_dict()
index_to_title = pd.Series(df['title'], index=df.index).to_dict()

print(f"‚úÖ Index mappings created!")
print(f"   Total titles indexed: {len(title_to_index):,}")
print(f"   Sample titles: {list(df['title'].head(3).values)}")

## Step 7: Enhanced Recommendation Function

In [None]:
def get_recommendations(title, n=10, min_similarity=0.0):
    """
    Get top N recommendations for a given title
    
    Parameters:
    -----------
    title : str
        Title of the movie/show
    n : int
        Number of recommendations to return
    min_similarity : float
        Minimum similarity threshold (0.0 to 1.0)
    
    Returns:
    --------
    DataFrame with recommendations and similarity scores
    """
    if title not in title_to_index:
        print(f"‚ùå '{title}' not found in database!")
        print(f"\nüí° Suggestions:")
        # Find similar titles
        similar_titles = df[df['title'].str.contains(title.split()[0], case=False, na=False)]['title'].head(5)
        if len(similar_titles) > 0:
            for t in similar_titles:
                print(f"   ‚Ä¢ {t}")
        return None
    
    # Get index
    idx = title_to_index[title]
    
    # Get similarity scores for this title
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort by similarity (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Exclude the title itself and filter by min_similarity
    sim_scores = [(i, score) for i, score in sim_scores[1:] if score >= min_similarity]
    
    # Get top N
    sim_scores = sim_scores[:n]
    
    if len(sim_scores) == 0:
        print(f"‚ö†Ô∏è  No recommendations found with similarity ‚â• {min_similarity}")
        return None
    
    # Get movie indices
    indices = [i[0] for i in sim_scores]
    
    # Create results DataFrame
    results = df.iloc[indices][['title', 'type', 'release_year', 'rating', 'listed_in']].copy()
    results['similarity_score'] = [score[1] for score in sim_scores]
    results['similarity_pct'] = results['similarity_score'] * 100
    
    # Add quality category
    results['quality'] = results['similarity_pct'].apply(
        lambda x: 'üåü Excellent' if x >= 80 else 
                  '‚ú® Great' if x >= 60 else 
                  'üëç Good' if x >= 40 else 
                  '‚ö†Ô∏è  Fair'
    )
    
    return results

print("‚úÖ Enhanced recommendation function ready!")

## Step 8: Comprehensive Testing

In [None]:
# Test with diverse titles
test_titles = [
    'Stranger Things',
    'Breaking Bad', 
    'The Dark Knight',
    'Friends',
    'Inception'
]

print("üß™ TESTING OPTIMIZED RECOMMENDATIONS")
print("="*70)

for test_title in test_titles:
    if test_title in title_to_index:
        print(f"\nüì∫ Top 5 Recommendations for: {test_title}")
        print("-" * 70)
        
        recs = get_recommendations(test_title, n=5)
        
        if recs is not None:
            for i, (idx, row) in enumerate(recs.iterrows(), 1):
                print(f"\n  {i}. {row['title']}")
                print(f"     {row['quality']} - {row['similarity_pct']:.1f}% match")
                print(f"     {row['type']} ‚Ä¢ {row['release_year']} ‚Ä¢ {row['rating']}")
                print(f"     Genres: {row['listed_in'][:60]}...")
        print()
    else:
        print(f"\n‚ö†Ô∏è  '{test_title}' not found in database")

## Step 9: MODEL PERFORMANCE ANALYSIS

### This is the critical step that shows improvement!

In [None]:
print("\n" + "="*70)
print("üìä MODEL PERFORMANCE ANALYSIS")
print("="*70)

# Sample 100 random titles and get their top-10 recommendations
print("\n‚è≥ Analyzing 100 sample titles (this may take 30 seconds)...\n")

sample_titles = df['title'].sample(min(100, len(df)), random_state=42)
all_top_scores = []

for title in sample_titles:
    recs = get_recommendations(title, n=10)
    if recs is not None and len(recs) > 0:
        all_top_scores.extend(recs['similarity_score'].tolist())

all_top_scores = np.array(all_top_scores)

print("‚úÖ Analysis complete!\n")
print("="*70)

# Overall statistics
print(f"‚ú® Top-10 Recommendation Scores (100 sample titles):")
print(f"   Average:  {all_top_scores.mean():.3f} ({all_top_scores.mean()*100:.1f}%)")
print(f"   Median:   {np.median(all_top_scores):.3f} ({np.median(all_top_scores)*100:.1f}%)")
print(f"   Std Dev:  {all_top_scores.std():.3f}")
print(f"   Max:      {all_top_scores.max():.3f} ({all_top_scores.max()*100:.1f}%)")
print(f"   Min:      {all_top_scores.min():.3f} ({all_top_scores.min()*100:.1f}%)")

# Quality distribution
print(f"\nüìà Quality Distribution:")
excellent = (all_top_scores >= 0.80).sum()
great = ((all_top_scores >= 0.60) & (all_top_scores < 0.80)).sum()
good = ((all_top_scores >= 0.40) & (all_top_scores < 0.60)).sum()
poor = (all_top_scores < 0.40).sum()
total = len(all_top_scores)

print(f"   Excellent (‚â•80%): {excellent:4d} ({excellent/total*100:5.1f}%)")
print(f"   Great (60-79%):   {great:4d} ({great/total*100:5.1f}%)")
print(f"   Good (40-59%):    {good:4d} ({good/total*100:5.1f}%)")
print(f"   Poor (<40%):      {poor:4d} ({poor/total*100:5.1f}%)")

# Overall assessment
print(f"\n" + "="*70)
avg_pct = all_top_scores.mean() * 100
poor_pct = poor / total * 100

if avg_pct >= 50.0:
    print(f"\n‚úÖ ‚úÖ ‚úÖ EXCELLENT! Average similarity is {avg_pct:.1f}% (Target: >50%)")
elif avg_pct >= 45.0:
    print(f"\n‚úÖ ‚úÖ GREAT! Average similarity is {avg_pct:.1f}% (Close to target)")
elif avg_pct >= 40.0:
    print(f"\n‚úÖ GOOD! Average similarity is {avg_pct:.1f}% (Improved from 32.4%)")
else:
    print(f"\n‚ö†Ô∏è  Still needs work. Average similarity is {avg_pct:.1f}%")

if poor_pct <= 30.0:
    print(f"‚úÖ ‚úÖ ‚úÖ EXCELLENT! Only {poor_pct:.1f}% poor recommendations (Target: <30%)")
elif poor_pct <= 40.0:
    print(f"‚úÖ ‚úÖ GREAT! {poor_pct:.1f}% poor recommendations (Better than 77.9%)")
elif poor_pct <= 50.0:
    print(f"‚úÖ GOOD! {poor_pct:.1f}% poor recommendations (Significant improvement)")
else:
    print(f"‚ö†Ô∏è  {poor_pct:.1f}% poor recommendations (needs more tuning)")

print("="*70)

# Compare to baseline
print(f"\nüìä Comparison to Baseline (Original Model):")
print(f"   Average Score: 32.4% ‚Üí {avg_pct:.1f}% ({avg_pct - 32.4:+.1f}%)")
print(f"   Poor (<40%):   77.9% ‚Üí {poor_pct:.1f}% ({poor_pct - 77.9:+.1f}%)")
print(f"   Good (40-59%): 20.3% ‚Üí {good/total*100:.1f}% ({good/total*100 - 20.3:+.1f}%)")
print(f"   Great (60-79%): 1.4% ‚Üí {great/total*100:.1f}% ({great/total*100 - 1.4:+.1f}%)")
print(f"   Excellent (‚â•80%): 0.4% ‚Üí {excellent/total*100:.1f}% ({excellent/total*100 - 0.4:+.1f}%)")

## Step 10: Save Optimized Model

In [None]:
import os

print("\nüíæ SAVING OPTIMIZED MODEL")
print("="*70)

# Prepare system data with all components
system_data = {
    'df': df,
    'cosine_sim': cosine_sim,
    'tfidf_matrix': tfidf_matrix,
    'tfidf_vectorizer': tfidf,
    'title_to_index': title_to_index,
    'index_to_title': index_to_title,
    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'version': 'optimized_v3_ultimate',
    'performance': {
        'avg_similarity': float(all_top_scores.mean()),
        'median_similarity': float(np.median(all_top_scores)),
        'poor_percentage': float(poor / total * 100),
        'excellent_percentage': float(excellent / total * 100)
    },
    'optimizations': [
        'Max features: 5000 (up from 3000)',
        'Trigrams: (1,3) ngrams',
        'Aggressive weighting: Genre 4x, Description 3x',
        'Enhanced preprocessing with multi-pass cleaning',
        'Country and rating features added',
        'Optimized min_df and max_df parameters'
    ]
}

# Save the complete recommendation system
filename = 'netflix_recommendation_system.pkl'
with open(filename, 'wb') as f:
    pickle.dump(system_data, f)

file_size_mb = os.path.getsize(filename) / (1024**2)
print(f"‚úÖ Model saved as '{filename}'")
print(f"   File size: {file_size_mb:.2f} MB")

# Save database
df.to_csv('netflix_content_database.csv', index=False)
print(f"‚úÖ Database saved as 'netflix_content_database.csv'")

# Create a smaller version without similarity matrix (for deployment)
compact_data = {
    'df': df,
    'tfidf_matrix': tfidf_matrix,
    'tfidf_vectorizer': tfidf,
    'title_to_index': title_to_index,
    'index_to_title': index_to_title,
    'version': 'compact_v3'
}

compact_filename = 'netflix_recommender_compact.pkl'
with open(compact_filename, 'wb') as f:
    pickle.dump(compact_data, f)

compact_size_mb = os.path.getsize(compact_filename) / (1024**2)
print(f"‚úÖ Compact model saved as '{compact_filename}'")
print(f"   File size: {compact_size_mb:.2f} MB (smaller for deployment)")

print("\n" + "="*70)
print("üéâ ULTIMATE OPTIMIZED MODEL COMPLETE!")
print("="*70)

print("\n‚ú® Revolutionary Improvements:")
print("   ‚Ä¢ Genre weighted 4x (maximum importance!)")
print("   ‚Ä¢ Description weighted 3x (critical for content)")
print("   ‚Ä¢ Country & rating features added (NEW!)")
print("   ‚Ä¢ Features increased: 3000 ‚Üí 5000 (+67%)")
print("   ‚Ä¢ Trigrams added for 3-word phrase matching")
print("   ‚Ä¢ Multi-pass advanced text cleaning")
print("   ‚Ä¢ Optimized TF-IDF parameters")

print(f"\nüéØ Achieved Results:")
print(f"   ‚Ä¢ Average similarity: {all_top_scores.mean()*100:.1f}% (Target: >50%)")
print(f"   ‚Ä¢ Poor recommendations: {poor/total*100:.1f}% (Target: <30%)")
print(f"   ‚Ä¢ Excellent recommendations: {excellent/total*100:.1f}%")

print(f"\nüìà Performance vs Original:")
print(f"   ‚Ä¢ Average: +{all_top_scores.mean()*100 - 32.4:.1f}% improvement")
print(f"   ‚Ä¢ Poor recommendations: {77.9 - poor/total*100:.1f}% reduction")

print("\nüì± Next Steps:")
print("   1. Restart your Streamlit app")
print("   2. Test with various movie titles")
print("   3. Enjoy much better recommendations! üé¨")
print("\n" + "="*70)

## Step 11: Feature Importance Analysis (Optional)

In [None]:
print("\nüìä FEATURE IMPORTANCE ANALYSIS")
print("="*70)

# Get feature names and their frequencies
feature_names = tfidf.get_feature_names_out()
feature_scores = np.array(tfidf_matrix.sum(axis=0)).flatten()

# Create DataFrame and sort
feature_df = pd.DataFrame({
    'feature': feature_names,
    'score': feature_scores
}).sort_values('score', ascending=False)

print(f"\nüîù Top 20 Most Important Features:\n")
for i, row in feature_df.head(20).iterrows():
    ngram_size = len(row['feature'].split())
    ngram_type = f"{ngram_size}-gram" if ngram_size > 1 else "unigram"
    print(f"   {row['feature']:30s} - Score: {row['score']:.2f} ({ngram_type})")

# Count ngram types
unigrams = sum(1 for f in feature_names if len(f.split()) == 1)
bigrams = sum(1 for f in feature_names if len(f.split()) == 2)
trigrams = sum(1 for f in feature_names if len(f.split()) == 3)

print(f"\nüìä N-gram Distribution:")
print(f"   Unigrams:  {unigrams:,} ({unigrams/len(feature_names)*100:.1f}%)")
print(f"   Bigrams:   {bigrams:,} ({bigrams/len(feature_names)*100:.1f}%)")
print(f"   Trigrams:  {trigrams:,} ({trigrams/len(feature_names)*100:.1f}%)")

print("\n‚úÖ Analysis complete!")