## **Building A Complex Recommender System**

We are building a sophisticated movie recommender system that utilizes a content-based approach to provide personalized suggestions. Our system will feature a robust recommender function and a well-structured pipeline created with Scikit-learn. By analyzing attributes such as genre, actors, directors, plot keywords, release year, language, and runtime, the system will identify and recommend movies with similar characteristics to enhance user experience and satisfaction. This approach ensures that recommendations are tailored to individual preferences based on detailed item comparisons.

### Steps Needed for Coding the Recommender System:
1.   Load and Merge Datasets
2.   Data Cleaning 
3.   Feature Engineering 
4.   EDA 
5.   Build the Recommender System using KNN
6.   Model Validation and Evaluation

## Imports

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import NearestNeighbors


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Loading & Merging Datasets

In [None]:
# load the datasets and merge them according to title

movies = pd.read_csv("content/5000_movies.csv")

In [None]:
movies.columns

In [None]:
credits = pd.read_csv("content/5000_movies.csv")

In [None]:
credits.value_counts

In [None]:
# merge them according to title
movies_all = pd.merge(movies, credits, on='title')
movies_all.head()

## Data Cleaning 

In [None]:
# Remove rows where 'release_date' is missing since it's crucial for feature engineering
movies_all = movies_all.dropna(subset=['release_date'])

# we can drop other rows with too many missing values or irrelevant columns
# Drop rows where important categorical features are missing
movies_all = movies_all.dropna(subset=['genres', 'cast', 'crew'])

# Drop columns that are not useful for recommendation
movies_all = movies_all.drop(columns=['homepage', 'status', 'tagline', 'overview'])

In [None]:
# check the dataframe
movies_all.info()

## Feature Engineering

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import NearestNeighbors
import ast
import matplotlib.pyplot as plt
import seaborn as sns

# Extracting year from release_date
movies_all['release_year'] = pd.to_datetime(movies_all['release_date']).dt.year

# Extracting the main genre from the genres column
def get_main_genre(genres):
    try:
        genres_list = ast.literal_eval(genres)
        if genres_list:
            return genres_list[0]['name']
    except:
        return None

movies_all['main_genre'] = movies_all['genres'].apply(get_main_genre)

# Preprocess Cast and Crew
def get_top_cast(cast, top_n=3):
    try:
        cast = ast.literal_eval(cast)
        return [member['name'] for member in cast[:top_n]]
    except:
        return []

def get_director(crew):
    try:
        crew = ast.literal_eval(crew)
        for member in crew:
            if member['job'] == 'Director':
                return member['name']
        return ''
    except:
        return ''
# Add new features (columns) to dataframe
movies_all['top_cast'] = movies_all['cast'].apply(get_top_cast)
movies_all['director'] = movies_all['crew'].apply(get_director)

In [None]:

# Flatten the 'top_cast' list for preprocessing
movies_all['top_cast'] = movies_all['top_cast'].apply(lambda x: ' '.join(x))

# Update numerical and categorical columns
numerical_cols = ['popularity', 'vote_average', 'vote_count', 'release_year']
categorical_cols = ['main_genre', 'top_cast', 'director']

# Ensure correct data types for numerical columns
for col in numerical_cols:
    movies_all[col] = pd.to_numeric(movies_all[col], errors='coerce')


## Preprocessing - Defining and Creating Pipelines

In [None]:
# Apply preprocessing
movies_features = movies_all[numerical_cols + categorical_cols]

# Define the pipeline with preprocessing and KNN
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', NearestNeighbors(n_neighbors=5, metric='cosine'))
])

# Fit the pipeline directly to the original features
pipeline.fit(movies_features)

## EDA 

In [None]:
# Numerical columns distribution
for col in numerical_cols:
    plt.figure(figsize=(10, 4))
    sns.histplot(movies_all[col].dropna(), kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Plot the popularity vs vote average
plt.figure(figsize=(10, 6))
sns.scatterplot(x='popularity', y='vote_average', data=movies_all)
plt.title('Popularity vs Vote Average')
plt.show()

## Creating a Recommendation Function

In [None]:
# Function to get similar movies
def recommend_movie(movie_title, n_neighbors=5):
    # Handle case where movie is not found
    if movie_title not in movies_all['title'].values:
        return f"Movie '{movie_title}' not found in the dataset."

    # Get the index of the movie
    movie_idx = movies_all[movies_all['title'] == movie_title].index[0]

    # Extract the movie data as a DataFrame
    movie_data = movies_features.iloc[movie_idx].to_frame().T

    # Transform the input movie data
    movie_data_transformed = pipeline.named_steps['preprocessor'].transform(movie_data)

    # Find similar movies
    distances, indices = pipeline.named_steps['knn'].kneighbors(movie_data_transformed)
    similar_movie_indices = indices.flatten()

    # Get titles of similar movies
    similar_movies = movies_all.iloc[similar_movie_indices]['title']

    # Filter out the input movie from its own recommendations
    similar_movies = similar_movies[similar_movies != movie_title]

    return similar_movies

# Test the recommender system
print(recommend_movie('The Matrix'))

## Model Evaluation 

In [None]:
# Evaluate the model by manual inspection
test_movies = ['The Matrix', 'Titanic', 'Avatar']
for movie in test_movies:
    print(f"Recommendations for {movie}:")
    print(recommend_movie(movie))
    print("\n")

# Testing the recommender system with a movie name
movie_name = 'Inception'
print(f"Recommendations for {movie_name}:")
print(recommend_movie(movie_name))

In [None]:
movie_name = 'The Avengers'
print(f"Recommendations for {movie_name}:")
print(recommend_movie(movie_name))

In [None]:
movie_name = 'Sherlock Holmes'
print(f"Recommendations for {movie_name}:")
print(recommend_movie(movie_name))

In [None]:
movie_name = 'After'
print(f"Recommendations for {movie_name}:")
print(recommend_movie(movie_name))