## **Building A Simple Recommender System**

For this project we will be building an item similarity based movie recommender system. This content-based recommender system will suggest movies to users based on similarities between attributes such as genre, actors, directors, plot keywords, release year, language, and runtime.

This is a content-based recommender system.
Content based recommender systems compare the attributes of the items and give the users recommendations based on the similarity between them.

### Steps Needed for Coding the Recommender System:
1.   Import libraries
2.   Importing our data -from selected Kaggle data set
3.   Create dataframes that contain parameters of interest
4.   Visualizations
5.   Build the Recommender System using Pandas
6.   Using KNN


## Imports

In [None]:
import numpy as np
import pandas as pd

## Loading & Merging Datasets

In [None]:
# ratings
columns_name = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('/content/sample_data/u.data', sep='\t', names=columns_name)

In [None]:
df.head()

In [None]:
# titles
movie_titles = pd.read_csv('/content/sample_data/Movie_Id_Titles')
movie_titles.head()

In [None]:
# merge two datasets together 
df = pd.merge(df, movie_titles, on='item_id')
df.head()

# EDA

In [None]:
# basis data analysis
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

In [None]:
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()

In [None]:
df.groupby('title')['rating'].count().sort_values(ascending=False).head()

In [None]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())

In [None]:
ratings.head()

# Creating DataFrames

In [None]:
# create ratings dataframe 
# get the number of ratings
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

## Visualizations

In [None]:
ratings['num of ratings'].hist(bins=70)

In [None]:
ratings['rating'].hist(bins=70)

In [None]:
sns.jointplot(x='rating', y= 'num of ratings', data=ratings,alpha=0.5)

In [None]:
# Creating a matrix using the user ids and movie ids
# So we can see the rating each user gave to each movie
moviemat = df.pivot_table(index='user_id', columns= 'title', values='rating')

In [None]:
# Creating User-item Matrix 
moviemat.head()

In [None]:
# sorting rating dataframe according to num_of_ratings
ratings.sort_values('num of ratings', ascending=False).head(10)

## Handling Null values and Joining Dataframes 

In [None]:
# We'll look at the user ratings for the top two movies - Star Wars and Fargo
starwars_user_ratings = moviemat['Star Wars (1977)']
fargo_user_ratings = moviemat['Fargo (1996)']

In [None]:
starwars_user_ratings.head()

In [None]:
# Corrwith
similar_to_starwars= moviemat.corrwith(starwars_user_ratings)

In [None]:
similar_to_fargo = moviemat.corrwith(fargo_user_ratings)

In [None]:
# Removing null values and use dataframes instead of series
corr_starwars = pd.DataFrame(similar_to_starwars, columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()

In [None]:
corr_starwars =corr_starwars.join(ratings['num of ratings'])

In [None]:
corr_starwars.head()

In [None]:
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation', ascending=False).head()

In [None]:
corr_fargo = pd.DataFrame(similar_to_fargo, columns=['Correlation'])
corr_fargo.dropna(inplace=True)
corr_fargo = corr_fargo.join(ratings['num of ratings'])

In [None]:
corr_fargo.head()

In [None]:
corr_fargo[corr_fargo['num of ratings']>100].sort_values('Correlation', ascending=False).head()

## Adding K-Nearest Neighbors (KNN)

In [None]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ratings_df = pd.read_csv('/content/sample_data/u.data', sep='\t', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies_df = pd.read_csv('/content/sample_data/Movie_Id_Titles', sep=',', header=0, names=['movie_id', 'movie_title'])

In [None]:
print("Ratings Data:")
print(ratings_df.head())

In [None]:
print("\nMovies Data:")
print(movies_df.head())

In [None]:
ratings_movies_df = pd.merge(ratings_df, movies_df, on='movie_id')

print("MovieLens Data:")
print(ratings_movies_df.head())

In [None]:
# Import necessary libraries
from sklearn.neighbors import NearestNeighbors
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd


# Creating item-user matrix: This contains the user details along with movie details and review by the user
user_item_matrix = ratings_movies_df.pivot_table(index='user_id', columns='movie_id', values='rating')

# Creating item-user matrix
item_user_matrix = user_item_matrix.T

#NaN value handling, using SimpleImputer we can fill in empty rows with the average of the column
imputer = SimpleImputer(strategy='mean')
item_user_matrix_imputed = imputer.fit_transform(item_user_matrix)

# Convert the imputed matrix back to a DataFrame
item_user_matrix = pd.DataFrame(item_user_matrix_imputed, index=item_user_matrix.index, columns=item_user_matrix.columns)

# Initialize the KNN model:
# 'n_neighbors=5' means we want to find the 5 most similar items.
# 'metric='cosine'' specifies that we use cosine similarity to measure the similarity between items.
knn = NearestNeighbors(n_neighbors=5, metric='cosine')

# Fit the KNN model on the item-user matrix:
# This trains the model to find similar items based on user ratings.
knn.fit(item_user_matrix)

# Created a similar item function to grab the top 5 movies that have a similar user-item matrix
def get_similar_items(movie_id, n_neighbors=5):
    # Check if the movie_id exists in the item-user matrix columns
    if movie_id not in user_item_matrix.columns:
        return []  # Return an empty list if the movie_id is not found

    # Get the index of the movie_id in the item-user matrix columns
    movie_index = user_item_matrix.columns.get_loc(movie_id)

    # Find similar items using the KNN model:
    distances, indices = knn.kneighbors([item_user_matrix.iloc[movie_index]], n_neighbors=n_neighbors)

    # Get the indices of the most similar movies
    similar_movie_indices = indices.flatten()

    # Convert the indices to movie IDs
    similar_movie_ids = [user_item_matrix.columns[i] for i in similar_movie_indices]

    return similar_movie_ids

## Testing KNN

In [None]:
# Testing the KNN model, let's use Fargo: movie_id

# Step#: 1 -Search for movies with 'Fargo' in the title
fargo_movie = movies_df[movies_df['movie_title'].str.contains('Fargo', case=False)]
print(fargo_movie)

In [None]:
# Step#: 2 -Find similar movies to movie_id 100 (aka Fargo)
similar_items = get_similar_items(100)
print(f"Similar movies to 'Fargo': {similar_items}")

In [None]:
# Step#: 3 -The end user will not understand movie_ids, we need to print the title of the movie:
similar_items = get_similar_items(100)
similar_movies_df = movies_df[movies_df['movie_id'].isin(similar_items)]
print(similar_movies_df)


#### Visualizations and Outputs will be put in a separate page :) 