- <b> Content based systems </b> recommend items to the customer similar to previously high rated items by the customer. For example if a movie is an item, then its actors, director, release year and genre are its important property.
- To create an <b> item profile </b>, frst we perform the TD-IDF vecorizer. TF is a key idea in information retrieval and NP. IDF is employed in test analysis and information retrieval to evaluate the significance of phrases within a set of documents. 
- During the creation of a <b> User profile </b> we use a utility matrix  that describes user and item. Then we can decide which item the user likes.

In [1]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [5]:
# loading the ratings dataset
ratings = pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv")

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
# loading the movies dataset
movies =  pd.read_csv("https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv")

In [8]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Two datasets are imported here, for a movie recommendation study. User ratings for mmovies in the first and movies dataset contains the names and genres

<h3> Statistical Analysis of Ratings </h3>

In [17]:
ratings.shape #number of ratings
len(ratings)

100836

In [18]:
movies.shape # number of movies
len(movies)

9742

In [19]:
n_users =  len(ratings['userId'].unique())
print(f"Number of unique users: {n_users}")

Number of unique users: 610


In [21]:
n_movies = len(movies['movieId'].unique())
print(f"Number of unique movies: {n_movies}")

Number of unique movies: 9742


<h3> User Rating Frequency </h3>

In [22]:
user_freq = ratings[['userId', 'movieId']].groupby('userId').count().reset_index()

In [23]:
user_freq.columns = ['userId', 'n_ratings']
user_freq.head()

Unnamed: 0,userId,n_ratings
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44


<h3> Movie Rating Analysis </h3>

In [39]:
mean_rating = ratings.groupby('movieId')['rating'].mean()
mean_rating

movieId
1         3.920930
2         3.431818
3         3.259615
4         2.357143
5         3.071429
            ...   
193581    4.000000
193583    3.500000
193585    3.500000
193587    3.500000
193609    4.000000
Name: rating, Length: 9724, dtype: float64

In [38]:
mean_rating.min()

0.5

In [31]:
mean_rating.max()

5.0

In [34]:
ratings['movieId'].max() #no. of users rated highest

193609

In [35]:
ratings['movieId'].min()

1

In [36]:
movie_stats = ratings.groupby('movieId')[['rating']].agg(['count','mean'])

In [37]:
movie_stats.columns.droplevel()

Index(['count', 'mean'], dtype='object')

<h2> User-Item Matrix Creation </h2>

A user-item matrix is a basic data structure in recommendation systems, created by code created. This s how it operates:
- To find the number of unique users and unique videos in the dataset, N and M are computed.
- There are four dictionaries produces:
    1. User mapper: Maps distinct user Ids to indexes. User Id 1 becomes index 0
    2. Movie mapper: Converts disticts movie Ids into indices. Movie Id 1 becomes index 0 too.
    3. User_inv_mapper: Reverses user_mapper and maps indices back to userIds
    4. Movie_inv_mapper: Reverses movie_mapper by mapping indices to Movie Ids.
 - To map the real user and movie Ids in the dataset to their matching indices, the lists user_index and movie_index are generated.
 - A sparse matrix X is created using the scipy function csr_matrix. The user and movie indices that correspond to the rating values in the dataset are used to generate this matrix. The form of it is (M,N) where M  denotes the quantity of distinct films and N denotes the quantity of distinct consumers.

In [40]:
from scipy.sparse import csr_matrix

In [41]:
def create_matrix(df):
    N = len(df['userId'].unique())
    M = len(df['movieId'].unique())
    # map Ids to indices
    user_mapper = dict(zip(np.unique(df['userId']), list(range(N))))
    movie_mapper = dict(zip(np.unique(df['movieId']), list(range(M))))
    
    # Map Indices to Ids
    user_inv_mapper = dict(zip(list(range(N)), np.unique(df['userId'])))
    movie_inv_mapper = dict(zip(list(range(M)), np.unique(df['movieId'])))
    
    user_index = [user_mapper[i] for i in df['userId']]
    movie_index = [movie_mapper[i] for i in df['movieId']]
    
    X = csr_matrix((df['rating'], (movie_index, user_index)), shape=(M,  N))
    
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper

    

In [42]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper = create_matrix(ratings)

<h3> Movie Similarity Analysis </h3>

In [43]:
# Find similar movies using KNN
from sklearn.neighbors import NearestNeighbors

In [48]:
def find_similar_movies(movie_id, X, k, metric='cosine', show_distance=False):
    neighbor_ids = []
    movie_ind = movie_mapper[movie_id]
    movie_vec = X[movie_ind]
    k+=1
    kNN = NearestNeighbors(n_neighbors=k, algorithm = 'brute', metric=metric)
    kNN.fit(X)
    movie_vec = movie_vec.reshape(1,-1)
    neighbor = kNN.kneighbors(movie_vec, return_distance = show_distance)
    for i in range(0, k):
        n = neighbor.item(i)
        neighbor_ids.append(movie_inv_mapper[n])
    neighbor_ids.pop(0)
    return neighbor_ids


The function above uses the k-Nearest Neighbors algorithm to identify movies that are similar to a given movie. The function takes inputs such as the target movie ID, a user-item matrix (X), the number of neighbors to consider (k), a similarity metric (default is cosine), and an option to show distances between movies.

The function begins by initializing a blank list to hold the IDs of films that are comparable. It takes the target movie's index out of the movie_mapper dictionary and uses the user-item matrix to acquire the feature vector that goes with it.Next, the kNN model is configured using the given parameters.

Using the movie_inv_mappeer dictionary, the loops retrieves these neighbor indices and maps them back to movie IDs. Sincs it matches the desired movie, the first item in the list is eliminated. The code ends with a list of related movie titles and the title of the target film, suggesting movies based on the KNN model.

In [49]:
movie_titles = dict(zip(movies['movieId'], movies['title']))

In [50]:
movie_id = 3

In [51]:
similar_ids = find_similar_movies(movie_id, X, k=10)

In [52]:
movie_title = movie_titles[movie_id]

In [55]:
print(f"Since you watched {movie_title}")

print("You might also like.....")
for i in similar_ids:
    print(movie_titles[i])

Since you watched Grumpier Old Men (1995)
You might also like.....
Grumpy Old Men (1993)
Striptease (1996)
Nutty Professor, The (1996)
Twister (1996)
Father of the Bride Part II (1995)
Broken Arrow (1996)
Bio-Dome (1996)
Truth About Cats & Dogs, The (1996)
Sabrina (1995)
Birdcage, The (1996)


<h3> Movie Recommendation with respect to Users Preference </h3>

In [56]:
def recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10):
    df1 = ratings[ratings['userId'] == user_id]
    if df1.empty:
        print(f"User with ID {user_id} does not exist.")
        return 
    movie_id = df1[df1['rating'] == max(df1['rating'])]['movieId'].iloc[0]
    movie_titles = dict(zip(movies['movieId'], movies['title']))
    similar_ids=find_similar_movies(movie_id, X, k)
    movie_title = movie_titles.get(movie_id, "Movie not found")
    if movie_title == "Movie not found":
        print(f"Movie with Id {movie_id} not found.")
        return
    print(f"Since you watched {movie_title}, you might also like:")
    for i in similar_ids:
        print(movie_titles.get(i, "Movie not found"))

The function accepts the following inputs: dictionaries (user_mapper, movie_mapper, and movie_inv_mapper) for mappig user and movie Ids to matrix indices; the user Id for which recommendations are desired; a user-item matrix X representing movie ratings; and an optional parameter k for the number of recommended movies (default is 10).

It initially filters the ratings dataset to see if the user with the given ID is there. It notifies the user that the requested person does not exist and ends the function if the user does not exist. The code if its exists, designates the movies that has received the highest rating from that particular user. It finds the movieId of this movie and chooses it based on the highest rating.

A dictionary calles movie_titles is created to map movie IDs to their titles. The function then uses find_similar_movies to locate films that are comparable to rhe  movie in the user-item matrix that has the highest rating denoted by movie_ID. It gives back a list of comparable movie IDs. 

The code searces the movie titles dictionary for the title of the highest rated film, and if film is not found, it sets "Movie not found", it means that the highest rated film based on movie_id is not present in the datase. If the movie is located, the customer is presented with recommendations for other movies basd on the highest rated film. The list of comparable movie Ids is iterated over, and the titles are printed. When a movie isn't discovered in the dataset, the default message is "Movie not found".

The function handles situations where the user or movie does not exist in the dataset and is intended to suggest movies for a particular user based on their highest rated film.

<h3> Recommend the Movies </h3>

In [58]:
user_id = 150
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

Since you watched Twelve Monkeys (a.k.a. 12 Monkeys) (1995), you might also like:
Pulp Fiction (1994)
Terminator 2: Judgment Day (1991)
Independence Day (a.k.a. ID4) (1996)
Seven (a.k.a. Se7en) (1995)
Fargo (1996)
Fugitive, The (1993)
Usual Suspects, The (1995)
Jurassic Park (1993)
Star Wars: Episode IV - A New Hope (1977)
Heat (1995)


In [59]:
user_id = 415
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

Since you watched Pulp Fiction (1994), you might also like:
Silence of the Lambs, The (1991)
Shawshank Redemption, The (1994)
Seven (a.k.a. Se7en) (1995)
Forrest Gump (1994)
Usual Suspects, The (1995)
Braveheart (1995)
Fight Club (1999)
Fargo (1996)
Terminator 2: Judgment Day (1991)
Reservoir Dogs (1992)


In [60]:
user_id = 467
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

Since you watched Postman, The (Postino, Il) (1994), you might also like:
Dead Man Walking (1995)
Like Water for Chocolate (Como agua para chocolate) (1992)
Smoke (1995)
Secrets & Lies (1996)
Leaving Las Vegas (1995)
Mighty Aphrodite (1995)
Last Emperor, The (1987)
Three Colors: White (Trzy kolory: Bialy) (1994)
Cinema Paradiso (Nuovo cinema Paradiso) (1989)
Fargo (1996)


In [61]:
user_id = 1500
recommend_movies_for_user(user_id, X, user_mapper, movie_mapper, movie_inv_mapper, k=10)

User with ID 1500 does not exist.


<h2> Conclusion</h2>

In conclusion, developing a python recommendation system allows for the creation of tailored content recommendations that improve user experience and take into account user preferences.