# Zee Recommender System

## Objective
Create a Recommender System to show personalized movie recommendations based on ratings given by a user and other users similar to them in order to improve user experience.

Recommender systems are very common nowadays in various fields like e-Commerce, OTT platforms, etc. They are extremely useful to enhance the user experience on the platforms, and also to increase the platform's revenue. It can be said that the better the recommender system is, better is the user experience and more is the revenue.

In this notebook, we will work with the data provided by Zee. We will use various approaches to take the problem statement.

In [1]:
import pandas as pd
import numpy as np
import re

## Reading data

In [2]:
movies = pd.read_fwf("data/zee-movies.dat", encoding="ISO-8859-1")
users = pd.read_fwf("data/zee-users.dat", encoding="ISO-8859-1")
ratings = pd.read_fwf("data/zee-ratings.dat", encoding="ISO-8859-1")

In [3]:
movies.head()

Unnamed: 0,Movie ID::Title::Genres,Unnamed: 1,Unnamed: 2
0,1::Toy Story (1995)::Animation|Children's|Comedy,,
1,2::Jumanji (1995)::Adventure|Children's|Fantasy,,
2,3::Grumpier Old Men (1995)::Comedy|Romance,,
3,4::Waiting to Exhale (1995)::Comedy|Drama,,
4,5::Father of the Bride Part II (1995)::Comedy,,


In [4]:
users.head()

Unnamed: 0,UserID::Gender::Age::Occupation::Zip-code
0,1::F::1::10::48067
1,2::M::56::16::70072
2,3::M::25::15::55117
3,4::M::45::7::02460
4,5::M::25::20::55455


In [5]:
ratings.head()

Unnamed: 0,UserID::MovieID::Rating::Timestamp
0,1::1193::5::978300760
1,1::661::3::978302109
2,1::914::3::978301968
3,1::3408::4::978300275
4,1::2355::5::978824291


The dataframes are not read properly. We will need to preprocess them by removing unwanted columns, and splitting the main column by `::`

### Preprocessing dataframes

In [6]:
movies = movies.drop(columns=['Unnamed: 1', 'Unnamed: 2'])
cols = movies.columns[0].split('::')
movies = movies['Movie ID::Title::Genres'].str.split('::', expand=True)
movies.columns = cols

In [7]:
movies.head()

Unnamed: 0,Movie ID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
cols = users.columns[0].split('::')
users = users['UserID::Gender::Age::Occupation::Zip-code'].str.split('::', expand=True)
users.columns = cols

In [9]:
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [10]:
cols = ratings.columns[0].split('::')
ratings = ratings['UserID::MovieID::Rating::Timestamp'].str.split('::', expand=True)
ratings.columns = cols

In [11]:
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Now that the dataframes are properly processed, we can carry on with EDA

# Exploratory Data Analysis

In [12]:
# shape
print(f"Shape of users: {users.shape}")
print(f"Shape of movies: {movies.shape}")
print(f"Shape of ratings: {ratings.shape}")

Shape of users: (6040, 5)
Shape of movies: (3883, 3)
Shape of ratings: (1000209, 4)


In [13]:
# missing values
print(f"Null values in users:\n {users.isna().sum()}")
print(f"Null values in movies:\n {movies.isna().sum()}")
print(f"Null values in ratings:\n {ratings.isna().sum()}")

Null values in users:
 UserID        0
Gender        0
Age           0
Occupation    0
Zip-code      0
dtype: int64
Null values in movies:
 Movie ID     0
Title        0
Genres      25
dtype: int64
Null values in ratings:
 UserID       0
MovieID      0
Rating       0
Timestamp    0
dtype: int64


In [14]:
# duplicates
print(f"Duplicates in users: {users.duplicated().sum()}")
print(f"Duplicates in movies: {movies.duplicated().sum()}")
print(f"Duplicates in ratings: {ratings.duplicated().sum()}")

Duplicates in users: 0
Duplicates in movies: 0
Duplicates in ratings: 0


In [15]:
users.info()
print("\n","-"*30,"\n")
movies.info()
print("\n","-"*30,"\n")
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   UserID      6040 non-null   object
 1   Gender      6040 non-null   object
 2   Age         6040 non-null   object
 3   Occupation  6040 non-null   object
 4   Zip-code    6040 non-null   object
dtypes: object(5)
memory usage: 236.1+ KB

 ------------------------------ 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Movie ID  3883 non-null   object
 1   Title     3883 non-null   object
 2   Genres    3858 non-null   object
dtypes: object(3)
memory usage: 91.1+ KB

 ------------------------------ 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------ 

In [16]:
users.describe(include='all')

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
count,6040,6040,6040,6040,6040
unique,6040,2,7,21,3439
top,6040,M,25,4,48104
freq,1,4331,2096,759,19


In [17]:
movies.describe(include='all')

Unnamed: 0,Movie ID,Title,Genres
count,3883,3883,3858
unique,3883,3883,360
top,3952,"Contender, The (2000)",Drama
freq,1,1,830


In [18]:
ratings.describe(include='all')

Unnamed: 0,UserID,MovieID,Rating,Timestamp
count,1000209,1000209,1000209,1000209
unique,6040,3706,5,458455
top,4169,2858,4,975528402
freq,2314,3428,348971,30


In [19]:
# converting timestamp to datetime object
ratings['Timestamp'] = pd.to_datetime(ratings['Timestamp'], unit='s')

  ratings['Timestamp'] = pd.to_datetime(ratings['Timestamp'], unit='s')


There are null values in the movies dataframe, specifically in the genre column. We can use a simple mode imputation to fill these missing values

In [164]:
movies['Genres'] = movies['Genres'].fillna(movies['Genres'].mode()[0]).copy()

## Feature Engineering

We can extract the year of release from the movie title

In [20]:
# extracting the release year of the movie from the movie name
def extract_year(name):
    try:
        pattern = r".*\((\d+)\)"
        return re.findall(pattern, name)[0]
    except:
        return '-1'

extract_year = np.vectorize(extract_year)

movies['release_year'] = extract_year(movies['Title'])
movies = movies.rename(columns={'Movie ID':'MovieID'})

## Mergeing the datasets

In [65]:
data = pd.merge(ratings,users,on='UserID',how='left')
data = pd.merge(data,movies,on='MovieID',how='left')
data['Rating'] = data['Rating'].astype(float)
data['UserID'] = data['UserID'].astype(int)
data['MovieID'] = data['MovieID'].astype(int)

In [66]:
data.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Gender,Age,Occupation,Zip-code,Title,Genres,release_year
0,1,1193,5.0,2000-12-31 22:12:40,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,1,661,3.0,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical,1996
2,1,914,3.0,2000-12-31 22:32:48,F,1,10,48067,My Fair Lady (1964),Musical|Romance,1964
3,1,3408,4.0,2000-12-31 22:04:35,F,1,10,48067,Erin Brockovich (2000),Drama,2000
4,1,2355,5.0,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998


# Recommender Systems

In order to work with collaborative or facrotization based recommender systems, we need to convert the data into user-item ratings matrix

In [67]:
# preparing the data
data.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Gender,Age,Occupation,Zip-code,Title,Genres,release_year
0,1,1193,5.0,2000-12-31 22:12:40,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,1,661,3.0,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical,1996
2,1,914,3.0,2000-12-31 22:32:48,F,1,10,48067,My Fair Lady (1964),Musical|Romance,1964
3,1,3408,4.0,2000-12-31 22:04:35,F,1,10,48067,Erin Brockovich (2000),Drama,2000
4,1,2355,5.0,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998


In [68]:
pivot = data.pivot_table(values='Rating', index='UserID', columns='MovieID')
pivot = pivot.fillna(0.0).copy()

In [69]:
pivot.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
# creating a mapping of movie id to movie name
movie_id_to_name = dict(data[['MovieID','Title']].drop_duplicates().values)

## RS with Pearson Correlation Coefficient

In [93]:
# defining the similarity function
def pearson_correlation(v1, v2):
    """Calculates the Pearson correlation coefficient between two vectors."""
    if len(v1) != len(v2):
        raise ValueError("Vectors must have the same length")

    v1_centered = v1 - np.mean(v1)
    v2_centered = v2 - np.mean(v2)

    covariance = np.sum(v1_centered * v2_centered)/len(v1)

    stddev1 = np.std(v1)
    stddev2 = np.std(v2)

    if stddev1 * stddev2 == 0:
        return 0.0

    correlation = covariance / (stddev1 * stddev2)

    return correlation

In [94]:
pearson_correlation(pivot.loc[1,:],pivot.loc[6,:])

0.16591180862627458

In [111]:
# create a function to recommend movies to the users
def rs_pearson(user_id, k=5):
    """Recommends k items for the given user using pearson correlation"""
    user_vec = pivot.loc[user_id,:]
    rated_movie_ids = user_vec.loc[user_vec>0.0].index.tolist()
    top_movies = dict()

    for i in rated_movie_ids:
        vec1 = pivot.loc[:,i]
        for j in pivot.columns:
            if j not in rated_movie_ids:
                vec2 = pivot.loc[:,j]
                coef = pearson_correlation(vec1,vec2)
            else:
                continue
            if j in top_movies:
                top_movies[j]+=coef
            else:
                top_movies[j] = coef

    recommendation_ids = [i[0] for i in sorted(top_movies.items(), key=lambda x:x[1], reverse=True)[:k]]

    print("Recommendations:")
    for i in range(len(recommendation_ids)):
        print(f"{i+1}. {movie_id_to_name[recommendation_ids[i]]}")
    

#### Inferencing

In [107]:
rs_pearson(user_id=1,k=5)

Recommendations:
1. Lion King, The (1994)
2. Little Mermaid, The (1989)
3. Sleeping Beauty (1959)
4. Lady and the Tramp (1955)
5. Peter Pan (1953)


In [108]:
rs_pearson(user_id=562,k=5)

Recommendations:
1. Die Hard (1988)
2. Untouchables, The (1987)
3. Batman (1989)
4. Predator (1987)
5. Robocop (1987)


## RS with Cosine Similarity

In [112]:
# define the similarity function
def cosine_similarity(v1, v2):
  """Calculates the cosine similarity between two vectors."""
  if len(v1) != len(v2):
    raise ValueError("Vectors must have the same length")

  dot_product = np.dot(v1, v2)

  magnitude_v1 = np.linalg.norm(v1)
  magnitude_v2 = np.linalg.norm(v2)

  if magnitude_v1 * magnitude_v2 == 0:
    return 0.0

  similarity = dot_product / (magnitude_v1 * magnitude_v2)

  return similarity

In [113]:
cosine_similarity(pivot.loc[1,:],pivot.loc[6,:])

0.1792221924905797

In [114]:
# create a function to recommend movies to the users
def rs_cosine(user_id, k=5):
    """Recommends k items for the given user using cosine similarity"""
    user_vec = pivot.loc[user_id,:]
    rated_movie_ids = user_vec.loc[user_vec>0.0].index.tolist()
    top_movies = dict()

    for i in rated_movie_ids:
        vec1 = pivot.loc[:,i]
        for j in pivot.columns:
            if j not in rated_movie_ids:
                vec2 = pivot.loc[:,j]
                coef = cosine_similarity(vec1,vec2)
            else:
                continue
            if j in top_movies:
                top_movies[j]+=coef
            else:
                top_movies[j] = coef

    recommendation_ids = [i[0] for i in sorted(top_movies.items(), key=lambda x:x[1], reverse=True)[:k]]

    print("Recommendations:")
    for i in range(len(recommendation_ids)):
        print(f"{i+1}. {movie_id_to_name[recommendation_ids[i]]}")
    

#### Inferencing

In [115]:
rs_cosine(user_id=1,k=5)

Recommendations:
1. Star Wars: Episode V - The Empire Strikes Back (1980)
2. Lion King, The (1994)
3. Raiders of the Lost Ark (1981)
4. Groundhog Day (1993)
5. Shawshank Redemption, The (1994)


In [116]:
rs_cosine(user_id=562,k=5)

Recommendations:
1. Die Hard (1988)
2. Terminator 2: Judgment Day (1991)
3. Star Wars: Episode VI - Return of the Jedi (1983)
4. Matrix, The (1999)
5. Total Recall (1990)


# RS with matrix factorization

In [126]:
from sklearn.decomposition import NMF

def decompose_ratings_matrix(R, num_factors):
    "Decomposes the ratings matrix into user and item matrix using the provided latent dimensions"

    model = NMF(n_components=num_factors, init='random', random_state=0)
    model.fit(R)

    V = model.components_.T
    U = model.transform(R)

    return U, V

In [141]:
U,V = decompose_ratings_matrix(pivot, num_factors=100)



In [142]:
print(pivot.shape)
print(U.shape)
print(V.shape)

(6040, 3706)
(6040, 100)
(3706, 100)


In [159]:
def rs_factorization(user_id, k=5):
    user_vec = U[user_id-1]
    pred_ratings = []
    for movie_vec in V:
        pred = np.dot(user_vec,movie_vec.T)
        pred_ratings.append(pred)
    recom = sorted(enumerate(pred_ratings), key=lambda x: x[1], reverse=True)[:k]

    print("Recommended Movies:")
    for i,j in recom:
        print(f"{movie_id_to_name[pivot.columns[i]]} || Predicted Rating = {j:.2f}")

#### Inferencing

In [160]:
rs_factorization(user_id=1)

Recommended Movies:
One Flew Over the Cuckoo's Nest (1975) || Predicted Rating = 5.01
Schindler's List (1993) || Predicted Rating = 4.99
Saving Private Ryan (1998) || Predicted Rating = 4.94
Christmas Story, A (1983) || Predicted Rating = 4.39
Beauty and the Beast (1991) || Predicted Rating = 4.32


In [161]:
rs_factorization(user_id=562)

Recommended Movies:
Godfather, The (1972) || Predicted Rating = 5.42
Saving Private Ryan (1998) || Predicted Rating = 4.89
Braveheart (1995) || Predicted Rating = 4.84
Raiders of the Lost Ark (1981) || Predicted Rating = 4.82
Alien (1979) || Predicted Rating = 4.73


# Conclusion

- We read and preprocessed the data into a workable format
- We dealt with missing values, duplicates and performed EDA
- We built three recommender systems, which were based on Pearson Correlation Coefficient, Cosine Similarity and Matrix Factorization respectively.
- The recommendations from all the three systems are different
- The training time for model 1 and 2 is negligible, since they are lazy algorithms. But the test time is very high. Hence, these algorithms cannot be used in real time recommendations
- Although the training time for model 3 is high, but the inference is very fast. Hence, this can be used in real time recommendations.
- Also, the accuracy of matrix factorization will ideally be higher since it is explicitely trained to reduce error, which lacks in the first two models.