# Zee Recommender System

## Objective
Create a Recommender System to show personalized movie recommendations based on ratings given by a user and other users similar to them in order to improve user experience.

Recommender systems are very common nowadays in various fields like e-Commerce, OTT platforms, etc. They are extremely useful to enhance the user experience on the platforms, and also to increase the platform's revenue. It can be said that the better the recommender system is, better is the user experience and more is the revenue.

In this notebook, we will work with the data provided by Zee. We will use various approaches to take the problem statement.

In [1]:
import pandas as pd
import numpy as np
import re

## Reading data

In [2]:
movies = pd.read_fwf("data/zee-movies.dat", encoding="ISO-8859-1")
users = pd.read_fwf("data/zee-users.dat", encoding="ISO-8859-1")
ratings = pd.read_fwf("data/zee-ratings.dat", encoding="ISO-8859-1")

In [3]:
movies.head()

Unnamed: 0,Movie ID::Title::Genres,Unnamed: 1,Unnamed: 2
0,1::Toy Story (1995)::Animation|Children's|Comedy,,
1,2::Jumanji (1995)::Adventure|Children's|Fantasy,,
2,3::Grumpier Old Men (1995)::Comedy|Romance,,
3,4::Waiting to Exhale (1995)::Comedy|Drama,,
4,5::Father of the Bride Part II (1995)::Comedy,,


In [4]:
users.head()

Unnamed: 0,UserID::Gender::Age::Occupation::Zip-code
0,1::F::1::10::48067
1,2::M::56::16::70072
2,3::M::25::15::55117
3,4::M::45::7::02460
4,5::M::25::20::55455


In [5]:
ratings.head()

Unnamed: 0,UserID::MovieID::Rating::Timestamp
0,1::1193::5::978300760
1,1::661::3::978302109
2,1::914::3::978301968
3,1::3408::4::978300275
4,1::2355::5::978824291


### Preprocessing dataframes

In [6]:
movies = movies.drop(columns=['Unnamed: 1', 'Unnamed: 2'])
cols = movies.columns[0].split('::')
movies = movies['Movie ID::Title::Genres'].str.split('::', expand=True)
movies.columns = cols

In [7]:
movies.head()

Unnamed: 0,Movie ID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
cols = users.columns[0].split('::')
users = users['UserID::Gender::Age::Occupation::Zip-code'].str.split('::', expand=True)
users.columns = cols

In [9]:
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [10]:
cols = ratings.columns[0].split('::')
ratings = ratings['UserID::MovieID::Rating::Timestamp'].str.split('::', expand=True)
ratings.columns = cols

In [11]:
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


# Exploratory Data Analysis

In [12]:
# shape
print(f"Shape of users: {users.shape}")
print(f"Shape of movies: {movies.shape}")
print(f"Shape of ratings: {ratings.shape}")

Shape of users: (6040, 5)
Shape of movies: (3883, 3)
Shape of ratings: (1000209, 4)


In [13]:
# missing values
print(f"Null values in users:\n {users.isna().sum()}")
print(f"Null values in movies:\n {movies.isna().sum()}")
print(f"Null values in ratings:\n {ratings.isna().sum()}")

Null values in users:
 UserID        0
Gender        0
Age           0
Occupation    0
Zip-code      0
dtype: int64
Null values in movies:
 Movie ID     0
Title        0
Genres      25
dtype: int64
Null values in ratings:
 UserID       0
MovieID      0
Rating       0
Timestamp    0
dtype: int64


In [14]:
# duplicates
print(f"Duplicates in users: {users.duplicated().sum()}")
print(f"Duplicates in movies: {movies.duplicated().sum()}")
print(f"Duplicates in ratings: {ratings.duplicated().sum()}")

Duplicates in users: 0
Duplicates in movies: 0
Duplicates in ratings: 0


In [15]:
users.info()
print("\n","-"*30,"\n")
movies.info()
print("\n","-"*30,"\n")
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   UserID      6040 non-null   object
 1   Gender      6040 non-null   object
 2   Age         6040 non-null   object
 3   Occupation  6040 non-null   object
 4   Zip-code    6040 non-null   object
dtypes: object(5)
memory usage: 236.1+ KB

 ------------------------------ 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Movie ID  3883 non-null   object
 1   Title     3883 non-null   object
 2   Genres    3858 non-null   object
dtypes: object(3)
memory usage: 91.1+ KB

 ------------------------------ 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------ 

In [16]:
users.describe(include='all')

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
count,6040,6040,6040,6040,6040
unique,6040,2,7,21,3439
top,6040,M,25,4,48104
freq,1,4331,2096,759,19


In [17]:
movies.describe(include='all')

Unnamed: 0,Movie ID,Title,Genres
count,3883,3883,3858
unique,3883,3883,360
top,3952,"Contender, The (2000)",Drama
freq,1,1,830


In [18]:
ratings.describe(include='all')

Unnamed: 0,UserID,MovieID,Rating,Timestamp
count,1000209,1000209,1000209,1000209
unique,6040,3706,5,458455
top,4169,2858,4,975528402
freq,2314,3428,348971,30


In [19]:
# converting timestamp to datetime object
ratings['Timestamp'] = pd.to_datetime(ratings['Timestamp'], unit='s')

  ratings['Timestamp'] = pd.to_datetime(ratings['Timestamp'], unit='s')


## Feature Engineering

In [20]:
# extracting the release year of the movie from the movie name
def extract_year(name):
    try:
        pattern = r".*\((\d+)\)"
        return re.findall(pattern, name)[0]
    except:
        return '-1'

extract_year = np.vectorize(extract_year)

movies['release_year'] = extract_year(movies['Title'])
movies = movies.rename(columns={'Movie ID':'MovieID'})

## Mergeing the datasets

In [24]:
data = pd.merge(ratings,users,on='UserID',how='left')
data = pd.merge(data,movies,on='MovieID',how='left')

In [25]:
data.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Gender,Age,Occupation,Zip-code,Title,Genres,release_year
0,1,1193,5,2000-12-31 22:12:40,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,1,661,3,2000-12-31 22:35:09,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical,1996
2,1,914,3,2000-12-31 22:32:48,F,1,10,48067,My Fair Lady (1964),Musical|Romance,1964
3,1,3408,4,2000-12-31 22:04:35,F,1,10,48067,Erin Brockovich (2000),Drama,2000
4,1,2355,5,2001-01-06 23:38:11,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy,1998
