<a href="https://colab.research.google.com/github/manola1109/Recommender-system-with-Python/blob/main/Non_Personalised_Recommender_Systems_Movielens_100k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Movielens - 100K Dataset

MovieLens 100K dataset has been a standard dataset used for benchmarking recommender systems for more than 20 years now and hence this provides a good point to start our learning journey for recommender systems. For non commercial personalised recommendations for movies you can check out the website: https://movielens.org/

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set.

## Data Description


**Ratings**    -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1. This is a comma separated list of
	         user id | item id | rating | timestamp.
              The time stamps are unix seconds since 1/1/1970 UTC   


**Movie Information**   -- Information about the items (movies); this is a comma separated
              list of
              movie id | movie title | release date | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.


**User Demographics**    -- Demographic information about the users; this is a comma
              separated list of
              user id | age | gender | occupation | zip code

## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>


In [1]:
import pandas as pd
import numpy as np

In [2]:
#Reading users file:
users = pd.read_csv('user_demographics.csv')

#Reading ratings file:
ratings= pd.read_csv('ratings.csv')

#Reading items file:
movie_info = pd.read_csv('movie_info.csv')

## 2. Basic Exploration <a class="anchor" id="Basic-Exploration"></a>

Let us look at each table to understand what we are dealing with here

### Exploring user data

In [3]:
# shape of the users data
print(users.shape)
# view the users data
users.head()

(943, 5)


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [4]:
pd.isnull(users).sum()

Unnamed: 0,0
user_id,0
age,0
sex,0
occupation,0
zip_code,0


So, we have 943 users in the dataset and each user has 5 features, i.e. user_ID, age, sex, occupation and zip_code. We have no missing values in the user data. Now let’s look at the ratings file.

### Exploring ratings data

In [5]:
# shape of the data
print(ratings.shape)
# view the ratings data
ratings.head()

(100000, 4)


Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [6]:
ratings[(ratings['user_id'] == 1)&(ratings['movie_id'] == 100)]

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
17672,1,100,5,878543541


In [7]:
pd.isnull(ratings).sum()

Unnamed: 0,0
user_id,0
movie_id,0
rating,0
unix_timestamp,0


We have 100k ratings for different user and movie combinations. Again there are no missing values here. Now lets examine the items file.

### Exploring Movie Information data

In [8]:
# shape of the data
print(movie_info.shape)
# view the items file
movie_info.head()

(1682, 22)


Unnamed: 0,movie id,movie title,release date,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-95,0,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-95,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-95,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-95,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-95,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [9]:
# Check missing values in movie information
pd.isnull(movie_info).sum()

Unnamed: 0,0
movie id,0
movie title,0
release date,1
unknown,0
Action,0
Adventure,0
Animation,0
Children's,0
Comedy,0
Crime,0


This dataset contains attributes of 1682 movies. There are 24 columns out of which last 19 columns specify the genre of a particular movie. These are binary columns, i.e., a value of 1 denotes that the movie belongs to that genre, and 0 otherwise.

We have release date missing for only 1 movie in the dataset and rest of the variables do not have any missing value

## 3.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [10]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [None]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [11]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [12]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [13]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [14]:
ratings = ratings[['user_id','movie','rating']]

For using non personalised recommender systems we are only interested in popular movies so we keep movies with atleast 100 ratings in the dataframe and drop the rest

In [15]:
movie_counts = ratings['movie'].value_counts()
ratings = ratings[(ratings['movie'].isin(movie_counts[movie_counts >= 100].index))]

Next, we create a user item matrix using Pandas Pivot Function such that users are in the index and each movie is represented by a separate column**
- Merge user data with ratings data
- Create user movie matrix using user ids as rows and movies as columns & name it 'user_movie_matrix'

User|Star Wars|Fargo|Contact
-|-|-|-
User 1|2|3|5
User 2|4|NA|NA
User 3|5|4|5
User 4|3|NA|2

In [16]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.movie.unique().shape[0]

In [17]:
n_users, n_items

(943, 338)

In [18]:
user_movie_matrix = ratings.pivot(index = 'user_id', columns = 'movie', values = 'rating')

In [19]:
user_movie_matrix

movie,100: Fargo (1996),1012: Private Parts (1997),1016: Con Air (1997),1028: Grumpier Old Men (1995),1047: Multiplicity (1996),109: Mystery Science Theater 3000: The Movie (1996),"111: Truth About Cats & Dogs, The (1996)",116: Cold Comfort Farm (1995),"117: Rock, The (1996)",118: Twister (1996),...,"928: Craft, The (1996)",92: True Romance (1993),93: Welcome to the Dollhouse (1995),94: Home Alone (1990),95: Aladdin (1992),96: Terminator 2: Judgment Day (1991),97: Dances with Wolves (1990),"98: Silence of the Lambs, The (1991)",99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,5.0,5.0,3.0,3.0,3.0,...,,3.0,5.0,2.0,4.0,5.0,3.0,4.0,3.0,5.0
2,5.0,,,,,,4.0,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,5.0,,,,,5.0,,,,,...,,,,3.0,4.0,,,3.0,3.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,5.0,,,,,,5.0,...,,,,,,,,,,5.0
940,3.0,,,,,,,2.0,,,...,,,,,5.0,5.0,,4.0,,3.0
941,,,,,,,,,5.0,,...,,,,,,,,,,
942,,,,4.0,,,,,4.0,,...,,,,,5.0,,5.0,,5.0,


In [20]:
ratings = ratings.merge(users[['user_id','sex']], how = 'left', on = 'user_id')
ratings = ratings[['user_id','sex','movie','rating']]

## 4. Non Personalised Recommender Systems using average ratings <a class="anchor" id="average"></a>

Here we calculate the mean rating for each movie, order with the highest rating listed first, and find the top five movies

In [21]:
user_movie_matrix.mean(axis=0).sort_values(ascending=False).head(5)

Unnamed: 0_level_0,0
movie,Unnamed: 1_level_1
"408: Close Shave, A (1995)",4.491071
318: Schindler's List (1993),4.466443
"169: Wrong Trousers, The (1993)",4.466102
483: Casablanca (1942),4.45679
"64: Shawshank Redemption, The (1994)",4.44523


Interestingly average rating placed close shave which is a short animated movie at the top altough it is not very popular

## 5. Non Personalised Recommender Systems using number of ratings or rating count <a class="anchor" id="ratingcount"></a>

Here we count the number of ratings for each movie, order with the most number of ratings first, and find the top five.

In [22]:
user_movie_matrix.count(axis=0).sort_values(ascending=False).head(5)

Unnamed: 0_level_0,0
movie,Unnamed: 1_level_1
50: Star Wars (1977),583
258: Contact (1997),509
100: Fargo (1996),508
181: Return of the Jedi (1983),507
294: Liar Liar (1997),485


We see here that average rating listed very different movies from when we tried to use rating count instead. Here we have more common movies as expected such as Star Wars, Fargo, Return of the Jedi etc.

## 6. Non Personalised Recommender Systems using count of ratings 4 and above <a class="anchor" id="4ratings"></a>

Here we calculate the percentage of ratings for each movie that are 4 or higher and order with the highest percentage first. Notice that the three different measures of "best" reflect different priorities and give different results; this should help you see why you need to be thoughtful about what metrics you use.

In [23]:
user_movie_matrix.apply(pd.value_counts)

  user_movie_matrix.apply(pd.value_counts)


movie,100: Fargo (1996),1012: Private Parts (1997),1016: Con Air (1997),1028: Grumpier Old Men (1995),1047: Multiplicity (1996),109: Mystery Science Theater 3000: The Movie (1996),"111: Truth About Cats & Dogs, The (1996)",116: Cold Comfort Farm (1995),"117: Rock, The (1996)",118: Twister (1996),...,"928: Craft, The (1996)",92: True Romance (1993),93: Welcome to the Dollhouse (1995),94: Home Alone (1990),95: Aladdin (1992),96: Terminator 2: Judgment Day (1991),97: Dances with Wolves (1990),"98: Silence of the Lambs, The (1991)",99: Snow White and the Seven Dwarfs (1937),9: Dead Man Walking (1995)
1.0,14,8,8,10,14,8,14,5,9,20,...,9,6,6,7,2,6,6,6,1,11
2.0,18,7,11,40,33,16,18,9,37,59,...,16,4,9,35,15,20,26,10,12,18
3.0,70,22,50,47,54,39,92,25,92,87,...,45,30,20,46,58,43,57,30,59,59
4.0,179,50,46,36,27,46,118,50,163,92,...,22,48,52,37,91,123,93,163,64,114
5.0,227,13,22,15,6,21,30,36,77,35,...,12,16,25,12,53,103,74,181,36,97


In [24]:
df_four = user_movie_matrix.apply(lambda x: x[x>=4]).count(axis=0) / user_movie_matrix.apply(lambda x: x).count(axis=0)
df_four.sort_values(ascending = False).head(5)

Unnamed: 0_level_0,0
movie,Unnamed: 1_level_1
479: Vertigo (1958),0.905028
"64: Shawshank Redemption, The (1994)",0.90106
"408: Close Shave, A (1995)",0.892857
"169: Wrong Trousers, The (1993)",0.889831
318: Schindler's List (1993),0.889262


## 7. Weak Personalisation using Gender Information <a class="anchor" id="weakratings"></a>

Till now we have seen absolute non personalised recommendations, the first step towards personalisation can be taken here by using the gender of the user.
We first recompute the mean rating for each movie separately for males and for females and then calculate the overall mean rating (across all ratings) for males and females.
Then we find out the 5 movies that have the greatest differences (one where men are most above women, and one where women are most above men) along with the differences in average.
The steps to use here are:
- Add the gender information from user dataframe
- Calculate Difference in average of men and women

In [25]:
user_movie_matrix['sex'] = list(users['sex'])

In [27]:
## Subseting all the male users within the user movie matrix
df_m = user_movie_matrix[user_movie_matrix['sex']=='M']
# Drop the 'sex' column before calculating the mean
df_m_mean = df_m.drop(columns=['sex']).mean(axis=0).sort_values(ascending=False)

df_f = user_movie_matrix[user_movie_matrix['sex']=='F']
# Drop the 'sex' column before calculating the mean
df_f_mean = df_f.drop(columns=['sex']).mean(axis=0).sort_values(ascending=False)

dif_g = df_f_mean - df_m_mean
dif_g.sort_values(ascending=False)

Unnamed: 0_level_0,0
movie,Unnamed: 1_level_1
"476: First Wives Club, The (1996)",0.748951
485: My Fair Lady (1964),0.635686
29: Batman Forever (1995),0.635452
"38: Net, The (1995)",0.625616
451: Grease (1978),0.612793
...,...
"199: Bridge on the River Kwai, The (1957)",-0.621978
554: Waterworld (1995),-0.664553
156: Reservoir Dogs (1992),-0.693785
92: True Romance (1993),-0.727273


Here we have done simple comparison for average ratings for male and female users for each movie and calculated the difference. This is personalisation at some level as we are taking the demographic of the user into consideration.

You could also compute the % of ratings 4+ separately for males and females and again find out the movies with the largest difference in both directions positive and negative

In [28]:
df_m = df_m.drop(['sex'],axis = 1)
df_f = df_f.drop(['sex'],axis = 1)

In [29]:
df_m_four = df_m.apply(lambda x: x[x>=4]).count(axis=0).sort_values(ascending=False) / df_m.apply(lambda x: x).count(axis=0)
df_f_four = df_f.apply(lambda x: x[x>=4]).count(axis=0).sort_values(ascending=False) / df_f.apply(lambda x: x).count(axis=0)

In [30]:
dif_g = df_f_four - df_m_four
dif_g.sort_values(ascending=False)

Unnamed: 0_level_0,0
movie,Unnamed: 1_level_1
"476: First Wives Club, The (1996)",0.344353
"38: Net, The (1995)",0.306934
225: 101 Dalmatians (1996),0.298450
485: My Fair Lady (1964),0.279403
402: Ghost (1990),0.274242
...,...
523: Cool Hand Luke (1967),-0.297573
"331: Edge, The (1997)",-0.331281
156: Reservoir Dogs (1992),-0.355367
92: True Romance (1993),-0.357955


Clearly some new movies have popped up here. We saw how easy it was to create non personalised recommender systems for movies and we did not need to remove old movies as when it comes to movies users are not that worried about how old is a particular movie rather they are interested in actors, genre etc.

Using Occupation for Recommendations Average Ratings by Occupation

In [31]:
# Merge occupation data with ratings data
ratings = ratings.merge(users[['user_id', 'occupation']], how='left', on='user_id')

# Create occupation-based user-movie matrix
occupation_movie_matrix = ratings.pivot(index='user_id', columns='movie', values='rating')

# Calculate average ratings for each movie by occupation
occupation_avg_ratings = occupation_movie_matrix.groupby(users['occupation']).mean()

# Get top 5 recommendations for a specific occupation (e.g., 'student')
top_5_student = occupation_avg_ratings.loc['student'].sort_values(ascending=False).head(5)
print("Top 5 recommendations for students based on average ratings:\n", top_5_student)

Top 5 recommendations for students based on average ratings:
 movie
483: Casablanca (1942)             4.688889
178: 12 Angry Men (1957)           4.600000
169: Wrong Trousers, The (1993)    4.550000
603: Rear Window (1954)            4.447368
318: Schindler's List (1993)       4.388889
Name: student, dtype: float64


4+ Ratings by Occupation

In [32]:
# Calculate the percentage of 4+ ratings for each movie by occupation
occupation_4plus_ratings = occupation_movie_matrix.groupby(users['occupation']).apply(
    lambda x: x[x >= 4].count(axis=0) / x.count(axis=0)
)

# Get top 5 recommendations for a specific occupation (e.g., 'student')
top_5_student_4plus = occupation_4plus_ratings.loc['student'].sort_values(ascending=False).head(5)
print("Top 5 recommendations for students based on 4+ ratings:\n", top_5_student_4plus)

Top 5 recommendations for students based on 4+ ratings:
 movie
483: Casablanca (1942)                   0.977778
657: Manchurian Candidate, The (1962)    0.961538
86: Remains of the Day, The (1993)       0.928571
479: Vertigo (1958)                      0.926829
603: Rear Window (1954)                  0.921053
Name: student, dtype: float64


Using Age for Recommendations
Average Ratings by Age Group

In [33]:
# Define age groups
age_bins = [0, 18, 25, 35, 45, 55, 100]
age_labels = ['<18', '18-24', '25-34', '35-44', '45-54', '55+']
ratings['age_group'] = pd.cut(users['age'], bins=age_bins, labels=age_labels)

# Create age group-based user-movie matrix
age_movie_matrix = ratings.pivot(index='user_id', columns='movie', values='rating')

# Calculate average ratings for each movie by age group
age_avg_ratings = age_movie_matrix.groupby(ratings['age_group']).mean()

# Get top 5 recommendations for a specific age group (e.g., '18-24')
top_5_young_adults = age_avg_ratings.loc['18-24'].sort_values(ascending=False).head(5)
print("Top 5 recommendations for young adults (18-24) based on average ratings:\n", top_5_young_adults)

Top 5 recommendations for young adults (18-24) based on average ratings:
 movie
178: 12 Angry Men (1957)                4.555556
483: Casablanca (1942)                  4.491228
64: Shawshank Redemption, The (1994)    4.466667
134: Citizen Kane (1941)                4.459459
12: Usual Suspects, The (1995)          4.447368
Name: 18-24, dtype: float64


  age_avg_ratings = age_movie_matrix.groupby(ratings['age_group']).mean()


4+ Ratings by Age Group

In [34]:
# Calculate the percentage of 4+ ratings for each movie by age group
age_4plus_ratings = age_movie_matrix.groupby(ratings['age_group']).apply(
    lambda x: x[x >= 4].count(axis=0) / x.count(axis=0)
)

# Get top 5 recommendations for a specific age group (e.g., '18-24')
top_5_young_adults_4plus = age_4plus_ratings.loc['18-24'].sort_values(ascending=False).head(5)
print("Top 5 recommendations for young adults (18-24) based on 4+ ratings:\n", top_5_young_adults_4plus)

Top 5 recommendations for young adults (18-24) based on 4+ ratings:
 movie
223: Sling Blade (1996)                 1.000000
479: Vertigo (1958)                     0.945946
64: Shawshank Redemption, The (1994)    0.933333
603: Rear Window (1954)                 0.921053
427: To Kill a Mockingbird (1962)       0.914286
Name: 18-24, dtype: float64


  age_4plus_ratings = age_movie_matrix.groupby(ratings['age_group']).apply(


Explanation:

Merging Data: We merge the relevant demographic information (occupation or age) from the users dataframe into the ratings dataframe.
Creating User-Movie Matrix: We create a new user-movie matrix based on the chosen demographic.
Calculating Ratings: We calculate either the average ratings or the percentage of 4+ ratings for each movie, grouped by the demographic.
Getting Recommendations: We retrieve the top 5 recommendations for a specific demographic group based on the chosen rating metric.
This approach allows you to explore recommendations tailored to different user demographics, providing a more personalized experience. You can experiment with different demographic groups and rating metrics to see how the recommendations vary.