#Assignment 05
#Recommendation Systems

Build a collaborative filtering based recommendation system on jokes rating

[Dataset](https://drive.google.com/file/d/1xCJdjbnythAcc8f3_0cIlEWuUiO3PqsR/view?usp=sharing)

Importing necessary libraries and loading the dataset

In [33]:
import pandas as pd
import numpy as np

In [34]:
data = pd.read_csv('/content/drive/MyDrive/Data/jokes-data.csv')

In [35]:
data.head()

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.75
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375


Analyzing the dataset

In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1092059 entries, 0 to 1092058
Data columns (total 4 columns):
 #   Column   Non-Null Count    Dtype  
---  ------   --------------    -----  
 0   id       1092059 non-null  object 
 1   user_id  1092059 non-null  int64  
 2   joke_id  1092059 non-null  int64  
 3   Rating   1092059 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 33.3+ MB


Checking for null entries

In [37]:
data.isna().sum()

id         0
user_id    0
joke_id    0
Rating     0
dtype: int64

In [38]:
data['user_id'].nunique() # Number of Users

40863

In [39]:
data['joke_id'].nunique() # Number of Jokes

139

In [40]:
print("Minimum: ",min(data['Rating'])) # Rating range
print("Maximum: ",max(data['Rating']))

Minimum:  -10.0
Maximum:  10.0


1. Popularity Based Recommendation - Count Based

* Assumption : Joke with the most number of reviews is the most popular.
* Based on simple count statistics.
* Cannot produce personalized results.

In [41]:
rating_count = pd.DataFrame(data.groupby('joke_id')['Rating'].count())
rating_count = rating_count.sort_values('Rating',ascending = False)
rating_count.head(10)

Unnamed: 0_level_0,Rating
joke_id,Unnamed: 1_level_1
6,27498
8,27485
5,27402
3,27369
4,27368
2,27361
7,27325
9,27125
79,17097
104,17082


According to the assumption - joke with most number of ratings is the most popular joke, Joke with joke_id = 6 is the best. The above table represents top 10 jokes based on the number of reviews.

The following steps checks for the average rating recieved by joke_id = 6.

In [42]:
joke_id_6_data = data[data['joke_id']==6]
joke_id_6_data

Unnamed: 0,id,user_id,joke_id,Rating
2,23098_6,23098,6,-6.438
26,18_6,18,6,-8.250
54,32451_6,32451,6,0.562
150,16020_6,16020,6,2.375
210,20518_6,20518,6,-4.219
...,...,...,...,...
1091979,28561_6,28561,6,-1.406
1091983,29396_6,29396,6,-3.781
1091987,13802_6,13802,6,-5.062
1092005,29911_6,29911,6,-7.500


In [43]:
joke_id_6_data['Rating'].mean()  # Average rating of joke_id = 6

-1.6126007709651609

In [44]:
rating_count.tail(10)

Unnamed: 0_level_0,Rating
joke_id,Unnamed: 1_level_1
17,259
21,158
51,155
63,126
33,118
70,113
41,112
42,109
106,104
90,100


The above table represents the least popular (10) jokes.

In [45]:
joke_id_90_data = data[data['joke_id']==90]
joke_id_90_data

Unnamed: 0,id,user_id,joke_id,Rating
45660,348_90,348,90,-1.219
51749,520_90,520,90,-3.656
100495,185_90,185,90,-4.906
125813,26_90,26,90,0.594
139050,94_90,94,90,-1.781
...,...,...,...,...
1016876,216_90,216,90,-9.500
1044578,262_90,262,90,1.812
1046746,100_90,100,90,2.062
1053115,464_90,464,90,-0.344


In [46]:
joke_id_90_data['Rating'].mean()

1.1824500000000002

In [47]:
rating_count.describe()

Unnamed: 0,Rating
count,139.0
mean,7856.539568
std,6120.858527
min,100.0
25%,4032.0
50%,6467.0
75%,10097.5
max,27498.0


From the statistical description above it can be concluded that the most popular joke recieved 27498 reviews and the least popular joke recieved 100 reviews.

2. Popularity Based Recommendation - Average Rating

* Assumption : Joke with the highest average rating is the most popular.
* Based on simple count statistics.
* Cannot produce personalized results.

In [48]:
rating_mean = pd.DataFrame(data.groupby('joke_id')['Rating'].mean())
rating_mean = rating_mean.sort_values('Rating', ascending=False)
rating_mean.head(10)

Unnamed: 0_level_0,Rating
joke_id,Unnamed: 1_level_1
43,3.733667
95,3.707069
79,3.637524
119,3.576558
25,3.557615
62,3.533092
22,3.498394
94,3.450939
58,3.379338
96,3.326277


Joke 43 is the top rated joke as per average rating.

In [49]:
rating_mean.tail(10)

Unnamed: 0_level_0,Rating
joke_id,Unnamed: 1_level_1
65,-0.940397
34,-0.951875
14,-1.019207
5,-1.415667
48,-1.527532
6,-1.612601
2,-1.933251
114,-2.153604
1,-2.24677
131,-2.734815


Joke 131 is the least liked joke.

In [50]:
joke_id_43_data = data[data['joke_id'] == 43]
joke_id_43_data.head(10)

Unnamed: 0,id,user_id,joke_id,Rating
171,33076_43,33076,43,7.156
195,37089_43,37089,43,6.594
258,36030_43,36030,43,-0.438
266,25445_43,25445,43,5.875
361,26714_43,26714,43,4.719
369,28436_43,28436,43,2.469
425,24598_43,24598,43,7.844
446,25821_43,25821,43,9.906
448,30148_43,30148,43,3.375
485,32831_43,32831,43,2.844


In [51]:
joke_id_43_data.describe()

Unnamed: 0,user_id,joke_id,Rating
count,15427.0,15427.0,15427.0
mean,21563.892202,43.0,3.733667
std,11739.982575,0.0,4.343023
min,1.0,43.0,-10.0
25%,11498.5,43.0,1.562
50%,22030.0,43.0,4.344
75%,32011.0,43.0,6.969
max,40863.0,43.0,10.0


In [52]:
joke_id_43_data['Rating'].mean()

3.733666688273806

# Collaborative Recommendation System

In [53]:
data.head()

Unnamed: 0,id,user_id,joke_id,Rating
0,31030_110,31030,110,2.75
1,16144_109,16144,109,5.094
2,23098_6,23098,6,-6.438
3,14273_86,14273,86,4.406
4,18419_134,18419,134,9.375


In [54]:
data.drop('id',axis=1,inplace=True)

In [55]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3162731 sha256=667a6aa6ba256f106e97e4e31e036d9a2a434597cbda5454cd4b7673dfad7c4a
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [56]:
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse

In [57]:
reader = Reader(rating_scale=(-10, 10))

In [58]:
data = Dataset.load_from_df(data[['user_id', 'joke_id', 'Rating']], reader)

In [59]:
# Splitting the data into training and testing sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

In [60]:
# Using KNNWithMeans algorithm
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)

In [61]:
# Fitting the algorithm on the training set
algo.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7e88eecad120>

In [62]:
# Making predictions on the test set
predictions = algo.test(testset)

In [63]:
# Evaluate the model using RMSE
print("RMSE : ",rmse(predictions))


RMSE: 4.1209
RMSE :  4.12090811317429


In [64]:
# Predicting the likely value of rating for joke_id 43 given by user_id 31030
user_id = '31030'  # Example user ID
joke_id = '43'    # Example joke ID

prediction_ex = algo.predict(user_id, joke_id)
predicted_rating = prediction_ex.est
print(f"Predicted rating for user {user_id} and joke {joke_id}: {predicted_rating}")

Predicted rating for user 31030 and joke 43: 1.7564069206441502
