# Popularity recommender
Popularity recommenders are like a music chart that tells you the most popular songs everyone is listening to. Just as the chart reflects the songs loved by a large number of people, popularity recommenders suggest items based on what's trending and widely liked by a broad audience. They aim to give recommendations that are in line with what's currently popular.

---
##1.&nbsp;Import libraries and files 💾
The dataset we're working with for this project is a smaller portion of the [BookCrossing dataset](http://www2.informatik.uni-freiburg.de/~cziegler/BX/). The BookCrossing (BX) dataset was collected by Cai-Nicolas Ziegler during a 4-week data collection period (August/September 2004) from the Book-Crossing community. It includes information from 278,858 users (with their identities anonymized but with demographic details) and consists of 1,149,780 ratings (both explicit and implicit) for 271,379 books. Because this dataset is massive, we decided to use a smaller chunk of it for our project.

In [None]:
import pandas as pd

In [None]:
url = 'https://drive.google.com/file/d/1yFwxNVF0MuAsiFTAZMfoVGt1nIOatByg/view?usp=sharing'
path = 'https://drive.google.com/uc?id='+url.split('/')[-2]
df = pd.read_csv(path)

---
##2.&nbsp;Explore the data 👩‍🚀

In [None]:
df.shape

(47905, 18)

Even though this is a reduced set of data, we still have almost 48000 rows of data over 18 columns.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47905 entries, 0 to 47904
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   user_id                   47905 non-null  int64  
 1   user_location             47905 non-null  object 
 2   user_age                  47905 non-null  float64
 3   book_isbn                 47905 non-null  object 
 4   book_rating               47905 non-null  int64  
 5   book_title                47905 non-null  object 
 6   book_author               47905 non-null  object 
 7   book_year_of_publication  47905 non-null  float64
 8   book_publisher            47905 non-null  object 
 9   img_s                     47905 non-null  object 
 10  img_m                     47905 non-null  object 
 11  img_l                     47905 non-null  object 
 12  book_summary              47905 non-null  object 
 13  book_language             47905 non-null  object 
 14  book_c

Not all of these columns are particularly helpful. I'm not too sure about the `img` columns, they might not be useful for our analysis. However, columns like `user_id`, `book_isbn`, and `book_rating` will be extremely valuable. At the moment, our main focus will be on these three columns. It's worth mentioning that in many recommendation systems, they divide users into 'neighbourhoods' to improve recommendations and speed up calculations. If you'd like to explore this further in the future, columns such as `user_age` and `user_location` could come in handy for creating these neighbourhoods.

In [None]:
df.describe()

Unnamed: 0,user_id,user_age,book_rating,book_year_of_publication
count,47905.0,47905.0,47905.0,47905.0
mean,138137.811878,35.638379,7.854504,1997.611398
std,80685.985628,9.952574,1.780345,5.484309
min,9.0,7.0,1.0,1959.0
25%,68185.0,30.0,7.0,1995.0
50%,136240.0,34.7439,8.0,1999.0
75%,209272.0,38.0,9.0,2002.0
max,278854.0,99.0,10.0,2004.0


Based on `.describe()`, we observe that our user base spans from 7 to 99 years old, with the majority falling within the 30 to 38 age range. When it comes to book ratings, they are measured on a scale of 1 to 10, and it appears that most users are generous and tend to rate books between 7 and 9. Additionally, the books in our dataset have publication years ranging from 1959 to 2004, but the majority of them were published between 1995 and 2002.

In [None]:
df.head()

Unnamed: 0,user_id,user_location,user_age,book_isbn,book_rating,book_title,book_author,book_year_of_publication,book_publisher,img_s,img_m,img_l,book_summary,book_language,book_category,publisher_city,publisher_state,publisher_country
0,3329,"grantsville, utah, usa",34.7439,440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],grantsville,utah,usa
1,7346,"sunnyvale, california, usa",49.0,440234743,9,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],sunnyvale,california,usa
2,7352,"houston, texas, usa",53.0,440234743,8,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],houston,texas,usa
3,9419,"somewhere, texas, usa",34.7439,440234743,5,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],somewhere,texas,usa
4,11224,"tumwater, washington, usa",51.0,440234743,6,The Testament,John Grisham,1999.0,Dell,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,http://images.amazon.com/images/P/0440234743.0...,"A suicidal billionaire, a burnt-out Washington...",en,['Fiction'],tumwater,washington,usa


How many individual books do we have in the DataFrame?

In [None]:
df["book_isbn"].nunique()

500

How many individual users do we have in the DataFrame?

In [None]:
df["user_id"].nunique()

20366

Let's also have a look at the full distribution of ratings

In [None]:
df["book_rating"].value_counts(normalize=True)

8     0.253543
10    0.203173
9     0.190627
7     0.161445
5     0.080430
6     0.070494
4     0.017180
3     0.012713
2     0.006304
1     0.004091
Name: book_rating, dtype: float64

---
##3.&nbsp;How should we build a popularity recommender? 📚

###3.1.&nbsp;Higest rated books
Let's the look at the most popular books by average rating

In [None]:
rating_count_df = df.groupby('book_isbn')['book_rating'].agg(['mean', 'count']).reset_index()
rating_count_df.nlargest(5, ['mean', 'count'])

Unnamed: 0,book_isbn,mean,count
113,0345339738,9.402597,77
234,0439139597,9.262774,137
237,043936213X,9.207547,53
112,0345339711,9.120482,83
233,0439136369,9.082707,133


Book with the highest mean score

In [None]:
highest_rating_isbn = rating_count_df.nlargest(1, 'mean')['book_isbn'].values[0]

highest_rated_isbn_mask = df['book_isbn'] == highest_rating_isbn
book_info_columns = ['book_isbn', 'book_title', 'book_author', 'book_year_of_publication']

df.loc[highest_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,book_isbn,book_title,book_author,book_year_of_publication
31206,345339738,"The Return of the King (The Lord of the Rings,...",J.R.R. TOLKIEN,1986.0


###3.2.&nbsp;Most rated books
But are the most highly rated books also the most well read books?

In [None]:
rating_count_df.sort_values(by=['count', 'mean'], ascending=False).head()

Unnamed: 0,book_isbn,mean,count
91,316666343,8.18529,707
481,971880107,4.390706,581
196,385504209,8.435318,487
61,312195516,8.182768,383
13,60928336,7.8875,320


Book with the most reviews

In [None]:
most_rated_isbn = rating_count_df.nlargest(1, 'count')['book_isbn'].values[0]
most_rated_isbn_mask = df['book_isbn'] == most_rated_isbn

df.loc[most_rated_isbn_mask, book_info_columns].drop_duplicates()

Unnamed: 0,book_isbn,book_title,book_author,book_year_of_publication
4254,316666343,The Lovely Bones: A Novel,Alice Sebold,2002.0


Looks like some books are well loved and some books are well read, we'll need to strike a balance of the two to find out the overall top 10 most popular books.

---
##4.&nbsp;Challenge: build a popularity recommender 😃
Find a hybrid system to sort books, so that you can recommend the "best" books that are both high rated and popular.

In [None]:
# sample solution

Menna

In [None]:
rating_count_df = df.groupby('book_isbn')['book_rating'].agg(['mean', 'count'])
rating_count_df

Unnamed: 0_level_0,mean,count
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
002542730X,7.805195,77
0060096195,8.132075,53
006016848X,6.947368,57
0060173289,7.610169,59
0060175400,8.384615,78
...,...,...
1573229326,6.673077,104
1573229571,7.910714,56
1592400876,8.500000,56
1844262553,8.600000,50


In [None]:
weight_rating_count = 0.5
weight_average_rating = 0.5

In [None]:
rating_count_df['popularity_score'] = (
    weight_rating_count * rating_count_df['mean'] +
    weight_average_rating * rating_count_df['count']
)
rating_count_df

Unnamed: 0_level_0,mean,count,popularity_score
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
002542730X,7.805195,77,42.402597
0060096195,8.132075,53,30.566038
006016848X,6.947368,57,31.973684
0060173289,7.610169,59,33.305085
0060175400,8.384615,78,43.192308
...,...,...,...
1573229326,6.673077,104,55.336538
1573229571,7.910714,56,31.955357
1592400876,8.500000,56,32.250000
1844262553,8.600000,50,29.300000


In [None]:
popular_books = rating_count_df.sort_values(by='popularity_score', ascending=False)
popular_books

Unnamed: 0_level_0,mean,count,popularity_score
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0316666343,8.185290,707,357.592645
0971880107,4.390706,581,292.695353
0385504209,8.435318,487,247.717659
0312195516,8.182768,383,195.591384
0060928336,7.887500,320,163.943750
...,...,...,...
0425083837,8.333333,48,28.166667
1400031346,8.229167,48,28.114583
0842329250,8.574468,47,27.787234
0156007754,8.425532,47,27.712766


In [None]:
num_recommendations = 10
top_recommendations = popular_books.head(num_recommendations)
print(top_recommendations)

                mean  count  popularity_score
book_isbn                                    
0316666343  8.185290    707        357.592645
0971880107  4.390706    581        292.695353
0385504209  8.435318    487        247.717659
0312195516  8.182768    383        195.591384
0060928336  7.887500    320        163.943750
059035342X  8.939297    313        160.969649
0142001740  8.452769    307        157.726384
0446672211  8.142373    295        151.571186
044023722X  7.338078    281        144.169039
0452282152  7.982014    278        142.991007


Aarti

In [None]:
# Calculate average rating for each book
average_ratings = df.groupby('book_title')['book_rating'].mean().reset_index()
average_ratings.columns = ['book_title', 'average_rating']

# popularity based on the number of reviews and the book's publication year
popularity = df.groupby('book_title').agg({'user_id': 'count', 'book_year_of_publication': 'max'}).reset_index()
popularity.columns = ['book_title', 'review_count', 'max_publication_year']

# popularity score
# most reviews - 1, less reviews smaller value -  0 and 1
# 1 - published recently / closer to 0 past or old books

popularity['popularity_score'] = (popularity['review_count'] / popularity['review_count'].max()) * \
                                  (1 - (popularity['max_publication_year'] / popularity['max_publication_year'].max()))

# average rating and popularity
book_data = pd.merge(average_ratings, popularity[['book_title', 'popularity_score']], on='book_title')

Merle

In [None]:
# Compute average rating for each movie
average_ratings = df.groupby('book_title')['book_rating'].mean()
average_ratings

book_title
1984                                                                 8.772277
1st to Die: A Novel                                                  7.711864
2nd Chance                                                           7.771812
A Bend in the Road                                                   7.486239
A Case of Need                                                       6.950820
                                                                       ...   
Wish You Well                                                        8.019608
Without Remorse                                                      7.830508
Year of Wonders                                                      8.318182
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values    7.653333
\O\" Is for Outlaw"                                                  7.333333
Name: book_rating, Length: 463, dtype: float64

In [None]:
# Compute the number of ratings (reviews) each movie has received
num_reviews = df.groupby('book_title')['user_id'].count()
num_reviews

book_title
1984                                                                 101
1st to Die: A Novel                                                  236
2nd Chance                                                           149
A Bend in the Road                                                   109
A Case of Need                                                        61
                                                                    ... 
Wish You Well                                                         51
Without Remorse                                                       59
Year of Wonders                                                       88
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values     75
\O\" Is for Outlaw"                                                   66
Name: user_id, Length: 463, dtype: int64

In [None]:
# Normalize the average ratings
normalized_rating = (average_ratings - average_ratings.min()) / (average_ratings.max() - average_ratings.min())
normalized_rating

book_title
1984                                                                 0.874235
1st to Die: A Novel                                                  0.662656
2nd Chance                                                           0.674617
A Bend in the Road                                                   0.617638
A Case of Need                                                       0.510808
                                                                       ...   
Wish You Well                                                        0.724058
Without Remorse                                                      0.686328
Year of Wonders                                                      0.783631
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values    0.650977
\O\" Is for Outlaw"                                                  0.587129
Name: book_rating, Length: 463, dtype: float64

In [None]:
# Normalize the number of ratings (reviews)
normalized_num_reviews = num_reviews / num_reviews.max()
normalized_num_reviews

book_title
1984                                                                 0.142857
1st to Die: A Novel                                                  0.333805
2nd Chance                                                           0.210750
A Bend in the Road                                                   0.154173
A Case of Need                                                       0.086280
                                                                       ...   
Wish You Well                                                        0.072136
Without Remorse                                                      0.083451
Year of Wonders                                                      0.124470
Zen and the Art of Motorcycle Maintenance: An Inquiry into Values    0.106082
\O\" Is for Outlaw"                                                  0.093352
Name: user_id, Length: 463, dtype: float64

In [None]:
# rating_for_scaling = rating_count_df.copy().set_index('book_isbn')

In [None]:
# from sklearn.preprocessing import MinMaxScaler

# MinMaxScaler().set_output(transform='pandas').fit_transform(rating_for_scaling)

In [None]:
# Set the alpha value (you can adjust this as needed)
alpha = 0.35

# Compute the popularity score
popularity_score = (1 - alpha) * normalized_rating + alpha * normalized_num_reviews

# Adding the popularity_score to a new DataFrame for better visualization
result_df = pd.DataFrame({
    'book_title': popularity_score.index,
    'popularity_score': popularity_score.values,
    'normalized_rating': normalized_rating,
    'normalized_num_reviews': normalized_num_reviews
})

result_df.sort_values(by='popularity_score', ascending=False)

Hana

In [None]:
test_df = df.groupby('book_isbn').agg({'book_rating':'mean','user_id':'count'})
test_df.head()

Unnamed: 0_level_0,book_rating,user_id
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1
002542730X,7.805195,77
0060096195,8.132075,53
006016848X,6.947368,57
0060173289,7.610169,59
0060175400,8.384615,78


In [None]:
test_df['book_rating_normalized'] = (test_df['book_rating'] - test_df['book_rating'].min()) / (test_df['book_rating'].max() - test_df['book_rating'].min())
test_df.head()

Unnamed: 0_level_0,book_rating,user_id,book_rating_normalized
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
002542730X,7.805195,77,0.681278
0060096195,8.132075,53,0.746499
006016848X,6.947368,57,0.510119
0060173289,7.610169,59,0.642365
0060175400,8.384615,78,0.796887


In [None]:
test_df['user_count_normalized'] = (test_df['user_id'] - test_df['user_id'].min()) / (test_df['user_id'].max() - test_df['user_id'].min())
test_df.head()

Unnamed: 0_level_0,book_rating,user_id,book_rating_normalized,user_count_normalized
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
002542730X,7.805195,77,0.681278,0.046899
0060096195,8.132075,53,0.746499,0.01059
006016848X,6.947368,57,0.510119,0.016641
0060173289,7.610169,59,0.642365,0.019667
0060175400,8.384615,78,0.796887,0.048411


In [None]:
rating_weight = 0.75
count_weight = 0.25

In [None]:
test_df['book_rating_weighted'] = test_df['book_rating_normalized'] * rating_weight
test_df.head()

Unnamed: 0_level_0,book_rating,user_id,book_rating_normalized,user_count_normalized,book_rating_weighted
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
002542730X,7.805195,77,0.681278,0.046899,0.510958
0060096195,8.132075,53,0.746499,0.01059,0.559874
006016848X,6.947368,57,0.510119,0.016641,0.382589
0060173289,7.610169,59,0.642365,0.019667,0.481774
0060175400,8.384615,78,0.796887,0.048411,0.597665


In [None]:
test_df['count_weighted'] = test_df['user_count_normalized'] * count_weight
test_df.head()

Unnamed: 0_level_0,book_rating,user_id,book_rating_normalized,user_count_normalized,book_rating_weighted,count_weighted
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
002542730X,7.805195,77,0.681278,0.046899,0.510958,0.011725
0060096195,8.132075,53,0.746499,0.01059,0.559874,0.002648
006016848X,6.947368,57,0.510119,0.016641,0.382589,0.00416
0060173289,7.610169,59,0.642365,0.019667,0.481774,0.004917
0060175400,8.384615,78,0.796887,0.048411,0.597665,0.012103


In [None]:
test_df['final_rating'] = test_df['book_rating_weighted'] + test_df['count_weighted']
test_df.head()

Unnamed: 0_level_0,book_rating,user_id,book_rating_normalized,user_count_normalized,book_rating_weighted,count_weighted,final_rating
book_isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
002542730X,7.805195,77,0.681278,0.046899,0.510958,0.011725,0.522683
0060096195,8.132075,53,0.746499,0.01059,0.559874,0.002648,0.562521
006016848X,6.947368,57,0.510119,0.016641,0.382589,0.00416,0.38675
0060173289,7.610169,59,0.642365,0.019667,0.481774,0.004917,0.486691
0060175400,8.384615,78,0.796887,0.048411,0.597665,0.012103,0.609768


In [None]:
test_df = test_df.sort_values('final_rating',ascending=False)

In [None]:
test_df = test_df.reset_index()
test_df

Unnamed: 0,book_isbn,book_rating,user_id,book_rating_normalized,user_count_normalized,book_rating_weighted,count_weighted,final_rating
0,0316666343,8.185290,707,0.757116,1.000000,0.567837,0.250000,0.817837
1,059035342X,8.939297,313,0.907560,0.403933,0.680670,0.100983,0.781653
2,0385504209,8.435318,487,0.807003,0.667171,0.605252,0.166793,0.772045
3,0439139597,9.262774,137,0.972102,0.137670,0.729076,0.034418,0.763494
4,0345339738,9.402597,77,1.000000,0.046899,0.750000,0.011725,0.761725
...,...,...,...,...,...,...,...,...
495,0380730138,6.685393,89,0.457849,0.065053,0.343386,0.016263,0.359650
496,0385511612,6.648649,74,0.450517,0.042360,0.337888,0.010590,0.348478
497,042516098X,6.640625,64,0.448916,0.027231,0.336687,0.006808,0.343495
498,0140244824,6.530303,66,0.426904,0.030257,0.320178,0.007564,0.327742


In [None]:
test_df.merge(df[['book_title','book_isbn']],how='left',on='book_isbn').drop_duplicates()

Unnamed: 0,book_isbn,book_rating,user_id,book_rating_normalized,user_count_normalized,book_rating_weighted,count_weighted,final_rating,book_title
0,0316666343,8.185290,707,0.757116,1.000000,0.567837,0.250000,0.817837,The Lovely Bones: A Novel
707,059035342X,8.939297,313,0.907560,0.403933,0.680670,0.100983,0.781653,Harry Potter and the Sorcerer's Stone (Harry P...
1020,0385504209,8.435318,487,0.807003,0.667171,0.605252,0.166793,0.772045,The Da Vinci Code
1507,0439139597,9.262774,137,0.972102,0.137670,0.729076,0.034418,0.763494,Harry Potter and the Goblet of Fire (Book 4)
1644,0345339738,9.402597,77,1.000000,0.046899,0.750000,0.011725,0.761725,"The Return of the King (The Lord of the Rings,..."
...,...,...,...,...,...,...,...,...,...
47031,0380730138,6.685393,89,0.457849,0.065053,0.343386,0.016263,0.359650,Vinegar Hill (Oprah's Book Club (Paperback))
47120,0385511612,6.648649,74,0.450517,0.042360,0.337888,0.010590,0.348478,Bleachers
47194,042516098X,6.640625,64,0.448916,0.027231,0.336687,0.006808,0.343495,Hornet's Nest
47258,0140244824,6.530303,66,0.426904,0.030257,0.320178,0.007564,0.327742,Songs in Ordinary Time (Oprah's Book Club (Pap...


Weighted Rating (WR) = (v/(v+m)R)+(m/(v+m)C)   
where,  
  
v is the number of votes for the movie   
m is the minimum votes required to be listed in the chart   
R is the average rating of the movie   
C is the mean vote across the whole   