# Recommender systems
## From "Exercises in Recommender systems.ipynb" February 26, 2025
(use venv_requirements.txt)

The hand-in exercise for this topic is Exercise 1 from the notebook “Exercises in
Recommender systems.ipynb”. 

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

### 1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [1]:
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")
file_path = "C:\\Users\\Hassan\\.cache\\kagglehub\\datasets\\khusheekapoor\\coursera-courses-dataset-2021\\versions\\1\\Coursera.csv"
data = pd.read_csv(file_path)



In [3]:
data['Course Name']

0       Write A Feature Length Screenplay For Film Or ...
1       Business Strategy: Business Model Canvas Analy...
2                           Silicon Thin Film Solar Cells
3                                    Finance for Managers
4            Retrieve Data using Single-Table SQL Queries
                              ...                        
3517    Capstone: Retrieving, Processing, and Visualiz...
3518                     Patrick Henry: Forgotten Founder
3519    Business intelligence and data analytics: Gene...
3520                                  Rigid Body Dynamics
3521    Architecting with Google Kubernetes Engine: Pr...
Name: Course Name, Length: 3522, dtype: object

In [4]:
#Not Calibrated     82
data['Course Rating'].value_counts()



Course Rating
4.7               740
4.6               623
4.8               598
4.5               389
4.4               242
4.9               180
4.3               165
4.2               121
5                  90
4.1                85
Not Calibrated     82
4                  51
3.8                24
3.9                20
3.6                18
3.7                18
3.5                17
3.4                13
3                  12
3.2                 9
3.3                 6
2.9                 6
2.6                 2
2.8                 2
2.4                 2
1                   2
2                   1
2.5                 1
3.1                 1
1.9                 1
2.3                 1
Name: count, dtype: int64

In [5]:
#Not Calibrated      50
data['Difficulty Level'].value_counts()

Difficulty Level
Beginner          1444
Advanced          1005
Intermediate       837
Conversant         186
Not Calibrated      50
Name: count, dtype: int64

In [6]:
#Dropping "Not Calibrated values"
data = data[(data['Difficulty Level'] != 'Not Calibrated') & (data['Course Rating'] != 'Not Calibrated')]

In [7]:

from sklearn.feature_extraction.text import TfidfVectorizer
# creating a tfidVectorizer object
tfidf = TfidfVectorizer(stop_words='english')

#Constructing TF-IDF matrix by fitting and transforming the data. This is going to be the basis of the cosine similarity later on.
CD_tfidf_matrix = tfidf.fit_transform(data['Course Description'])
CD_tfidf_matrix.shape

(3392, 19782)

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# comput cosine similarity matrix for Course Description - this displays the measure of similarity between two vectors, in our case course descriptions.
CD_cosine_sim = cosine_similarity(CD_tfidf_matrix, CD_tfidf_matrix)

In [9]:
#Constructs a reverse map of indices and cours names - uncertain why we do this
# also dropping duplicates.
indices = pd.Series(data.index, index=data['Course Name']).drop_duplicates()


In [10]:
def recommend_courses(course_tit, cosine_sim):
    #get the index that matches the course title provided
    idx = indices[course_tit]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    #top 3 results
    sim_scores = sim_scores[1:4]
    course_indices = [i[0] for i in sim_scores]
    return data['Course Name'].iloc[course_indices]




In [11]:
recommendations = recommend_courses('Write A Feature Length Screenplay For Film Or Television', CD_cosine_sim)
print(recommendations)

1481    Script Writing: Write a Pilot Episode for a TV...
1629                               Write Your First Novel
3481                                   Transmedia Writing
Name: Course Name, dtype: object


### 2. Create a Content-based filtering recommender system based on the Skills.
We'll still use `data`, as it is already cleaned.
This recommends courses based on the similarity of their  skills to a related courses course title, using cosine-sim computed from the TF-IDF matrix of the `Skill`.

In [12]:
Sk_tfidf_matrix = tfidf.fit_transform(data['Skills'])
Sk_tfidf_matrix.shape
CD_cosine_sim = cosine_similarity(Sk_tfidf_matrix, Sk_tfidf_matrix)

def recommend_courses_based_on_skills(course_tit, cosine_sim):
    #get the index that matches the course title provided
    idx = indices[course_tit]
    
    sk_sim_scores = list(enumerate(cosine_sim[idx]))
    sk_sim_scores = sorted(sk_sim_scores, key=lambda x: x[1], reverse=True)

    #top 3 results
    sk_sim_scores = sk_sim_scores[1:4]
    course_indices = [i[0] for i in sk_sim_scores]
    return data['Course Name'].iloc[course_indices]

recommend_courses_based_on_skills('Write A Feature Length Screenplay For Film Or Television', CD_cosine_sim)

1451    Creative Writing: The Craft of Setting and Des...
1481    Script Writing: Write a Pilot Episode for a TV...
3462                 Creative Writing: The Craft of Style
Name: Course Name, dtype: object

### 3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.


In [13]:
kagglehub.dataset_download('arashnic/book-recommendation-dataset')

file_path = "C:\\Users\\Hassan\\.cache\\kagglehub\\datasets\\arashnic\\book-recommendation-dataset\\versions\\3\\Ratings.csv"
br = pd.read_csv(file_path)
book = pd.read_csv("C:\\Users\\Hassan\\.cache\\kagglehub\\datasets\\arashnic\\book-recommendation-dataset\\versions\\3\\Books.csv")



  book = pd.read_csv("C:\\Users\\Hassan\\.cache\\kagglehub\\datasets\\arashnic\\book-recommendation-dataset\\versions\\3\\Books.csv")


In [None]:
#Getting each users count. 
user_review_count = br.groupby('User-ID').size().reset_index(name='Book-Count').sort_values(by='Book-Count', ascending=False)
#top 200 users 
dedicated_reviewers = user_review_count.head(200)

#merging book-review where user ID matches.
dedicated_only_br = br[br['User-ID'].isin(dedicated_reviewers['User-ID'])]

# not certain how I should have dealt with the ISBN containing letters, so I dropped them here.
dedicated_only_br = dedicated_only_br[~dedicated_only_br['ISBN'].str.contains(r'\D')]

#creating a user matrix 
user_matrix = dedicated_only_br.pivot(index='User-ID', columns='ISBN', values='Book-Rating').fillna(0)

corr = user_matrix.T.corr()
corr

User-ID,3363,6251,6575,7346,11601,11676,12538,13552,15408,16634,...,264321,265115,265313,266226,269566,271284,274061,274308,275970,278418
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3363,1.000000,-0.000996,0.028205,0.012924,-0.000584,0.009200,0.009346,0.007621,0.012848,-0.001219,...,0.004416,0.004701,-0.000733,-0.000747,-0.000941,-0.000341,-0.000995,-0.000456,-0.000796,-0.000694
6251,-0.000996,1.000000,0.018040,0.008508,0.019093,0.024525,-0.001215,0.009299,0.005216,0.015704,...,0.011880,0.005970,0.015119,0.003761,0.017974,-0.000524,-0.001529,0.013143,0.011469,0.019364
6575,0.028205,0.018040,1.000000,0.045450,0.012103,0.043056,0.019963,0.011093,0.019665,0.006381,...,0.016384,0.024903,0.003385,0.015302,0.005613,-0.000522,-0.001523,0.026729,0.027509,-0.001063
7346,0.012924,0.008508,0.045450,1.000000,0.006689,0.030234,0.025496,0.025259,0.003982,0.003277,...,0.009306,0.039262,-0.001356,0.016822,0.024799,-0.000631,0.012068,0.006823,0.008816,-0.001284
11601,-0.000584,0.019093,0.012103,0.006689,1.000000,0.010210,0.025775,-0.000888,0.003508,-0.001099,...,0.005377,-0.000890,0.010454,-0.000673,-0.000849,-0.000308,-0.000897,-0.000411,0.006764,0.019108
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271284,-0.000341,-0.000524,-0.000522,-0.000631,-0.000308,0.006238,-0.000416,0.015546,-0.000400,-0.000641,...,-0.000572,-0.000519,-0.000386,-0.000393,-0.000495,1.000000,0.012392,-0.000240,-0.000419,-0.000365
274061,-0.000995,-0.001529,-0.001523,0.012068,-0.000897,0.008738,0.004162,0.042235,0.005432,0.018517,...,0.005835,-0.001514,0.025986,0.002817,0.003878,0.012392,1.000000,-0.000700,-0.001221,0.006543
274308,-0.000456,0.013143,0.026729,0.006823,-0.000411,0.033068,-0.000557,0.042741,-0.000535,0.010102,...,0.006980,0.023160,0.012483,-0.000526,-0.000663,-0.000240,-0.000700,1.000000,-0.000560,-0.000489
275970,-0.000796,0.011469,0.027509,0.008816,0.006764,0.015672,0.005497,-0.001209,0.010413,-0.001496,...,0.008029,-0.001211,0.019971,-0.000917,0.007384,-0.000419,-0.001221,-0.000560,1.000000,-0.000852


### 4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [26]:
input_user_df = user_matrix.loc[274061]
#keep only books rated by user (rating>0)
book_list = input_user_df[input_user_df > 0].index.tolist()
books_rated_df = user_matrix[book_list]



In [36]:
dedicated_reviewers = books_rated_df.T.notnull().sum()
dedicated_reviewers = dedicated_reviewers.reset_index()
dedicated_reviewers.columns = ["User-ID", "book_count"]



In [37]:
dedicated_reviewers

Unnamed: 0,User-ID,book_count
0,3363,200
1,6251,200
2,6575,200
3,7346,200
4,11601,200
...,...,...
195,271284,200
196,274061,200
197,274308,200
198,275970,200


In [42]:
#method takes user_ID, the predefined user_item matrix, book dataframe and (akin to the method in "Recomender systems.ipynb") amount of recommendations wished to be displayed.

def user_based_recommender(input_user, user_item_matrix, books_df, num_recommendations=5):

    #get all books rated
    input_user_df = user_item_matrix.loc[input_user]
    #insert only books rated by user (rating>0) into list
    book_list = input_user_df[input_user_df > 0].index.tolist()

    #filer dataframe with book_list
    books_rated_df = user_item_matrix[book_list]

    # Selecting the top 200 users that has reviewed most books  
    dedicated_reviewers = books_rated_df.T.notnull().sum()
    dedicated_reviewers = dedicated_reviewers.reset_index()
    #rename as column names lost
    dedicated_reviewers.columns = ["User-ID", "book_count"]

    #sort by book_count
    dedicated_only_br = dedicated_reviewers.sort_values(by="book_count", ascending=False).head(200)["User-ID"]

    #top 200users and the books they have rated
    final_df = books_rated_df[books_rated_df.index.isin(dedicated_only_br)]
    # correlation matrix based on ratings of all 200 users
    corr_df = final_df.T.corr()
    
    # Created top correlated users
    user_corr = corr_df[input_user].reset_index().rename(columns={input_user: 'corr'})
    user_corr = user_corr.sort_values(by="corr", ascending=False).loc[user_corr["User-ID"] != input_user].reset_index(drop=True)

    # Creating correlated weighting of ratings
    top_users_ratings = user_corr.merge(br, left_on="User-ID", right_on="User-ID")
    top_users_ratings["weighted_rating"] = top_users_ratings["corr"] * top_users_ratings["Book-Rating"]

    # Creating a recommendation dataframe
    recommendation_df = top_users_ratings.groupby("ISBN").agg({"weighted_rating": "mean"}).sort_values(by="weighted_rating", ascending=False)
    recommendation_df = recommendation_df.reset_index()

    # Creating the final recommendations
    books_to_be_recommended = recommendation_df.merge(books_df, on="ISBN")
    books_to_be_recommended = books_to_be_recommended.head(num_recommendations)

    return books_to_be_recommended[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]


In [41]:
recommendations = user_based_recommender(274061, user_matrix, book)
print(recommendations)


         ISBN                                         Book-Title  \
0  0373005938                                    Lucifer's Angel   
1  0373023146                       Midnight Sun's Magic (#2314)   
2  1550743252  Easy Braids, Barrettes and Bows (Kids Can Do I...   
3  0451450299                         Echoes of the Fourth Magic   
4  0140367659                  The Magic World (Puffin Classics)   

       Book-Author Year-Of-Publication              Publisher  
0  Violet Winspear                1980              Harlequin  
1      Betty Neels                1979         Mills and Boon  
2  Judy Ann Sadler                1997           Disney Press  
3   R.A. Salvatore                1991  New Amer Library (Mm)  
4        E. Nesbit                1996           Puffin Books  
