## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

### Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

In [1]:
import kagglehub
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


  from .autonotebook import tqdm as notebook_tqdm


path

In [2]:
path = kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")
print("Path to dataset files:", path)


Path to dataset files: C:\Users\Jlo\.cache\kagglehub\datasets\khusheekapoor\coursera-courses-dataset-2021\versions\1


In [3]:
dataset_path = path

files = os.listdir(dataset_path)
print(files)

['Coursera.csv']


In [4]:
file_path = dataset_path + "/Coursera.csv"

df_coursera = pd.read_csv(file_path)
df_coursera.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [5]:
print(df_coursera.columns)

Index(['Course Name', 'University', 'Difficulty Level', 'Course Rating',
       'Course URL', 'Course Description', 'Skills'],
      dtype='object')


In [6]:
df_coursera.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [7]:
# (Optional) Clean the text fields: remove punctuation, lower-case, etc.
df_coursera['Course Description'] = df_coursera['Course Description'].fillna("").str.lower()
df_coursera['Skills'] = df_coursera['Skills'].fillna("").str.lower()

### 1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [8]:
tfidf_desc = TfidfVectorizer(stop_words='english')
desc_matrix = tfidf_desc.fit_transform(df_coursera['Course Description'])

cosine_sim_desc = cosine_similarity(desc_matrix, desc_matrix)

In [9]:
def recommend_courses_by_description(course_index, top_n=5):
    # Enumerate similarity scores, sort, and return top_n courses (excluding itself)
    sim_scores = list(enumerate(cosine_sim_desc[course_index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [score for score in sim_scores if score[0] != course_index]
    recommended_indices = [i for i, score in sim_scores[:top_n]]
    return df_coursera.iloc[recommended_indices][['Course Name', 'University', 'Course Description']]


In [10]:
print("Recommendations based on Course Description for course index 0:")
print(recommend_courses_by_description(0, top_n=3))

Recommendations based on Course Description for course index 0:
                                            Course Name  \
1481  Script Writing: Write a Pilot Episode for a TV...   
1629                             Write Your First Novel   
3481                                 Transmedia Writing   

                     University  \
1481  Michigan State University   
1629  Michigan State University   
3481  Michigan State University   

                                     Course Description  
1481  what you�ll achieve:   in this project-centere...  
1629  write your first novel  if you�ve ever had the...  
3481  do you have a desire to write a novel, write a...  


### 2. Create a Content-based filtering recommender system based on the Skills.

In [11]:
# Vectorize the Skills column using TF-IDF.
tfidf_skills = TfidfVectorizer(stop_words='english')
skills_matrix = tfidf_skills.fit_transform(df_coursera['Skills'])

# Compute cosine similarity for skills.
cosine_sim_skills = cosine_similarity(skills_matrix, skills_matrix)



In [12]:
def recommend_courses_by_skills(course_index, top_n=5):
    sim_scores = list(enumerate(cosine_sim_skills[course_index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = [score for score in sim_scores if score[0] != course_index]
    recommended_indices = [i for i, score in sim_scores[:top_n]]
    return df_coursera.iloc[recommended_indices][['Course Name', 'University', 'Skills']]

In [13]:
# Test recommender for course index 0.
print("Recommendations based on Skills for course index 0:")
print(recommend_courses_by_skills(0, top_n=3))

Recommendations based on Skills for course index 0:
                                            Course Name  \
1451  Creative Writing: The Craft of Setting and Des...   
1481  Script Writing: Write a Pilot Episode for a TV...   
3462               Creative Writing: The Craft of Style   

                     University  \
1451        Wesleyan University   
1481  Michigan State University   
3462        Wesleyan University   

                                                 Skills  
1451  copywriting  storytelling  fiction writing  hu...  
1481  bible  film  film studies  cinematography  wri...  
3462  creative writing  fiction writing  film  copyw...  


### Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:


### 3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.

In [14]:
path = kagglehub.dataset_download("arashnic/book-recommendation-dataset")
print("Path to dataset files:", path)

Path to dataset files: C:\Users\Jlo\.cache\kagglehub\datasets\arashnic\book-recommendation-dataset\versions\3


In [15]:
dataset_path = path

files = os.listdir(dataset_path)
print(files)

['Books.csv', 'classicRec.png', 'DeepRec.png', 'Ratings.csv', 'recsys_taxonomy2.png', 'Users.csv']


In [16]:
df_ratings = pd.read_csv(dataset_path + "/Ratings.csv")
df_ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [17]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   User-ID      1149780 non-null  int64 
 1   ISBN         1149780 non-null  object
 2   Book-Rating  1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


In [18]:
user_rating_counts = df_ratings.groupby('User-ID').size().reset_index(name='rating_count')

In [19]:
user_rating_counts = user_rating_counts.sort_values('rating_count', ascending=False)
user_rating_counts.head()

Unnamed: 0,User-ID,rating_count
4213,11676,13602
74815,198711,7550
58113,153662,6109
37356,98391,5891
13576,35859,5850


In [20]:
top_users = user_rating_counts.head(200)['User-ID']

In [21]:
ratings_filtered = df_ratings[df_ratings['User-ID'].isin(top_users)]
ratings_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 288349 entries, 4330 to 1147616
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   User-ID      288349 non-null  int64 
 1   ISBN         288349 non-null  object
 2   Book-Rating  288349 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 8.8+ MB


In [22]:
# Create a pivot table with User-ID as rows, ISBN as columns, and Book-Rating as values.
rating_matrix = ratings_filtered.pivot_table(index='User-ID', columns='ISBN', values='Book-Rating').fillna(0)

# Inspect the rating matrix shape and a snippet of the data.
print("User-Item Rating Matrix shape:", rating_matrix.shape)
print(rating_matrix.head())

User-Item Rating Matrix shape: (200, 142394)
ISBN      9022906116  *0515128325  0 7336 1053 6  0000000000  00000000000  \
User-ID                                                                     
3363             0.0          0.0            0.0         0.0          0.0   
6251             0.0          0.0            0.0         0.0          0.0   
6575             0.0          0.0            0.0         0.0          0.0   
7346             0.0          0.0            0.0         0.0          0.0   
11601            0.0          0.0            0.0         0.0          0.0   

ISBN     0000000051  0000001481  0000913154  0001046438  000104687X  ...  \
User-ID                                                              ...   
3363            0.0         0.0         0.0         0.0         0.0  ...   
6251            0.0         0.0         0.0         0.0         0.0  ...   
6575            0.0         0.0         0.0         0.0         0.0  ...   
7346            0.0         0.0    

In [23]:
rating_matrix.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 3363 to 278418
Columns: 142394 entries,  9022906116 to b00005wz75
dtypes: float64(142394)
memory usage: 217.3 MB


In [24]:


# Compute cosine similarity between users based on their rating vectors.
user_similarity = cosine_similarity(rating_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)

print("User similarity matrix shape:", user_similarity_df.shape)
print(user_similarity_df.head())


User similarity matrix shape: (200, 200)
User-ID    3363      6251      6575      7346      11601     11676     12538   \
User-ID                                                                         
3363     1.000000  0.000000  0.026157  0.012896  0.000000  0.016370  0.009182   
6251     0.000000  1.000000  0.032732  0.009438  0.018611  0.032008  0.000000   
6575     0.026157  0.032732  1.000000  0.052928  0.011933  0.051280  0.018886   
7346     0.012896  0.009438  0.052928  1.000000  0.007270  0.042078  0.024509   
11601    0.000000  0.018611  0.011933  0.007270  1.000000  0.014299  0.024588   

User-ID    13552     15408     16634   ...    264321    265115    265313  \
User-ID                                ...                                 
3363     0.007846  0.012360  0.000000  ...  0.005057  0.005149  0.000000   
6251     0.009843  0.005789  0.019615  ...  0.012436  0.006767  0.015009   
6575     0.011312  0.018628  0.010907  ...  0.018278  0.029977  0.004107   
7346     0.

### 4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

In [25]:
# Load the Books.csv dataset to get book details (adjust file path as needed)
df_books = pd.read_csv("Books.csv")

# Inspect the books data; expected columns include 'Book-ID', 'Book-Title', 'Book-Author', etc.
print(df_books.head())




         ISBN                                         Book-Title  \
0  0195153448                                Classical Mythology   
1  0002005018                                       Clara Callan   
2  0060973129                               Decision in Normandy   
3  0374157065  Flu: The Story of the Great Influenza Pandemic...   
4  0393045218                             The Mummies of Urumchi   

            Book-Author Year-Of-Publication                   Publisher  \
0    Mark P. O. Morford                2002     Oxford University Press   
1  Richard Bruce Wright                2001       HarperFlamingo Canada   
2          Carlo D'Este                1991             HarperPerennial   
3      Gina Bari Kolata                1999        Farrar Straus Giroux   
4       E. J. W. Barber                1999  W. W. Norton &amp; Company   

                                         Image-URL-S  \
0  http://images.amazon.com/images/P/0195153448.0...   
1  http://images.amazon.com/

  df_books = pd.read_csv("Books.csv")


In [29]:
df_books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   ISBN                 271360 non-null  object
 1   Book-Title           271360 non-null  object
 2   Book-Author          271358 non-null  object
 3   Year-Of-Publication  271360 non-null  object
 4   Publisher            271358 non-null  object
 5   Image-URL-S          271360 non-null  object
 6   Image-URL-M          271360 non-null  object
 7   Image-URL-L          271357 non-null  object
dtypes: object(8)
memory usage: 16.6+ MB


In [30]:
def recommend_books(user_id, rating_matrix, user_similarity_df, df_books, top_n=5):
    # Get similarity scores for the target user with other users.
    sim_scores = user_similarity_df.loc[user_id]
    # Exclude the target user.
    sim_scores = sim_scores.drop(user_id)
    # Get the top 5 most similar users.
    similar_users = sim_scores.sort_values(ascending=False).head(5).index
    
    # Aggregate ratings from similar users (you can compute a weighted average based on similarity).
    similar_ratings = rating_matrix.loc[similar_users]
    # Compute the average rating for each book among these similar users.
    avg_ratings = similar_ratings.mean(axis=0)
    
    # Find books the target user has not rated.
    target_user_ratings = rating_matrix.loc[user_id]
    unrated_books = target_user_ratings[target_user_ratings == 0].index
    
    # Filter average ratings to only include unrated books.
    recommendations = avg_ratings.loc[unrated_books].sort_values(ascending=False).head(top_n)
    
    # Merge with the books dataset to get book details.
    recommended_books = df_books[df_books['ISBN'].isin(recommendations.index)]
    return recommended_books



In [31]:
# Test the recommender for a sample user (for example, the first user in top_users)
test_user = top_users.iloc[0]
recommended_books = recommend_books(test_user, rating_matrix, user_similarity_df, df_books, top_n=5)
print("Recommended Books for User", test_user)
print(recommended_books[['Book-Title', 'Book-Author']])

Recommended Books for User 11676
                                         Book-Title           Book-Author
4269   White Oleander : A Novel (Oprah's Book Club)           Janet Fitch
10950                              The Runaway Jury          JOHN GRISHAM
15611                   Little House on the Prairie  Laura Ingalls Wilder
24362                The Long Winter (Little House)  Laura Ingalls Wilder
26127                                      Palomino        DANIELLE STEEL
