# Exercises in Recommender systems

This notebook contains exercises in Recommender systems

## Exercise 1

Using the "Coursera Courses Dataset 2021" available at kaggle ([https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021](https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021)) or on moodle, to do the following:

1. Create a Content-based filtering recommender system based on the Course Descriptions.
2. Create a Content-based filtering recommender system based on the Skills.

Using the "Book Recommendation Dataset" available at kaggle ([https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset](https://www.kaggle.com/datasets/arashnic/book-recommendation-dataset)) or on moodle, to do the following:

3. Load in the `Ratings.csv` file (on moodle, it is called `Books_Ratings.csv`). Group by `User-ID` and sort by `Book-Rating` in descending order to get the users who rated most books. Filter the rating data to only contain the 200 users that rated most books.
4. Create a Collaborative filtering recommender system based on the user ratings from 3 together with the `Books.csv` dataset.

## Exercise 2

Using the "Coursera Courses Dataset 2021" from Exercise 1, to do the following:

1. [Optional] Create a Content-based filtering recommender system based on both the Course Descriptions and the Skills.
2. [Optional] Can you come up with a way of including Difficulty Level and Course Rating in your recommender system?

### 1. Create a Content-based filtering recommender system based on the Course Descriptions.

In [44]:
import kagglehub
import pandas as pd
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Download latest version
path = kagglehub.dataset_download("khusheekapoor/coursera-courses-dataset-2021")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/khusheekapoor/coursera-courses-dataset-2021?dataset_version_number=1...


100%|█████████████████████████████████████████████████████████████████████████████| 1.65M/1.65M [00:00<00:00, 2.75MB/s]

Extracting model files...
Path to dataset files: C:\Users\Bruger\.cache\kagglehub\datasets\khusheekapoor\coursera-courses-dataset-2021\versions\1





In [7]:
df = pd.read_csv("C:/Users/Bruger/.cache/kagglehub/datasets/khusheekapoor/coursera-courses-dataset-2021/versions/1/Coursera.csv")

In [8]:
df

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...
...,...,...,...,...,...,...,...
3517,"Capstone: Retrieving, Processing, and Visualiz...",University of Michigan,Beginner,4.6,https://www.coursera.org/learn/python-data-vis...,"In the capstone, students will build a series ...",Databases syntax analysis web Data Visuali...
3518,Patrick Henry: Forgotten Founder,University of Virginia,Intermediate,4.9,https://www.coursera.org/learn/henry,"�Give me liberty, or give me death:� Rememberi...",retirement Causality career history of the ...
3519,Business intelligence and data analytics: Gene...,Macquarie University,Advanced,4.6,https://www.coursera.org/learn/business-intell...,�Megatrends� heavily influence today�s organis...,analytics tableau software Business Intellig...
3520,Rigid Body Dynamics,Korea Advanced Institute of Science and Techno...,Beginner,4.6,https://www.coursera.org/learn/rigid-body-dyna...,"This course teaches dynamics, one of the basic...",Angular Mechanical Design fluid mechanics F...


In [88]:
def feature_fiting(df, feature):
    df = df.copy()
    #RWe first replace missing values with an empty string
    df[feature] = df[feature].fillna('')
    
    #Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
    tfidf = TfidfVectorizer(stop_words='english')
    
    #Construct the required TF-IDF matrix by fitting and transforming the data
    tfidf_matrix = tfidf.fit_transform(df[feature])
    
    return tfidf_matrix

In [87]:
#RWe first replace missing values with an empty string
df['Course Description'] = df['Course Description'].fillna('')

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['Course Description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(3522, 20074)

In [77]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 281 ms
Wall time: 270 ms


In [78]:
cosine_sim

array([[1.00000000e+00, 3.12366523e-02, 1.97603991e-02, ...,
        3.17538002e-02, 3.33859933e-02, 1.96231367e-02],
       [3.12366523e-02, 1.00000000e+00, 8.58915185e-03, ...,
        3.13671991e-02, 4.88239107e-03, 4.56033552e-02],
       [1.97603991e-02, 8.58915185e-03, 1.00000000e+00, ...,
        3.45669421e-03, 1.65197252e-02, 6.37237740e-03],
       ...,
       [3.17538002e-02, 3.13671991e-02, 3.45669421e-03, ...,
        1.00000000e+00, 5.07544593e-04, 6.72367274e-03],
       [3.33859933e-02, 4.88239107e-03, 1.65197252e-02, ...,
        5.07544593e-04, 1.00000000e+00, 1.14068789e-03],
       [1.96231367e-02, 4.56033552e-02, 6.37237740e-03, ...,
        6.72367274e-03, 1.14068789e-03, 1.00000000e+00]])

In [79]:
cosine_sim.shape

(3522, 3522)

In [None]:
indices = df

In [108]:
# Function that takes in movie description as input and outputs most similar movies
def get_recommendations(_indicies, feature_return, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the description
    indices = pd.Series(df.index, index=df[feature_return]).drop_duplicates()
  
    idx = indices[_indicies]

    # Get the pairwsie similarity scores of all courses with that description
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar courses
    sim_scores = sim_scores[1:11]

    # Get the course indices
    course_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar courses
    return df[feature_return].iloc[course_indices]

In [109]:
get_recommendations(df.iloc[0]["Course Description"], "Course Description")

1451    In this course aspiring writers will be introd...
1481    What you�ll achieve:   In this project-centere...
3462    Your style is as unique and distinctive as you...
2424    In this course, creative nonfiction writers wi...
3005    This class is the chance to create your person...
339     The blank page can be the most daunting obstac...
3481    Do you have a desire to write a novel, write a...
535     If you have always wanted to tell your own sto...
1629    WRITE YOUR FIRST NOVEL  If you�ve ever had the...
3255    How well do you think you know tango? This two...
Name: Course Description, dtype: object

In [85]:
get_recommendations(df.iloc[1]["Course Description"])

3311    By the end of this guided project, you will be...
3232    By the end of this 2.5 project, you will be fl...
1636    By the end of this project, you will be fluent...
954     By the end of this project, you will be fluent...
10      By the end of this guided project, you will be...
2147    By the end of this guided project, you will be...
1915    By the end of this project, you will be fluent...
3400    By the end of this 2 hour-long guided project,...
422     This guided project was developed to engage an...
969     By the end of this guided project, you will be...
Name: Course Description, dtype: object

### 2. Create a Content-based filtering recommender system based on the Skills.

In [89]:
tfidf_matrix = feature_fiting(df, "Skills")

In [90]:
%%time 
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

CPU times: total: 109 ms
Wall time: 123 ms


In [99]:
get_recommendations(df.iloc[0]["Skills"], "Skills")

KeyError: 'Drama  Comedy  peering  screenwriting  film  Document Review  dialogue  creative writing  Writing  unix shells arts-and-humanities music-and-art'