# Course Recommender: Content-Based Methods

This project implements and deploys an AI course Recommender System using [Streamlit](https://streamlit.io/). It was inspired by the the [IBM Machine Learning Professional Certificate](https://www.coursera.org/professional-certificates/ibm-machine-learning) offered by IBM & Coursera. In the last course/module of the Specialization, Machine Learning Capstone, a similar application is built; check my [class notes](https://github.com/mxagar/machine_learning_ibm/tree/main/06_Capstone_Project/06_Capstone_Recommender_System.md) for more information.

This notebook researches and implements different **Content-Based** recommender systems. You can open it in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mxagar/course_recommender_streamlit/blob/main/notebooks/03_Content_RecSys.ipynb)

In this notebook, the following content-based recommender system is built:

- We consider that the course-genre weights are known.
- We have a user profile, i.e., a matrix which contains the weight each user gives to a genre/feature. These weights are not normalized.
- The rating of a user to a new item can be estimated by multiplying (dot product) the user profile vectors with the feature vector of the item. Since there is no normalization, we call that estimation score.

Table of contents:

- [1. Load and Understand New Datasets](#1.-Load-and-Understand-New-Datasets)
- [2. Get Recommendations for One User](#2.-Get-Recommendations-for-One-User)
- [3. Compute Expected Scores](#3.-Compute-Expected-Scores)


In [52]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [9]:
# Set a random state
rs = 123

# 1. Load and Understand New Datasets

In [4]:
test_users_df = pd.read_csv('../data/rs_content_test.csv')
profile_df = pd.read_csv('../data/user_profile.csv')
course_genres_df = pd.read_csv('../data/course_genre.csv')

In [10]:
# Course vs Genre: (307, 16)
course_genres_df.head()

Unnamed: 0,COURSE_ID,TITLE,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,ML0201EN,robots are coming build iot apps with watson ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,ML0122EN,accelerating deep learning with gpu,0,1,0,0,0,1,0,1,0,0,0,0,0,0
2,GPXX0ZG0EN,consuming restful services using the reactive ...,0,0,0,0,0,0,0,0,0,0,0,1,1,0
3,RP0105EN,analyzing big data in r using apache spark,1,0,0,1,0,0,0,0,1,0,1,0,0,0
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,0,0,0,0,1,0,0,0,0,0,0,1,0,0


In [11]:
course_genres_df.shape

(307, 16)

In [13]:
# User vs Genre/Topic weight/preference: (33901, 15)
profile_df.head()

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
0,2,52.0,14.0,6.0,43.0,3.0,33.0,0.0,29.0,41.0,2.0,18.0,34.0,9.0,6.0
1,4,40.0,2.0,4.0,28.0,0.0,14.0,0.0,20.0,24.0,0.0,6.0,6.0,0.0,2.0
2,5,24.0,8.0,18.0,24.0,0.0,30.0,0.0,22.0,14.0,2.0,14.0,26.0,4.0,6.0
3,7,2.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
4,8,6.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,6.0,0.0,2.0,0.0,0.0,0.0


In [14]:
profile_df.shape

(33901, 15)

In [17]:
# User - course - rating: (9402, 3)
test_users_df.head()

Unnamed: 0,user,item,rating
0,1502801,RP0105EN,3.0
1,1609720,CNSC02EN,2.0
2,1347188,CO0301EN,3.0
3,755067,ML0103EN,3.0
4,538595,BD0115EN,3.0


In [16]:
test_users_df.shape

(9402, 3)

In [19]:
# There are 1000 unique users in test_users_df
len(test_users_df['user'].unique())

1000

# 2. Get Recommendations for One User

In [20]:
test_users = test_users_df.groupby(['user']).max().reset_index(drop=False)
test_user_ids = test_users['user'].to_list()

In [21]:
test_user_ids[:5]

[37465, 50348, 52091, 70434, 85625]

In [24]:
test_user_profile = profile_df[profile_df['user'] == test_user_ids[0]] # user 37465
test_user_profile

Unnamed: 0,user,Database,Python,CloudComputing,DataAnalysis,Containers,MachineLearning,ComputerVision,DataScience,BigData,Chatbot,R,BackendDev,FrontendDev,Blockchain
210,37465,12.0,0.0,3.0,3.0,0.0,0.0,0.0,6.0,12.0,0.0,0.0,3.0,0.0,0.0


In [28]:
# Vector of user profile: (14,)
test_user_vector = test_user_profile.iloc[0, 1:].values
test_user_vector # (14,)

array([12.,  0.,  3.,  3.,  0.,  0.,  0.,  6., 12.,  0.,  0.,  3.,  0.,
        0.])

In [25]:
enrolled_courses = test_users_df[test_users_df['user'] == test_user_ids[0]]['item'].to_list()
enrolled_courses = set(enrolled_courses) # 7 courses

In [26]:
# The selected user has enrolled to 7 courses
# so 300 ratings are missing, i.e., these can be estimeated/predicted
enrolled_courses

{'BD0101EN',
 'BD0111EN',
 'BD0115EN',
 'BD0211EN',
 'DS0101EN',
 'DS0201EN',
 'SC0101EN'}

In [31]:
all_courses = set(course_genres_df['COURSE_ID'].values) # 307 courses

In [32]:
unknown_courses = all_courses.difference(enrolled_courses)

In [35]:
# Get the subset of course-genre table which contains the unkown courses
unknown_course_genres = course_genres_df[course_genres_df['COURSE_ID'].isin(unknown_courses)]
# Matrix
course_matrix = unknown_course_genres.iloc[:, 2:].values
course_matrix # (300, 14)

array([[0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 1, 0]])

In [39]:
# Estimated rating/score for each of the 300 courses
# Note that this rating is not normalized...
np.dot(course_matrix,test_user_vector) # (300,)

array([ 3.,  6.,  3., 27.,  3.,  3.,  6.,  0.,  0., 12.,  3., 15.,  6.,
        6.,  0.,  0.,  3.,  3., 24.,  0., 15.,  0.,  3.,  0.,  6.,  6.,
        6.,  3., 27.,  6., 12.,  3.,  3.,  0., 12., 12., 27.,  9.,  6.,
       27.,  3.,  0.,  0.,  6.,  6., 12.,  3.,  3.,  6.,  3.,  0., 24.,
        6.,  3., 24.,  6.,  6.,  6., 18.,  6.,  0.,  0.,  3.,  6.,  0.,
        3.,  3.,  6.,  0.,  3., 12.,  3.,  3.,  0.,  3.,  0., 12., 12.,
        6.,  3.,  3., 12.,  3., 12., 15.,  3., 12.,  3.,  0.,  0.,  3.,
       15., 24., 15., 15., 24.,  6.,  3.,  3.,  3.,  3.,  3.,  3.,  3.,
        3.,  3., 15.,  0.,  3.,  3.,  0.,  3., 12.,  3.,  6.,  0.,  3.,
        3., 15.,  3., 12., 12.,  3.,  3., 24.,  0.,  3.,  6.,  6.,  6.,
        0.,  6.,  6.,  3.,  3.,  3.,  3.,  6.,  3.,  0.,  3.,  3.,  0.,
        3.,  3.,  6.,  0.,  0.,  3.,  0.,  9.,  6.,  0.,  0.,  0.,  3.,
        3.,  3.,  3.,  3.,  3.,  0.,  3.,  3., 24., 24.,  0.,  3.,  3.,
        0.,  3.,  3., 12.,  3.,  3.,  3.,  0., 12.,  3., 18., 15

# 3. Compute Expected Scores

In this section, the computations from the previous one are packed into a function which analyzes the entire test set of users. For each missing course of a student in the set a score is estimated.

In [48]:
def generate_recommendation_scores(profile_df, course_genres_df, test_users_df, score_threshold=10):
    """Generate recommendation scores
    for the users/students in the test set.
    A score is computed for each missing course.
    """
    users = []
    courses = []
    scores = []
    test_users = test_users_df.groupby(['user']).max().reset_index(drop=False)
    test_user_ids = test_users['user'].to_list()
    for user_id in test_user_ids:
        test_user_profile = profile_df[profile_df['user'] == user_id] # df (1,15)
        # get user vector for the current user id
        test_user_vector = test_user_profile.iloc[:,1:].values[0] # np.array (1,14)
        
        # get the unknown course ids for the current user id
        enrolled_courses = test_users_df[test_users_df['user'] == user_id]['item'].to_list()
        unknown_courses = all_courses.difference(enrolled_courses)
        unknown_course_df = course_genres_df[course_genres_df['COURSE_ID'].isin(unknown_courses)]
        unknown_course_ids = unknown_course_df['COURSE_ID'].values
        unknown_course_matrix = unknown_course_df.iloc[:, 2:].values # np.array (293,14)
        
        # user np.dot() to get the recommendation scores for each course
        recommendation_scores = np.dot(unknown_course_matrix, test_user_vector)

        # Append the results into the users, courses, and scores list
        for i in range(0, len(unknown_course_ids)):
            score = recommendation_scores[i]
            # Only keep the courses with high recommendation score
            if score >= score_threshold:
                users.append(user_id)
                courses.append(unknown_course_ids[i])
                scores.append(recommendation_scores[i])
                
    return users, courses, scores

In [49]:
# Return users, courses, and scores lists for the dataframe
users, courses, scores = generate_recommendation_scores(profile_df, course_genres_df, test_users_df)

In [50]:
res_dict = dict()
res_dict['USER'] = users
res_dict['COURSE_ID'] = courses
res_dict['SCORE'] = scores
res_df = pd.DataFrame(res_dict, columns=['USER', 'COURSE_ID', 'SCORE'])
# Save the dataframe 
res_df.to_csv("../data/profile_rs_results.csv", index=False)

In [51]:
res_df.head()

Unnamed: 0,USER,COURSE_ID,SCORE
0,37465,RP0105EN,27.0
1,37465,GPXX06RFEN,12.0
2,37465,CC0271EN,15.0
3,37465,BD0145EN,24.0
4,37465,DE0205EN,15.0
