# Assignment 2: SoC Module Recommender System

## Instructions to submit the assignment

- Name your jupyter notebook as `Assignment2_[StudentID].ipynb`. For instance: `Assignment2_A0123873A.ipynb`
- Your solution notebook must contain the python code that we can run to verify the answers.
  - late within 1 hour: 10% reduction in grade
  - late within 6 hours: 30% reduction in grade
  - late within 12 hours: 50% reduction in grade
  - late within 1 days: 70% reduction in grade
  - after 1 days: zero mark
- **This is an individual assessment. Refrain from working in groups.**

In this assignment we design a recomendation engine (*Don't worry about the effectiveness of the system. It maybe very bad. The idea is just to offer you a proof of concept!*). The recommendation engine suggests the students a module that closely matches the modules already taken by the student. The dataset comprices of two files:
- List of modules in the School of Computing 
- List of graduated students and the modules they had taken during their studies

Schematic diagram of the entire process is shown as below:

![Schema]()

# Loading the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from nltk.corpus import stopwords

# ignore specific UserWarning
import warnings
warnings.filterwarnings("ignore")

'''
    YOU MUST USE THE RANDOM SEED WHEREVER NEEDED OR RANDOM_STATE as 42.
'''
rng = np.random.default_rng(seed=42)

courses = pd.read_csv("courses.tsv", sep='\t')
students = pd.read_csv("students.tsv", sep='\t')

# Question 1: Creating the preprocessing pipeline (10 marks)

We want to create a sklearn pipeline to efficiently preprocess the data and prepare it for training a model. We use three different features in the `courses` data: `specialisation`, `info` and `workload`. We want to represent every feature in a numeric form and merge them to form a feature vector for every course. We do so in the following way:
- `specialisation` represents one of the six levels of the module. For instance: CS2103 is a Software Engineering (SE) specialisation module. Encode this categorical feature into a vector. The decision of handling missing values is left to you! *(Hint: You can use `MultiLabelBinerizer` to do so.)*
- `info` provides a short discription of the module. We want to convert it into a vector using CountVectorizer. *Don't forget to remove the stopwords* while doing so.
-  `workload` states the intended distribution of workload over lectures, tutorials, labs and self study. We want to find the workload as the sum of individual workloads. For instnce: 3-1-1-3-2 workload transforms to 10 hours.

Provide implementation for three classes that help us build the pipeline. `transformed_courses` should be a numpy array of shape `[n_courses X n_features]`.

                                                                                                   (6 marks)

In [2]:
class WorkloadTransformer:        
    def fit(self, X, y = None, **fit_params):
        return self
    
    def transform(self, X, y = None, **fit_params):
        
        #split '-' and calculate the sum, trun into a numpy array
        tmp = X['workload'].str.strip().str.split('-').apply(lambda x : sum(map(float,x))).to_numpy().reshape(-1,1)

        return tmp
    

In [3]:
class InfoTransformer:        
    def fit(self, X, y = None, **fit_params):
        stop = stopwords.words('english')
        self.count_vect = CountVectorizer(stop_words = stop).fit(X['info'])
        return self
     
    def transform(self, X, y = None, **fit_params):
        
        info_count = self.count_vect.transform(X['info']).toarray()
        
        return info_count

In [4]:
class SpecTransformer:       
        
    def fit(self, X, y = None, **fit_params):    
        
        self.mlb = MultiLabelBinarizer()
        self.mlb.fit(X['specialisation'].fillna('None').tolist())
        return self
    
    def transform(self, X, y = None, **fit_params):
        

        return self.mlb.transform(X['specialisation'].fillna('None').tolist())

In [5]:
featureTransformer = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing', Pipeline([('info', InfoTransformer())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())]))
])

featureTransformer.fit(courses)
transformeed_courses = featureTransformer.transform(courses)
transformeed_courses.shape

(184, 2217)

Now we prepare our testing data in the same way we preprocessed the course. `students` data comprises of 1000 students and a list of modules they have taken. 

Create `Xtest` and `Ytest` as two matrices. `Xtest`, of size `1000*5`, comprises of first five modules for every student in the list. `Ytest`, of size `1000*[remaining_modules]`, comprises of rest of the modules for every student in the list. 
We do so in order to assess the performance of the recommender. We assess the recommender based on its effectiveness to predict the modules given a list of five modules as the input.

For instance: 
- `Xtest[0] = ['CS2105', 'CS4222', 'CS6270', 'CS6205', 'CS4226']`
- `Ytest[0] = ['CS3282', 'CS6204', 'CS5223', 'CS3281', 'CS4344', 'CS5422', 'CS3237', 'CS5233']`.

<div align="right">(2 marks)</align>

In [6]:
# Write your code here
Xtest = students['courses'].str.split("," ,expand=True).values[:,:5]
print(Xtest.shape)


Ytest = students['courses'].str.split(",", expand=True).values[:,5:]
print(Ytest.shape)
Xtest[0]


(1000, 5)
(1000, 21)


array(['CS5422', 'CS5223', 'CS4237', 'CS3281', 'CS6213'], dtype=object)

For every student in `Xtest`, we need to transform the list of 5 modules to the feature space using the `featureTransformer` fit on the training data. For every module we will get a feature vector of size `n_features`. We *add* these feature vectors to get an aggregate feature vector for very student.

Write a function `getFeatureVector` that takes in the list of modules and `featureTransformer`. It returns the feature vector for the specified list of courses. For instance, `getFeatureVector(Xtest[0], featureTransformer)` will return a vector of size `n_features`.

<div align="right">(2 marks)</div>

In [7]:
def getFeatureVector(modules, featureTransformer):
     #featureTransformer:should use courses table
    modules_in_courses = courses[courses['code'].isin(modules)]
    vector = featureTransformer.transform(modules_in_courses).sum(axis = 0)
    return vector

getFeatureVector(Xtest[0],featureTransformer).shape

(2217,)

# Question 2: Content based recommender (4 marks)

We can use a model as simple as K-nearest neighbour (KNN) to perform a content based recommendation. If we provide a list of 5 modules to the recommender, it provide us a list of modules that are similar to the specified modules.

`sklearn` provides `NearestNeighbors` as well as `KNeighborsClassifier`, both of which have a similar functionality. `NearestNeighbors` provides as an easy functionality to predict a list of K nearest neighbours. Therefore, we prefer it over `KNeighborsClassifier`. If we want to find K nearest points to a datapoint`d`, we need to use `n_neighbors` as K + 1 because the list includes `d` itself.

You can now train the model using the training data, which comprises of `transformed_courses` and with their codes as the labels. 
<div align="right">(1 mark)</div>

In [8]:
K = 5
model = NearestNeighbors(algorithm = "brute", n_neighbors = K + 1).fit(transformeed_courses)
## Write your code here
#model.fit(transformeed_courses)

It is time to see our model in action. Let's see what modules our model reommends based on the modules taken by a student.

Write a function that takes in a *pre-trained* model of your choice as input and the list of modules. It returns the top-K recommendations of the model. Print the top 6 recommendations for the first student. 
<div align="right">(3 marks)</div>

In [9]:
def recommend(model, modulesTaken, k ):
    modulesTaken_vector = getFeatureVector(modulesTaken,featureTransformer).reshape(1,-1)
    distances, idx = model.kneighbors( modulesTaken_vector ,k)
    idx = idx.ravel()
    recommendations = courses.loc[idx,'code'] 
    return recommendations.values

print(recommend(model, Xtest[0], 6))

['CS3203' 'CS3205' 'CS5223' 'CS2020' 'CS3216' 'CS6213']


# Question 3: Recommender evaluation (6 marks)

Is this the model any good? To assess the performance of the model, we use **precision** and **recall** as our metrics. `Ytest` consists of true labels for every students. Using those labels as the ground truth, compute the precision and recall for every student. Write a code that prints values of average precision and recall for a specific value of `K` over the `students` dataset. Print the value of average precision and average recall for `K= 10`.

                                                                                                     (2 marks)

In [10]:
# Write your code here
def recom_evaluation(model,X,Y, k):
    precision_list = []
    recall_list = []
    for i in range(X.shape[0]):
        y_pred = recommend(model, X[i], k)
        y =[w for w in Y[i] if w is not None]  # filter out the None value
        precision = len(set(y_pred)& set(y))/len(y_pred)
        precision_list.append(precision)

        recall = len(set(y_pred)& set(y))/len(y)
        recall_list.append(recall)
    
    precision_list = np.array(precision_list)
    precision = precision_list.mean()

    recall_list = np.array(recall_list)
    recall = recall_list.mean()
    return precision, recall

pre, rec = recom_evaluation(model,Xtest,Ytest,10)
print("The average precision is: ",pre)
print("The average recall is: ",rec)

The average precision is:  0.056799999999999996
The average recall is:  0.05652216743954361


We observe that both precision and recall is not really great. The reason might be igh feature dimension, which may even be noisy. Append the exisiting `featureTransformer` with a PCA to reduce the dimension. 

Print the value of average precision and recall for `K= 10` after the introduction of PCA.

                                                                                                     (2 marks)

In [11]:
pca = PCA()
featureTransformer = Pipeline([('feats',featureTransformer),('pca',pca)])
pca_courses = featureTransformer.fit_transform(courses)
print(pca_courses.shape)

model2 = NearestNeighbors(algorithm = "brute", n_neighbors = 10).fit(pca_courses)
pre, rec = recom_evaluation(model2,Xtest,Ytest,10)
print("The average precision after the introduction of PCA is: ",pre)
print("The average recall after the introduction of PCA is: ",rec)

(184, 184)
The average precision after the introduction of PCA is:  0.1195
The average recall after the introduction of PCA is:  0.12975396668604872


Can you provide some **concrete** (something that you can implement) suggestions to improve the performance of the system? The improvement does not have to be very significant.

                                                                                                    (2 marks)

In [12]:
# Write your code here
missing_number = courses['specialisation'].isna().sum()
total_number = len(courses['specialisation'])
print('the number of courses with missing specialisation: ',missing_number)
print('the total number of courses: ',total_number)
print('the missing percentage of specialisation：',missing_number/total_number)

the number of courses with missing specialisation:  30
the total number of courses:  184
the missing percentage of specialisation： 0.16304347826086957


The specialisation of 30 courses have missing values, which account for 16% of the total courses. This would lead to these courses can't be effectively clustered by its specialisation due to the NaN.In order to improve it, people can manually track these missing values and add its specialisation.

In [13]:
CountVectorizer(stop_words = 'english').fit(courses['info']).vocabulary_

{'module': 1224,
 'introduces': 1047,
 'fundamental': 838,
 'concepts': 373,
 'problem': 1469,
 'solving': 1788,
 'computing': 371,
 'programming': 1490,
 'using': 2026,
 'imperative': 961,
 'language': 1079,
 'foremost': 812,
 'introductory': 1050,
 'course': 446,
 'series': 1729,
 'includes': 977,
 'cs1020': 476,
 'cs2010': 478,
 'topics': 1951,
 'covered': 450,
 'include': 975,
 'writing': 2104,
 'pseudo': 1520,
 'codes': 308,
 'basic': 169,
 'formulation': 821,
 'program': 1485,
 'development': 566,
 'coding': 309,
 'testing': 1920,
 'debugging': 512,
 'constructs': 405,
 'variables': 2037,
 'types': 1989,
 'expressions': 755,
 'assignments': 128,
 'functions': 837,
 'control': 420,
 'structures': 1844,
 'data': 495,
 'arrays': 114,
 'strings': 1840,
 'simple': 1758,
 'file': 789,
 'processing': 1475,
 'recursion': 1584,
 'appropriate': 96,
 'soc': 1778,
 'students': 1847,
 'equivalent': 697,
 'cs1010': 473,
 'cs1010s': 475,
 'cs1010e': 474,
 'methodology': 1200,
 'taught': 1894,
 

From the info column, although we have removed the stopwords, there are still many useless words like "first, also, using" that appears many times. These words are making our clustering less effective. In order to improve the model's performance, we need to do more work on info and remove these useless words while keep some specific nouns.