# Recommendation System

Simon Chen

In this mini-LA assignment, you will work on designing and creating a recommender system for an online learning system with limited number of learning modules. This recommender system will suggest which module(s) a person should register. Your recommender system should be able to (a) make a reasonable suggestion to a brand new learner (with or without background information), and (b) make a reasonable suggestion to a learner based on the prior history.

This is an open-ended project, and here are a couple of things you may want to keep in mind:

- The design of the recommender system can be hypothetical but you will need to work with some data to work out the implementation of your algorithm. You can use the data I provide (see below), simulate some data, or both.
- The main purpose of this assignment is to get you started on what it takes to build a recommender system. It does not have to be bounded in the context where the data below is collected.
- You have a lot of different choices for algorithms. To design the recommender system, you may need to compare/contrast/combine different techniques depending on different contexts.
- You may need to consider what “reasonable” means here in the context of making recommendations.
- Besides working out the implementation of the recommender system, you need to think about the meanings and limitations of the recommendation made by your system. You can discuss it with your own experience (e.g., the modules are the courses you need to take here at Teachers College).

## Datasets

#### studentRegistration.csv
A data frame with 32593 rows and 5 variables:

- code_module
>Course name, for which student registered
- code_presentation
> Semester name, for which student registered
- id_student
> Unique student identifier, connects to dataset student
- date_registration
> Date of student registration to course in days from official start. It can be negative - student registered before course started.
- date_unregistration
> Date of student deregistered from course in days from official start. It can be negative - student deregistered before course started. NA value means that student finished course.

#### studentInfo.csv
A data frame with 32593 rows and 12 variables:

- code_module
> Name of course, for which student registered
- code_presentation
> Name of semester, for which student registered
- id_student
> Unique integer identifiing each student
- gender
> Students gender
- region
> UK region, in which student lives
- highest_education
> Highest education student achieved before taking course
- imd_band
>Index of Multiple Deprivation (see https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015) percentile, students with imd_band lower than 20 comes from the most deprived regions
- age_band
> Age band of student
- num_of_prev_attempts
> Number of student previous attempts on the selected course
- studied_credits
> Total credits student is studiing at the Open University during period of the course
- disability
> Student claims disability of any type, logical
- final_result
> Student final result in the course

## Research Problem

Design and create a recommender system for an online learning system with limited number of learning modules. This recommender system will suggest which module(s) a person should register. 

The recommender system should be able to 
- (a) make a reasonable suggestion to a brand new learner (with or without background information)
- (b) make a reasonable suggestion to a learner based on the prior history.


## Import Datasets

Import the two datasets and preview them.

In [1]:
import numpy as np
import pandas as pd
info = pd.read_csv('studentInfo.csv')
info.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass


In [2]:
registration = pd.read_csv('studentRegistration.csv')
registration.head()

Unnamed: 0,code_module,code_presentation,id_student,date_registration,date_unregistration
0,AAA,2013J,11391,-159.0,
1,AAA,2013J,28400,-53.0,
2,AAA,2013J,30268,-92.0,12.0
3,AAA,2013J,31604,-52.0,
4,AAA,2013J,32885,-176.0,


## Data Processing

Merge the datasets and clean the missing data. 

In [3]:
#merging
student = info.merge(registration, on=['code_module', 'code_presentation', 'id_student'])
student.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,


In [4]:
# Get the overview of the datasets.
student.describe()

Unnamed: 0,id_student,num_of_prev_attempts,studied_credits,date_registration,date_unregistration
count,32593.0,32593.0,32593.0,32548.0,10072.0
mean,706687.7,0.163225,79.758691,-69.4113,49.757645
std,549167.3,0.479758,41.0719,49.260522,82.46089
min,3733.0,0.0,30.0,-322.0,-365.0
25%,508573.0,0.0,60.0,-100.0,-2.0
50%,590310.0,0.0,60.0,-57.0,27.0
75%,644453.0,0.0,120.0,-29.0,109.0
max,2716795.0,6.0,655.0,167.0,444.0


In [5]:
student[student.isnull().any(axis=1)].head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,
5,AAA,2013J,38053,M,Wales,A Level or Equivalent,80-90%,35-55,0,60,N,Pass,-110.0,


In [7]:
# fill the missing values in date_unregistration with 0
student_filled = student.fillna(0)
student_filled

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,0.0
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,0.0
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,0.0
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32588,GGG,2014J,2640965,F,Wales,Lower Than A Level,10-20,0-35,0,30,N,Fail,-4.0,0.0
32589,GGG,2014J,2645731,F,East Anglian Region,Lower Than A Level,40-50%,35-55,0,30,N,Distinction,-23.0,0.0
32590,GGG,2014J,2648187,F,South Region,A Level or Equivalent,20-30%,0-35,0,30,Y,Pass,-129.0,0.0
32591,GGG,2014J,2679821,F,South East Region,Lower Than A Level,90-100%,35-55,0,30,N,Withdrawn,-49.0,101.0


In [8]:
# create a new column finished from date_unregistration to represent whether the student finished the course
d = {10000: 1}
student_filled['finished'] = student_filled['date_unregistration'].map(d)
student_filled = student_filled.fillna(0)
student_filled.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration,finished
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,0.0,0.0
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,0.0,0.0
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0,0.0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,0.0,0.0
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,0.0,0.0


In [9]:
# check for the distribution of values for num_of_prev_attempts
student_filled['num_of_prev_attempts'].value_counts(normalize=True)

0    0.871997
1    0.101218
2    0.020710
3    0.004357
4    0.001197
5    0.000399
6    0.000123
Name: num_of_prev_attempts, dtype: float64

In [10]:
# show the unique values of final_result
student_filled['final_result'].unique()

array(['Pass', 'Withdrawn', 'Fail', 'Distinction'], dtype=object)

In [11]:
# check for the distribution of values for final_result
student_filled['final_result'].value_counts(normalize=True)

Pass           0.379253
Withdrawn      0.311601
Fail           0.216365
Distinction    0.092781
Name: final_result, dtype: float64

In [12]:
# create a new column final_grade to represent each student's final result numerically
r = {'Pass': 1, 'Withdrawn': 0, 'Fail': -1, 'Distinction': 2}
student_mapped = student_filled.copy()
student_mapped['final_grade'] = student_filled['final_result'].map(r)
student_mapped.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration,finished,final_grade
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,0.0,0.0,1
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,0.0,0.0,1
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0,0.0,0
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,0.0,0.0,1
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,0.0,0.0,1


In [13]:
# # show the unique values of highest_education
student_mapped['highest_education'].unique()

array(['HE Qualification', 'A Level or Equivalent', 'Lower Than A Level',
       'Post Graduate Qualification', 'No Formal quals'], dtype=object)

In [14]:
# create a new column highestEducation to represent each student's highest education numerically
e = {'HE Qualification': 3, 'A Level or Equivalent': 2, 'Lower Than A Level': 1, 'Post Graduate Qualification': 4, 'No Formal quals': 0}
student_mapped['highestEducation'] = student_filled['highest_education'].map(e)
student_mapped.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration,finished,final_grade,highestEducation
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,0.0,0.0,1,3
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,0.0,0.0,1,3
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0,0.0,0,2
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,0.0,0.0,1,2
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,0.0,0.0,1,1


In [15]:
# create a new column Gender to represent each student's gender numerically
s = {'M': 0, 'F': 1}
student_mapped['Gender'] = student_mapped['gender'].map(s)
student_mapped.head()

Unnamed: 0,code_module,code_presentation,id_student,gender,region,highest_education,imd_band,age_band,num_of_prev_attempts,studied_credits,disability,final_result,date_registration,date_unregistration,finished,final_grade,highestEducation,Gender
0,AAA,2013J,11391,M,East Anglian Region,HE Qualification,90-100%,55<=,0,240,N,Pass,-159.0,0.0,0.0,1,3,0
1,AAA,2013J,28400,F,Scotland,HE Qualification,20-30%,35-55,0,60,N,Pass,-53.0,0.0,0.0,1,3,1
2,AAA,2013J,30268,F,North Western Region,A Level or Equivalent,30-40%,35-55,0,60,Y,Withdrawn,-92.0,12.0,0.0,0,2,1
3,AAA,2013J,31604,F,South East Region,A Level or Equivalent,50-60%,35-55,0,60,N,Pass,-52.0,0.0,0.0,1,2,1
4,AAA,2013J,32885,F,West Midlands Region,Lower Than A Level,50-60%,0-35,0,60,N,Pass,-176.0,0.0,0.0,1,1,1


In [16]:
# create a dataset with rows being the seven course names while the columns are the mean of the number of student previous attempts on the selected course and whether the student finished the course
course = student_filled.groupby('code_module')[['num_of_prev_attempts', 'finished']].mean()
course

Unnamed: 0_level_0,num_of_prev_attempts,finished
code_module,Unnamed: 1_level_1,Unnamed: 2_level_1
AAA,0.054813,0.0
BBB,0.211025,0.0
CCC,0.053,0.0
DDD,0.248087,0.0
EEE,0.054192,0.0
FFF,0.202654,0.0
GGG,0.034333,0.0


In [17]:
# create a dataset with rows being the seven course names while the columns are the mean of the number of student previous attempts on the selected course and the student's final grade for the course
course2 = student_mapped.groupby('code_module')[['num_of_prev_attempts', 'final_grade']].mean()
course2

Unnamed: 0_level_0,num_of_prev_attempts,final_grade
code_module,Unnamed: 1_level_1,Unnamed: 2_level_1
AAA,0.054813,0.647059
BBB,0.211025,0.336831
CCC,0.053,0.314614
DDD,0.248087,0.252073
EEE,0.054192,0.492161
FFF,0.202654,0.335867
GGG,0.034333,0.466456


In [18]:
course3 = student_mapped.groupby('code_module')[['num_of_prev_attempts', 'final_grade', 'finished', 'Gender', 'highestEducation']].mean()
# sort the dataset based on students' highest education
course3.sort_values(by='highestEducation', ascending=False)

Unnamed: 0_level_0,num_of_prev_attempts,final_grade,finished,Gender,highestEducation
code_module,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AAA,0.054813,0.647059,0.0,0.419786,2.057487
CCC,0.053,0.314614,0.0,0.248309,1.957375
EEE,0.054192,0.492161,0.0,0.11486,1.858214
DDD,0.248087,0.252073,0.0,0.401467,1.804528
FFF,0.202654,0.335867,0.0,0.18217,1.706905
BBB,0.211025,0.336831,0.0,0.88393,1.635352
GGG,0.034333,0.466456,0.0,0.806235,1.388713


# Content-Based Recommender
