# User-User content based filtering experiments
### 7/22/23
Source: https://medium.com/grabngoinfo/recommendation-system-user-based-collaborative-filtering-a2e76e3e15c4

### Imports

In [15]:
import pandas as pd
import numpy as np
import scipy as sp

import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity

import sys

### Data

In [3]:
student_data = pd.read_excel("dataset/StudentInformationTable.xlsx")
course_data = pd.read_excel("dataset/CourseInformationTable.xlsx")
career_data = pd.read_excel("dataset/CourseSelectionTable.xlsx")

In [4]:
display(student_data.describe())
display(student_data.head(5))

Unnamed: 0,StudentId,EnrollmentYear
count,4568.0,4568.0
mean,2284.5,2018.295972
std,1318.812344,1.191886
min,1.0,2000.0
25%,1142.75,2018.0
50%,2284.5,2019.0
75%,3426.25,2019.0
max,4568.0,2020.0


Unnamed: 0,StudentId,EnrollmentYear,Education,Major
0,1115,2018,Undergraduate,Biological Science
1,1108,2018,Undergraduate,Biological Science
2,1192,2018,Undergraduate,Urban and Rural Planning
3,1193,2018,Undergraduate,Urban and Rural Planning
4,1293,2018,Undergraduate,World History


In [5]:
display(course_data.describe())
display(course_data.head(5))

Unnamed: 0,CourseId,Grade
count,5591.0,5225.0
mean,2796.0,2.436842
std,1614.127009,0.939362
min,1.0,0.0
25%,1398.5,2.0
50%,2796.0,2.0
75%,4193.5,3.0
max,5591.0,12.0


Unnamed: 0,CourseId,CourseName,College,Type,Grade,Prerequisite,Introduction
0,362,Fascinating Robot,College of Engineering,Whole school optional,2.0,,This course is open to all students in the sch...
1,1045,Introduction to Seismology,School of Earth and Space Sciences,General elective course,2.0,,This course is a quality education general cou...
2,1647,Speeches and oral cultures in China,Department of Chinese Language and Literature,Whole school optional,2.0,,The course is based on the introduction and re...
3,1830,Modern Chinese History,Department of History,Required major,4.0,ancient Chinese history,This course is based on a large number of orig...
4,1834,Chinese Historiography,Department of History,optional,3.0,,This course is a compulsory course for undergr...


In [6]:
display(career_data.describe())
display(career_data.head(5))

Unnamed: 0,StudentId,Semester,CourseId,Score
count,208949.0,208941.0,208949.0,149223.0
mean,1878.77259,1.505344,2578.111147,81.154792
std,1245.936537,0.532999,1732.925391,13.84162
min,1.0,1.0,1.0,0.0
25%,778.0,1.0,750.0,78.0
50%,1695.0,1.0,2569.0,84.0
75%,2914.0,2.0,4151.0,90.0
max,4568.0,3.0,5591.0,100.0


Unnamed: 0,StudentId,AcademicYear,Semester,CourseId,CourseName,CourseCollege,Score
0,1115,18-19,1.0,146,Advanced Mathematics (B) (1),National School of Development,81.0
1,1115,18-19,1.0,148,Problem-solving on Higher Mathematics (B),School of Economics,
2,1115,18-19,1.0,654,General Chemistry Practice,College of Engineering,
3,1115,18-19,1.0,681,General Chemistry (B),Department of Medicine Teaching office,72.0
4,1115,18-19,1.0,684,General Chemistry Lab.（B）,Department of Medicine Teaching office,83.5


Note: We think the missing grades are pass/fail classes but this is speculation

### Data cleaning/merging

In [7]:
career_data_clean = career_data.dropna()

In [8]:
print("Number of students in data:  ", career_data_clean.StudentId.nunique())
print("Range of scores:             ", career_data_clean.Score.min(), career_data_clean.Score.max())
print("Unique scores in dataset:    ")
print(np.array(sorted(career_data_clean.Score.unique())))

Number of students in data:   4546
Range of scores:              0.0 100.0
Unique scores in dataset:    
[  0.    1.    1.5   2.    2.5   3.    3.5   4.    5.    5.5   6.    7.
   7.5   8.    9.    9.5  10.   11.   12.   12.5  13.   14.   14.5  15.
  16.   16.5  17.   18.   19.   19.5  20.   20.5  21.   22.   22.5  23.
  23.5  23.6  24.   25.   25.5  26.   27.   27.5  28.   28.5  29.   30.
  30.5  31.   32.   33.   34.   34.5  35.   35.5  36.   36.5  37.   37.5
  38.   38.5  39.   39.5  40.   40.5  41.   41.5  42.   42.5  43.   43.5
  44.   44.5  45.   45.5  46.   46.5  47.   47.5  48.   48.5  49.   49.5
  50.   50.5  51.   51.5  52.   52.5  53.   53.5  54.   54.5  55.   55.5
  56.   56.5  57.   57.5  58.   59.   59.5  60.   60.5  61.   61.5  62.
  62.5  63.   63.5  64.   64.5  65.   65.5  66.   66.5  66.6  67.   67.5
  68.   68.5  69.   69.5  70.   70.5  70.7  71.   71.5  72.   72.5  73.
  73.5  74.   74.5  75.   75.5  76.   76.5  77.   77.5  78.   78.5  79.
  79.5  80.   80.5  80.8  

In [9]:
career_student_data = pd.merge(career_data_clean, student_data, 'inner', 'StudentId')

In [10]:
display(career_student_data.describe())
display(career_student_data.head(5))

Unnamed: 0,StudentId,Semester,CourseId,Score,EnrollmentYear
count,149021.0,149021.0,149021.0,149021.0,149021.0
mean,1631.926044,1.430362,2723.731306,81.136139,2017.849874
std,1163.655105,0.527811,1690.240923,13.840801,1.213044
min,1.0,1.0,2.0,0.0,2000.0
25%,617.0,1.0,1103.0,78.0,2017.0
50%,1411.0,1.0,2740.0,84.0,2018.0
75%,2592.0,2.0,4152.0,90.0,2019.0
max,4568.0,3.0,5591.0,100.0,2020.0


Unnamed: 0,StudentId,AcademicYear,Semester,CourseId,CourseName,CourseCollege,Score,EnrollmentYear,Education,Major
0,1115,18-19,1.0,146,Advanced Mathematics (B) (1),National School of Development,81.0,2018,Undergraduate,Biological Science
1,1115,18-19,1.0,681,General Chemistry (B),Department of Medicine Teaching office,72.0,2018,Undergraduate,Biological Science
2,1115,18-19,1.0,684,General Chemistry Lab.（B）,Department of Medicine Teaching office,83.5,2018,Undergraduate,Biological Science
3,1115,18-19,1.0,748,Physiology,College of Life Sciences,85.0,2018,Undergraduate,Biological Science
4,1115,18-19,1.0,844,Physiology Lab.,College of Life Sciences,75.0,2018,Undergraduate,Biological Science


### Data partitioning
We should partition data on students, not on individual classes they have taken, so instead of partitioning by selecting random rows we will be selecting random students

In [11]:
all_student_ids = career_student_data.StudentId.unique()
training_students = np.random.choice(all_student_ids, int(all_student_ids.size * .8), False)
testing_students = np.array([i for i in all_student_ids if i not in training_students])
training_data = career_student_data[career_student_data["StudentId"].isin(training_students)]
testing_data = career_student_data[career_student_data["StudentId"].isin(testing_students)]
print("Number of total students:    ", all_student_ids.size)
print("Number of training students: ", training_students.size)
print("Number of testing students:  ", testing_students.size)
display(training_data.describe())
display(testing_data.describe())

Number of total students:     4546
Number of training students:  3636
Number of testing students:   910


Unnamed: 0,StudentId,Semester,CourseId,Score,EnrollmentYear
count,118823.0,118823.0,118823.0,118823.0,118823.0
mean,1633.262045,1.429875,2717.31621,81.070651,2017.853682
std,1166.152414,0.527264,1693.641101,13.889531,1.201761
min,1.0,1.0,2.0,0.0,2000.0
25%,612.0,1.0,1100.0,78.0,2017.0
50%,1405.0,1.0,2736.0,84.0,2018.0
75%,2597.0,2.0,4152.0,90.0,2019.0
max,4568.0,3.0,5591.0,100.0,2020.0


Unnamed: 0,StudentId,Semester,CourseId,Score,EnrollmentYear
count,30198.0,30198.0,30198.0,30198.0,30198.0
mean,1626.66915,1.43228,2748.973409,81.393821,2017.83489
std,1153.780391,0.529959,1676.58455,13.644545,1.256368
min,4.0,1.0,10.0,0.0,2000.0
25%,631.0,1.0,1191.25,78.0,2017.0
50%,1416.0,1.0,2766.0,84.0,2018.0
75%,2561.0,2.0,4152.0,90.0,2019.0
max,4567.0,3.0,5591.0,100.0,2020.0


Our means and standard deviations are looking pretty comparable, we could do a t-test to affirm the null hypothesis that our training and testing are 99% probably not statistically different. But I will do this later because we rlly don't need it

### Student-Course Matrix

In [22]:
print("Our training matrix will be a", training_data.StudentId.nunique(), "by", training_data.CourseId.nunique(), "table")
print("Assuming we store a float at each point for the score our table will occupy:", training_data.StudentId.nunique() * training_data.CourseId.nunique() * sys.getsizeof(training_data.Score[0]) / 1000000, "GB")

Our training matrix will be a 3636 by 3474 table
Assuming we store a float at each point for the score our table will occupy: 12.631464 GB


This is feasable to hold in memory!!!