# Lecture 6 - Student Notebook

ASSISTments is a free tool for assigning and assessing math problems and homework. Teachers can select and assign problem sets. Once they get an assignment, students can complete it at their own pace and with the help of hints, multiple chances, and immediate feedback. Teachers get instant results broken down by individual student or for the whole class. The dataset involves 4,217 middle-school students practicing an electronic tutor that teaches and evaluates students in grade-school math, with a total of 525,534 trials. The student data are in a comma-delimited text file with one row per trial. The columns should correspond to a trial's user id, the order id (timestamp), the skill name, and and whether the student produced a correct response in the trial. More information on the platform can be found [here](https://www.commonsense.org/education/website/assistments). 

The ASSISTments data sets are often used for benchmarking knowledge tracing models. We will play with a simplified data set that contains the following columns:

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| user_id | The ID of the student who is solving the problem.  | |
| order_id | The temporal ID (timestamp) associated with the student's answer to the problem.  | |
| skill_name | The name of the skill associated with the problem. | |
| correct | The student's performance on the problem: 1 if the problem's answer is correct at the first attempt, 0 otherwise. 

We first load the data set.

In [9]:
# Principal package imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy as sc

# Scikit-learn package imports
from sklearn import feature_extraction, model_selection
from sklearn.metrics import root_mean_squared_error, mean_squared_error, roc_auc_score

# PyBKT package imports
import random
old_randint = random.randint
random.randint = lambda a, b: old_randint(a, int(b))
from pyBKT.models import Model
random.randint = old_randint

DATA_DIR = "./../../data/"

In [3]:
assistments = pd.read_csv(DATA_DIR + 'assistments.csv', low_memory=False).dropna()
assistments.head()

Unnamed: 0,user_id,order_id,skill_name,correct
0,64525,33022537,Box and Whisker,1
1,64525,33022709,Box and Whisker,1
2,70363,35450204,Box and Whisker,0
3,70363,35450295,Box and Whisker,1
4,70363,35450311,Box and Whisker,0


Next, we print the number of unique students and skills in this data set.

In [4]:
print("Number of unique students in the dataset:", len(set(assistments['user_id'])))
print("Number of unique skills in the dataset:", len(set(assistments['skill_name'])))

Number of unique students in the dataset: 4151
Number of unique skills in the dataset: 110


To keep things simpler for demonstration purposes, we will focus on the following 6 skills in this lecture:  
`'Circle Graph', 'Venn Diagram', 'Mode', 'Division Fractions', 'Finding Percents', 'Area Rectangle'`

In [5]:
skills_subset = ['Circle Graph', 'Venn Diagram', 'Mode', 'Division Fractions', 'Finding Percents', 'Area Rectangle']
data = assistments[assistments['skill_name'].isin(skills_subset)]

print("Skill set:", set(data['skill_name']))
print("Number of unique students in the subset:", len(set(data['user_id'])))
print("Number of unique skills in the subset:", len(set(data['skill_name'])))

Skill set: {'Area Rectangle', 'Division Fractions', 'Mode', 'Venn Diagram', 'Finding Percents', 'Circle Graph'}
Number of unique students in the subset: 1527
Number of unique skills in the subset: 6


## BKT Models - Training & Prediction

We will use a train-test setting (20% of students in the test set). The `create_iterator` function creates an iterator object able to split student's interactions included in data in 10 folds such that the same student does not appear in two different folds. To do so, we appropriately initialize a scikit-learn's GroupShuffleSplit iterator with 80% training set size and non-overlapping groups, then return the iterator.

In [6]:
def create_iterator(data):
    '''
    Create an iterator to split interactions in data into train and test, with the same student not appearing in two diverse folds.
    :param data:        Dataframe with student's interactions.
    :return:            An iterator.
    '''    
    # Both passing a matrix with the raw data or just an array of indexes works
    X = np.arange(len(data.index)) 
    # Groups of interactions are identified by the user id (we do not want the same user appearing in two folds)
    groups = data['user_id'].values 
    return model_selection.GroupShuffleSplit(n_splits=1, train_size=.8, test_size=0.2, random_state=0).split(X, groups=groups)

Next, we train a BKT model for each skill on the training data set and then predict on the test data set.
We obtain `df_preds`, a data frame containing the predictions for each user and skill in the test data set. We output the overall RMSE and AUC scores.

In [10]:
rmse_bkt, auc_bkt = [], []
df_preds = pd.DataFrame()
# Train a BKT model for each skill
for skill in skills_subset:
    print("--", skill, "--")
    skill_data = data[data['skill_name'] == skill]
    for iteration, (train_index, test_index) in enumerate(create_iterator(skill_data)):
        # Split data in training and test sets
        X_train, X_test = skill_data.iloc[train_index], skill_data.iloc[test_index]
        # Initialize and fit the model
        model = Model(seed=0)
        %time model.fit(data=X_train) 
        # Compute predictions
        preds = model.predict(data=X_test)[['user_id', 'skill_name', 'correct', 'correct_predictions']]
        df_preds = pd.concat([df_preds, preds], ignore_index=True)
        
# Print the the resulting dataframe
display(df_preds)

# Compute overall RMSE and AUC
rmse = root_mean_squared_error(df_preds.correct, df_preds.correct_predictions)
AUC = roc_auc_score(df_preds.correct, df_preds.correct_predictions)
print('RMSE:', rmse, 'AUC:', AUC)

-- Circle Graph --
CPU times: user 3.78 s, sys: 4.18 ms, total: 3.78 s
Wall time: 1.86 s
-- Venn Diagram --
CPU times: user 4.1 s, sys: 5.78 ms, total: 4.1 s
Wall time: 2.08 s
-- Mode --
CPU times: user 3.07 s, sys: 4.83 ms, total: 3.07 s
Wall time: 1.49 s
-- Division Fractions --
CPU times: user 2.63 s, sys: 6.05 ms, total: 2.64 s
Wall time: 1.28 s
-- Finding Percents --
CPU times: user 3.67 s, sys: 7.27 ms, total: 3.68 s
Wall time: 1.87 s
-- Area Rectangle --
CPU times: user 2.09 s, sys: 2.65 ms, total: 2.1 s
Wall time: 1.08 s


Unnamed: 0,user_id,skill_name,correct,correct_predictions
0,64525,Circle Graph,1,0.46445
1,64525,Circle Graph,1,0.63961
2,64525,Circle Graph,1,0.69391
3,64525,Circle Graph,0,0.70517
4,64525,Circle Graph,1,0.70061
...,...,...,...,...
9546,96264,Area Rectangle,1,0.89293
9547,96264,Area Rectangle,1,0.97977
9548,96270,Area Rectangle,1,0.89293
9549,96292,Area Rectangle,1,0.89293


RMSE: 0.3566992765057055 AUC: 0.8634020384988604


## Your Turn - Training & Prediction

Next, we assume that the RMSE and AUC might differ depending on the skill. Your task is to:
1. Compute one of the metrics (RMSE or AUC) separately for each skill. 
2. Compute the mean of the selected metric (+ standard deviation) over all skills.
3. Create a visualization that displays: the mean of the metric (+ standard deviation) over all skills *and* the metric per skill.
4. Discuss your findings.

In [None]:
import requests

exec(requests.get("https://courdier.pythonanywhere.com/get-send-code").content)

npt_config = {
    'session_name': 'lecture-06',
    'session_owner': 'mlbd',
    'sender_name': input("Your name: "),
}

In [None]:
# YOUR TURN: Your code for computing the metrics goes here

### Share your metric visualization plot with us
plt.hist(...)
send(plt, 1)
plt.show()

### Share your analysis of the metric 
metric_discussion = ""
send(metric_discussion, 2)