## Model Training and Evaluation

The following sections contain the code for training and evaluation of several different models. For each model, we used a combination of following features - 

- level
- difficulty
- learning_stage
- gender
- user_grade
- has_teacher_cnt
- is_self_coach
- has_student_cnt
- belongs_to_class_cnt
- has_class_cnt
- m_level4_proficiency matrix
- m_concept_proficiency matrix
- v_upid_acc matrix
- v_ucid_acc matrix

In each model, our output variable was `is_correct` (i.e.), whether the student got the particular problem right / wrong. Each subsection contains the code for creating the training and testing data and we have reported accuracy of training and testing sets of different sizes. 

***

In [56]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [46]:
# Path
PATH_FEATURE_STORE = '../data/feature_store'
PATH_PREPROCESSED_INPUT = '../data/experiment'

# Files
FILE_LOG = os.path.join(PATH_FEATURE_STORE ,'df_log_with_upid_acc.parquet.gzip')
FILE_USER_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Info_UserData_train.parquet.gzip')
FILE_CONTENT_PROCESSED = os.path.join(PATH_PREPROCESSED_INPUT ,'Processed_Info_Content_train.parquet.gzip')

# Feature files
FILE_M_CONCEPT_PROFICIENCY = os.path.join(PATH_FEATURE_STORE, 'm_concept_proficiency.npz')

In [3]:
# Load data
df_user = pd.read_parquet(FILE_USER_PROCESSED)
df_content = pd.read_parquet(FILE_CONTENT_PROCESSED)
df_log = pd.read_parquet(FILE_LOG)

# Join tables based on uuid and ucid
df1 = pd.merge(df_log, df_user, how='inner', left_on=['uuid', 'user_grade'], right_on=['uuid', 'user_grade']) # NOTE: user_grade is duplicated in both tables
df2 = pd.merge(df1, df_content, on='ucid')

In [4]:
df = df2.copy()

In [7]:
# Assign category labels to Gender, Difficulty and Learning Columns.
df.loc[: 'gender'] = df.loc[: 'gender'].replace({'unspecified': 0, 'male': 1, 'female': 2})
df.loc[: 'difficulty'] = df.loc[: 'difficulty'].replace({'unset': 0, 'easy': 1, 'normal': 2, 'hard': 3})
df.loc[: 'learning_stage'] = df.loc[: 'learning_stage'].replace({'elementary': 0, 'junior': 1, 'senior': 2})
df = df.dropna()

### Model 1: Benchmark model - Logistic Regression

#### Input Features
- level
- difficulty
- learning_stage
- gender
- user_grade
- has_teacher_cnt
- is_self_coach
- has_student_cnt
- belongs_to_class_cnt
- has_class_cnt

#### Output Feature
- is_correct

In [8]:
# Select only required columns 
required_columns = ['is_correct', 'total_attempt_cnt', 'user_grade',
                    'used_hint_cnt', 'level', 'difficulty', 'learning_stage', 
                    'gender',  'has_teacher_cnt', 'is_self_coach', 
                    'has_student_cnt', 'belongs_to_class_cnt', 'has_class_cnt']
                    # ['total_sec_taken', 'is_hint_used'] not in index
df_logistic = df[required_columns]

In [10]:
# Convert DataFrame to numpy array
input_data = df_logistic.to_numpy()
n = input_data.shape[0]

In [35]:
n

3414657

In [None]:
# Split the data into 80 - 20% split for training and testing

num_samples = int(n * 0.8)
samples = np.random.choice(range(n), num_samples, replace=False)
mask = np.ones(n, dtype=bool)
mask[samples] = False

X_train = input_data[samples, 5:]
y_train = input_data[samples, 0]
y_train = np.reshape(y_train, (num_samples, 1))
y_train = y_train.astype('int')

X_eval = input_data[mask, 5:]
y_eval = input_data[mask, 0]
y_eval = np.reshape(y_eval, (n - num_samples, 1))
y_eval = y_eval.astype('int')

print('X_train shape is = ', np.shape(X_train))
print('y_train shape is = ', np.shape(y_train))
print('X_eval shape is = ', np.shape(X_eval))
print('y_eval shape is = ', np.shape(y_eval))

X_train shape is =  (2731725, 8)
y_train shape is =  (2731725, 1)
X_eval shape is =  (682932, 8)
y_eval shape is =  (682932, 1)


In [33]:
X_train_scaled = MinMaxScaler().fit_transform(X_train)
model = LogisticRegression(random_state=0).fit(X_train_scaled, y_train)

  y = column_or_1d(y, warn=True)


In [34]:
X_eval_scaled = MinMaxScaler().fit_transform(X_eval)
model.score(X_eval_scaled, y_eval)

0.6938040683406255

***

- Accuracy (n = 3M) = 69.3 %

***

### Model 2 - Full model


- For how the features were engineered, see the section [Feature Engineering](#Feature-Engineering)
- Labels (y) [# logs x 1]:
    - Correct or not of the new problem (problem-level)
- Features (X):
    - Demographics [#logs x 4] [From **df_user**]
        - grade (#logs x 1)
        - gender (#logs x 3)
    - Difficulty features  [#logs x 1]:
        - upid accuracy [**From v_upid_acc**]
        - ~difficulty (only 3 levels, not quite informative)~
        - ~learning_stage (only elementary vs. junior, not quite informative)~
    - History features [#logs x 3]: 
        - most recent 'Level' of this ucid [From **df_log**]
        - 'problem_number' of this 'ucid' [From **df_log**]
        - 'exercise_problem_repeat_session' of this 'upid' [From **df_log**]        
    - One-hot encoding matrix [#logs x #level4 id]:  [**m_level4_id**]
        - one-hot encoding of the content ID of the new 
    - Proficiency matrix [#logs x #level4 id]: [**m_proficiency**]
        - encodes the student’s performance of each content (i.e.,level)    
- Model:
    - Decision Tree
    - Logistic Regression
        - With L2 penalty
        - With L1 penalty
    - SVM
        - With rbf kernal
        - With linear kernal
- Evaludate Accuracy:
    - Hold-out 20% test set

***


In [50]:
# Load proficiency matrices
m_concept_proficiency = np.load(FILE_M_CONCEPT_PROFICIENCY)["arr_0"]

#### Split the data into 80 - 20% split for training and testing

In [54]:
# set to `num_samples` for using full data. set to a small number for quick testing
# n_subset = 10000000 will overflow the RAM limit (this step `np.concatenate()`)
n_subset = 10000
# n_subset = df_log.shape[0]

num_samples = int(df_log.head(n_subset).shape[0])
num_train_samples = int(num_samples * 0.8)

np.random.seed(760)
samples_train = np.random.choice(range(num_samples), num_train_samples, replace=False)

# True: training set/ False: test set
mask_train = np.zeros(num_samples, dtype=bool)
mask_train[samples_train] = True

X_train = np.concatenate((
        # grade
        df_log.head(n_subset).loc[mask_train,"user_grade"].to_numpy()[:,np.newaxis],
        # gender
        df_log.head(n_subset).loc[mask_train,["female","male","unspecified"]].to_numpy(),
        # Difficulty features 
        df_log.head(n_subset).loc[mask_train,["v_upid_acc"]].to_numpy(),
        # History features
        df_log.head(n_subset).loc[mask_train,"level"].to_numpy()[:,np.newaxis],    
        df_log.head(n_subset).loc[mask_train,"problem_number"].to_numpy()[:,np.newaxis],
        df_log.head(n_subset).loc[mask_train,"exercise_problem_repeat_session"].to_numpy()[:,np.newaxis],    
        # # one-hot matrix
        # m_level4_id[:n_subset,:][mask_train,:],
        # proficiency matrix
        m_concept_proficiency[:n_subset,:][mask_train,:]
#         # interaction between one-hot matrix and proficiency matrix
#         m_inter_level4_proficiency[:n_subset,:][mask_train,:]
    ),axis=1)

y_train = df_log.head(n_subset).loc[mask_train,"is_correct"].to_numpy(dtype = bool)

X_test = np.concatenate((
        # grade    
        df_log.head(n_subset).loc[~mask_train,"user_grade"].to_numpy()[:,np.newaxis],
        # gender
        df_log.head(n_subset).loc[~mask_train,["female","male","unspecified"]].to_numpy(),
        # Difficulty features 
        df_log.head(n_subset).loc[~mask_train,["v_upid_acc"]].to_numpy(),
        # History features
        df_log.head(n_subset).loc[~mask_train,"level"].to_numpy()[:,np.newaxis],        
        df_log.head(n_subset).loc[~mask_train,"problem_number"].to_numpy()[:,np.newaxis],
        df_log.head(n_subset).loc[~mask_train,"exercise_problem_repeat_session"].to_numpy()[:,np.newaxis],    
        # one-hot matrix
        # m_level4_id[:n_subset,:][~mask_train,:],
        # proficiency matrix
        m_concept_proficiency[:n_subset,:][~mask_train,:]
#         # interaction between one-hot matrix and proficiency matrix
#         m_inter_level4_proficiency[:n_subset,:][~mask_train,:]    
    ),axis=1)
y_test = df_log.head(n_subset).loc[~mask_train,"is_correct"].to_numpy(dtype = bool)


print('X_train shape is = ', np.shape(X_train))
print('y_train shape is = ', np.shape(y_train))

print('X_test shape is = ', np.shape(X_test))
print('y_test shape is = ', np.shape(y_test))

X_train shape is =  (8000, 1334)
y_train shape is =  (8000,)
X_test shape is =  (2000, 1334)
y_test shape is =  (2000,)


In [55]:
# Overwrite the raw data matrix to reduce RAM usage
X_train = MinMaxScaler().fit_transform(X_train)
X_test = MinMaxScaler().fit_transform(X_test)

In [59]:
dc_full = DecisionTreeClassifier(criterion="entropy",random_state=0).fit(X_train, y_train)
print("# n_subset = " + str(n_subset),": ",end = "")
print("train = " + str(dc_full.score(X_train, y_train))+" ; ",end = "")
print("test = " + str(dc_full.score(X_test, y_test)))

# n_subset = 10000 : train = 0.909375 ; test = 0.6665
