<div style="text-align: center; background-color: #559cff; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  FINAL PROJECT – Intelligient Data Analysis
</div>

| Name              | ID       |
|-------------------|----------|
| Trương Công Gia Phát |21127667|

### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score

### Import previous function that is needed

In [2]:
CATEGORICAL = ['event_name', 'name','fqid', 'room_fqid', 'text_fqid']
NUMERICAL = ['elapsed_time','level','page','room_coor_x', 'room_coor_y', 
        'screen_coor_x', 'screen_coor_y', 'hover_duration']

def feature_engineer(dataset_df):
    """
    Perform feature engineering by aggregating categorical and numerical features 
    at the session-level grouped by level_group.
    """
    # Initialize a list to store aggregation results
    agg_results = []
    # Process categorical features: count unique values
    for col in CATEGORICAL:
        agg = dataset_df.groupby(['session_id', 'level_group'])[col].nunique()
        agg.name = f"{col}_nunique"
        agg_results.append(agg)

    for col in NUMERICAL:
        mean_agg = dataset_df.groupby(['session_id', 'level_group'])[col].mean()
        mean_agg.name = f"{col}_mean"
        agg_results.append(mean_agg)
        
        std_agg = dataset_df.groupby(['session_id', 'level_group'])[col].std()
        std_agg.name = f"{col}_std"
        agg_results.append(std_agg)

    features_df = pd.concat(agg_results, axis=1)
    features_df.fillna(-1, inplace=True)
    features_df.reset_index(inplace=True)

    features_df.set_index('session_id', inplace=True)
    
    return features_df

## 1. Model Building

First we read the preprocessed and engineered files that were uploaded.

In [3]:
train_x = pd.read_csv("/kaggle/input/training-data/train_x.csv",index_col=0)
valid_x = pd.read_csv("/kaggle/input/validating-data/valid_x.csv",index_col=0)

Then we perform some simple preprocessing of the input file.

In [4]:
train_x["level_group"] = train_x["level_group"].astype("category")
valid_x["level_group"] = valid_x["level_group"].astype("category")
train_x.head(5)

Unnamed: 0_level_0,level_group,event_name_nunique,name_nunique,fqid_nunique,room_fqid_nunique,text_fqid_nunique,elapsed_time_mean,elapsed_time_std,level_mean,level_std,...,room_coor_x_mean,room_coor_x_std,room_coor_y_mean,room_coor_y_std,screen_coor_x_mean,screen_coor_x_std,screen_coor_y_mean,screen_coor_y_std,hover_duration_mean,hover_duration_std
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20090312431273200,0-4,10,3,30,7,17,85793.56,49246.539458,1.945455,1.230975,...,7.701275,399.29605,-71.41375,129.2924,448.41025,214.871,383.04486,104.08274,2389.5,3227.370757
20090312431273200,13-22,10,3,49,12,35,1040601.0,126666.129584,17.402381,2.358652,...,-130.34717,622.0614,-162.0043,230.37088,442.4898,240.28021,379.30103,99.06786,899.9259,1305.088265
20090312431273200,5-12,10,3,39,11,24,357205.2,80175.676658,8.054054,2.096919,...,14.306062,357.2277,-57.26932,137.40947,451.95096,203.26855,378.7849,120.255455,969.3333,1316.408315
20090312433251036,0-4,11,4,22,6,11,97633.42,67372.714092,1.870504,1.232616,...,-84.04596,445.98004,-53.67108,156.18625,358.22308,252.5547,370.72308,121.06293,1378.75,2114.876406
20090312433251036,13-22,11,6,73,16,43,2498852.0,777382.529186,17.762529,1.825923,...,-30.762283,529.5757,-142.8619,234.27959,462.85248,259.28885,387.93008,133.34569,720.38495,1990.705518


We also need to read the trains_labels.csv again and preprocessed it.

In [5]:
labels = pd.read_csv("/kaggle/input/predict-student-performance-from-game-play/train_labels.csv")
labels['session'] = labels.session_id.apply(lambda x: int(x.split('_')[0]) )
labels['q'] = labels.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )

labels.head(5)

Unnamed: 0,session_id,correct,session,q
0,20090312431273200_q1,1,20090312431273200,1
1,20090312433251036_q1,0,20090312433251036,1
2,20090312455206810_q1,1,20090312455206810,1
3,20090313091715820_q1,0,20090313091715820,1
4,20090313571836404_q1,1,20090313571836404,1


Then we do the following:

- Create a dataframe for storing the predictions of each question for all users. The dataframe's index column is the user `session_id`s.
- Create an empty dictionary to store the models created for each question.

In [6]:
VALID_USER_LIST = valid_x.index.unique()


prediction_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)

models = {}
accuracies = []
# Create an empty dictionary to store the evaluation score for each question.
evaluation_dict ={}

Then we define our model's parameters:

In [7]:
xgb_params = {
        'booster': 'gbtree',
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'learning_rate': 0.02,
        'max_depth': 4,
        'alpha': 4,
        'n_estimators': 10000,
        'early_stopping_rounds': 100,
        # 'tree_method': 'gpu_hist',
        'subsample': 0.8,
        'colsample_bytree': 0.2,
        'use_label_encoder': False,
        'n_jobs': 8,
        'seed': 42,
} 

### 2. Experiment

Build an individual XGBoost model (XGBClassifier) for each question (1–18) in a dataset. The training and validation data are split based on level_group (0-4, 5-12, 13-22), and the model's performance is evaluated for each question using validation accuracy. We store the model for each question the predefined dictionary above for future use.

In [None]:
for q_no in range(1,19):

    if q_no<=3: grp = '0-4'
    elif q_no<=13: grp = '5-12'
    elif q_no<=22: grp = '13-22'
    
    train_loop_df = train_x.loc[train_x.level_group == grp].copy()
    train_users = train_loop_df.index.values
    valid_loop_df = valid_x.loc[valid_x.level_group == grp].copy()
    valid_users = valid_loop_df.index.values

    train_labels = labels.loc[labels.q==q_no].set_index('session').loc[train_users]
    valid_labels = labels.loc[labels.q==q_no].set_index('session').loc[valid_users]

    train_loop_df["correct"] = train_labels["correct"]
    valid_loop_df["correct"] = valid_labels["correct"]

    X_train = train_loop_df.drop(columns=["correct"])
    y_train = train_loop_df["correct"]
    X_valid = valid_loop_df.drop(columns=["correct"])
    y_valid = valid_loop_df["correct"]

    X_train = X_train.drop(columns=["level_group"])
    X_valid = X_valid.drop(columns=["level_group"])

    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)],verbose=0)

    y_pred = model.predict(X_valid)

    accuracy = accuracy_score(y_valid, y_pred)
    accuracies.append(accuracy)
    print(f"Question {q_no}: Validation Accuracy = {accuracy:.4f}")

    models[f'{grp}_{q_no}'] = model
    prediction_df.loc[valid_users, q_no-1] = y_pred.flatten() 


Question 1: Validation Accuracy = 0.7290
Question 2: Validation Accuracy = 0.9756
Question 3: Validation Accuracy = 0.9351
Question 4: Validation Accuracy = 0.7946
Question 5: Validation Accuracy = 0.6293
Question 6: Validation Accuracy = 0.7899
Question 7: Validation Accuracy = 0.7462
Question 8: Validation Accuracy = 0.6378
Question 9: Validation Accuracy = 0.7662
Question 10: Validation Accuracy = 0.6085
Question 11: Validation Accuracy = 0.6558
Question 12: Validation Accuracy = 0.8701
Question 13: Validation Accuracy = 0.7229
Question 14: Validation Accuracy = 0.7373
Question 15: Validation Accuracy = 0.6130
Question 16: Validation Accuracy = 0.7494
Question 17: Validation Accuracy = 0.7036
Question 18: Validation Accuracy = 0.9516


Create a dataframe to store true value for each question and users

In [9]:
true_df = pd.DataFrame(data=np.zeros((len(VALID_USER_LIST),18)), index=VALID_USER_LIST)
for i in range(18):
    tmp = labels.loc[labels.q == i+1].set_index('session').loc[VALID_USER_LIST]
    true_df[i] = tmp.correct.values
true_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
22000320020067784,1,1,0,1,1,1,1,1,0,1,1,1,0,1,0,1,1,1
22000321083750010,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,1,1
22000401381351532,1,1,1,1,1,1,1,0,1,1,0,1,0,1,1,1,1,1
22000407142860316,1,1,1,1,1,1,0,1,1,0,1,1,0,1,1,0,1,1
22000407572357990,1,1,1,0,1,1,0,1,1,1,1,1,0,0,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100215342220508,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1
22100215460321130,0,1,1,1,0,1,1,0,1,0,1,1,0,1,0,1,1,1
22100217104993650,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1
22100219442786200,0,1,1,1,1,1,1,0,1,0,1,1,0,1,0,1,1,1


In [14]:
prediction_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
22000320020067784,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22000321083750010,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22000401381351532,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22000407142860316,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
22000407572357990,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100215342220508,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22100215460321130,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22100217104993650,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
22100219442786200,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0


Find the best threshold to optimizing the prediction accuracy.

In [10]:
max_score = 0
best_threshold = 0

y_true = true_df.values.reshape(-1) 
y_pred_probs = prediction_df.values.reshape(-1)  

# Loop through thresholds to find the best one
for threshold in np.arange(0.5, 0.81, 0.005):
    y_pred = (y_pred_probs > threshold).astype(int)

    f1 = f1_score(y_true, y_pred)

    if f1 > max_score:
        max_score = f1
        best_threshold = threshold

print(f"Best threshold: {best_threshold:.2f}\tF1 Score: {max_score:.4f}")

Best threshold: 0.50	F1 Score: 0.8453


In [11]:
import jo_wilder_310
env = jo_wilder_310.make_env()
iter_test = env.iter_test()


In [12]:
limits = {'0-4':(1,4), '5-12':(4,14), '13-22':(14,19)}
# test = pd.read_csv("test.csv", dtype=dtypes)
for (test, sample_submission) in iter_test:
    test_df = feature_engineer(test)
    test_df["level_group"] = test_df["level_group"].astype("category")
    grp = test_df.level_group.values[0]
    a,b = limits[grp]
    for t in range(a, b):
        model = models[f'{grp}_{t}']

        X_test = test_df
        X_test = X_test.drop(columns=["level_group"])

        proba = model.predict_proba(X_test)
        predictions = (proba[:, 1] >= best_threshold).astype(int)

        mask = sample_submission.session_id.str.contains(f'q{t}')
        # sample_submission.loc[mask,'correct'] = y_pred
        sample_submission.loc[mask,'correct'] = predictions
    # Submit predictions for the current test batch
    env.predict(sample_submission)
        

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


In [13]:
! head submission.csv

session_id,correct
20090109393214576_q1,1
20090109393214576_q2,1
20090109393214576_q3,1
20090109393214576_q4,1
20090109393214576_q5,1
20090109393214576_q6,1
20090109393214576_q7,1
20090109393214576_q8,1
20090109393214576_q9,1
