Here is a list of all the features and their descriptions 
1. **Unique-id**: A unique identifier for each student.
2. **namea**: Name or identifier of the student.
3. **OffTask**: Indicates whether the student was off-task (N for No, and possibly Y for Yes).
4. **Avgright**: Average correctness of right actions.
5. **Avgbug**: Average correctness of bug-related actions.
6. **Avghelp**: Average correctness of help actions.
7. **Avgchoice**: Average correctness of choice-related actions.
8. **Avgstring**: Average correctness of string-related actions.
9. **Avgnumber**: Average correctness of number-related actions.
10. **Avgpoint**: Average correctness of point-related actions.
11. **Avgpchange**: Average correctness of actions related to changes or updates.
12. **Avgtime**: Average time taken per action.
13. **AvgtimeSDnormed**: Normalized average time per action with standard deviation.
14. **Avgtimelast3SDnormed**: Normalized average time per action with the last 3 actions' standard deviation.
15. **Avgtimelast5SDnormed**: Normalized average time per action with the last 5 actions' standard deviation.
16. **Avgnotright**: Average correctness of actions that are not categorized as "right."
17. **Avghowmanywrong-up**: Average correctness of actions related to increasing the count of wrong actions.
18. **Avghelppct-up**: Average correctness of actions related to help percentage updates.
19. **Avgwrongpct-up**: Average correctness of actions related to wrong percentage updates.
20. **Avgtimeperact-up**: Average correctness of actions related to time per action updates.
21. **AvgPrev3Count-up**: Average correctness of actions related to the count-up of previous 3 actions.
22. **AvgPrev5Count-up**: Average correctness of actions related to the count-up of previous 5 actions.
23. **Avgrecent8help**: Average correctness of actions that are recent and related to help.
24. **Avgrecent5wrong**: Average correctness of actions that are recent and categorized as wrong.
25. **Avgmanywrong-up**: Average correctness of actions where many wrong actions were taken.
26. **AvgasymptoteA-up**: Average correctness of actions where asymptote A was updated.
27. **AvgasymptoteB-up**: Average correctness of actions where asymptote B was updated.

Here is a list of all the features from the new dataset that were added to ca1-dataset.csv to form ca1-df_newfeatures.csv. Features 31-40 are the 10 new features that I have added

28. **help**: A feature that likely relates to help actions or requests.
29. **Pknow-2**: A feature related to knowledge or proficiency.
30. **time**: A timestamp or time-related feature.
31. **Avg_Time_Per_Action**: Average time taken per action.
32. **Help_Ratio_rolling**: A rolling time window-based ratio of help requests to total actions.
33. **Time_Diff**: Sequential time difference between actions.
34. **Percentage_Correct**: Percentage of correct actions.
35. **Cumulative_Knowledge_Change**: Cumulative knowledge change.
36. **Total_Actions**: Total number of actions by each student.
37. **Total_Correct_Actions**: Total number of correct actions by each student.
38. **Help_Request_Frequency**: Frequency of help requests.
39. **Help_Request_Ratio**: Ratio of help requests to total actions.
40. **Avg_Time_Between_Actions**: Average time between actions.


In [1]:
#Import All Packages
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.metrics import roc_auc_score, cohen_kappa_score, f1_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [2]:
# Load the original and new datasets
original_data = pd.read_csv('ca1-dataset.csv')
new_data = pd.read_csv('ca2-dataset.csv')

In [3]:
# Merge datasets on the UniqueID - This is the df I will work with before adding the new features to the old df. 
df = pd.merge(original_data, new_data, on='Unique-id') 

**Feature 1: Calculates the ratio of help requests to total actions on a rolling time window basis**
Calculates how often students request help relative to all their actions, considering a sliding time window to capture changes in this behavior pattern as time progresses. This can help you understand if there are specific points in time or periods where students tend to seek more or less help relative to their overall activity, which could be valuable for analyzing student behavior in educational contexts.

In [4]:
rolling_window = 5 
df['Help_Ratio_rolling'] = df.groupby('Unique-id')['help'].rolling(rolling_window).mean().fillna(0).reset_index(0, drop=True)
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,Prev3Count-up,Prev5Count-up,recent8help,recent5wrong,manywrong-up,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,0,1,0,0,0,ON TASK,awagner,0.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,0,1,0,0,0,ON TASK,awagner,0.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,ON TASK,awagner,0.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,ON TASK,awagner,0.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,0,0,1,0,0,0,ON TASK,awagner,0.0


**Feature 2: Sequential Time Difference**
This code calculates the sequential time difference for each row within the same 'Unique-id' group. It computes the time elapsed (in seconds or another time unit) between the current action and the previous action for each student ('Unique-id'). The result is stored in the 'Time_Diff' column.

In [5]:
df['time'] = pd.to_numeric(df['time'], errors='coerce')
df.sort_values(['Unique-id', 'time'], inplace=True)

# Calculate the sequential time difference
df['Time_Diff'] = df.groupby('Unique-id')['time'].diff()

# Handle missing values in the Time_Diff column (e.g., replace with 0)
df['Time_Diff'].fillna(0, inplace=True)

df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,Prev5Count-up,recent8help,recent5wrong,manywrong-up,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,1,0,0,0,ON TASK,awagner,0.0,0.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,1,0,0,0,ON TASK,awagner,0.0,0.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,ON TASK,awagner,0.0,0.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,ON TASK,awagner,0.0,1.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,0,1,0,0,0,ON TASK,awagner,0.0,0.0


**Feature 3: Calculate the percentage of correct actions out of total actions for each student**


In [6]:
df['Percentage_Correct'] =round(((df['Avgright'] / df.groupby('Unique-id')['Unique-id'].transform('count')) * 100),2)
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,recent8help,recent5wrong,manywrong-up,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,1,0,0,0,ON TASK,awagner,0.0,0.0,50.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,1,0,0,0,ON TASK,awagner,0.0,0.0,50.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,ON TASK,awagner,0.0,0.0,50.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,0,ON TASK,awagner,0.0,1.0,50.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,1,0,0,0,ON TASK,awagner,0.0,0.0,33.33


**Feature 4: Accumulates the change in knowledge estimates (Pknow-2) over time** 
The 'Cumulative_Knowledge_Change' column stores the cumulative change in knowledge (represented by the 'Pknow-2' values) for each student over time as they progress through the dataset. It provides a running total of how a student's knowledge changes based on the 'Pknow-2' values, which can be valuable for tracking and analyzing learning patterns and trends.

In [7]:
df['Cumulative_Knowledge_Change'] = df.groupby('Unique-id')['Pknow-2'].cumsum()
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,recent5wrong,manywrong-up,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,1,0,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,1,0,0,0,ON TASK,awagner,0.0,0.0,50.0,1.776573
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,0,ON TASK,awagner,0.0,1.0,50.0,1.876751
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,1,0,0,0,ON TASK,awagner,0.0,0.0,33.33,0.979706



**Feature 5: Total Actions Per Student**
The 'Total_Actions' column now contains the total number of actions (rows) associated with each student. This feature is useful for understanding and analyzing the overall activity level or engagement of each student in the dataset.

In [8]:
df['Total_Actions'] = df.groupby('Unique-id')['Unique-id'].transform('count')
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,manywrong-up,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,0,ON TASK,awagner,0.0,0.0,50.0,1.776573,2
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,0,ON TASK,awagner,0.0,1.0,50.0,1.876751,2
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,0,0,ON TASK,awagner,0.0,0.0,33.33,0.979706,3


**Feature 6: Total Correct Actions per Student**

In [9]:
df['Total_Correct_Actions'] = df.groupby('Unique-id')['Avgright'].transform('sum')
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,asymptoteA-up,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions,Total_Correct_Actions
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,0,ON TASK,awagner,0.0,0.0,50.0,1.776573,2,2.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,0,ON TASK,awagner,0.0,1.0,50.0,1.876751,2,2.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,0,ON TASK,awagner,0.0,0.0,33.33,0.979706,3,3.0


**Feature 7: Help Request Frequency** The 'Help_Request_Frequency' column now contains the total frequency of help requests made by each student. This feature provides insight into how often each student seeks help during their interactions with the system or platform.

In [10]:
#Feature 7
df['Help_Request_Frequency'] = df.groupby('Unique-id')['Avghelp'].transform('sum')
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,asymptoteB-up,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions,Total_Correct_Actions,Help_Request_Frequency
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0,0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0,ON TASK,awagner,0.0,0.0,50.0,1.776573,2,2.0,0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0,0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0,ON TASK,awagner,0.0,1.0,50.0,1.876751,2,2.0,0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0,ON TASK,awagner,0.0,0.0,33.33,0.979706,3,3.0,0


**Feature 8: Help Request Ratio** The 'Help_Request_Ratio' column now contains the percentage of help requests relative to the total number of actions for each student. This feature provides an indication of how frequently a student requests help in relation to their overall actions, expressed as a percentage.

In [11]:
#Feature 8 
df['Help_Request_Ratio'] = (df['Help_Request_Frequency'] / df['Total_Actions']) * 100
df.head()


Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,Behaviour,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions,Total_Correct_Actions,Help_Request_Frequency,Help_Request_Ratio
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0,0,0.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,ON TASK,awagner,0.0,0.0,50.0,1.776573,2,2.0,0,0.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,ON TASK,awagner,0.0,0.0,50.0,0.888287,2,2.0,0,0.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,ON TASK,awagner,0.0,1.0,50.0,1.876751,2,2.0,0,0.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,ON TASK,awagner,0.0,0.0,33.33,0.979706,3,3.0,0,0.0


**Feature 9: Average Time Between Actions per Student**
The feature Avg_Time_Between_Actions is designed to capture the average time between consecutive actions for each student. It can provide valuable insights into a student's pacing and behavior while using the educational platform or system

In [12]:
df['Avg_Time_Between_Actions'] = df.groupby('Unique-id')['time'].diff().groupby(df['Unique-id']).transform('mean').fillna(0)
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,Coder,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions,Total_Correct_Actions,Help_Request_Frequency,Help_Request_Ratio,Avg_Time_Between_Actions
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,awagner,0.0,0.0,50.0,0.888287,2,2.0,0,0.0,0.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,awagner,0.0,0.0,50.0,1.776573,2,2.0,0,0.0,0.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,awagner,0.0,0.0,50.0,0.888287,2,2.0,0,0.0,1.0
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,awagner,0.0,1.0,50.0,1.876751,2,2.0,0,0.0,1.0
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,awagner,0.0,0.0,33.33,0.979706,3,3.0,0,0.0,35.0


**Feature 10: Average time per action per student** 
The feature Avg_Time_Per_Action calculates the average time taken per action for each student in the dataset. This feature provides insights into how much time, on average, a student spends on each individual action or task within the educational platform or system

In [13]:
df['Avg_Time_Per_Action'] = df.groupby('Unique-id')['time'].transform('mean')
df.head()

Unnamed: 0,Unique-id,namea_x,OffTask,Avgright,Avgbug,Avghelp,Avgchoice,Avgstring,Avgnumber,Avgpoint,...,Help_Ratio_rolling,Time_Diff,Percentage_Correct,Cumulative_Knowledge_Change,Total_Actions,Total_Correct_Actions,Help_Request_Frequency,Help_Request_Ratio,Avg_Time_Between_Actions,Avg_Time_Per_Action
0,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0.0,0.0,50.0,0.888287,2,2.0,0,0.0,0.0,12.0
1,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZgy46jl,N,1.0,0.0,0,0,0,0,0,...,0.0,0.0,50.0,1.776573,2,2.0,0,0.0,0.0,12.0
3,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0.0,0.0,50.0,0.888287,2,2.0,0,0.0,1.0,7.5
2,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ77be0l,N,1.0,0.0,0,0,0,0,0,...,0.0,1.0,50.0,1.876751,2,2.0,0,0.0,1.0,7.5
4,awagner-closeloop-ins_h1zaz4-03.30.2011_at_13:...,stuZ5lp7k7,N,1.0,0.0,0,0,0,0,0,...,0.0,0.0,33.33,0.979706,3,3.0,0,0.0,35.0,25.333333


In [157]:
# Made the merged dataset into a new csv file
#df.to_csv('ca1_dataset-newfeatures.csv', index=False)

#Analysed both the orignial dataset with new merged csv file
new_df = pd.read_csv('ca1_dataset-newfeatures.csv')
# original_data.columns


Index(['Unique-id', 'namea', 'OffTask', 'Avgright', 'Avgbug', 'Avghelp',
       'Avgchoice', 'Avgstring', 'Avgnumber', 'Avgpoint', 'Avgpchange',
       'Avgtime', 'AvgtimeSDnormed', 'Avgtimelast3SDnormed',
       'Avgtimelast5SDnormed', 'Avgnotright', 'Avghowmanywrong-up',
       'Avghelppct-up', 'Avgwrongpct-up', 'Avgtimeperact-up',
       'AvgPrev3Count-up', 'AvgPrev5Count-up', 'Avgrecent8help',
       'Avg recent5wrong', 'Avgmanywrong-up', 'AvgasymptoteA-up',
       'AvgasymptoteB-up'],
      dtype='object')

In [163]:
#Added all new features to old dataset based on Unique-id. Removed all the features that were not used from the new dataset for the features. 
r_df = pd.merge(original_data, new_df[['Unique-id','help','Pknow-2', 'time',
       'Avg_Time_Per_Action', 'Help_Ratio_rolling', 'Time_Diff',
       'Percentage_Correct', 'Cumulative_Knowledge_Change', 'Total_Actions',
       'Total_Correct_Actions', 'Help_Request_Frequency', 'Help_Request_Ratio',
       'Avg_Time_Between_Actions']], on = 'Unique-id', how = 'left')
# r_df.to_csv('ca1_df_newfeatures.csv', index=False)


In [190]:
r_df.columns
#Added the following columns - 'Unique-id','help','Pknow-2', 'time','Avg_Time_Per_Action', 'Help_Ratio_rolling', 'Time_Diff','Percentage_Correct', 'Cumulative_Knowledge_Change', 'Total_Actions', 'Total_Correct_Actions', 'Help_Request_Frequency', 'Help_Request_Ratio', 'Avg_Time_Between_Actions'

Index(['Unique-id', 'namea', 'OffTask', 'Avgright', 'Avgbug', 'Avghelp',
       'Avgchoice', 'Avgstring', 'Avgnumber', 'Avgpoint', 'Avgpchange',
       'Avgtime', 'AvgtimeSDnormed', 'Avgtimelast3SDnormed',
       'Avgtimelast5SDnormed', 'Avgnotright', 'Avghowmanywrong-up',
       'Avghelppct-up', 'Avgwrongpct-up', 'Avgtimeperact-up',
       'AvgPrev3Count-up', 'AvgPrev5Count-up', 'Avgrecent8help',
       'Avg recent5wrong', 'Avgmanywrong-up', 'AvgasymptoteA-up',
       'AvgasymptoteB-up', 'help', 'Pknow-2', 'time', 'Avg_Time_Per_Action',
       'Help_Ratio_rolling', 'Time_Diff', 'Percentage_Correct',
       'Cumulative_Knowledge_Change', 'Total_Actions', 'Total_Correct_Actions',
       'Help_Request_Frequency', 'Help_Request_Ratio',
       'Avg_Time_Between_Actions'],
      dtype='object')

In [5]:
#Creating a function classifier_competition that will take the working dataset and the classfier as arguments to produce Kappa values, F1 Values, ROC-AUC Values and Accuracy to see what model produces the best scores.  
def classifier_competition(data, classifier):
    # Convert 'OffTask' to binary labels (1 for 'Y', 0 for 'N')
    data['OffTask'] = data['OffTask'].replace({'Y': 1, 'N': 0})

    # Define your features and labels
    X = data.drop(columns=['OffTask', 'Unique-id', 'namea'], axis=1)
    y = data['OffTask']

    # Initialize a 10 Group K-Fold cross-validator
    gkf = GroupKFold(n_splits=10)

    # Initialize lists to store evaluation metrics
    kappa_values = []
    f1_values = []
    roc_auc_values = []
    acc_values = []

    # Standardize features (optional, but can help with some algorithms)
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Initialize your classifier
    clf = classifier

    # PerformING Group K-Fold Cross-Validation
    for train_idx, test_idx in gkf.split(X, y, groups=data['namea']):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # CheckING if both classes are present in the current split
        if len(np.unique(y_test)) == 1:
            continue  # Skip this split

        # TrainING classifier
        clf.fit(X_train, y_train)

        # Make predictions
        y_pred = clf.predict(X_test)

        # Calculate evaluation metrics
        kappa = cohen_kappa_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        # Handle ROC AUC calculation when both classes are present
        if len(np.unique(y_test)) == 2:
            roc_auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
        else:
            roc_auc = np.nan

        acc = accuracy_score(y_test, y_pred)

        # Append metrics to lists
        kappa_values.append(kappa)
        f1_values.append(f1)
        roc_auc_values.append(roc_auc)
        acc_values.append(acc)

    # Calculate mean metrics
    mean_kappa = round(np.nanmean(kappa_values),2)
    mean_f1 = round(np.mean(f1_values),2)
    mean_roc_auc = round(np.nanmean(roc_auc_values),2)
    mean_acc = round(np.mean(acc_values),2)

    # Print the mean metrics
    print(f'Mean Kappa: {mean_kappa}')
    print(f'Mean F1 Score: {mean_f1}')
    print(f'Mean ROC AUC: {mean_roc_auc}')
    print(f'Mean Accuracy: {mean_acc}')

#Using each classifier to see what shows the best Kappa, f1, auc-roc, accuracy scores
data = pd.read_csv('ca1_df_newfeatures.csv')
data.head()

print('Using GaussianNB Classifier:')
classifier_competition(data, GaussianNB()) 
print('')

print('Using RandomForestClassifier:')
classifier_competition(data, RandomForestClassifier()) 
print('')

print('Using XGBClassifier')
classifier_competition(data, XGBClassifier()) 

Using GaussianNB Classifier:
Mean Kappa: 0.14
Mean F1 Score: 0.17
Mean ROC AUC: 0.83
Mean Accuracy: 0.84

Using RandomForestClassifier:
Mean Kappa: 0.22
Mean F1 Score: 0.22
Mean ROC AUC: 0.9
Mean Accuracy: 0.98

Using XGBClassifier


  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


Mean Kappa: 0.27
Mean F1 Score: 0.28
Mean ROC AUC: 0.91
Mean Accuracy: 0.98


  if is_sparse(data):
  if is_sparse(data):


Upon research, I found out that early stopping is a regularization technique used during the training of machine learning models, including gradient-boosting-based models like XGBoost. Its primary purpose is to prevent overfitting and improve the generalization ability of the model. So I used it to implement XGBoost 

In [28]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score, roc_auc_score, cohen_kappa_score, accuracy_score
from sklearn.model_selection import GroupKFold, train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler

data = pd.read_csv('ca1_df_newfeatures.csv')
data['OffTask'] = data['OffTask'].replace({'Y': 1, 'N': 0})

# Define features and labels
X = data.drop(columns=['OffTask', 'Unique-id', 'namea'], axis=1)
y = data['OffTask']

# Apply oversampling to address class imbalance
oversampler = RandomOverSampler()
X_resampled, y_resampled = oversampler.fit_resample(X, y)

# Standardize features
scaler = StandardScaler()
X_resampled = scaler.fit_transform(X_resampled)

# Initialize GroupKFold cross-validator
gkf = GroupKFold(n_splits=10)

# Initialize lists to store evaluation metrics
kappa_values = []
f1_values = []
roc_auc_values = []
acc_values = []

# Initialize your XGBoost classifier with early stopping
clf = XGBClassifier(n_estimators=1000, eval_metric="logloss", verbose=False, early_stopping_rounds=10)

# Get the 'namea' values after oversampling
namea_resampled = data['namea'].iloc[oversampler.sample_indices_]

# Perform GroupKFold Cross-Validation
for train_idx, test_idx in gkf.split(X_resampled, y_resampled, groups=namea_resampled):
    X_train, X_test = X_resampled[train_idx], X_resampled[test_idx]
    y_train, y_test = y_resampled[train_idx], y_resampled[test_idx]

    # Check if both classes are present in the current split
    if len(np.unique(y_test)) == 1:
        continue  # Skip this split

    # Split data into training, validation, and test sets
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.7, random_state=42)

    # Train classifier with early stopping using validation dataset
    clf.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)

    # Make predictions
    y_pred = clf.predict(X_test)

    # Calculate evaluation metrics
    kappa = cohen_kappa_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    acc = accuracy_score(y_test, y_pred)

    # Append metrics to lists
    kappa_values.append(kappa)
    f1_values.append(f1)
    roc_auc_values.append(roc_auc)
    acc_values.append(acc)

# Calculate mean metrics
mean_kappa = round(np.nanmean(kappa_values), 2)
mean_f1 = round(np.mean(f1_values), 2)
mean_roc_auc = round(np.nanmean(roc_auc_values), 2)
mean_acc = round(np.mean(acc_values), 2)

# Print the mean metrics
print(f'Mean Kappa: {mean_kappa}')
print(f'Mean F1 Score: {mean_f1}')
print(f'Mean ROC AUC: {mean_roc_auc}')
print(f'Mean Accuracy: {mean_acc}')


  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.



  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.

Parameters: { "verbose" } are not used.



  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


Parameters: { "verbose" } are not used.

Mean Kappa: 0.47
Mean F1 Score: 0.64
Mean ROC AUC: 0.88
Mean Accuracy: 0.74
