# M4 | Research Investigation Notebook

In this notebook, you will do a research investigation of your chosen dataset in teams. You will begin by formally selecting your research question (task 0), then processing your data (task 1), creating a predictive model (task 2), evaluating your model's results (task 3), and describing the contributions of each team member (task 4).

For grading, please make sure your notebook has all cells run and is stored in your team's [Github Classroom repository](https://classroom.github.com/a/CNxME27U). You will also need to write a short, 2 page report about your design decisions as a team, to be stored in your repository. The Milestone 4 submission will be the contents of your repository at the due date (April 28 at 23:59 CET).

## Brief overview of Calcularis
[Calcularis](https://school.alemira.com/de/calcularis/) by Alemira School is a mathematics learning program developed with neuroscientists and computer scientists from ETH Zurich. It promotes the development and interaction of the different areas of the brain that are responsible for processing numbers and quantities and solving mathematical tasks. Calcularis can be used from 1st grade to high school. Children with dyscalculia also benefit in the long term and overcome their arithmetic weakness.

The Calcularis dataset has three main tables:
* ***users***: meta information about users (i.e. total time spent learning with Calcularis, geographic location).
* ***events***: events done by the users in the platform (i.e. playing a game, selecting a new animal in the zoo simulation).
* ***subtasks***: sub-tasks with answer attempts solved by users, primarily in the context of game events.

These tables and useful metadata information are described in detail in the [Milestone 2 data exploration notebook](https://github.com/epfl-ml4ed/mlbd-2023/blob/main/project/milestone-02/m2_calcularis_sciper.ipynb).

We have provided access to the [full dataset](https://moodle.epfl.ch/mod/forum/discuss.php?d=88179) (~65k users) and a randomly selected subset (~1k users from M2). We have also provided access to a [test account to experiment with Calcularis](https://moodle.epfl.ch/mod/forum/discuss.php?d=88094). You should provide arguments and justifications for all of your design decisions throughout this investigation. You can use your M3 responses as the basis for this discussion.

In [1]:
# Import the tables of the data set as dataframes.
import time
start = time.time()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_DIR = './data' # You many change the directory

# You can use the nrows=X argument in pd.read_csv to truncate your data
users_small = pd.read_csv('{}/calcularis_small_users.csv'.format(DATA_DIR), index_col=0)
events_small = pd.read_csv('{}/calcularis_small_events.csv'.format(DATA_DIR), index_col=0)
subtasks_small = pd.read_csv('{}/calcularis_small_subtasks.csv'.format(DATA_DIR), index_col=0)
users_full = pd.read_csv(f'{DATA_DIR}/full_calcularis_users.csv', index_col=0)
events_full = pd.read_csv(f'{DATA_DIR}/full_calcularis_events.csv', index_col=0)
subtasks_full = pd.read_csv(f'{DATA_DIR}/full_calcularis_subtasks.csv', index_col=0)

## Task 0: Research Question

**Research question:**
For this milestone we focus on detecting wheel-spinning behaviour of Calcularis Users. This is a time series analysis. We rely on features that were proven to be useful in various scientific papers which had to goal of detecting wheel-spinning on different datasets.

## Task 1: Data Preprocessing

In this section, you are asked to preprocess your data in a way that is relevant for the model. Please include 1-2 visualizations of features / data explorations that are related to your downstream prediction task.

In [2]:
# Your code for data processing goes here
events = events_small

subtasks = subtasks_small
processed_df = events.copy()
processed_df.drop_duplicates(inplace=True)
processed_df = processed_df[processed_df.type == 'task']
processed_df = processed_df[['user_id', 'skill_id', 'learning_time_ms', 'start']]
processed_df = processed_df.reset_index()
processed_df['correct'] = processed_df.apply(
    lambda row: subtasks[subtasks.event_id == row.event_id].iloc[0].correct, axis=1
)



In [3]:
processed_df['po'] = processed_df.apply(
    lambda row: processed_df[(processed_df.user_id == row.user_id) & (processed_df.skill_id == row.skill_id) & (processed_df.start <= row.start)]['event_id'].count(),
    axis=1
)
processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1
1,1,1,4.0,21167.0,2022-11-11T10:26:27.893Z,True,1
2,2,1,7.0,11182.0,2022-11-18T10:34:01.044Z,True,1
3,3,1,19.0,6823.0,2022-11-25T10:32:43.428Z,False,1
4,4,1,7.0,9107.0,2022-12-02T10:44:40.555Z,True,2


In [4]:
# Does not include result from the current practice opportunity
processed_df['correct_response_count'] = processed_df.apply(
    lambda row: processed_df[(processed_df.user_id == row.user_id) & (processed_df.skill_id == row.skill_id) & (processed_df.start < row.start)]['correct'].sum(),
    axis=1
)
processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0
1,1,1,4.0,21167.0,2022-11-11T10:26:27.893Z,True,1,0
2,2,1,7.0,11182.0,2022-11-18T10:34:01.044Z,True,1,0
3,3,1,19.0,6823.0,2022-11-25T10:32:43.428Z,False,1,0
4,4,1,7.0,9107.0,2022-12-02T10:44:40.555Z,True,2,1


In [5]:
processed_df['correct_response_percentage'] = processed_df.apply(
    lambda row: row.correct_response_count / (row.po - 1) if row.po > 1 else 0,
    axis=1
)
processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0,0.0
1,1,1,4.0,21167.0,2022-11-11T10:26:27.893Z,True,1,0,0.0
2,2,1,7.0,11182.0,2022-11-18T10:34:01.044Z,True,1,0,0.0
3,3,1,19.0,6823.0,2022-11-25T10:32:43.428Z,False,1,0,0.0
4,4,1,7.0,9107.0,2022-12-02T10:44:40.555Z,True,2,1,1.0


In [6]:
processed_df[(processed_df.user_id == 41) & (processed_df.skill_id == 95)]

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage
1216,1387,41,95.0,3758.0,2018-03-29T16:01:02.701Z,True,1,0,0.0
1233,1409,41,95.0,5861.0,2018-04-06T17:19:13.891Z,True,2,1,1.0
1234,1410,41,95.0,3039.0,2018-04-06T18:09:55.968Z,True,3,2,1.0
1283,1460,41,95.0,5589.0,2018-11-06T11:18:47.955Z,False,4,3,1.0


In [7]:
processed_df = processed_df.sort_values(by='po')
for index, row in processed_df.iterrows():
    if row.po == 1:
        processed_df.loc[index, 'correct_response_in_a_row_count'] = 0
    else:
        
        last_response = processed_df[(
            processed_df.user_id == row.user_id) & (processed_df.skill_id == row.skill_id) & (
            processed_df.po == row.po-1
        )]
        processed_df.loc[index, 'correct_response_in_a_row_count'] = last_response.correct_response_in_a_row_count.values[0] + 1 if last_response.correct.values[0] else 0

processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0,0.0,0.0
12329,13756,341,36.0,4185.0,2020-12-09T07:50:48.457Z,True,1,0,0.0,0.0
12330,13757,341,97.0,14024.0,2021-01-04T08:54:29.429Z,True,1,0,0.0,0.0
12331,13762,341,110.0,6709.0,2021-01-04T18:20:19.807Z,True,1,0,0.0,0.0
12332,13763,341,111.0,7653.0,2021-01-05T06:55:37.085Z,True,1,0,0.0,0.0


In [8]:
processed_df['correct_response_in_a_row_percentage'] = processed_df.apply(
    lambda row: row.correct_response_in_a_row_count / (row.po - 1) if row.po > 1 else 0,
    axis=1
)
processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0,0.0,0.0,0.0
12329,13756,341,36.0,4185.0,2020-12-09T07:50:48.457Z,True,1,0,0.0,0.0,0.0
12330,13757,341,97.0,14024.0,2021-01-04T08:54:29.429Z,True,1,0,0.0,0.0,0.0
12331,13762,341,110.0,6709.0,2021-01-04T18:20:19.807Z,True,1,0,0.0,0.0,0.0
12332,13763,341,111.0,7653.0,2021-01-05T06:55:37.085Z,True,1,0,0.0,0.0,0.0


In [9]:
processed_df['time_on_current_skill_ms'] = processed_df.apply(
    lambda row: processed_df[
        (processed_df.user_id == row.user_id) &
        (processed_df.skill_id == row.skill_id) & 
        (processed_df.start <= row.start)
    ]['learning_time_ms'].sum(),
    axis=1
)
processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0,0.0,0.0,0.0,8835.0
12329,13756,341,36.0,4185.0,2020-12-09T07:50:48.457Z,True,1,0,0.0,0.0,0.0,4185.0
12330,13757,341,97.0,14024.0,2021-01-04T08:54:29.429Z,True,1,0,0.0,0.0,0.0,14024.0
12331,13762,341,110.0,6709.0,2021-01-04T18:20:19.807Z,True,1,0,0.0,0.0,0.0,6709.0
12332,13763,341,111.0,7653.0,2021-01-05T06:55:37.085Z,True,1,0,0.0,0.0,0.0,7653.0


In [10]:
PO_CUTOFF = 10

In [11]:
processed_df = processed_df[processed_df.po <= PO_CUTOFF]

In [12]:
processed_df['pessimistic_wheelspinning'] = processed_df.apply(
    lambda row: len(processed_df[
        (processed_df.user_id == row.user_id) & 
        (processed_df.skill_id == row.skill_id) & 
        (processed_df.correct_response_in_a_row_count >= 3)
    ]) == 0,
    axis=1
)

processed_df.head()

Unnamed: 0,event_id,user_id,skill_id,learning_time_ms,start,correct,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms,pessimistic_wheelspinning
0,0,1,1.0,8835.0,2022-11-02T08:39:12.355Z,True,1,0,0.0,0.0,0.0,8835.0,True
12329,13756,341,36.0,4185.0,2020-12-09T07:50:48.457Z,True,1,0,0.0,0.0,0.0,4185.0,True
12330,13757,341,97.0,14024.0,2021-01-04T08:54:29.429Z,True,1,0,0.0,0.0,0.0,14024.0,True
12331,13762,341,110.0,6709.0,2021-01-04T18:20:19.807Z,True,1,0,0.0,0.0,0.0,6709.0,True
12332,13763,341,111.0,7653.0,2021-01-05T06:55:37.085Z,True,1,0,0.0,0.0,0.0,7653.0,True


In [13]:
users_with_sufficient_po = processed_df[processed_df.po == 10].user_id.unique()
processed_df['optimistic_wheelspinning'] = processed_df.user_id.isin(users_with_sufficient_po) & processed_df.pessimistic_wheelspinning

In [18]:
processed_df.drop(columns=[
    'event_id', 'learning_time_ms', 'start', 'correct'
], inplace=True)
print(time.time() - start)

385.8427538871765


In [44]:
backup_df = processed_df.copy()
temp_df = processed_df[processed_df.correct_response_in_a_row_count == 3]
mastery_achieved = pd.DataFrame(temp_df.groupby(['user_id', 'skill_id'])['po'].min())

for index, row in processed_df.iterrows():
    if (
        ((row.user_id, row.skill_id) in mastery_achieved.index) and
        row.po >= mastery_achieved.loc[(row.user_id, row.skill_id)].po
    ):
        processed_df.drop(index=index, inplace=True)

processed_df.head()

Unnamed: 0,user_id,skill_id,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms,pessimistic_wheelspinning,optimistic_wheelspinning
0,1,1.0,1,0,0.0,0.0,0.0,8835.0,True,False
12329,341,36.0,1,0,0.0,0.0,0.0,4185.0,True,False
12330,341,97.0,1,0,0.0,0.0,0.0,14024.0,True,False
12331,341,110.0,1,0,0.0,0.0,0.0,6709.0,True,False
12332,341,111.0,1,0,0.0,0.0,0.0,7653.0,True,False


In [42]:
backup_df[backup_df.user_id == 865]

Unnamed: 0,user_id,skill_id,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms,pessimistic_wheelspinning,optimistic_wheelspinning
29015,865,19.0,1,0,0.0,0.0,0.0,7041.0,True,True
29014,865,28.0,1,0,0.0,0.0,0.0,27446.0,True,True
29013,865,4.0,1,0,0.0,0.0,0.0,24348.0,True,True
29012,865,7.0,1,0,0.0,0.0,0.0,51015.0,True,True
29011,865,1.0,1,0,0.0,0.0,0.0,11372.0,True,True
...,...,...,...,...,...,...,...,...,...,...
29084,865,48.0,7,6,1.0,6.0,1.0,43028.0,False,False
29058,865,0.0,7,6,1.0,6.0,1.0,45362.0,False,False
29059,865,0.0,8,7,1.0,7.0,1.0,48031.0,False,False
29063,865,0.0,9,8,1.0,8.0,1.0,50415.0,False,False


In [45]:
processed_df[(processed_df.user_id == 865) & (processed_df.skill_id == 48)]

Unnamed: 0,user_id,skill_id,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms,pessimistic_wheelspinning,optimistic_wheelspinning
29035,865,48.0,1,0,0.0,0.0,0.0,6215.0,False,False
29040,865,48.0,2,1,1.0,1.0,1.0,12385.0,False,False
29041,865,48.0,3,2,1.0,2.0,1.0,18514.0,False,False


In [46]:
no_indeterminate_df = processed_df[processed_df.optimistic_wheelspinning == processed_df.pessimistic_wheelspinning]
no_indeterminate_df.head()

Unnamed: 0,user_id,skill_id,po,correct_response_count,correct_response_percentage,correct_response_in_a_row_count,correct_response_in_a_row_percentage,time_on_current_skill_ms,pessimistic_wheelspinning,optimistic_wheelspinning
24447,712,180.0,1,0,0.0,0.0,0.0,6052.0,False,False
24446,712,172.0,1,0,0.0,0.0,0.0,1457.0,False,False
12361,341,166.0,1,0,0.0,0.0,0.0,1202.0,False,False
24456,712,175.0,1,0,0.0,0.0,0.0,12858.0,False,False
24459,712,166.0,1,0,0.0,0.0,0.0,945.0,False,False


In [47]:
print(len(processed_df))
print(len(no_indeterminate_df))

29469
9508


*Your discussion about your processing decisions goes here*

## Task 2: Model Building

Train a model for your research question. 

In [15]:
# Your code for training a model goes here


*Your discussion about your model training goes here*

## Task 3: Model Evaluation
In this task, you will use metrics to evaluate your model.

In [16]:
# Your code for model evaluation goes here

*Your discussion/interpretation about your model's behavior goes here*

## Task 4: Team Reflection
Please describe the contributions of each team member to Milestone 4. Reflect on how you worked as team: what went well, what can be improved for the next milestone?

*Your discussion about team responsibilities goes here*