<a href="https://colab.research.google.com/github/jlingohr/riiid/blob/main/riiid_splits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import files
from google.colab import drive
import os

import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook as tqdm
from pyarrow import feather 


In [None]:
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/kaggle"

In [None]:
%cd /content/gdrive/My Drive/kaggle/riiid/
!pwd

/content/gdrive/My Drive/kaggle/riiid
/content/gdrive/My Drive/kaggle/riiid


In [None]:
random.seed(1234)
np.random.seed(1234)

## First save file to feather format

In [None]:
%%time

dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "boolean",
    "task_container_id": "int16",
    "user_answer": "int8",
    "answered_correctly": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "boolean"
}



data = pd.read_feather("data/no_lectures/train.feather")
# Filter out lectures
data = data[data.content_type_id == False]
data.reset_index(drop=True, inplace=True)
data['row_id'] = data.index

print("Train size:", data.shape)

Train size: (99270702, 10)
CPU times: user 6.77 s, sys: 3.99 s, total: 10.8 s
Wall time: 24.8 s


In [None]:
data.head()

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,False,1,3,1,,
1,1,56943,115,5716,False,2,2,1,37000.0,False
2,2,118363,115,128,False,0,0,1,55000.0,False
3,3,131167,115,7860,False,3,0,1,19000.0,False
4,4,137965,115,7922,False,4,1,1,11000.0,False


In [None]:
usecols = [
    'row_id',
    'user_id',
    'content_id'
]
data = data.loc[:, usecols]


In [None]:
data.head()

Unnamed: 0,row_id,user_id,content_id
0,0,115,5692
1,1,115,5716
2,2,115,128
3,3,115,7860
4,4,115,7922


### Look at distribution of questions

In [None]:
data.content_id.value_counts().sort_values(ascending=True).head(20)

1486      1
5823      1
10033     1
1485      1
10008     1
10005     1
10007     1
1484      1
10006     1
3557      3
3572      5
7547     11
7548     11
7549     11
7546     11
7550     12
7567     14
7566     14
7568     15
4741     18
Name: content_id, dtype: int64

There are 9 questions that appear once in the entire dataset and 10 questions that appear less than 5 times. We need to make sure that these questions are trained in in *every* fold so that when they appear in test we can correrctly embed them. To make sure they appear bn all folds we can
1. For each user with that unique id just use all there rows
2. For each user with that id make sure that the smallest fold extends to at least that question timestamp

In [None]:
target_questions = data.content_id.value_counts().sort_values(ascending=True)
target_questions = target_questions[target_questions < 5].index.to_list()
target_questions

[1486, 5823, 10033, 1485, 10008, 10005, 10007, 1484, 10006, 3557]

In [None]:
user_ids_with_target_questions = data[data.content_id.isin(target_questions)]
user_ids_with_target_questions

Unnamed: 0,row_id,user_id,content_id
15715993,15715993,343784114,3557
15716248,15716248,343784114,1485
15716249,15716249,343784114,1486
15716250,15716250,343784114,1484
61536271,61536271,1333688829,10033
72695488,72695488,1576785630,3557
87529236,87529236,1896513376,3557
87529237,87529237,1896513376,5823
95688100,95688100,2070144393,10008
95688101,95688101,2070144393,10007


So there are 5 users who are the sole user who came across some specific question. 

In [None]:
# Get highest row id for question for each user
last_question_row_id = user_ids_with_target_questions.groupby(['user_id']).row_id.max()
last_question_row_id

user_id
343784114     15716250
1333688829    61536271
1576785630    72695488
1896513376    87529237
2070144393    95688103
Name: row_id, dtype: int64

In [None]:
users_with_target_questions = data[data.user_id.isin(user_ids_with_target_questions.user_id.unique())]
users_with_target_questions

Unnamed: 0,row_id,user_id,content_id
15708825,15708825,343784114,128
15708826,15708826,343784114,7860
15708827,15708827,343784114,7922
15708828,15708828,343784114,156
15708829,15708829,343784114,51
...,...,...,...
95688146,95688146,2070144393,4441
95688147,95688147,2070144393,6196
95688148,95688148,2070144393,9194
95688149,95688149,2070144393,5560


In [None]:
row_min, row_max = users_with_target_questions.groupby(['user_id']).row_id.min(), users_with_target_questions.groupby(['user_id']).row_id.max()
print(row_min)
print(row_max)

user_id
343784114     15708825
1333688829    61533590
1576785630    72691190
1896513376    87524303
2070144393    95686414
Name: row_id, dtype: int64
user_id
343784114     15717301
1333688829    61538672
1576785630    72696741
1896513376    87529571
2070144393    95688150
Name: row_id, dtype: int64


Since some of the questions appear only at the end of these users' sequences, naive way to split would be for these users, train on all their rows and don't both with splitting as we did earlier.

In [None]:
# If last_question_row_id is over 70th percentile of the users series, take some series expanding past it (i.e. whole series). Otherwise can
# treat as normal

(last_question_row_id - row_min) / (row_max - row_min)

user_id
343784114     0.876003
1333688829    0.527548
1576785630    0.774275
1896513376    0.936598
2070144393    0.972926
Name: row_id, dtype: float64

can treat user id 1576785630 as normal because question 3557 is seen by two other users, both of whom have question later on that are not seen by others. We can also treat user 1333688829 as normal. The rest would be easiest to just take the whole sequence.

In [None]:
users_with_target_questions = users_with_target_questions[~users_with_target_questions.user_id.isin([1576785630, 1333688829])]
users_with_target_questions

Unnamed: 0,row_id,user_id,content_id
15708825,15708825,343784114,128
15708826,15708826,343784114,7860
15708827,15708827,343784114,7922
15708828,15708828,343784114,156
15708829,15708829,343784114,51
...,...,...,...
95688146,95688146,2070144393,4441
95688147,95688147,2070144393,6196
95688148,95688148,2070144393,9194
95688149,95688149,2070144393,5560


## Splitting Strategy

Typically we would want something like an 80-20 train-validation split, but in this competition doing so the typical way won't work. We want a validation set that contains users who have not been seen at training, but we don't know what percentage.

I think it would be reasonable to do something like take 90% of users to be in train and the rest in validaiton. Then of those in the 90%, only use the first 0.7, 0.8, and 0.9 of their observations for training, and put the rest in validation. This way we get multiple folds and we get to see how the model improves with more data. 

**Alternatively**, would it be better to do something like compute the mean number of events per user and place a user in validation data if it falls outside of some distance from the mean? This way we test on users with very few and a lot of examples, but at the cost of not training on them.

**Finally** we can calculate the proportion of unique ideas overall and since we know we are testing on 2.5 million examples, take that proportion for the number of new ideas (unseen during train). Then we can also create the batches such that if a user has a sequence longer than our max sequence, we partition their sequence into equal-sized sequences

In [None]:
print(len(data))
print(data.user_id.nunique())
print(data.user_id.nunique() / len(data))

99270702
393401
0.003962911433828684


In [None]:
keep_all_ids = user_ids_with_target_questions.user_id.unique().tolist()
keep_all_ids = list(filter(lambda x: x not in [1576785630, 1333688829], keep_all_ids))
keep_all_ids

[343784114, 1896513376, 2070144393]

In [None]:
# Split on users IDs first
user_ids = data.user_id.unique()
user_ids = list(filter(lambda x: x not in keep_all_ids, user_ids))
print(len(user_ids))
train_ids, val_ids = train_test_split(user_ids, test_size=0.01)
print(len(train_ids))
print(len(val_ids))

393398
389464
3934


In [None]:
assert len(set(train_ids).intersection(val_ids)) == 0
assert len(set(train_ids).intersection(keep_all_ids)) == 0

In [None]:
# save users who only appear in val set
pd.DataFrame({'user_id': val_ids}).sort_values(by='user_id').reset_index(drop=True).to_feather('data/val_unseen_users.feather')

In [None]:
# split into train and val where val contains users that are not in train
# Then we have to further partition the rows of train
train, val = data[data.user_id.isin(train_ids)], data[data.user_id.isin(val_ids)]
print(train.shape)
print(val.shape)
print("Got ", len(train) + len(val), " rows, expected ", len(data))

(98265538, 3)
(989681, 3)
Got  99255219  rows, expected  99270702


In [None]:
train.user_id.isin(val.user_id).value_counts() # Should be all false

False    98265538
Name: user_id, dtype: int64

In [None]:
# assert len(train) + len(val) == len(data)

In [None]:
assert (train.index == train.row_id).sum() == len(train)

In [None]:
assert (val.index == val.row_id).sum() == len(val)

In [None]:
val.head()

Unnamed: 0,row_id,user_id,content_id
34043,34043,952772,7900
34044,34044,952772,7876
34045,34045,952772,175
34046,34046,952772,1278
34047,34047,952772,2064


In [None]:
FOLD_PERCENTILES = {0: 0.85, 1: 0.9, 2: 0.95}

def create_fold(train_df, val_df, train_all_df, fold_idx):
    grouped = train_df.groupby(['user_id'])
    row_id_min = grouped.row_id.min()
    row_id_max = grouped.row_id.max()

    train_row_ids, val_row_ids = [], []

    for row_min, row_max in tqdm(zip(row_id_min, row_id_max)):
        row_ids = np.arange(row_min, row_max+1).tolist()
        length = len(row_ids)
        
        if length < 10:
            train_row_ids += row_ids
        else:
            train_row_ids += row_ids[:int(FOLD_PERCENTILES[fold_idx] * length)]
            val_row_ids += row_ids[int(FOLD_PERCENTILES[fold_idx] * length):]

    # Add row indices from users seen only in validation
    val_row_ids += val_df.row_id.tolist()
    train_row_ids += train_all_df.row_id.tolist()
    print('Expect {} rows'.format(len(data)))
    print("Found {} rows. {} train and {} val".format(len(train_row_ids) + len(val_row_ids), len(train_row_ids), len(val_row_ids)))
        
    return train_row_ids, val_row_ids



In [None]:
train_row_ids, val_row_ids = create_fold(train, val, users_with_target_questions, 0)
assert len(set(train_row_ids).intersection(val_row_ids)) == 0
pd.DataFrame({'row_id': train_row_ids}).to_feather('data/no_lectures/fold_0/train_rows.feather')
pd.DataFrame({'row_id': val_row_ids}).to_feather('data/no_lectures/fold_0/val_rows.feather')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # Remove the CWD from sys.path while we load stuff.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Expect 99270702 rows
Found 99270702 rows. 83365054 train and 15905648 val


In [None]:
train_row_ids, val_row_ids = create_fold(train, val, users_with_target_questions, 1)
assert len(set(train_row_ids).intersection(val_row_ids)) == 0
pd.DataFrame({'row_id': train_row_ids}).to_feather('data/no_lectures/fold_1/train_rows.feather')
pd.DataFrame({'row_id': val_row_ids}).to_feather('data/no_lectures/fold_1/val_rows.feather')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # Remove the CWD from sys.path while we load stuff.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Expect 99270702 rows
Found 99270702 rows. 88314713 train and 10955989 val


In [None]:
train_row_ids, val_row_ids = create_fold(train, val, users_with_target_questions, 2)
assert len(set(train_row_ids).intersection(val_row_ids)) == 0
pd.DataFrame({'row_id': train_row_ids}).to_feather('data/no_lectures/fold_2/train_rows.feather')
pd.DataFrame({'row_id': val_row_ids}).to_feather('data/no_lectures/fold_2/val_rows.feather')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  # Remove the CWD from sys.path while we load stuff.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Expect 99270702 rows
Found 99270702 rows. 93197329 train and 6073373 val
