In [1]:
import os
import numpy as np
import pandas as pd

### Load datasets

In [2]:
train = pd.read_csv('input/raw/data-science-bowl-2019/train.csv')

In [3]:
train_labels = pd.read_csv('input/raw/data-science-bowl-2019/train_labels.csv')

In [4]:
test = pd.read_csv('input/raw/data-science-bowl-2019/test.csv')

In [5]:
specs = pd.read_csv('input/raw/data-science-bowl-2019/specs.csv')

### Filter unusefull data

In [6]:
keep_id = train[train.type == "Assessment"][['installation_id']].drop_duplicates()
train = pd.merge(train, keep_id, on="installation_id", how="inner")

In [7]:
train.shape

(8294138, 11)

In [8]:
keep_id.shape

(4242, 1)

installation_id's who did assessments (we have already taken out the ones who never took one), but without results in the train_labels? As you can see below, yes there are 628 of those.

In [9]:
discard_id = train[train.installation_id.isin(train_labels.installation_id.unique()) != True].installation_id.unique()

In [10]:
discard_id.shape

(628,)

In [11]:
train = train[train.installation_id.isin(discard_id)!=True]

In [12]:
train.shape

(7734558, 11)

In [13]:
train.head()

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
0,27253bdc,34ba1a28d02ba8ba,2019-08-06T04:57:18.904Z,"{""event_code"": 2000, ""event_count"": 1}",0006a69f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
1,27253bdc,4b57c9a59474a1b9,2019-08-06T04:57:45.301Z,"{""event_code"": 2000, ""event_count"": 1}",0006a69f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK
2,77261ab5,2b9d5af79bcdb79f,2019-08-06T04:58:14.538Z,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0006a69f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK
3,b2dba42b,2b9d5af79bcdb79f,2019-08-06T04:58:14.615Z,"{""description"":""Let's build a sandcastle! Firs...",0006a69f,2,3010,29,Sandcastle Builder (Activity),Activity,MAGMAPEAK
4,1325467d,2b9d5af79bcdb79f,2019-08-06T04:58:16.680Z,"{""coordinates"":{""x"":273,""y"":650,""stage_width"":...",0006a69f,3,4070,2137,Sandcastle Builder (Activity),Activity,MAGMAPEAK




Basically what we need to do is to compose aggregated features for each session of which we know the train label.

In [14]:
print(f'Number of rows in train_labels: {train_labels.shape[0]}')
print(f'Number of unique game_sessions in train_labels: {train_labels.game_session.nunique()}')

Number of rows in train_labels: 17690
Number of unique game_sessions in train_labels: 17690


### Fix num correct and incorrect variables 

geting from data-science-bowl-2019-data-exploration

From Kaggle: The file train_labels.csv has been provided to show how these groups would be computed on the assessments in the training set. Assessment attempts are captured in event_code 4100 for all assessments except for Bird Measurer, which uses event_code 4110. If the attempt was correct, it contains "correct":true.

However, in the first version I already noticed that I had one attempt too many for this installation_id when mapping the rows with the train_labels for. It turns out that there are in fact also assessment attemps for Bird Measurer with event_code 4100, which should not count (see below). In this case that also makes sense as this installation_id already had a pass on the first attempt

In [15]:
#credits for this code chuck go to Andrew Lukyanenko
train['attempt'] = 0
train.loc[(train['title'] == 'Bird Measurer (Assessment)') & (train['event_code'] == 4110),\
       'attempt'] = 1
train.loc[(train['type'] == 'Assessment') &\
       (train['title'] != 'Bird Measurer (Assessment)')\
       & (train['event_code'] == 4100),\
          'attempt'] = 1

train['correct'] = None
train.loc[(train['attempt'] == 1) & (train['event_data'].str.contains('"correct":true')), 'correct'] = True
train.loc[(train['attempt'] == 1) & (train['event_data'].str.contains('"correct":false')), 'correct'] = False

### Save datasets

In [16]:
train.to_pickle('input/processed/X.pkl')

In [17]:
train_labels.to_pickle('input/processed/y.pkl')

In [18]:
test.to_pickle('input/processed/submission.pkl')