The features we engineer were actually inspired from [this](https://www.kaggle.com/mhviraf/a-new-baseline-for-dsb-2019-catboost-model) kaggle kernel. The idea is to look at the various activities of a person before an assessment: such as the number and types of different clips they watched, the number and types of different games they played, the number and types of different assesments they took, their accuracy on these assessments and so on. 

Recall that specs.csv contains information that describes the various events in any game_session. We will not be using this to engineer our features. Thus, in order to save memory, we will not load specs.csv for the time being.

**List of features to create **


1) number of clips watched before attempting the assessment: Note that there are 20 different titles under the type 'clips'. We will create one entry for each such title. Additionally, one more entry for just the total number of clips watched (before the assessment) irrespective of their title. => 21 features.

2) number of activities done before the assessment: 8 different activities + 1 for all the activities = 9 features. 

3) number of games played before the assessment: 11 different games + 1 for all the games = 12 features

4) number of assessments done before the current assessment: 5 different assessments + 1 for all the assessments = 6 features

5) total (accumalated) amount of time spent watching different clips before the assessment: 20 different clips + 1 for all the clips = 21 features

6) total time spent on activities: 8 different activities + 1 for all the activities = 9 features

7) total time spent on games: 11 different games + 1 for all the games = 12

8) total time spent on previous assessments: 5 different assessments + 1 for all the assessments

9) average event_count in the clips watched (before the assessment): 20+1= 21 features

10) average event_count in activities (before the assessment): 8+1 = 9 features

11) average event_count in games (before the assessment): 11+1=12 features

12) average event_count in previous assessments: 5+1=6 features

13) average accuracy in previous assessments: 5+1 = 6 features


In total, we therefore have 48 x 3 + 6 = 150 features in the above list

Also note that we should only consider game_session which lasted for a non-zero time interval. Indeed there are game_session that were abandoned immediately after starting, for e.g. the session with id = '45bb1e1b6b50c07b'. These should not be included in the statistics being assembled above.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import time

In [2]:
print('The current working directory is: {}'.format(os.getcwd()))

The current working directory is: C:\Users\agarw\Dropbox\Kaggle data-science-bowl 2019\Prarit-data-science-bowl-2019


In [3]:
train=pd.read_csv('train.csv')

In [4]:
train.shape

(11341042, 11)

In [5]:
trainlbls=pd.read_csv('train_labels.csv')

In [6]:
trainlbls.shape

(17690, 7)

In [7]:
# merging the data in train and trainlbls according to game_session, installation_id and title
train=train.merge(trainlbls, how='left', on=['game_session', 'installation_id', 'title'])
train.shape

(11341042, 15)

In [8]:
train.keys()

Index(['event_id', 'game_session', 'timestamp', 'event_data',
       'installation_id', 'event_count', 'event_code', 'game_time', 'title',
       'type', 'world', 'num_correct', 'num_incorrect', 'accuracy',
       'accuracy_group'],
      dtype='object')

In [9]:
# for purposes of testing, we will first use only a small amount of data first. So we 
# choose to work with the data for only 10 installation ids. 

id_chosen=train.installation_id.unique()[0:10]

In [10]:
# the installation_ids chosen above
id_chosen

array(['0001e90f', '000447c4', '0006a69f', '0006c192', '0009a5a9',
       '0011edc8', '00129856', '0016b7cc', '00195df7', '001d0ed0'],
      dtype=object)

In [11]:
# reducing the train dataframe size by restricting to only 10 installation ids as chosen above
# use the pd.Series.isin() to choose rows based on a given list
# this was suggested in the following stackexchange post:
# https://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe
train=train.loc[train.installation_id.isin(id_chosen)]

In [12]:
# shape of the reduced dataframe
train.shape

(11091, 15)

In [13]:
# converting the timestamp string to datetime format
# for documentation on pandas datetime functionality, see
# see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html 
train['datetime']=pd.to_datetime(train.timestamp)
train.drop(columns=['timestamp'], inplace=True)

In [14]:

train.head(10)

Unnamed: 0,event_id,game_session,event_data,installation_id,event_count,event_code,game_time,title,type,world,num_correct,num_incorrect,accuracy,accuracy_group,datetime
0,27253bdc,45bb1e1b6b50c07b,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE,,,,,2019-09-06 17:53:46.937
1,27253bdc,17eeb7f223665f53,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK,,,,,2019-09-06 17:54:17.519
2,77261ab5,0848ef14a8dc6892,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:54:56.302
3,b2dba42b,0848ef14a8dc6892,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:54:56.387
4,1bb5fbdb,0848ef14a8dc6892,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:03.253
5,1325467d,0848ef14a8dc6892,"{""coordinates"":{""x"":583,""y"":605,""stage_width"":...",0001e90f,4,4070,9991,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:06.279
6,1325467d,0848ef14a8dc6892,"{""coordinates"":{""x"":601,""y"":570,""stage_width"":...",0001e90f,5,4070,10622,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:06.913
7,1325467d,0848ef14a8dc6892,"{""coordinates"":{""x"":250,""y"":665,""stage_width"":...",0001e90f,6,4070,11255,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:07.546
8,1325467d,0848ef14a8dc6892,"{""coordinates"":{""x"":279,""y"":629,""stage_width"":...",0001e90f,7,4070,11689,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:07.979
9,1325467d,0848ef14a8dc6892,"{""coordinates"":{""x"":839,""y"":654,""stage_width"":...",0001e90f,8,4070,12272,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-09-06 17:55:08.566


In [84]:
# function to compute the total time (in ms) spent and number of events in a game_session

def game_session_time_events(game_session=None, dataframe = train):
    if not game_session:
        print('no game_session was passed to the function')
        return 
    sess=dataframe.loc[dataframe.game_session==game_session]
    
    # time spent (in ms)
    session_time= sess.datetime 
    start_time=session_time.min()
    end_time=session_time.max()
    time_interval=pd.Timedelta(end_time-start_time).delta/10**6
    
    # number of events
    num_events=sess.event_count.max()
    
    
    return time_interval, num_events

In [90]:
game_session_time_events()

no game_session was passed to the function


In [91]:
game_session_time_events('0848ef14a8dc6892')

(194307.0, 267)

In [87]:
# function to count previous game_sessions of a given type and title

def count_prev_sessions(dataframe, current_time, typ, title):
    '''dataframe: the dataframe containing the record of a particular player
       current_time: the time at which the assessment (whose statistics we are looking for) started'''
    
    # locating game_sessions of the specified type and title that occurred before current_time
    earlier_sessions=dataframe.loc[(dataframe.datetime<current_time) 
                                   & (dataframe.type==typ) & (dataframe.title==title)].game_session.unique()
    
    # selecting sessions with non-zero session_time
    session_time_events=[game_session_time_events(session) for session in earlier_sessions]
    good_sessions=[ (session, time_events[0], time_events[1]) for session, time_events in zip(earlier_sessions, session_time_events) if time_events[0]>0]
    
    num_good_sessions=len(good_sessions)
    
    # The following methodology to obtain the n-th tuple element in a list of tuples is
    # based on suggestion given in the following stackexchange post
    # https://stackoverflow.com/questions/12142133/how-to-get-first-element-in-a-list-of-tuples
    cummulative_time=0
    avg_n_events=0
    if num_good_sessions>0: 
        cummulative_time=np.sum(list(list(zip(*good_sessions))[1]))
        avg_n_events=np.mean(list(list(zip(*good_sessions))[2]))
    
    return earlier_sessions, good_sessions, num_good_sessions, cummulative_time, avg_n_events

In [20]:
train.type.unique()

array(['Clip', 'Activity', 'Game', 'Assessment'], dtype=object)

In [29]:
train.loc[(train.installation_id=='0006a69f') & (train.type=='Assessment')].game_session.unique()

array(['901acc108f55a5a1', '77b8ee947eb84b4e', '6bdf9623adc94d89',
       'e7e7db2a241eadcc', '9501794defd84e4d', 'a9ef3ecb3d1acc6a'],
      dtype=object)

In [31]:
stTm=train.loc[train.game_session=='901acc108f55a5a1'].datetime.min()
print(stTm)

2019-08-06 05:22:01.344000


In [89]:
count_prev_sessions(dataframe=train.loc[train.installation_id=='0006a69f'],
                    current_time=stTm, typ='Activity', title='Sandcastle Builder (Activity)')

(array(['2b9d5af79bcdb79f'], dtype=object),
 [('2b9d5af79bcdb79f', 89774.0, 102)],
 1,
 89774.0,
 102.0)

In [33]:
train.loc[train.installation_id=='0006a69f']

Unnamed: 0,event_id,game_session,event_data,installation_id,event_count,event_code,game_time,title,type,world,num_correct,num_incorrect,accuracy,accuracy_group,datetime
1538,27253bdc,34ba1a28d02ba8ba,"{""event_code"": 2000, ""event_count"": 1}",0006a69f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE,,,,,2019-08-06 04:57:18.904
1539,27253bdc,4b57c9a59474a1b9,"{""event_code"": 2000, ""event_count"": 1}",0006a69f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK,,,,,2019-08-06 04:57:45.301
1540,77261ab5,2b9d5af79bcdb79f,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0006a69f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:14.538
1541,b2dba42b,2b9d5af79bcdb79f,"{""description"":""Let's build a sandcastle! Firs...",0006a69f,2,3010,29,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:14.615
1542,1325467d,2b9d5af79bcdb79f,"{""coordinates"":{""x"":273,""y"":650,""stage_width"":...",0006a69f,3,4070,2137,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:16.680
1543,1325467d,2b9d5af79bcdb79f,"{""coordinates"":{""x"":863,""y"":237,""stage_width"":...",0006a69f,4,4070,3937,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:18.474
1544,1325467d,2b9d5af79bcdb79f,"{""coordinates"":{""x"":817,""y"":617,""stage_width"":...",0006a69f,5,4070,4820,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:19.365
1545,1bb5fbdb,2b9d5af79bcdb79f,"{""description"":""Let's build a sandcastle! Firs...",0006a69f,6,3110,6954,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:21.490
1546,1325467d,2b9d5af79bcdb79f,"{""coordinates"":{""x"":809,""y"":180,""stage_width"":...",0006a69f,7,4070,8187,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:22.732
1547,5e812b27,2b9d5af79bcdb79f,"{""size"":0,""coordinates"":{""x"":782,""y"":207,""stag...",0006a69f,8,4030,8745,Sandcastle Builder (Activity),Activity,MAGMAPEAK,,,,,2019-08-06 04:58:23.295
