The features we engineer were actually inspired from [this](https://www.kaggle.com/mhviraf/a-new-baseline-for-dsb-2019-catboost-model) kaggle kernel. The idea is to look at the various activities of a person before an assessment: such as the number and types of different clips they watched, the number and types of different games they played, the number and types of different assesments they took, their accuracy on these assessments and so on. 

Recall that specs.csv contains information that describes the various events in any game_session. We will not be using this to engineer our features. Thus, in order to save memory, we will not load specs.csv for the time being.

 **List of features to create**


1) number of clips watched before attempting the assessment: Note that there are 20 different titles under the type 'clips'. We will create one entry for each such title. Additionally, one more entry for just the total number of clips watched (before the assessment) irrespective of their title. => 21 features.

2) number of activities done before the assessment: 8 different activities + 1 for all the activities = 9 features. 

3) number of games played before the assessment: 11 different games + 1 for all the games = 12 features

4) number of assessments done before the current assessment: 5 different assessments + 1 for all the assessments = 6 features

5) total (accumalated) amount of time spent watching different clips before the assessment: 20 different clips + 1 for all the clips = 21 features

6) total time spent on activities: 8 different activities + 1 for all the activities = 9 features

7) total time spent on games: 11 different games + 1 for all the games = 12

8) total time spent on previous assessments: 5 different assessments + 1 for all the assessments = 6

9) average event_count in the clips watched (before the assessment): 20+1= 21 features

10) average event_count in activities (before the assessment): 8+1 = 9 features

11) average event_count in games (before the assessment): 11+1=12 features

12) average event_count in previous assessments: 5+1=6 features

13) average accuracy in previous assessments: 5+1 = 6 features

14) average accuracy_grp of previous assessments: 5+1=6

15) current assessment's title: 1

16) target variable i.e current assessments accuracy group: 1


In total, we therefore have a dataframe with 48 x 3 + 6 + 6+1+1= 158 columns based on the above list.

Also note that we should only consider game_session which lasted for a non-zero time interval. Indeed there are game_session that were abandoned immediately after starting, for e.g. the session with id = '45bb1e1b6b50c07b'. These should not be included in the statistics being assembled above.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import time

In [3]:
print('The current working directory is: {}'.format(os.getcwd()))

The current working directory is: C:\Users\agarw\Dropbox\Kaggle data-science-bowl 2019\Prarit-data-science-bowl-2019


In [4]:
train=pd.read_csv('train.csv')

In [5]:
train.shape

(11341042, 11)

In [6]:
train.head()

Unnamed: 0,event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world
0,27253bdc,45bb1e1b6b50c07b,2019-09-06T17:53:46.937Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE
1,27253bdc,17eeb7f223665f53,2019-09-06T17:54:17.519Z,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK
2,77261ab5,0848ef14a8dc6892,2019-09-06T17:54:56.302Z,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK
3,b2dba42b,0848ef14a8dc6892,2019-09-06T17:54:56.387Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK
4,1bb5fbdb,0848ef14a8dc6892,2019-09-06T17:55:03.253Z,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK


In [7]:
trainlbls=pd.read_csv('train_labels.csv')

In [8]:
trainlbls.shape

(17690, 7)

In [9]:
trainlbls.head()

Unnamed: 0,game_session,installation_id,title,num_correct,num_incorrect,accuracy,accuracy_group
0,6bdf9623adc94d89,0006a69f,Mushroom Sorter (Assessment),1,0,1.0,3
1,77b8ee947eb84b4e,0006a69f,Bird Measurer (Assessment),0,11,0.0,0
2,901acc108f55a5a1,0006a69f,Mushroom Sorter (Assessment),1,0,1.0,3
3,9501794defd84e4d,0006a69f,Mushroom Sorter (Assessment),1,1,0.5,2
4,a9ef3ecb3d1acc6a,0006a69f,Bird Measurer (Assessment),1,0,1.0,3


In [10]:
# for purposes of testing, we will first use only a small amount of data first. So we 
# choose to work with the data for only 10 installation ids. 

#id_chosen=train.installation_id.unique()[0:10]

In [11]:
# the installation_ids chosen above
#id_chosen

In [12]:
# reducing the train dataframe size by restricting to only 10 installation ids as chosen above
# use the pd.Series.isin() to choose rows based on a given list
# this was suggested in the following stackexchange post:
# https://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe
#train=train.loc[train.installation_id.isin(id_chosen)]

In [13]:
# shape of the reduced dataframe
#train.shape

In [14]:
# converting the timestamp string to datetime format
# for documentation on pandas datetime functionality, see
# see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html 
train['datetime']=pd.to_datetime(train.timestamp)
train.drop(columns=['timestamp'], inplace=True)

In [15]:

train.head()

Unnamed: 0,event_id,game_session,event_data,installation_id,event_count,event_code,game_time,title,type,world,datetime
0,27253bdc,45bb1e1b6b50c07b,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE,2019-09-06 17:53:46.937
1,27253bdc,17eeb7f223665f53,"{""event_code"": 2000, ""event_count"": 1}",0001e90f,1,2000,0,Magma Peak - Level 1,Clip,MAGMAPEAK,2019-09-06 17:54:17.519
2,77261ab5,0848ef14a8dc6892,"{""version"":""1.0"",""event_count"":1,""game_time"":0...",0001e90f,1,2000,0,Sandcastle Builder (Activity),Activity,MAGMAPEAK,2019-09-06 17:54:56.302
3,b2dba42b,0848ef14a8dc6892,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,2,3010,53,Sandcastle Builder (Activity),Activity,MAGMAPEAK,2019-09-06 17:54:56.387
4,1bb5fbdb,0848ef14a8dc6892,"{""description"":""Let's build a sandcastle! Firs...",0001e90f,3,3110,6972,Sandcastle Builder (Activity),Activity,MAGMAPEAK,2019-09-06 17:55:03.253


In [16]:
# function to compute the total time (in ms) spent and number of events in a game_session

def game_session_time_events(game_session=None, dataframe = train):
    if not game_session:
        print('no game_session was passed to the function')
        return 
    sess=dataframe.loc[dataframe.game_session==game_session]
    
    # time spent (in ms)
    session_time= sess.datetime 
    start_time=session_time.min()
    end_time=session_time.max()
    time_interval=pd.Timedelta(end_time-start_time).delta/10**6
    
    # number of events
    num_events=sess.event_count.max()
    
    
    return time_interval, num_events

In [17]:
# function to count previous game_sessions of a given type and title

def count_prev_sessions(dataframe, current_time, typ, title):
    '''dataframe: the dataframe containing the record of a particular player
       current_time: the time at which the assessment (whose statistics we are looking for) started'''
    
    # locating game_sessions of the specified type and title that occurred before current_time
    earlier_sessions=dataframe.loc[(dataframe.datetime<current_time) 
                                   & (dataframe.type==typ) & (dataframe.title==title)].game_session.unique()
    
    # selecting sessions with non-zero session_time
    session_time_events=[game_session_time_events(session) for session in earlier_sessions]
    good_sessions=[ (session, time_events[0], time_events[1]) for session, time_events in zip(earlier_sessions, session_time_events) if time_events[0]>0]
    
    num_good_sessions=len(good_sessions)
    
    # The following methodology to obtain the n-th tuple element in a list of tuples is
    # based on suggestion given in the following stackexchange post
    # https://stackoverflow.com/questions/12142133/how-to-get-first-element-in-a-list-of-tuples
    cummulative_time=0
    avg_n_events=0
    if num_good_sessions>0: 
        cummulative_time=np.sum(list(list(zip(*good_sessions))[1]))
        avg_n_events=np.mean(list(list(zip(*good_sessions))[2]))
    
    
    avg_accuracy=0
    mean_accuracy_group=0
    
    if typ=='Assessment' and num_good_sessions>0:
        # good_sess_result_dat is the dataset containing statistics for assessments that are considered good
        # basically we are choosing the rows in trainlbls whose game_session is in goodsessions
        # use the pd.Series.isin() to choose rows based on a given list
        # this was suggested in the following stackexchange post:
        # https://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe
        good_sess_results_dat= trainlbls.loc[
            trainlbls.game_session.isin(list(list(zip(*good_sessions))[0]))]
        avg_accuracy=good_sess_results_dat.accuracy.mean()
        mean_accuracy_group=good_sess_results_dat.accuracy_group.mean()
    
    
    return earlier_sessions, good_sessions, num_good_sessions, cummulative_time, avg_n_events, avg_accuracy, mean_accuracy_group

Some important comments: 

1) Need to think what will I do for 'assessment' sessions where an attempt was never made. One way is to completely remove them from the data -> for training set this does not pose a problem because trainlbls only contains the results of completed assessments. Since we directly use trainlbls to obtain the assessment results, therefore incomplete assessments don't event appear. -> However, they will still contribute to other statistics such as number of assessments and avg. number of events per assessment as well as cummulative time spent on assessments. Perhaps this is how it should be done. 

2) results of assessments in the test data are not given explicity. Will have to check if these can also be obtained from trainlbls or will need to be extracted from event_data?

3) In the function, count_prev_sessions when I compute avg. accuracy or accuracy_group_mode, I need to only choose one row per good_session. The way I am doing it right now, it chooses all the rows in good_session.This implies the accuracy and accuracy_group of a session are considered n_events times. Here n_events is the number of events in the session. This might not be a problem for avg. accuracy but the accuracy_group_mode can and will be incorrectly swayed by sessions with high event_count. -> Again because I am using trainlbls directly to obtain the assessment results, this will not be a problem since trainlbls only contains one row per session. 

In [18]:
# list of various different clips, activities etc in the app

clip_list=[('Clip', title) for title in train.loc[train.type=='Clip'].title.unique()]

activity_list=[('Activity', title) for title in train.loc[train.type=='Activity'].title.unique()]

game_list=[('Game', title) for title in train.loc[train.type=='Game'].title.unique()]

assessment_list=[('Assessment', title) 
                 for title in train.loc[train.type=='Assessment'].title.unique()]

type_title_list=clip_list+activity_list+game_list+assessment_list

print('There are a total of {} different clips/games/activities/assessment in the app'.
      format(len(type_title_list)))

There are a total of 44 different clips/games/activities/assessment in the app


In [19]:
num_features=len(type_title_list)*3 + 2*len(assessment_list) + 4*3 + 2 + 2
print('We expect to build {} features through the above function'.format(num_features))

We expect to build 158 features through the above function


Now apply count_prev_sessions to each session in trainlbls. 

In [20]:
# The code in this code cell was a previous implementation which I replaced by a slighly better looking code in the next cell


# for testing purposes only doing this for first 2 rows of trainlbls
# finally will need to run the loop for the whole length of trainlbls
# length of trainlbls can be obtained from trainlbls.shape[0]

# features is the array that will contain the features for each assessment session
#features=[]

#for idx in range(trainlbls.shape[0]):
#    row=trainlbls.iloc[idx]
#    session=row.game_session
#    player=row.installation_id
#    acc_grp=row.accuracy_group
#    dataframe=train.loc[train.installation_id==player]
#    current_time=dataframe.loc[dataframe.game_session==session].datetime.min()
#    session_title=row.title
#    
#    # getting statistics for all the earlier sessions
#    stats=[player, session, session_title,  acc_grp]
#    for typ, title in type_title_list:
#        _, _,num_good_sess, cumm_time, avg_n_ev, avg_acc, avg_acc_grp=count_prev_sessions(dataframe, current_time, typ, title)
#        lis=[typ, title, num_good_sess, cumm_time, avg_n_ev, avg_acc, avg_acc_grp]
#        stats.append(lis)
#    
#    features.append(stats)

In [21]:
# The following function enables us to apply count_prev_sess to each row of trainlbs

# features is the array that will contain the features for each assessment session
features=[]

def apply_count_prev_sess(row):
    session=row.game_session
    player=row.installation_id
    acc_grp=row.accuracy_group
    dataframe=train.loc[train.installation_id==player]
    current_time=dataframe.loc[dataframe.game_session==session].datetime.min()
    session_title=row.title
    
    # getting statistics for all the earlier sessions
    stats=[player, session, session_title,  acc_grp]
    for typ, title in type_title_list:
        _, _,num_good_sess, cumm_time, avg_n_ev, avg_acc, avg_acc_grp=count_prev_sessions(dataframe, current_time, typ, title)
        lis=[typ, title, num_good_sess, cumm_time, avg_n_ev, avg_acc, avg_acc_grp]
        stats.append(lis)
    
    features.append(stats)

In [22]:
%%time
# for testing purposes only applying 'apply_count_prev_sess' to the first 2 rows of trainlbls
# finally will need to run the loop for the whole length of trainlbls
# length of trainlbls can be obtained from trainlbls.shape[0]

# This is really slow, need to find a way to speed it up. 
trainlbls.iloc[0:2].apply(lambda x: apply_count_prev_sess(x), axis=1)

Wall time: 46.8 s


In [22]:
# arr is an array containing the a specified set of stats for all assessment sessions
# for eg if we want the stats corresponding 'clip' titled 'Magma Peak - Level 1' for all assessments
# then we notice that the corresdong list is the 6th entry in 'features', so we give the command: np.array(list(zip(*features))[5]

feat=pd.DataFrame([])

for itr in range(4,48):
    arr=np.array(list(zip(*features))[itr]) # note that here we have passed a list of mixed-types to np.array
                                            # the two elements of list(zip(*features))[itr] are type and title, which are of strings
                                            # the rest are numeric
                                            # due to the presence of strings in the list, np.array will convert all the elements in list to strings
                                            # we will therefore have to type-cast them back to float before passing to the dataframe 
    typ=arr[0,0]
    title=arr[0,1]
    if not typ=='Assessment':
        dat=pd.DataFrame(arr[::, 2:5].astype(float), columns=['{}_{}_num_good_sess'.format(typ, title),
                                                            '{}_{}_cumm_time'.format(typ, title),
                                                            '{}_{}_avg_n_ev'.format(typ, title)])
    else:
        dat=pd.DataFrame(arr[::, 2:].astype(float), columns=['{}_{}_num_good_sess'.format(typ, title),
                                                           '{}_{}_cumm_time'.format(typ, title),
                                                           '{}_{}_avg_n_ev'.format(typ, title),
                                                           '{}_{}_avg_acc'.format(typ, title),
                                                           '{}_{}_avg_acc_grp'.format(typ, title)])
        
    feat=pd.concat([feat, dat], axis = 1)

In [23]:
feat

Unnamed: 0,Clip_Welcome to Lost Lagoon!_num_good_sess,Clip_Welcome to Lost Lagoon!_cumm_time,Clip_Welcome to Lost Lagoon!_avg_n_ev,Clip_Magma Peak - Level 1_num_good_sess,Clip_Magma Peak - Level 1_cumm_time,Clip_Magma Peak - Level 1_avg_n_ev,Clip_Magma Peak - Level 2_num_good_sess,Clip_Magma Peak - Level 2_cumm_time,Clip_Magma Peak - Level 2_avg_n_ev,Clip_Tree Top City - Level 1_num_good_sess,...,Assessment_Cart Balancer (Assessment)_num_good_sess,Assessment_Cart Balancer (Assessment)_cumm_time,Assessment_Cart Balancer (Assessment)_avg_n_ev,Assessment_Cart Balancer (Assessment)_avg_acc,Assessment_Cart Balancer (Assessment)_avg_acc_grp,Assessment_Chest Sorter (Assessment)_num_good_sess,Assessment_Chest Sorter (Assessment)_cumm_time,Assessment_Chest Sorter (Assessment)_avg_n_ev,Assessment_Chest Sorter (Assessment)_avg_acc,Assessment_Chest Sorter (Assessment)_avg_acc_grp
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
# total number of any clips watched in previous sessions 
# this can be obtained by summing over the total number of good_sessions of various different clip-titles
# the total number of good_session of clips are captured in every 3rd column starting from the 0-th column and upto 60th column in 'feat'
feat['Clips_num']=feat.iloc[::,0:60:3].sum(axis=1)

# Similarly total accumalated time on the all the clips can be obtained by summing over cumm_time of each clip-title
# This corresponds to summing over every 3rd column starting from 1st column and upto 60th column in feat
feat['Clips_time']=feat.iloc[::,1:60:3].sum(axis=1)

# average events over all clips can be obtained from averaging over avg. events in each clip-title
# this corresponds to averaging over entries in ever 3rd column from 2nd to 60th columns in feat
feat['Clips_avg_ev']=feat.iloc[::,2:60:3].mean(axis=1)


# total number of any activities done in previous sessions 
# this can be obtained by summing over the total number of good_sessions of various different activity-titles
# the total number of good_session of various activities are captured in every 3rd column starting from the 60-th column and upto 84th column in 'feat'
feat['Acts_num']=feat.iloc[::,60:84:3].sum(axis=1)

# Similarly total accumalated time on the all the activities can be obtained by summing over cumm_time of each activity-title
# This corresponds to summing over every 3rd column starting from 61st column and upto 84th column in feat
feat['Acts_time']=feat.iloc[::,61:84:3].sum(axis=1)

# average events over all the activities can be obtained from averaging over avg. events in each activity-title
# this corresponds to averaging over entries in ever 3rd column from 62nd to 84th columns in feat
feat['Acts_avg_ev']=feat.iloc[::,62:84:3].mean(axis=1)


# total number of any games done in previous sessions 
# this can be obtained by summing over the total number of good_sessions of various different games-titles
# the total number of good_session of various games are captured in every 3rd column starting from the 84-th column and upto 117th column in 'feat'
feat['Games_num']=feat.iloc[::,84:117:3].sum(axis=1)

# Similarly total accumalated time on the all the games can be obtained by summing over cumm_time of each game-title
# This corresponds to summing over every 3rd column starting from 85th column and upto 117th column in feat
feat['Games_time']=feat.iloc[::,85:117:3].sum(axis=1)

# average events over all the games can be obtained from averaging over avg. events in each games-title
# this corresponds to averaging over entries in ever 3rd column from 86th to 117th columns in feat
feat['Games_avg_ev']=feat.iloc[::,86:117:3].mean(axis=1)


# total number of any assessments done in previous sessions 
# this can be obtained by summing over the total number of good_sessions of various different assessment-titles
# the total number of good_session of various assessments are captured in every 5th column starting from the 117-th column and upto 142nd column in 'feat'
feat['Assess_num']=feat.iloc[::,117:142:5].sum(axis=1)

# Similarly total accumalated time on the all the assessments can be obtained by summing over cumm_time of each assessment-title
# This corresponds to summing over every 5th column starting from 118th column and upto 142nd column in feat
feat['Assess_time']=feat.iloc[::,118:142:5].sum(axis=1)

# average events over all the assessments can be obtained from averaging over avg. events in each assessment-title
# this corresponds to averaging over entries in ever 5th column from 119th to 142nd columns in feat
feat['Assess_avg_ev']=feat.iloc[::,119:142:5].mean(axis=1)

# average accuracy over all the assessments can be obtained from averaging over avg. accuracy in each assessment-title
# this corresponds to averaging over entries in ever 5th column from 120th to 142nd columns in feat
feat['Assess_avg_acc']=feat.iloc[::,120:142:5].mean(axis=1)

# average accuracy_grp over all the assessments can be obtained from averaging over avg. accuracy_grp in each assessment-title
# this corresponds to averaging over entries in ever 5th column from 121st to 142nd columns in feat
feat['Assess_avg_acc_grp']=feat.iloc[::,121:142:5].mean(axis=1)

In [57]:
feat['assessment_title']=np.array(list(zip(*features))[2])

In [58]:
feat['accuracy_grp']=list(list(zip(*features))[3])

In [59]:
feat

Unnamed: 0,Clip_Welcome to Lost Lagoon!_num_good_sess,Clip_Welcome to Lost Lagoon!_cumm_time,Clip_Welcome to Lost Lagoon!_avg_n_ev,Clip_Magma Peak - Level 1_num_good_sess,Clip_Magma Peak - Level 1_cumm_time,Clip_Magma Peak - Level 1_avg_n_ev,Clip_Magma Peak - Level 2_num_good_sess,Clip_Magma Peak - Level 2_cumm_time,Clip_Magma Peak - Level 2_avg_n_ev,Clip_Tree Top City - Level 1_num_good_sess,...,Games_num,Games_time,Games_avg_ev,Assess_num,Assess_time,Assess_avg_ev,Assess_avg_acc,Assess_avg_acc_grp,assessment_title,accuracy_grp
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,552501.0,38.454545,2.0,132551.0,27.0,0.2,0.6,Mushroom Sorter (Assessment),3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,552501.0,38.454545,1.0,39803.0,9.6,0.2,0.6,Bird Measurer (Assessment),0


In [61]:
features_file='train_features.csv'
feat.to_csv(features_file, )