# Introduction

Fifth Annual Data Science Bowl will analyze digital game play to help build more effective educational media tools for children. The competition will look at advancements in early childhood education. The results will lead to better designed games and improved learning outcomes, empowering children, parents, caregivers and educators across the globe with insights into how young children learn through media and which approaches work best to help them build on foundational learning skills.
To better understand these challenges and develop the most effective approaches to high-quality early educational media, Booz Allen Hamilton and Kaggle launched the fifth annual Data Science Bowl, the world's largest data science competition focused on social good.
Participants will be provided with anonymous gameplay data from the PBS KIDS Measure Up! app, which was developed as a part of the CPB-PBS Ready to Learn Initiative with funding from the U.S. Department of Education. They will be tasked with **creating algorithms that utilize information about how players use the app to determine what they know and are learning from the experience**, in order to discover important relationships between their engagement with educational media and learning. The insights gleaned from these solutions will help PBS KIDS and other organizations create new solutions, content and products that help ensure each and every user has the best chance to learn important skills, helping improve childhood learning access and achievement.

**Learning Path**
Exposure ---> Exploration ---> Practice ---> Demonstration(as in demonstration of knowledge)
In the PBS KIDS Measure Up! app, **children ages 3 to 5** learn early **STEM concepts focused on length, width, capacity, and weight** while going on an adventure through Treetop City, Magma Peak, and Crystal Caves. Joined by their favorite PBS KIDS characters from Dinosaur Train, Peg + Cat, and Sid the Science Kid, children can also collect rewards and unlock digital toys as they play. At the same time, caregivers can monitor and expand upon what their child is learning using a free companion app: PBS KIDS Super Vision.
Parents can track the skills in which their kids excel, and the skills where they may need more practice. The app also provides tips and related activity ideas to extend learning into daily activities and family time.
The PBS KIDS Measure Up! app, children navigate a map and complete various levels(media types):
  1. Clip(Exposure)
     - Interstitials/Introductory
     - Longer(2-3 Minutes)/Familiar with problem
  2. Activities(Practice): No subjective/There is cause and effect
  3. Games(Practice): with the goal of solving problems/ There is an option of replay
  4. Assessments(Measure player's knowledge/skills): number of incommect and number of accuracy group
       * Bird Measurer
       * Cart Balancer
       * Cauldron Filler
       * Chest Sorter
       * Mushroom Sorter
world - The section of the application the game or video belongs to. Helpful to identify the educational curriculum goals of the media. Possible values are: 'NONE' (at the app's start screen), TREETOPCITY' (Length/Height), 'MAGMAPEAK' (Capacity/Displacement), 'CRYSTALCAVES' (Weight).
The intent of the competition is to use the gameplay data to forecast how many attempts a child will take to pass a given assessment (an incorrect answer is counted as an attempt).
Each application install is represented by an installation_id. This will typically correspond to one child, but you should expect noise from issues such as shared devices.
Note that the training set contains many installation_ids which never took assessments, whereas every installation_id in the test set made an attempt on at least one assessment.
The outcomes in this competition are grouped into 4 groups (labeled accuracy_group in the data):

3: the assessment was solved on the first attempt

2: the assessment was solved on the second attempt

1: the assessment was solved after 3 or more attempts

0: the assessment was never solved

# Load the required Libararies 

In [177]:
import numpy as np
from tqdm import tqdm
import json
import pandas as pd
import os
import gc
from sklearn.model_selection import KFold
#import lightgbm as lgb
from training import *
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Download the data using Kaggle API

In [6]:
# Download the dataset
#!kaggle competitions list
#!kaggle competitions download -c data-science-bowl-2019

# Read the input Data

In [39]:
#Shape of data 
print('Reading train.csv file....')
train = pd.read_csv('data/train.csv')
print('Training.csv file have {} rows and {} columns'.format(train.shape[0], train.shape[1]))

print('Reading test.csv file....')
test = pd.read_csv('data/test.csv')
print('Test.csv file have {} rows and {} columns'.format(test.shape[0], test.shape[1]))

print('Reading train_labels.csv file....')
train_labels = pd.read_csv('data/train_labels.csv')
print('Train_labels.csv file have {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))

print('Reading specs.csv file....')
specs = pd.read_csv('data/specs.csv')
print('Specs.csv file have {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

print('Reading sample_submission.csv file....')
sample_submission = pd.read_csv('data/sample_submission.csv')
print('Sample_submission.csv file have {} rows and {} columns'.format(sample_submission.shape[0], sample_submission.shape[1]))

Reading train.csv file....
Training.csv file have 11341042 rows and 11 columns
Reading test.csv file....
Test.csv file have 1156414 rows and 11 columns
Reading train_labels.csv file....
Train_labels.csv file have 17690 rows and 7 columns
Reading specs.csv file....
Specs.csv file have 386 rows and 3 columns
Reading sample_submission.csv file....
Sample_submission.csv file have 1000 rows and 2 columns


# Data Processing 

In [12]:
# memory useage 
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11341042 entries, 0 to 11341041
Data columns (total 11 columns):
event_id           object
game_session       object
timestamp          object
event_data         object
installation_id    object
event_count        int64
event_code         int64
game_time          int64
title              object
type               object
world              object
dtypes: int64(3), object(8)
memory usage: 8.1 GB


In [33]:
test.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1156414 entries, 0 to 1156413
Data columns (total 11 columns):
event_id           1156414 non-null object
game_session       1156414 non-null object
timestamp          1156414 non-null object
event_data         1156414 non-null object
installation_id    1156414 non-null object
event_count        1156414 non-null int64
event_code         1156414 non-null int64
game_time          1156414 non-null int64
title              1156414 non-null object
type               1156414 non-null object
world              1156414 non-null object
dtypes: int64(3), object(8)
memory usage: 852.0 MB


In [65]:
all_game_session = train['game_session'].append(test['game_session']).unique()
session_dict = dict(zip(all_game_session, np.arange(len(all_game_session))))

all_installs = train['installation_id'].append(test['installation_id']).unique()
installation_dict = dict(zip(all_installs, range(len(all_installs))))

all_titles = train['title'].append(test['title']).unique()
title_dict = dict(zip(all_titles, range(len(all_titles))))

all_types = train['type'].append(test['type']).unique()
type_dict = dict(zip(all_types, range(len(all_types))))

all_world = train['world'].append(test['world']).unique()
world_dict = dict(zip(all_world, range(len(all_world))))


all_events = train['event_id'].append(test['event_id']).unique()
event_dict = dict(zip(all_events, range(len(all_events))))

In [66]:
for df in [train, test]:
    df['game_session'] = df['game_session'].map(session_dict)
    df['installation_id'] = df['installation_id'].map(installation_dict)
    df['world'] = df['world'].map(world_dict)
    df['type'] = df['type'].map(type_dict)
    df['title'] = df['title'].map(title_dict)
    df['event_id'] = df['event_id'].map(event_dict)

In [68]:
train.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11341042 entries, 0 to 11341041
Data columns (total 11 columns):
event_id           int64
game_session       int64
timestamp          object
event_data         object
installation_id    int64
event_count        int64
event_code         int64
game_time          int64
title              int64
type               int64
world              int64
dtypes: int64(9), object(2)
memory usage: 4.3 GB


In [72]:
activities_map = dict(zip(title_dict.values(), 
                          4100*np.ones(len(title_dict)).astype('int')))
activities_map[title_dict['Bird Measurer (Assessment)']] = 4110

In [75]:
def extracting_duration(durations):
    dur_std = 0
    dur_sum = 0
    dur_mean = 0
    if len(durations) != 0:
        dur_sum = durations.iloc[-1]
        duration_norm = durations.diff().dropna()
        if len(duration_norm) >= 2:
            dur_std = duration_norm.std()
            dur_mean = duration_norm.mean()
    return dur_mean, dur_sum, dur_std


def feature_engineering(user_sample, test_data=False):
    output = []
    Cum_Assess, Cum_Activity, Cum_Clip, Cum_Game = 0, 0, 0, 0
    cum_corr, cum_incorr, cum_acc = 0, 0, 0
    cum_dur_assess, cum_dur_clip, cum_dur_game, cum_dur_activity = 0, 0, 0, 0
    counter = 0
    cum_acc_group = []
    # itarates through each session of one instalation_id
    for session_name, session in user_sample.groupby('game_session', sort=False):

        # Start a dict to have the feature characterestics
        features = {'Clip': 0, 'Activity': 0,
                    'Assess': 0, 'Game': 0,
                    'Cum_Clip': Cum_Clip, 'Cum_Activity': Cum_Activity,
                    'Cum_Assess': Cum_Assess, 'Cum_Game': Cum_Game,
                    'cum_dur_clip': cum_dur_clip, 'cum_dur_asses': cum_dur_assess,
                    'cum_dur_activity': cum_dur_activity, 'cum_dur_game': cum_dur_game}

        features['installation_id'] = session['installation_id'].unique()[0]
        features['game_session'] = session['game_session'].unique()[0]
        # event_counter includes all event codes and all types
        features['event_counter'] = session.iloc[-1]['event_count']

        # session type
        features['type'] = session['type'].unique()[0]
        # session title
        features['title'] = session['title'].unique()[0]

        # World
        features['world'] = session['world'].unique()[0]

        # Just get back those with event codes of 4100 and 4110
        all_attempts = session.query(
            f'event_code == {activities_map[features["title"]]}')
#        all_attempts = session

        if (features['type'] == type_dict['Assessment']):
            # if we consider all event codes,
            # actions should be the same as event counter
            features['Assess'] += len(all_attempts['event_data'])
            Cum_Assess += features['Assess']

            # Durations
            features['assess_dur_mean'], features['assess_dur_sum'], \
                features['assess_dur_std'] = extracting_duration(
                    all_attempts['game_time'])
            cum_dur_assess += features['assess_dur_sum']

            # Check the numbers of correct atteampts
            features['cum_corr'] = cum_corr
            features['correct'] = all_attempts['event_data'].str.contains(
                'true').sum()
            cum_corr += features['correct']

            # Check the numbers of incorrect atteampts
            features['cum_incorrect'] = cum_incorr
            features['incorrect'] = all_attempts['event_data'].str.contains(
                'false').sum()
            cum_incorr += features['incorrect']

            # To compute accuracy
            features['cum_acc'] = cum_acc / counter if counter > 0 else 0
            features['mean_acc_group'] = sum(cum_acc_group) / counter if counter > 0 else 0
            counter += 1
            features['acc'] = features['correct'] / (features['Assess'])\
                if features['Assess'] != 0 else 0
            cum_acc += features['acc']

            # To find the accuracy group
            if features['acc'] == 0:
                features['acc_group'] = 0
            elif features['acc'] == 1:
                features['acc_group'] = 3
            elif features['acc'] == 0.5:
                features['acc_group'] = 2
            else:
                features['acc_group'] = 1
            cum_acc_group.append(features['acc_group'])

        elif features['type'] == type_dict['Clip']:
            # check the total number of clips
            features['Clip'] += len(all_attempts['event_data'])
            Cum_Clip += features['Clip']

            # Durations
            features['clip_dur_mean'], features['clip_dur_sum'], \
                features['clip_dur_std'] = extracting_duration(
                    all_attempts['game_time'])
            cum_dur_clip += features['clip_dur_sum']

        elif features['type'] == type_dict['Activity']:
            # check the total number of clips
            features['Activity'] += len(all_attempts['event_data'])
            Cum_Activity += features['Activity']

            # Durations
            features['activity_dur_mean'], features['activity_dur_sum'], \
                features['activity_dur_std'] = extracting_duration(
                    all_attempts['game_time'])
            cum_dur_activity += features['activity_dur_sum']

        elif features['type'] == type_dict['Game']:
            # check the total number of Games
            features['Game'] += len(all_attempts['event_data'])
            Cum_Game += features['Game']

            # Durations
            features['game_dur_mean'], features['game_dur_sum'], \
                features['game_dur_std'] = extracting_duration(
                    all_attempts['game_time'])
            cum_dur_game += features['game_dur_sum']

        if features.get('Assess', 0) > 0 or test_data:
            output.append(features)
    if test_data:
        return output[-1]
    return output

In [82]:
# groups_train = train.groupby('installation_id', sort = False)
# #g_train = groups_train.get_group('0006a69f')
# g_train = groups_train.get_group(installation_dict['0006a69f'])
# ss = pd.DataFrame(feature_engineering(g_train, False))

# groups = test.groupby('installation_id', sort = False)
# g_test = groups.get_group(installation_dict['00abaee7'])
# ss = pd.DataFrame(feature_engineering(g_test, True), index=[0])

## Process train set

In [262]:
# Apply compile function to each installation_id in train dataset
groups = train.groupby('installation_id', sort = False)
df_train = pd.DataFrame()
count = 0
temp_out = []
for ins_id, user_sample in tqdm(groups):
    temp_out += feature_engineering(user_sample)
df_train = pd.DataFrame(temp_out)
#del temp_out
print(df_train.shape)
df_train['installation_id'].equals(train_labels['installation_id'])

100%|██████████| 17000/17000 [19:25<00:00, 17.29it/s]  


(17690, 29)


True

In [263]:
df_train.head()

Unnamed: 0,Activity,Assess,Clip,Cum_Activity,Cum_Assess,Cum_Clip,Cum_Game,Game,acc,acc_group,...,cum_dur_game,cum_incorrect,event_counter,game_session,incorrect,installation_id,mean_acc_group,title,type,world
0,0,1,0,0,0,0,0,0,1.0,3,...,0,0,48,901acc108f55a5a1,0,0006a69f,0.0,Mushroom Sorter (Assessment),Assessment,TREETOPCITY
1,0,11,0,0,1,0,4,0,0.0,0,...,185103,0,87,77b8ee947eb84b4e,11,0006a69f,3.0,Bird Measurer (Assessment),Assessment,TREETOPCITY
2,0,1,0,0,12,0,4,0,1.0,3,...,185103,11,35,6bdf9623adc94d89,0,0006a69f,1.5,Mushroom Sorter (Assessment),Assessment,TREETOPCITY
3,0,2,0,0,13,0,4,0,0.5,2,...,185103,11,42,9501794defd84e4d,1,0006a69f,1.5,Mushroom Sorter (Assessment),Assessment,TREETOPCITY
4,0,1,0,0,15,0,8,0,1.0,3,...,320634,12,32,a9ef3ecb3d1acc6a,0,0006a69f,1.6,Bird Measurer (Assessment),Assessment,TREETOPCITY


## Process test set

In [83]:
temp_data = []
for ins_id, user_sample in tqdm(test.groupby('installation_id', sort=False)):
    a = feature_engineering(user_sample, test_data = True)
    temp_data.append(a)
    
df_test = pd.DataFrame(temp_data)
del temp_data
print(df_test.shape)
df_test['installation_id'].equals(sample_submission['installation_id'].map(installation_dict))

100%|██████████| 1000/1000 [01:55<00:00,  8.67it/s]


(1000, 29)


False

In [85]:
# df_test.to_csv('data_compiled/df_test.csv', index = False)
# df_train.to_csv('data_compiled/df_train.csv', index = False)
del train, test

# Read the clean data

In [204]:
#Shape of data 
print('Reading df_train.csv file....')
df_train = pd.read_csv('data_compiled/df_train.csv')
print('df_train.csv file have {} rows and {} columns'\
      .format(df_train.shape[0], df_train.shape[1]))

#Shape of data 
print('Reading df_test.csv file....')
df_test = pd.read_csv('data_compiled/df_test.csv')
print('df_test.csv file have {} rows and {} columns'\
      .format(df_test.shape[0], df_test.shape[1]))

Reading df_train.csv file....
df_train.csv file have 17690 rows and 29 columns
Reading df_test.csv file....
df_test.csv file have 1000 rows and 29 columns


In [155]:
df_train.columns

Index(['Activity', 'Assess', 'Clip', 'Cum_Activity', 'Cum_Assess', 'Cum_Clip',
       'Cum_Game', 'Game', 'acc', 'acc_group', 'assess_dur_mean',
       'assess_dur_std', 'assess_dur_sum', 'correct', 'cum_acc', 'cum_corr',
       'cum_dur_activity', 'cum_dur_asses', 'cum_dur_clip', 'cum_dur_game',
       'cum_incorrect', 'event_counter', 'game_session', 'incorrect',
       'installation_id', 'mean_acc_group', 'title', 'type', 'world'],
      dtype='object')

# Training Step

In [170]:
x_cols = [col for col in df_train.columns if col not in 
          ['correct', 'incorrect', 'acc_group', 
           'installation_id', 'game_session' ,'type']]
y_col = ['acc_group']
x_encoder = ['title', 'world']

## Convert categorical variable into dummy/indicator variables

In [205]:
print(df_train.shape, df_test.shape)
df_train, df_test = Convert_LabelEncoder(df_train, df_test, x_encoder)
print(df_train.shape, df_test.shape)

(17690, 29) (1000, 29)
(17690, 29) (1000, 29)


In [200]:
df_te.shape

(17690, 29)

## Train random forest model

In [234]:
RF_mdl = random_forest_param_selection(df_train[x_cols], 
                                       df_train[y_col].values.ravel(),
                                       nfolds = 5, n_jobs = None)

The training roc_auc_score is: 1.0
The best parameters are: {'n_estimators': 1620, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False}


In [235]:
RF_mdl.score(df_train[x_cols], df_train[y_col])

1.0

In [237]:
y_pred = RF_mdl.predict(df_test[x_cols])
y_pred.shape

(1000,)

In [244]:
np.unique(y_pred, return_counts = True)

(array([0]), array([1000]))

In [231]:
df_train['acc_group'].value_counts(normalize = True)

3    0.500000
0    0.239062
1    0.136292
2    0.124647
Name: acc_group, dtype: float64

In [248]:
df_train[x_cols].head()

Unnamed: 0,Activity,Assess,Clip,Cum_Activity,Cum_Assess,Cum_Clip,Cum_Game,Game,acc,assess_dur_mean,...,cum_corr,cum_dur_activity,cum_dur_asses,cum_dur_clip,cum_dur_game,cum_incorrect,event_counter,mean_acc_group,title,world
0,0,1,0,0,0,0,0,0,1.0,0.0,...,0,0,0,0,0,0,48,0.0,4,2
1,0,11,0,0,1,0,4,0,0.0,5426.1,...,1,0,31011,0,185103,0,87,3.0,0,2
2,0,1,0,0,12,0,4,0,1.0,0.0,...,1,0,121043,0,185103,11,35,1.5,4,2
3,0,2,0,0,13,0,4,0,0.5,0.0,...,2,0,139069,0,185103,11,42,1.5,4,2
4,0,1,0,0,15,0,8,0,1.0,0.0,...,3,0,162112,0,320634,12,32,1.6,0,2
