<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#General-information" data-toc-modified-id="General-information-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><strong>General information</strong></a></span></li><li><span><a href="#Notebook-setup" data-toc-modified-id="Notebook-setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><strong>Notebook setup</strong></a></span></li><li><span><a href="#Feature-engineering-functions" data-toc-modified-id="Feature-engineering-functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span><strong>Feature engineering functions</strong></a></span><ul class="toc-item"><li><span><a href="#Reading-data" data-toc-modified-id="Reading-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><strong>Reading data</strong></a></span></li><li><span><a href="#Filter-instalation_id-with-no-assessments" data-toc-modified-id="Filter-instalation_id-with-no-assessments-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span><strong>Filter <code>instalation_id</code> with no assessments</strong></a></span></li><li><span><a href="#Encoding-text-data" data-toc-modified-id="Encoding-text-data-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span><strong>Encoding text data</strong></a></span></li><li><span><a href="#Determining-start-and-final-events-for-each-training-instance" data-toc-modified-id="Determining-start-and-final-events-for-each-training-instance-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span><strong>Determining start and final events for each training instance</strong></a></span></li><li><span><a href="#Generate-missing-assessment-labels" data-toc-modified-id="Generate-missing-assessment-labels-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span><strong>Generate missing assessment labels</strong></a></span></li><li><span><a href="#Get-general-information-about-the-performance-of-children-in-each-assessment" data-toc-modified-id="Get-general-information-about-the-performance-of-children-in-each-assessment-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span><strong>Get general information about the performance of children in each assessment</strong></a></span></li><li><span><a href="#Obtaining-features-for-each-answered-assessment-(from-both-train-and-test-sets)" data-toc-modified-id="Obtaining-features-for-each-answered-assessment-(from-both-train-and-test-sets)-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span><strong>Obtaining features for each answered assessment (from both train and test sets)</strong></a></span></li><li><span><a href="#Generating-the-final-train-and-test-sets" data-toc-modified-id="Generating-the-final-train-and-test-sets-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span><strong>Generating the final train and test sets</strong></a></span></li><li><span><a href="#Saving-final-train-and-test-sets" data-toc-modified-id="Saving-final-train-and-test-sets-3.9"><span class="toc-item-num">3.9&nbsp;&nbsp;</span><strong>Saving final train and test sets</strong></a></span></li></ul></li><li><span><a href="#Final-dataset-generation" data-toc-modified-id="Final-dataset-generation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span><strong>Final dataset generation</strong></a></span><ul class="toc-item"><li><span><a href="#train_labels_full-generation" data-toc-modified-id="train_labels_full-generation-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>train_labels_full generation</a></span></li><li><span><a href="#X_train,-Y_train,-X_test-generation" data-toc-modified-id="X_train,-Y_train,-X_test-generation-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>X_train, Y_train, X_test generation</a></span></li></ul></li></ul></div>

# **General information**

This Notebook is meant to perform the necessary feature engineering functions in order to condition the training and test data for its input to the machine learning model.

# **Notebook setup**

**Library import**

In [None]:
import pandas as pd
import numpy as np
import json
from numba import jit #high performance python compiler
from tqdm.notebook import tqdm #fast, extensible progress bar
from xgboost import XGBClassifier
from joblib import Parallel, delayed #provides lightweight pipelining in Python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score
#Python Standard Library
import os #miscellaneous operating system interfaces
import copy #generic shallow and deep copy operations
import gc #interface to the optional garbage collector
import time #time access and conversions
import datetime #basic date and time types
import json #JSON encoder and decoder
import re #provides regular expression matching operations
from typing import Any, List #support for type hints
import warnings #warning control
warnings.filterwarnings("ignore")
from itertools import product #cartesian product, equivalent to a nested for-loop
from collections import Counter, defaultdict #Counter: dict subclass for counting hashable objects
                                             #defaultdict: dict subclass that calls a factory function to supply missing values
pd.options.display.precision = 15
pd.set_option('max_rows', 500)
    
#Others
from IPython.display import HTML #Create a display object given raw data
import networkx as nx #creation, manipulation, and study of complex networks

#Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import altair as alt #declarative statistical visualization library based on Vega

sns.set_style('whitegrid')
my_pal = sns.color_palette(n_colors=10)

# **Feature engineering functions**

## **Reading data**

In [None]:
# Read data from CSV files
def read_data(read_train_labels_full = True):
    print('Reading train.csv file....')
    train = pd.read_csv('train.csv')
    print('Training.csv file has {} rows and {} columns'.format(train.shape[0], train.shape[1]))

    print('Reading test.csv file....')
    test = pd.read_csv('test.csv')
    print('Test.csv file has {} rows and {} columns'.format(test.shape[0], test.shape[1]))

    if read_train_labels_full:
        print('Reading train_labels_full.csv file....')
        train_labels = pd.read_csv('clean_data/train_labels_full.csv')
        print('Train_labels_full.csv file has {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))
    else:
        print('Reading train_labels.csv file....')
        train_labels = pd.read_csv('train_labels.csv')
        print('Train_labels.csv file has {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))
        
    print('Reading specs.csv file....')
    specs = pd.read_csv('specs.csv')
    print('Specs.csv file has {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

    print('Reading sample_submission.csv file....')
    ss = pd.read_csv('sample_submission.csv')
    print('Sample_submission.csv file has {} rows and {} columns\n'.format(ss.shape[0], ss.shape[1]))
    
    return train, test, train_labels, specs, ss

## **Filter ``instalation_id`` with no assessments**

In [None]:
def get_installation_ids_with_assessment(train):
    """All 'installation_id' from the test set have at least one assessment,
    but in the train set some don't have any.
    """
    print('Filtering \'installation_id\' without assessments....')
    installation_id_with_assesment = train.groupby('installation_id')[['type']].agg(lambda x: np.sum(x == 'Assessment') > 0).type.to_dict()
    train = train[train['installation_id'].map(installation_id_with_assesment)].reset_index(drop=True)
    print('train has {} rows and {} columns\n'.format(train.shape[0], train.shape[1]))
    return train

## **Encoding text data**

In [None]:
def encode_text_data(train, test):
    print('Encoding text data....\n')
    #Make a list with all the unique 'titles' from the train and test set
    list_of_title = sorted(list(set(train['title'].unique()).union(set(test['title'].unique()))))
    #Make a list with all the unique 'event_code' from the train and test set
    list_of_event_code = sorted(list(set(train['event_code'].unique()).union(set(test['event_code'].unique()))))
    #Make a list with all the unique 'event_id' from the train and test set
    list_of_event_id = sorted(list(set(train['event_id'].unique()).union(set(test['event_id'].unique()))))
    #Make a list with all the unique 'world' from the train and test set
    list_of_world = sorted(list(set(train['world'].unique()).union(set(test['world'].unique()))))
    short_list_of_world = list_of_world.copy()
    short_list_of_world.pop(2)
    #Make a list with all the 'title' that are assessments
    list_of_assessment_title = sorted(list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(set(test[test['type'] == 'Assessment']['title'].value_counts().index))))
    #Make a list with all the 'game_session' that are assessments
    list_of_assessment_session = sorted(list(set(train[train['type'] == 'Assessment']['game_session'].unique()).union(set(test[test['type'] == 'Assessment']['game_session'].unique()))))
    
    #Create a dictionary encoding the 'title' 
    title_map = dict(zip(list_of_title, np.arange(len(list_of_title))))    #(keys: str | values: int)
    title_map_labels = dict(zip(np.arange(len(list_of_title)), list_of_title)) #reversed dictionary (keys: int | values: str)
    #Create a dictionary encoding the 'world' 
    world_map = dict(zip(list_of_world, np.arange(len(list_of_world))))
    
    #Replace the text data with the numerical data from the dictionaries
    train['title'] = train['title'].map(title_map)
    test['title'] = test['title'].map(title_map)
    train['world'] = train['world'].map(world_map)
    test['world'] = test['world'].map(world_map)
    
    #Create a dictionary with the 'event_code' that has the information of the correct answers
    correct_event_code_map = dict(zip(title_map.values(), (4100*np.ones(len(title_map))).astype('int')))
    correct_event_code_map[title_map['Bird Measurer (Assessment)']] = 4110 #Change the one for 'Bird Measurer (Assessment)' to 4110
    
    #Convert 'timestamp' into datetime class
    train['timestamp'] = pd.to_datetime(train['timestamp'])
    test['timestamp'] = pd.to_datetime(test['timestamp'])
    
    return train, test, correct_event_code_map, list_of_title, list_of_world, short_list_of_world,list_of_event_code, title_map, title_map_labels, world_map, list_of_assessment_title, list_of_assessment_session, list_of_event_id

## **Determining start and final events for each training instance**

In [None]:
def identify_start_and_final_events(train,test):
    print('Determining start and final events for each training instance....\n')
    # Sort values
    train = train.sort_values(by=['installation_id', 'timestamp'])
    test = test.sort_values(by=['installation_id', 'timestamp'])
    
    # Identify start events
    train_min = train.groupby('installation_id')[['timestamp']].min()
    train_min['initial_time'] = 1
    train_min = train_min.reset_index()
    train = train.merge(train_min, on=['installation_id', 'timestamp'], how='left')
    train['initial_time'].fillna(0, inplace=True)
    
    test_min = test.groupby('installation_id')[['timestamp']].min()
    test_min['initial_time'] = 1
    test_min = test_min.reset_index()
    test = test.merge(test_min, on=['installation_id', 'timestamp'], how='left')
    test['initial_time'].fillna(0, inplace=True)
    
    # Identify final events (first event of each assessment)
    train_max = train[train['type'] == 'Assessment']
    train_max = train_max.groupby('game_session')[['timestamp']].min()
    train_max['final_time'] = 1
    train_max = train_max.reset_index()
    train = train.merge(train_max, on=['game_session', 'timestamp'], how='left')
    train['final_time'].fillna(0, inplace=True)
    
    test_max = test[test['type'] == 'Assessment']
    test_max = test_max.groupby('game_session')[['timestamp']].min()
    test_max['final_time'] = 1
    test_max = test_max.reset_index()
    test = test.merge(test_max, on=['game_session', 'timestamp'], how='left')
    test['final_time'].fillna(0, inplace=True)
    
    return train, test

## **Generate missing assessment labels**

In [None]:
def get_assessment_accuracy(df,assessment_session):
    assessment_df = df[df['game_session']== assessment_session]
    assessment_info = {'game_session' : assessment_session}
    assessment_info.update({'installation_id' : assessment_df.iloc[-1].installation_id})
    assessment_title = assessment_df.iloc[-1].title
    assessment_info.update({'title' : title_map_labels[assessment_title]})
    
    #accumulated_correct_attempts & accumulated_incorrect_attempts
    all_attempts = assessment_df.query(f'event_code == {correct_event_code_map[assessment_title]}')
    assessment_info.update({'num_correct' : all_attempts['event_data'].str.contains('true').sum()})
    assessment_info.update({'num_incorrect' : all_attempts['event_data'].str.contains('false').sum()})
    assessment_info.update({'accuracy' : assessment_info['num_correct']/(assessment_info['num_correct']+assessment_info['num_incorrect']) if (assessment_info['num_correct']+assessment_info['num_incorrect']) != 0 else 0})
    #accuracy_group
    if assessment_info['accuracy'] == 0:
        assessment_info.update({'accuracy_group' : 0})
    elif assessment_info['accuracy'] == 1:
        assessment_info.update({'accuracy_group' : 3})
    elif assessment_info['accuracy'] == 0.5:
        assessment_info.update({'accuracy_group' : 2})
    else:
        assessment_info.update({'accuracy_group' : 1})
        
    return assessment_info
    
def get_train_labels_full(train_labels,train,test):
    print('Generating missing assessment labels...')
    train_labels_full = train_labels
    list_of_session_with_label = train_labels['game_session'].unique()
    print('{} sessions with labels: {}...'.format(len(list_of_session_with_label),list_of_session_with_label[:5]))
    list_of_session_without_label = np.setdiff1d(list_of_assessment_session,list_of_session_with_label)
    print('{} sessions without labels: {}...'.format(len(list_of_session_without_label),list_of_session_without_label[:5]))
    list_of_sessions_to_be_predicted = []
    for assessment_session in tqdm(list_of_session_without_label, total= 6896):
        if train[train['game_session'] == assessment_session].any(axis=None):
            train_labels_full = train_labels_full.append(get_assessment_accuracy(train,assessment_session),ignore_index=True)
        else:
            if test[test['game_session'] == assessment_session].shape[0] > 1:
                train_labels_full = train_labels_full.append(get_assessment_accuracy(test,assessment_session),ignore_index=True)
            else:
                list_of_sessions_to_be_predicted.append(assessment_session)
                
    print('{} sessions to be predicted by model: {}...'.format(len(list_of_sessions_to_be_predicted),list_of_sessions_to_be_predicted[:5]))
    print('Result: {} labels generated\n'.format(train_labels_full.shape[0]-train_labels.shape[0]))
    train_labels_full.to_csv('train_labels_full.csv', index=False)
    return train_labels_full

## **Get general information about the performance of children in each assessment**

In [None]:
def get_assessment_info(train_labels_full):
    """Gets the mean and median of both accuracy and accuracy_group per each assessment
    """
    print('Getting general information about the performance of children in each assessment...\n')
    mean_acc_map = {title: train_labels_full[train_labels_full['title'] == title]['accuracy'].mean() for title in list_of_assessment_title}
    mean_acc_group_map = {title: train_labels_full[train_labels_full['title'] == title]['accuracy_group'].mean() for title in list_of_assessment_title}
    
    median_acc_map = {title: train_labels_full[train_labels_full['title'] == title]['accuracy'].median() for title in list_of_assessment_title}
    median_acc_group_map = {title: train_labels_full[train_labels_full['title'] == title]['accuracy_group'].median() for title in list_of_assessment_title}
    
    return mean_acc_map, mean_acc_group_map, median_acc_map, median_acc_group_map

## **Obtaining features for each answered assessment (from both train and test sets)**

In [None]:
def extract_correct_pct(series, subs=False):
    tot, num_cor = 0, 0
    for s in series:
        dict_s = json.loads(s)
        if 'correct' in dict_s.keys():
            tot += 1
            if dict_s['correct']:
                num_cor += 1
    if subs and num_cor > 0:
        num_cor -= 1
        tot -= 1
    return num_cor / tot if tot > 0 else 0.0

def observation(df, k, i,test_set=False):
    # Select the time series corresponding to installation_id k and Assessment i
    df2 = df[(df['final_time_sum'] <= i) | (df['final_time_sum'] == i + 1) & (df['final_time'] == 1)]
    df = df2.reset_index(drop=True)
    df.loc[:, 'date'] = df.loc[:, 'timestamp'].dt.date
    
    #Getting the features for the assessment
    features = {}
    #assesment dependent features
    assessment_row = df.iloc[-1]
    features.update({'installation_id': assessment_row['installation_id']})
    features.update({'game_session': assessment_row['game_session']})
    features.update({'is_' + assessment_title.replace(' ', '_')[:-13]: 1 if assessment_row['title'] == title_map[assessment_title] else 0 for assessment_title in list_of_assessment_title})
    features.update({'is_' + world: 1 if assessment_row['world'] == world_map[world] else 0 for world in short_list_of_world})
    features.update({'hour': assessment_row['timestamp'].hour})
    features.update({'dayofweek': assessment_row['timestamp'].dayofweek})
    
    features.update({'num_unique_dates': df['date'].nunique()})
    features.update({'mean_acc': mean_acc_map[title_map_labels[assessment_row['title']]]})
    features.update({'mean_acc_group': mean_acc_group_map[title_map_labels[assessment_row['title']]]})
    features.update({'median_acc': median_acc_map[title_map_labels[assessment_row['title']]]})
    features.update({'median_acc_group': median_acc_group_map[title_map_labels[assessment_row['title']]]})
    
    #counter features
    #title_count
    title_count = df.groupby('game_session')['title'].max().value_counts().to_dict()
    for title in title_map.values():
        features.update({'num_' + str(title): title_count[title] if title in title_count else 0})
    features['num_'+ str(assessment_row['title'])] -= 1
    features.update({'num_total_title': sum(title_count.values())})
    features.update({'previous_completions': features['num_'+ str(assessment_row['title'])]})
    
    #title_type_count
    title_type_count = df.groupby('game_session')['type'].max().value_counts().to_dict()
    title_type_keys = ['Assessment', 'Game', 'Clip', 'Activity']
    for title_type in title_type_keys:
        features.update({'num_' + title_type: title_type_count[title_type] if title_type in title_type_count else 0})
    features['num_Assessment'] -= 1
    
    #title_type_same_world_count
    title_type_same_world_count = df[df.world == assessment_row['world']].groupby('game_session')['type'].max().value_counts().to_dict()
    for title_type in title_type_keys:
        features.update({'num_' + title_type + '_same_world': title_type_same_world_count[title_type] if title_type in title_type_same_world_count else 0})
    features['num_Assessment_same_world'] -= 1
    features.update({'num_total_title_same_world': sum(title_type_same_world_count.values())})
    
    #event_code_count
    event_code_count = df.groupby('event_code')['event_code'].count().to_dict()
    for event_code in list_of_event_code:
        features.update({str(event_code): event_code_count[event_code] if event_code in event_code_count else 0})
    features.update({'total_event_count': sum(event_code_count.values())})             
    
    #min_count
    min_count_by_type = df.groupby('type')['game_time'].agg(lambda x: sum(x)/(60 * 1000)).to_dict()
    for title_type in title_type_keys:
        if title_type == 'Clip':
            pass
        else:
            features.update({'mins_' + title_type: min_count_by_type[title_type] if title_type in min_count_by_type else 0})
    features.update({'mins_total': sum(min_count_by_type.values())})
    
    #avg_min_count
    avg_min_count_by_type = df.groupby('type')['game_time'].agg(lambda x: sum(x)/(60 * 1000)).to_dict()
    for title_type in title_type_keys:
        if title_type == 'Clip':
            pass
        else:
            features.update({'avg_mins_' + title_type: avg_min_count_by_type[title_type]/features['num_' + title_type] if title_type in min_count_by_type and features['num_' + title_type]!= 0 else 0})
    features.update({'avg_mins_total': sum(avg_min_count_by_type.values())/features['num_total_title']})
    
    #correct_pct's
    game_pct = df[df['type'] == 'Game'].groupby('game_session')['event_data'].agg(lambda x: extract_correct_pct(x)).mean()
    features.update({'game_pct': game_pct if game_pct is not np.nan else 0})
    ass_pct = df[df['type'] == 'Assessment'].groupby('game_session')['event_data'].agg(lambda x: extract_correct_pct(x,True)).mean()
    features.update({'ass_pct': ass_pct if ass_pct is not np.nan else 0})
    
    #accumulated_correct/incorrect_attemps and accuracies
    accumulated_correct_attempts = 0
    accumulated_incorrect_attempts = 0
    accumulated_accuracy = 0
    accumulated_accuracy_group = 0
    last_accuracy_same_title = -1
    last_accuracy_group_same_title = -1
    
    for assessment_session in df[df['type'] == 'Assessment']['game_session'].unique():
        if assessment_session == features['game_session']: #skipping the assessment to be labeled
            pass
        else:
            accumulated_correct_attempts += train_labels_full[train_labels_full['game_session'] == assessment_session]['num_correct'].values[0]
            accumulated_incorrect_attempts += train_labels_full[train_labels_full['game_session'] == assessment_session]['num_incorrect'].values[0]
            accuracy = train_labels_full[train_labels_full['game_session'] == assessment_session]['accuracy'].values[0]
            accumulated_accuracy += accuracy
            last_accuracy_same_title = accuracy if title_map[train_labels_full[train_labels_full['game_session'] == assessment_session]['title'].values[0]] == assessment_row['title'] else last_accuracy_same_title 
            accuracy_group = train_labels_full[train_labels_full['game_session'] == assessment_session]['accuracy_group'].values[0]
            accumulated_accuracy_group +=  accuracy_group
            last_accuracy_group_same_title = accuracy_group if title_map[train_labels_full[train_labels_full['game_session'] == assessment_session]['title'].values[0]] == assessment_row['title'] else last_accuracy_group_same_title 
            
    features.update({'accumulated_correct_attempts': accumulated_correct_attempts})
    features.update({'accumulated_incorrect_attempts': accumulated_incorrect_attempts})
    features.update({'accumulated_accuracy': accumulated_accuracy})
    features.update({'accumulated_accuracy_group': accumulated_accuracy_group/features['num_Assessment'] if features['num_Assessment'] > 0 else 0})
    features.update({'last_accuracy_same_title': last_accuracy_same_title})
    features.update({'last_accuracy_group_same_title': last_accuracy_group_same_title})
    
    #Getting the label for the assessment if it is not an incomplete one from the test set
    if not test_set:
        assessment_accuracy = features.update({'accuracy_group': train_labels_full[train_labels_full['game_session'] == assessment_session]['accuracy_group'].values[0]})

    return features

## **Generating the final train and test sets**

In [None]:
def get_final_data(train, test):
    train['final_time_sum'] = train['final_time'].groupby(train['installation_id']).transform('cumsum')
    test['final_time_sum'] = test['final_time'].groupby(test['installation_id']).transform('cumsum')
    obs_per_id_train = train.groupby('installation_id')[['final_time_sum']].max().astype('int32').to_dict()['final_time_sum']
    obs_per_id_test = test.groupby('installation_id')[['final_time_sum']].max().astype('int32').to_dict()['final_time_sum']
    final_train = []
    final_test = []    
    print('Obtaining train features:')
    #final_train = parallelize_dataframe(obs_list,final_train,n_cores = 4)
    for k, v in tqdm(obs_per_id_train.items(),total= 4242):
        df_ = train[train['installation_id'] == k]
        for i in range(v):
            final_train.append(observation(df_, k, i))

    print('Obtaining test features:')
    for k, v in tqdm(obs_per_id_test.items(), total = 1000):
        df_ = test[test['installation_id'] == k]
        for i in range(v-1):
            final_train.append(observation(df_, k, i))
        final_test.append(observation(df_, k, v - 1,test_set=True))
    
    final_train = pd.DataFrame(final_train)
    final_train = final_train.reset_index(drop=True)
    X_train = final_train.iloc[:, 2:-1]
    Y_train = final_train.iloc[:, -1]
    
    final_test = pd.DataFrame(final_test)
    final_test = final_test.reset_index(drop=True)
    X_test = final_test.iloc[:, 2:]
    
    print('Final train (\'X_train\') information:\nN = {}\np= {}\nsum(y)={}\n\n'.format(X_train.shape[0],X_train.shape[1],Y_train.shape[0]))
    print('Final test (\'X_test\') information:\nN = {}\np= {}'.format(X_test.shape[0],X_test.shape[1]))
    
    return X_train, Y_train, X_test, final_train, final_test

## **Saving final train and test sets**

In [None]:
def save_data(X_train, Y_train, X_test,final_train,final_test):
    X_train.to_csv('clean_data/X_train.csv', index=False)
    Y_train.to_csv('clean_data/Y_train.csv', index=False)
    X_test.to_csv('clean_data/X_test.csv', index=False)
    final_train.to_csv('clean_data/final_train.csv', index=False)
    final_test.to_csv('clean_data/final_test.csv', index=False)

# **Final dataset generation**


## train_labels_full generation

In [None]:
#If train_labels_full hasn't still been generated, run this cell

#Read data
#train, test, train_labels, specs, ss = read_data(read_train_labels_full = False)
#Filter 'installation_id' without assements from train
#train = get_installation_ids_with_assessment(train)
#Get useful dictionaries with encoded titles, worlds, types...
#train, test, correct_event_code_map, list_of_title, list_of_world, short_list_of_world, list_of_event_code, title_map, title_map_labels, world_map, list_of_assessment_title, list_of_assessment_session, list_of_event_id = encode_text_data(train, test)
#Generate missing assessment labels
#train_labels_full = get_train_labels_full(train_labels,train,test)
#Storing the new data set
#train_labels_full.to_csv('clean_data/train_labels_full.csv', index=False)

## X_train, Y_train, X_test generation

In [None]:
#Read data
train, test, train_labels_full, specs, ss = read_data()
#Filter 'installation_id' without assements from train
train = get_installation_ids_with_assessment(train)
#Get useful dictionaries with encoded titles, worlds, types...
train, test, correct_event_code_map, list_of_title, list_of_world,short_list_of_world, list_of_event_code, title_map, title_map_labels, world_map, list_of_assessment_title, list_of_assessment_session, list_of_event_id = encode_text_data(train, test)
#Identify start and final events for each assessment in both the train and test sets
train, test= identify_start_and_final_events(train,test)
#Get general metrics about the performance of children in each assessment
mean_acc_map, mean_acc_group_map, median_acc_map, median_acc_group_map = get_assessment_info(train_labels_full)
#Generate final training and test set
X_train,Y_train, X_test, final_train, final_test = get_final_data(train,test)
#Saving data 
save_data(X_train, Y_train, X_test,final_train,final_test)