## A LightGBM model to predict students' success <br>

This LightGBM model for classification includes 45 features:

#### - Features on questions' characteristics:

answered_correctly_mean: Difficulty of the current question (share of correct answers by all users) <br>
previous_questions_difficulty: Difficulty of previous questions answered by a user <br>
answered_correctly_mean_std: Standard deviation of the question's share of correct answeres by all users <br>
answered_correctly_mean_bundle: Share of questions answered correctly in a specific question bundle <br>

tag1: First tag of the current question <br>
tag2: Second tag of the current question <br>
tag3: Third tag of the current question <br>
part: Category of the current question <br>

question_clusters: Clustering of questions according to conditional probabilities with respect to the 20 questions most frequently asked <br>
question_clusters_2: Clustering of questions with respect to other features (independently of the question's share of correct answers)

#### - Features on user's history:

previous_questions: Number of previously answered questions by the user <br>
answered_correctly_lagged_sum: Number of previous questions answered correctly by the user <br>

previous_answered_correctly: User's knowledge (User's share of correct answers in the past) <br>
previous_answered_correctly_rolling: Rolling window of user's share of correct answers (last 350 interactions) <br>
previous_answered_correctly_rolling_200: Rolling window of user's share of correct answers (last 200 interactions) <br>
previous_answered_correctly_rolling_100: Rolling window of user's share of correct answers (last 100 interactions) <br>
previous_answered_correctly_rolling_50: Rolling window of user's share of correct answers (last 50 interactions) <br>
previous_answered_correctly_rolling_20: Rolling window of user's share of correct answers (last 20 interactions) <br>
previous_answered_correctly_rolling_10: Rolling window of user's share of correct answers (last 10 interactions)

prev_que_answ_corr: Difficulty of previous questions answered correctly <br>
prev_que_answ_incorr: Difficulty of previous questions answered incorrectly <br>
prev_que_answ_corr_min: Max. difficulty of previous questions answered correctly <br>
prev_que_answ_incorr_max: Min. difficulty of previous questions answered incorrectly <br>

time_between_questions: Time between current and last question <br>
time_between_questions_2lag': Time between current and second-last question <br>
time_between_questions_3lag': Time between current and third-last question <br>
time_between_questions_4lag': Time between current and fourth-last question <br>
time_between_questions_5lag': Time between current and fifth-last question <br>
time_between_questions_6lag': Time between current and sixth-last question <br>
time_between_questions_7lag': Time between current and seventh-last question <br>
timediff_last_corr_answ: Time between current and last question answered correctly <br>
timediff_last_incorr_answ: Time between current and last question answered incorrectly <br>
timediff_2_last_incorr_answ': Time between current and second-last question answered incorrectly <br>
timediff_3_last_incorr_answ': Time between current and third-last question answered incorrectly <br>
timediff_4_last_incorr_answ': Time between current and fourth-last question answered incorrectly <br>

prior_question_elapsed_time: (Average) time spent on each question of the brevious task container <br>
prior_question_elapsed_time_sum: Sum of the time spent on answering questions in the past <br>
prior_question_elapsed_time_mean': Sum / Number of questions answered in the past <br>
prior_question_had_explanation_mean': Average of whether previous questions had an explanation <br>
 
repeated_question_corr: Current question seen in the past and answered correctly <br>
repeated_question_exp: Current question seen in the past and it had an explanation <br>
repeated_question_time: Time between the current question and when it has been seen in the past <br>

answer_share: Share of user's answer choice in the past with respect to the correct answer <br>
 
repeated_tags_correct_share: Share of questions of the respective tag answered correctly in the past <br>
tag_timediff: Time between the current question and a question with the respective tag in the past <br>


### Importing the required modules

In [None]:
import numpy as np
import pandas as pd
import random
import gc
from collections import defaultdict, deque
from bitarray import bitarray

from tqdm import tqdm

import lightgbm as lgb

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

### Loading the data

In [None]:
dtypes = {
    "row_id": "int64",
    "timestamp": "int64",
    "user_id": "int32",
    "content_id": "int16",
    "content_type_id": "int8",
    "task_container_id": "int16",
    "answered_correctly": "int8",
    "user_answer": "int8",
    "prior_question_elapsed_time": "float32", 
    "prior_question_had_explanation": "int8"
    }

data = pd.read_csv('train.csv', dtype=dtypes, usecols=["row_id","user_id","task_container_id","timestamp",
                    "content_id","content_type_id","answered_correctly","user_answer",
                        "prior_question_elapsed_time","prior_question_had_explanation"], nrows=10**5)

In [None]:
data = data[data['content_type_id']==0]

### Question clustering

In [None]:
most_freq_questions = list(data['content_id'].value_counts().index.tolist())[:20]

In [None]:
users = data['user_id'].values
questions = data['content_id'].values
y = data['answered_correctly'].values

In [None]:
sim_question_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(int))))

for i, (user_id, content_id, answ_corr) in enumerate(zip(tqdm(users), questions, y)):

    for que in most_freq_questions:
        if sim_question_dict[user_id][que]['correct']==1:
            sim_question_dict[que][content_id]['correct']['answered_correctly'] += answ_corr
            sim_question_dict[que][content_id]['correct']['previous_questions'] += 1

        elif sim_question_dict[user_id][que]['correct']==-1:
            sim_question_dict[que][content_id]['incorrect']['answered_correctly'] += answ_corr
            sim_question_dict[que][content_id]['incorrect']['previous_questions'] += 1

    if content_id in most_freq_questions:
        if answ_corr == 1:
            sim_question_dict[user_id][content_id]['correct']=1
        else:
            sim_question_dict[user_id][content_id]['correct']=-1

In [None]:
question_dict = defaultdict(lambda: defaultdict(int))

for i, (content_id, answ_corr) in enumerate(zip(tqdm(questions), y)):
    question_dict[content_id]['previous_questions_content'] += 1
    question_dict[content_id]['answered_correctly_content_sum'] += answ_corr
    question_dict[content_id]['answered_correctly_mean'] = question_dict[content_id][
        'answered_correctly_content_sum'] / question_dict[content_id]['previous_questions_content']

In [None]:
cluster_question_corr = np.zeros((13523,20), dtype=np.float32)
cluster_question_incorr = np.zeros((13523,20), dtype=np.float32)

for i, content_id in enumerate(tqdm(questions)):
    for j, que in enumerate(most_freq_questions):
        if sim_question_dict[que][content_id]['correct']['previous_questions']>0:
            cluster_question_corr[content_id,j] = (sim_question_dict[que][
                content_id]['correct']['answered_correctly']/sim_question_dict[que][content_id]['correct'][
                'previous_questions']) - question_dict[content_id]['answered_correctly_mean'] 
        else:
            cluster_question_corr[content_id,j] = np.nan
        if sim_question_dict[que][content_id]['incorrect']['previous_questions']>0:
            cluster_question_incorr[content_id,j] = (sim_question_dict[que][content_id][
                'incorrect']['answered_correctly']/sim_question_dict[que][content_id]['incorrect'][
                'previous_questions']) - question_dict[content_id]['answered_correctly_mean']
        else:
            cluster_question_incorr[content_id,j] = np.nan

In [None]:
X = np.nan_to_num(np.concatenate((cluster_question_corr,cluster_question_incorr), axis=1))

clustering = KMeans(30, n_init=100, max_iter=1000)
cluster_groups_questions = clustering.fit_predict(X)

question_clusters_dict = pd.Series(cluster_groups_questions).to_dict()

### Random user sampling

In [None]:
user_list = data['user_id'].unique()
random.seed(52)
user_list = random.sample(list(user_list), len(user_list))

train_split = int(len(user_list)*0.5)
user_list = user_list[train_split:]

data = data.loc[data['user_id'].isin(user_list),:]

del user_list
gc.collect()

### Data preprocessing

In [None]:
questions = pd.read_csv('questions.csv')

tags_len_max = questions['tags'].apply(lambda x: len(str(x).split())).max()

for i in range(1, tags_len_max + 1):
    questions['tag{}'.format(i)] = questions['tags'][questions['tags'].isnull()==0].apply(lambda x: int(str(x
                                                        ).split()[i-1]) if len(str(x).split()) >= i else 0)

questions = questions.rename(columns={'question_id':'content_id'})
questions.set_index('content_id', inplace=True)

data = data.merge(data.loc[data['content_type_id']==0,:].merge(questions, on='content_id', how='left')[[
    'row_id','bundle_id','correct_answer','part','tag1','tag2','tag3']], on='row_id', how='left')
del questions
gc.collect()

In [None]:
data_cont = data.groupby(['user_id','task_container_id'])[
    'prior_question_had_explanation','prior_question_elapsed_time'].mean()
data_cont = data_cont.groupby('user_id')['prior_question_had_explanation','prior_question_elapsed_time'].shift(
    -1)
data_cont.rename(columns={'prior_question_had_explanation':'question_had_explanation',
                     'prior_question_elapsed_time':'question_elapsed_time'}, inplace=True)
data = data.join(data_cont, on=['user_id','task_container_id'], how='left')

del data_cont
gc.collect()

### CV strategy

The CV strategy is based on: https://www.kaggle.com/its7171/cv-strategy

In [None]:
random.seed(52)

users_random_start=defaultdict(int)

timestamp_max_user = data.groupby('user_id')['timestamp'].max().reset_index()
timestamp_max = data['timestamp'].max()

for user_id, timestamp in zip(tqdm(timestamp_max_user['user_id'].values), timestamp_max_user['timestamp'
    ].values):
    users_random_start[user_id] = random.randint(0, timestamp_max - timestamp)

est_time = np.empty(len(data))
for i, (timestamp, user_id) in enumerate(zip(tqdm(data['timestamp'].values), data['user_id'].values)):
    est_time[i] = timestamp + users_random_start[user_id]
data['est_time'] = est_time

del timestamp_max_user
del users_random_start
del est_time
gc.collect()

data.sort_values('est_time', inplace=True)
data.reset_index(drop=True, inplace=True)

data.loc[int(len(data)*0.95):,'train_val_split']=1
data.loc[:int(len(data)*0.95),'train_val_split']=0

data.sort_values(['user_id','timestamp'], inplace=True)

train_val_split = data['train_val_split'].values

### Feature engineering

In [None]:
feature_position = dict(zip(['answered_correctly_mean', 'previous_answered_correctly','previous_questions',
        'previous_questions_difficulty','prev_que_answ_corr','prev_que_answ_incorr','tag1','tag2','tag3',
        'part','time_between_questions','prior_question_elapsed_time','prior_question_elapsed_time_mean',
            'prior_question_had_explanation_mean','answered_correctly_lagged_sum','repeated_question_corr',
            'repeated_question_exp','repeated_question_time','answer_share','prior_question_elapsed_time_sum',
            'prev_que_answ_corr_min','prev_que_answ_incorr_max','previous_answered_correctly_rolling',
            'answered_correctly_mean_std','answered_correctly_mean_bundle','timediff_last_corr_answ',
            'timediff_last_incorr_answ','time_between_questions_2lag','timediff_2_last_incorr_answ',
            'time_between_questions_3lag','timediff_3_last_incorr_answ','time_between_questions_4lag',
            'timediff_4_last_incorr_answ','time_between_questions_5lag','time_between_questions_6lag',
            'time_between_questions_7lag','previous_answered_correctly_rolling_200',
            'previous_answered_correctly_rolling_100','previous_answered_correctly_rolling_50',
            'previous_answered_correctly_rolling_20','previous_answered_correctly_rolling_10',
            'repeated_tags_correct_share','tag_timediff','question_clusters','question_clusters_2'
            ],list(range(45))))
            
features = np.zeros([len(data),45], dtype=np.float32)

In [None]:
users = data['user_id'].values
questions = data['content_id'].values
bundles= data['bundle_id'].values
y = data['answered_correctly'].values
user_answer = data['user_answer'].values
timestamp = data['timestamp'].values
prior_question_had_explanation = data['prior_question_had_explanation'].values
question_had_explanation = data['question_had_explanation'].values
question_elapsed_time = data['question_elapsed_time'].values

task_container = data['task_container_id'].values

correct_answer = data['correct_answer']
tag1 = data['tag1'].fillna(-1).astype('int').values

user_lookup = pd.Series(data['user_id'].unique()).to_dict()
user_lookup = {k: v for v, k in user_lookup.items()}

tag1_lookup = pd.Series(data['tag1'].fillna(-1).astype('int').unique()).to_dict()
tag1_lookup = {k: v for v, k in tag1_lookup.items()}

features[:,feature_position['tag1']] = data['tag1'].values
features[:,feature_position['tag2']] = data['tag2'].values
features[:,feature_position['tag3']] = data['tag3'].values
features[:,feature_position['part']] = data['part'].values
features[:,feature_position['prior_question_elapsed_time']] = data['prior_question_elapsed_time'].values

features = features.astype(np.float32)

del data
gc.collect()

In [None]:
question_dict = defaultdict(lambda: defaultdict(int))

for i, (content_id, answ_corr, elap_time) in enumerate(zip(tqdm(questions), y, question_elapsed_time)):
    question_dict[content_id]['previous_questions_content'] += 1
    question_dict[content_id]['answered_correctly_content_sum'] += answ_corr
    question_dict[content_id]['answered_correctly_mean'] = question_dict[content_id][
        'answered_correctly_content_sum'] / question_dict[content_id]['previous_questions_content']
    features[i,feature_position['question_clusters']] = question_clusters_dict[content_id]+1

for i, (content_id, answ_corr) in enumerate(zip(tqdm(questions), y)):
    features[i,feature_position['answered_correctly_mean']]=question_dict[content_id]['answered_correctly_mean']
    question_dict[content_id]['answered_correctly_mean_std_to_norm'] += (answ_corr - question_dict[content_id]['answered_correctly_mean'])**2

for i, content_id in enumerate(tqdm(questions)):
    features[i,feature_position['answered_correctly_mean_std']]=question_dict[content_id][
        'answered_correctly_mean_std_to_norm'] / question_dict[content_id]['previous_questions_content'] 
del question_dict
gc.collect()

In [None]:
bundle_dict = defaultdict(lambda: defaultdict(int))

for i, (bundle_id, answ_corr) in enumerate(zip(tqdm(bundles), y)):
    bundle_dict[bundle_id]['previous_questions_content'] += 1
    bundle_dict[bundle_id]['answered_correctly_content_sum'] += answ_corr
    bundle_dict[bundle_id]['answered_correctly_mean'] = bundle_dict[bundle_id]['answered_correctly_content_sum'] / bundle_dict[bundle_id][
        'previous_questions_content']
    
for i, (bundle_id, answ_corr) in enumerate(zip(tqdm(bundles), y)):
    features[i,feature_position['answered_correctly_mean_bundle']]=bundle_dict[bundle_id]['answered_correctly_mean']
    
del bundle_dict

In [None]:
user_content_dict = defaultdict(lambda: defaultdict(int))

for i, (user_id, content_id, answ_corr, exp, time) in enumerate(zip(tqdm(users),questions, y, question_had_explanation,
                                                             timestamp)):  
    if user_id not in user_content_dict:
        user_content_dict[user_id]['answ_corr'] = bitarray(13523, endian='little')
        user_content_dict[user_id]['answ_corr'].setall(0)
        user_content_dict[user_id]['exp'] = bitarray(13523, endian='little')
        user_content_dict[user_id]['exp'].setall(0)

    features[i,feature_position['repeated_question_corr']] = user_content_dict[user_id]['answ_corr'][content_id]
    features[i,feature_position['repeated_question_exp']] = user_content_dict[user_id]['exp'][content_id]
    
    if user_content_dict[user_id][content_id] > 0:
        features[i,feature_position['repeated_question_time']] = time - user_content_dict[user_id][content_id]
    else:
        features[i,feature_position['repeated_question_time']] = np.nan

    user_content_dict[user_id]['answ_corr'][content_id] = answ_corr
    user_content_dict[user_id]['exp'][content_id] = exp
    user_content_dict[user_id][content_id] = time

del user_content_dict
gc.collect()

In [None]:
user_tag1_matrix = np.zeros((len(user_lookup), len(tag1_lookup)), dtype=np.int16)
user_tag1_corr_matrix = np.zeros((len(user_lookup), len(tag1_lookup)), dtype=np.int16)
user_tag1_time_matrix = np.zeros((len(user_lookup), len(tag1_lookup)), dtype=np.int64)

for i, (user_id, t1, answ_corr, time) in enumerate(zip(tqdm(users),tag1, y, timestamp)):
    features[i,feature_position['repeated_tags_correct_share']] = user_tag1_corr_matrix[user_lookup[user_id
                    ],tag1_lookup[t1]] / user_tag1_matrix[user_lookup[user_id],tag1_lookup[t1]]
    
    if user_tag1_time_matrix[user_lookup[user_id],tag1_lookup[t1]] > 0:
        features[i,feature_position['tag_timediff']] = time - user_tag1_time_matrix[user_lookup[user_id
                                                        ],tag1_lookup[t1]]
    else: 
        features[i,feature_position['tag_timediff']] = np.nan
        
    user_tag1_matrix[user_lookup[user_id],tag1_lookup[t1]] += 1
    user_tag1_corr_matrix[user_lookup[user_id],tag1_lookup[t1]] += answ_corr
    user_tag1_time_matrix[user_lookup[user_id],tag1_lookup[t1]] = time
    
del user_tag1_matrix
del user_tag1_corr_matrix
del user_tag1_time_matrix
gc.collect()

In [None]:
user_dict = defaultdict(lambda: defaultdict(int))

for i, (user_id, content_id, answ_corr, que_corr, time, exp, elap_time, user_answ, corr_answ, task_id
       ) in enumerate(zip(tqdm(users), questions, y, features[:,feature_position['answered_correctly_mean'
        ]], timestamp, prior_question_had_explanation, features[:,feature_position[
        'prior_question_elapsed_time']],user_answer,correct_answer, task_container)):
           
    user_dict[user_id]['previous_questions'] += 1
    user_dict[user_id]['answer_{}'.format(user_answ)] += 1
    user_dict[user_id]['answered_correctly'] =answ_corr
    
    user_dict[user_id]['answered_correctly_sum'] += answ_corr
    user_dict[user_id]['answered_incorrectly_sum'] += (1-answ_corr)
    
    if user_dict[user_id]['previous_questions'] == 1:
        features[i,feature_position['time_between_questions']] = np.nan
        features[i,feature_position['timediff_last_corr_answ']] = np.nan
        features[i,feature_position['timediff_last_incorr_answ']] = np.nan
        
        for l in range(2,8):
            features[i,feature_position['time_between_questions_{}lag'.format(l)]] = np.nan
        for l in range(2,5):    
            features[i,feature_position['timediff_{}_last_incorr_answ'.format(l)]] = np.nan
        
        features[i,feature_position['previous_questions']]=0
        features[i,feature_position['answered_correctly_lagged_sum']]=np.nan
        features[i,feature_position['previous_answered_correctly']]=np.nan
        features[i,feature_position['previous_questions_difficulty']]=np.nan
        features[i,feature_position['prev_que_answ_corr']]=np.nan
        features[i,feature_position['prev_que_answ_incorr']]=np.nan
        
        features[i,feature_position['prev_que_answ_corr_min']] = np.nan
        features[i,feature_position['prev_que_answ_incorr_max']] = np.nan
        
        features[i,feature_position['answer_share']]=np.nan
        
        features[i,feature_position['previous_answered_correctly_rolling']] = np.nan
        features[i,feature_position['previous_answered_correctly_rolling_200']] = np.nan
        features[i,feature_position['previous_answered_correctly_rolling_100']] = np.nan
        features[i,feature_position['previous_answered_correctly_rolling_50']] = np.nan
        features[i,feature_position['previous_answered_correctly_rolling_20']] = np.nan
        features[i,feature_position['previous_answered_correctly_rolling_10']] = np.nan
        
        user_dict[user_id]['prior_question_elapsed_time_sum'] = 0
        
        user_dict[user_id]['prior_question_elapsed_time_mean'] = np.nan
        user_dict[user_id]['prior_question_had_explanation_mean'] = np.nan
        
        user_dict[user_id]['answered_correctly_history']=bitarray()
        user_dict[user_id]['answered_correctly_history'].append(answ_corr)
        
        user_dict[user_id]['timestamp_lst'] = deque([], maxlen=7)
        user_dict[user_id]['timestamp_answ_corr']=deque([],maxlen=1)
        user_dict[user_id]['timestamp_answ_incorr']=deque([],maxlen=4)
              
    else:
        features[i,feature_position['time_between_questions']] = time - user_dict[user_id]['timestamp']
        try:
            features[i,feature_position['timediff_last_corr_answ']] = time - user_dict[user_id][
            'timestamp_answ_corr'][-1]
        except: 
            features[i,feature_position['timediff_last_corr_answ']] = np.nan
        try:
            features[i,feature_position['timediff_last_incorr_answ']] = time - user_dict[user_id][
                'timestamp_answ_incorr'][-1]
        except:
            features[i,feature_position['timediff_last_incorr_answ']] = np.nan
        
        for l in range(2,8):
            if user_dict[user_id]['previous_questions'] <= l:
                features[i,feature_position['time_between_questions_{}lag'.format(l)]] = np.nan
            else:
                features[i,feature_position['time_between_questions_{}lag'.format(l)]] = time - user_dict[
                    user_id]['timestamp_lst'][-l]
 
        for l in range(2,5):
            if len(user_dict[user_id]['timestamp_answ_incorr']) < l:
                features[i,feature_position['timediff_{}_last_incorr_answ'.format(l)]] = np.nan
            if len(user_dict[user_id]['timestamp_answ_incorr']) >= l:
                features[i,feature_position['timediff_{}_last_incorr_answ'.format(l)]] = time - user_dict[
                    user_id]['timestamp_answ_incorr'][-l]
          
        features[i,feature_position['previous_questions']]=previous_questions_temp
        features[i,feature_position['answered_correctly_lagged_sum']]=answered_correctly_lagged_sum_temp
        features[i,feature_position['previous_answered_correctly'
                                   ]] = answered_correctly_lagged_sum_temp / previous_questions_temp
        
        features[i,feature_position['previous_questions_difficulty']]=previous_questions_difficulty_temp
        features[i,feature_position['prev_que_answ_corr']]=previous_questions_answ_corr_temp
        features[i,feature_position['prev_que_answ_incorr']]=previous_questions_answ_incorr_temp
        
        features[i,feature_position['prev_que_answ_corr_min']] = prev_que_answ_corr_min_temp
        features[i,feature_position['prev_que_answ_incorr_max']] = prev_que_answ_incorr_max_temp
        
        features[i,feature_position['answer_share']]=user_dict[user_id]['answer_{}_share'.format(int(
            corr_answ))]
        
        if len(user_dict[user_id]['answered_correctly_history'])>350:
            features[i,feature_position['previous_answered_correctly_rolling']] = sum(
                answered_correctly_history_temp[-350:]) / 350
        else:
            features[i,feature_position['previous_answered_correctly_rolling']] = sum(
                answered_correctly_history_temp) / previous_questions_temp
        
        for r_w in [200, 100, 50, 20]:
            if len(user_dict[user_id]['answered_correctly_history'])>r_w:
                features[i,feature_position['previous_answered_correctly_rolling_{}'.format(r_w)]] = sum(
                answered_correctly_history_temp[-r_w:]) / r_w
            else:
                features[i,feature_position['previous_answered_correctly_rolling_{}'.format(r_w)]] = sum(
                answered_correctly_history_temp) / previous_questions_temp
        
        user_dict[user_id]['prior_question_elapsed_time_sum'] += elap_time
        
        user_dict[user_id]['prior_question_elapsed_time_mean'] = user_dict[user_id][
            'prior_question_elapsed_time_sum'] / previous_questions_temp
        
        user_dict[user_id]['prior_question_had_explanation_mean'] = user_dict[user_id][
            'prior_question_had_explanation_sum'] / user_dict[user_id]['previous_questions']
        
        user_dict[user_id]['answered_correctly_history'].append(answ_corr)
    
    user_dict[user_id]['previous_answered_correctly'] = user_dict[user_id]['answered_correctly_sum'
                                                            ] / user_dict[user_id]['previous_questions']
    user_dict[user_id]['answered_correctly_mean_sum'] += que_corr
    user_dict[user_id]['answered_correctly_mean_weight'] += (que_corr * answ_corr)
    user_dict[user_id]['answered_incorrectly_mean_weight'] += (que_corr * (1-answ_corr))
        
    user_dict[user_id]['previous_questions_difficulty'] =  user_dict[user_id]['answered_correctly_mean_sum'
                                                        ] / user_dict[user_id]['previous_questions']
    user_dict[user_id]['prev_que_answ_corr'] =  user_dict[user_id]['answered_correctly_mean_weight'
                                                ] / user_dict[user_id]['answered_correctly_sum']
    user_dict[user_id]['prev_que_answ_incorr'] =  user_dict[user_id]['answered_incorrectly_mean_weight'
                                                    ] / user_dict[user_id]['answered_incorrectly_sum']
    
    if ((que_corr < user_dict[user_id]['prev_que_answ_corr_min']) | (user_dict[user_id][
        'prev_que_answ_corr_min']==0)) & (answ_corr==1):
        user_dict[user_id]['prev_que_answ_corr_min'] = que_corr
    elif user_dict[user_id]['prev_que_answ_corr_min']==0:
        user_dict[user_id]['prev_que_answ_corr_min']==np.nan
        
    if ((que_corr > user_dict[user_id]['prev_que_answ_incorr_max']) | (user_dict[user_id][
        'prev_que_answ_incorr_max']==0)) & (answ_corr==0):
        user_dict[user_id]['prev_que_answ_incorr_max'] = que_corr
    elif user_dict[user_id]['prev_que_answ_incorr_max']==0:
        user_dict[user_id]['prev_que_answ_incorr_max']==np.nan
        
    user_dict[user_id]['prior_question_had_explanation_sum'] += exp
    
    user_dict[user_id]['answer_{}_share'.format(user_answ)] = user_dict[user_id][
        'answer_{}'.format(user_answ)] / user_dict[user_id]['previous_questions']
    
    features[i,feature_position['prior_question_had_explanation_mean']] = user_dict[user_id][
        'prior_question_had_explanation_mean']
    features[i,feature_position['prior_question_elapsed_time_mean']] = user_dict[user_id][
        'prior_question_elapsed_time_mean']
    features[i,feature_position['prior_question_elapsed_time_sum']] = user_dict[user_id][
        'prior_question_elapsed_time_sum']
    
    previous_questions_temp = user_dict[user_id]['previous_questions']
    answered_correctly_lagged_temp = user_dict[user_id]['answered_correctly']
    answered_correctly_lagged_sum_temp = user_dict[user_id]['answered_correctly_sum']
    previous_answered_correctly_temp = user_dict[user_id]['previous_answered_correctly']
    previous_questions_difficulty_temp = user_dict[user_id]['previous_questions_difficulty']
    previous_questions_answ_corr_temp = user_dict[user_id]['prev_que_answ_corr']
    previous_questions_answ_incorr_temp = user_dict[user_id]['prev_que_answ_incorr']
    prev_que_answ_corr_min_temp = user_dict[user_id]['prev_que_answ_corr_min']                                                          
    prev_que_answ_incorr_max_temp = user_dict[user_id]['prev_que_answ_incorr_max']
    answered_correctly_history_temp = user_dict[user_id]['answered_correctly_history']
    answered_correctly_mean_lagged_temp =  que_corr
    previous_answered_correctly_std_temp = user_dict[user_id]['previous_answered_correctly_std']
    task_container_temp = task_id
    overperformance_temp = user_dict[user_id]['overperformance']
    underperformance_temp = user_dict[user_id]['underperformance']
    
    user_performance_temp = user_dict[user_id]['user_performance']
    
    user_dict[user_id]['timestamp'] = time
    user_dict[user_id]['timestamp_lst'].append(time)

    if answ_corr == 1:
        user_dict[user_id]['timestamp_answ_corr'].append(time)
    else:
        user_dict[user_id]['timestamp_answ_incorr'].append(time)
    
del user_dict
gc.collect()

### Second question clustering

In [None]:
X_clustering = features[:,[1,2,3,4,5,15,16,17,37,39]]
X_clustering = np.concatenate((questions.reshape(-1,1),np.nan_to_num(X_clustering), np.nan_to_num(
    question_had_explanation.reshape(-1,1)), np.nan_to_num(question_elapsed_time.reshape(-1,1)
    ),np.nan_to_num(timestamp.reshape(-1,1))), axis=1)
X_clustering = pd.DataFrame(X_clustering).groupby(0).mean()
scaler = StandardScaler()
X_clustering = pd.DataFrame(scaler.fit_transform(X_clustering))
clustering = KMeans(20, n_init=30, max_iter=1000)
X_clustering['clusters_que'] = clustering.fit_predict(X_clustering)
X_clustering.set_index(np.unique(questions), inplace=True)
que_cluster_dict = X_clustering[['clusters_que']].to_dict(orient='index')
for i, content_id in enumerate(tqdm(questions)):
    features[i,feature_position['question_clusters_2']] = que_cluster_dict[content_id]['clusters_que']+1

### Training the model

In [None]:
train_data=lgb.Dataset(features[train_val_split==0],label=y[train_val_split==0])
val_data=lgb.Dataset(features[train_val_split==1],label=y[train_val_split==1])

param = {'num_leaves':200, 'objective':'binary', 'learning_rate':.1, 'bagging_fraction':0.5, 'bagging_freq':0,
         'lambda_l2': 0.1, 'zero_as_missing':True}
param['metric'] = ['auc']
lgbm=lgb.train(param, train_data, num_boost_round=1500, verbose_eval=10, valid_sets=[train_data, val_data
                                ], early_stopping_rounds=20, categorical_feature=[6,7,8,9,43,44])

### Saving the model for inference

In [None]:
lgbm.save_model('lgbm_model.txt', num_iteration=lgbm.best_iteration)