# Riiid! Answer Correctness Prediction
## Introduction
In this competition you will predict which questions each student is able to answer correctly. You will loop through a series of batches of questions. Once you make that prediction, you can move on to the next batch.

This competition is different from most Kaggle Competitions in that:
* You can only submit from Kaggle Notebooks
* You must use our custom **`riiideducation`** Python module.  The purpose of this module is to control the flow of information to ensure that you are not using future data to make predictions.  If you do not use this module properly, your code may fail.

## In this Starter Notebook, we'll show how to use the **`riiideducation`** module to get the test features and make predictions.
## TL;DR: End-to-End Usage Example
```
import riiideducation
env = riiideducation.make_env()

# Training data is in the competition dataset as usual
train_df = pd.read_csv('/kaggle/input/riiideducation/train.csv', low_memory=False)
train_my_model(train_df)
iter_test = env.iter_test()
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])```
Note that `train_my_model` is a function you need to write for the above example to work.

## In-depth Introduction
First let's import the module and create an environment.

In [1]:
import riiideducation
import pandas as pd
import os

# You can only call make_env() once, so don't lose it!
env = riiideducation.make_env()

In [2]:
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

In [3]:
## Can't read all rows directly with pandas, this cell only reads a subset of the data

train_data = pd.read_csv(
    '/kaggle/input/riiid-test-answer-prediction/train.csv', 
    low_memory=False, 
    nrows=10**6, 
    dtype={
        'row_id': 'int64', 
        'timestamp': 'int64', 
        'user_id': 'int32', 
        'content_id': 'int16', 
        'content_type_id': 'int8',
        'task_container_id': 'int16', 
        'user_answer': 'int8', 
        'answered_correctly': 'int8', 
        'prior_question_elapsed_time': 'float32', 
        'prior_question_had_explanation': 'boolean'
    }
)

train_data

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
0,0,0,115,5692,0,1,3,1,,
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False
...,...,...,...,...,...,...,...,...,...,...
999995,999995,26482248,20949024,8803,0,29,1,1,14000.0,True
999996,999996,26516686,20949024,4664,0,30,3,1,17000.0,True
999997,999997,26537967,20949024,4108,0,31,1,0,18000.0,True
999998,999998,26590240,20949024,5014,0,32,3,0,6000.0,True


In [4]:
#print('Part of missing values for every column')
#train_data = train_data['prior_question_elapsed_time'].fillna(train_data['prior_question_elapsed_time'].mean(), inplace=True)
#train_data = train_data['prior_question_had_explanation'].fillna(False, inplace=True)

In [5]:
train_df_dna = train_data.dropna()
print(train_df_dna.isnull().sum() / len(train_df_dna))
train_df_dna.shape

row_id                            0.0
timestamp                         0.0
user_id                           0.0
content_id                        0.0
content_type_id                   0.0
task_container_id                 0.0
user_answer                       0.0
answered_correctly                0.0
prior_question_elapsed_time       0.0
prior_question_had_explanation    0.0
dtype: float64


(976277, 10)

In [6]:
ds = train_df_dna['user_id'].value_counts()
ds.columns = ['user_id', 'count']
ds

7171715     10796
1283420      7475
18122922     7412
9418512      7260
4421282      6959
            ...  
10855907        6
4280793         6
1946295         2
2148001         1
15960740        1
Name: user_id, Length: 3822, dtype: int64

In [7]:
train_questions_only_df = train_df_dna[train_df_dna['answered_correctly']!=-1]
train_questions_only_df

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
1,1,56943,115,5716,0,2,2,1,37000.0,False
2,2,118363,115,128,0,0,0,1,55000.0,False
3,3,131167,115,7860,0,3,0,1,19000.0,False
4,4,137965,115,7922,0,4,1,1,11000.0,False
5,5,157063,115,156,0,5,2,1,5000.0,False
...,...,...,...,...,...,...,...,...,...,...
999995,999995,26482248,20949024,8803,0,29,1,1,14000.0,True
999996,999996,26516686,20949024,4664,0,30,3,1,17000.0,True
999997,999997,26537967,20949024,4108,0,31,1,0,18000.0,True
999998,999998,26590240,20949024,5014,0,32,3,0,6000.0,True


In [8]:
grouped_by_user_df = train_questions_only_df.groupby('user_id')
user_answers_df = grouped_by_user_df.agg({'answered_correctly': ['mean', 'count', 'std', 'skew']}).copy()
user_answers_df.columns = ['mean_user_accuracy', 'questions_answered', 'std_user_accuracy', 'skew_user_accuracy']
user_answers_df

Unnamed: 0_level_0,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
115,0.688889,45,0.468179,-0.844439
124,0.206897,29,0.412251,1.527297
2746,0.611111,18,0.501631,-0.498374
5382,0.669355,124,0.472354,-0.728823
8623,0.638889,108,0.482562,-0.586492
...,...,...,...,...
20913319,0.632242,397,0.482804,-0.550582
20913864,0.300000,20,0.470162,0.945300
20938253,0.609943,523,0.488230,-0.452101
20948951,0.620000,50,0.490314,-0.509877


In [9]:
grouped_by_content_df = train_questions_only_df.groupby('content_id')
content_answers_df = grouped_by_content_df.agg({'answered_correctly': ['mean', 'count', 'std', 'skew'] }).copy()
content_answers_df.columns = ['mean_accuracy', 'question_asked', 'std_accuracy', 'skew_accuracy']

content_answers_df

Unnamed: 0_level_0,mean_accuracy,question_asked,std_accuracy,skew_accuracy
content_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.863014,73,0.346212,-2.156130
1,0.927273,55,0.262082,-3.383648
2,0.560811,444,0.496848,-0.245894
3,0.798995,199,0.401763,-1.503527
4,0.602606,307,0.490158,-0.421410
...,...,...,...,...
13518,0.750000,8,0.462910,-1.440165
13519,0.555556,9,0.527046,-0.271052
13520,0.700000,10,0.483046,-1.035098
13521,0.857143,7,0.377964,-2.645751


In [10]:
del grouped_by_user_df
del grouped_by_content_df

In [11]:
features = [
    'mean_user_accuracy', 
    'questions_answered',
    'std_user_accuracy', 
    'skew_user_accuracy',
    'mean_accuracy', 
    'question_asked',
    'std_accuracy', 
    'prior_question_elapsed_time', 
    'prior_question_had_explanation',
    'skew_accuracy'
]
target = 'answered_correctly'

In [12]:
train_df_merged = train_df_dna.merge(user_answers_df, how='left', on='user_id')
train_df_merged = train_df_merged.merge(content_answers_df, how='left', on='content_id')

train_df_merged = train_df_merged.fillna(0.5)
train_df_merged
print(train_df_merged.isnull().sum() / len(train_df_merged))

train_df_merged

row_id                            0.0
timestamp                         0.0
user_id                           0.0
content_id                        0.0
content_type_id                   0.0
task_container_id                 0.0
user_answer                       0.0
answered_correctly                0.0
prior_question_elapsed_time       0.0
prior_question_had_explanation    0.0
mean_user_accuracy                0.0
questions_answered                0.0
std_user_accuracy                 0.0
skew_user_accuracy                0.0
mean_accuracy                     0.0
question_asked                    0.0
std_accuracy                      0.0
skew_accuracy                     0.0
dtype: float64


Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy,mean_accuracy,question_asked,std_accuracy,skew_accuracy
0,1,56943,115,5716,0,2,2,1,37000.0,False,0.688889,45,0.468179,-0.844439,0.768595,242,0.422605,-1.281734
1,2,118363,115,128,0,0,0,1,55000.0,False,0.688889,45,0.468179,-0.844439,0.959184,49,0.199915,-4.789271
2,3,131167,115,7860,0,3,0,1,19000.0,False,0.688889,45,0.468179,-0.844439,0.926108,203,0.262241,-3.282080
3,4,137965,115,7922,0,4,1,1,11000.0,False,0.688889,45,0.468179,-0.844439,0.959064,171,0.198723,-4.674816
4,5,157063,115,156,0,5,2,1,5000.0,False,0.688889,45,0.468179,-0.844439,0.940887,203,0.236420,-3.766807
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
976272,999995,26482248,20949024,8803,0,29,1,1,14000.0,True,0.297872,47,0.462267,0.913373,0.688312,77,0.466221,-0.829364
976273,999996,26516686,20949024,4664,0,30,3,1,17000.0,True,0.297872,47,0.462267,0.913373,0.772871,317,0.419639,-1.308758
976274,999997,26537967,20949024,4108,0,31,1,0,18000.0,True,0.297872,47,0.462267,0.913373,0.493776,723,0.500307,0.024950
976275,999998,26590240,20949024,5014,0,32,3,0,6000.0,True,0.297872,47,0.462267,0.913373,0.839080,87,0.369587,-1.878090


In [13]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(train_df_merged, random_state=666, test_size=0.2)
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype(bool)
test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].astype(bool)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [14]:
#import optuna
#from lightgbm import LGBMClassifier
#from sklearn.metrics import roc_auc_score
#from optuna.samplers import TPESampler

#sampler = TPESampler(seed=666)

#def create_model(trial):
#    num_leaves = trial.suggest_int("num_leaves", 2, 31)
#    n_estimators = trial.suggest_int("n_estimators", 50, 300)
#    max_depth = trial.suggest_int('max_depth', 3, 8)
#    min_child_samples = trial.suggest_int('min_child_samples', 100, 1200)
#    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)
#    min_data_in_leaf = trial.suggest_int('min_data_in_leaf', 5, 90)
#    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.0001, 1.0)
#    feature_fraction = trial.suggest_uniform('feature_fraction', 0.0001, 1.0)
#    model = LGBMClassifier(
#        num_leaves=num_leaves,
#        n_estimators=n_estimators, 
#        max_depth=max_depth, 
#        min_child_samples=min_child_samples, 
#        min_data_in_leaf=min_data_in_leaf,
#        learning_rate=learning_rate,
#        feature_fraction=feature_fraction,
#        random_state=666
#)
#    return model

#def objective(trial):
#    model = create_model(trial)
#    model.fit(train_df[features], train_df[target])
#    score = roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1])
#    return score



In [15]:
import optuna
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from optuna.samplers import TPESampler

sampler = TPESampler(seed=666)

def create_model(trial):
    max_depth = trial.suggest_int('max_depth', 3, 8)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0001, 0.99)
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    subsample = trial.suggest_uniform('subsample', 0.5, 1.0)
    model = XGBClassifier(
        max_depth=max_depth, 
        learning_rate=learning_rate,
        min_child_weight = min_child_weight,
        subsample = subsample,
        random_state=666
)
    return model


def objective(trial):
    model = create_model(trial)
    model.fit(train_df[features], train_df[target])
    score = accuracy_score(test_df[target].values, model.predict(test_df[features]))
    return score

In [16]:
# Use optuna to optimize parameters
#study = optuna.create_study(direction="maximize", sampler=sampler)
#study.optimize(objective, n_trials=100)
#params = study.best_params
#params['random_state'] = 666

In [17]:
#params = {
#    'bagging_fraction': 0.5817242323514327,
#    'feature_fraction': 0.6884588361650144,
#    'learning_rate': 0.42887924851375825, 
#    'max_depth': 6,
#    'min_child_samples': 946, 
#    'min_data_in_leaf': 47, 
#    'n_estimators': 169,
#    'num_leaves': 29,
#    'random_state': 666
#}

#model = LGBMClassifier(**params)
#model.fit(train_df[features], train_df[target])
#print('LGB score: ', roc_auc_score(test_df[target].values, model.predict_proba(test_df[features])[:,1]))

In [18]:
params = {
    'learning_rate': 0.0930381630776193, 
    'max_depth': 7,
    'min_child_weight': 8, 
    'subsample': 0.8470702875072027, 
    'random_state': 666
}

model = XGBClassifier(**params)
model.fit(train_df[features], train_df[target])
print('Accuracy score: ', accuracy_score(test_df[target].values, model.predict(test_df[features])))

Accuracy score:  0.7252734871143525


## `iter_test` function

Generator which loops through each batch of questions in the test set. You have direct access to the example test rows for your convenience, but your code will only be able to get rows from the real test set via the API. Once you call **`predict`** you can continue on to the next batch.

Yields:
* While there are more batch(es) and `predict` was called successfully since the last yield, yields a tuple of:
    * `test_df`: DataFrame with the test features for the next batch, and user responses for the previous batch.
    * `sample_prediction_df`: DataFrame with an example prediction.  Intended to be filled in and passed back to the `predict` function.
* If `predict` has not been called successfully since the last yield, prints an error and yields `None`.

In [19]:
# You can only iterate through a result from `env.iter_test()` once
# so be careful not to lose it once you start iterating.
iter_test = env.iter_test()

In [20]:
%%time

#submission = pd.DataFrame(columns=["row_id", "answered_correctly"])

for (test_df, sample_prediction_df) in iter_test:
    test_df = test_df.merge(user_answers_df, how = 'left', on = 'user_id')
    test_df = test_df.merge(content_answers_df, how = 'left', on = 'content_id')
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(value = False).astype(bool)
    test_df.fillna(value = -1, inplace = True)

    test_df['answered_correctly'] = model.predict_proba(test_df[features])[:,1]
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
    
    #submission = submission.append(test_df[['row_id', 'answered_correctly']], ignore_index=True)
    
#submission.to_csv('submission.csv')

CPU times: user 1.02 s, sys: 62.9 ms, total: 1.08 s
Wall time: 484 ms


In [21]:
test_df

Unnamed: 0,row_id,timestamp,user_id,content_id,content_type_id,task_container_id,prior_question_elapsed_time,prior_question_had_explanation,prior_group_answers_correct,prior_group_responses,mean_user_accuracy,questions_answered,std_user_accuracy,skew_user_accuracy,mean_accuracy,question_asked,std_accuracy,skew_accuracy,answered_correctly
0,74,75311,275030867,8308,0,3,15000.0,False,"[1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, ...","[0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 3, 0, 0, ...",-1.0,-1.0,-1.0,-1.0,0.603774,53,0.493793,-0.436795,0.469266
1,75,31220886463,1305988022,396,0,4163,19000.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.721739,345,0.448793,-0.993919,0.55112
2,76,48613916248,1310228392,11869,0,1458,26333.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.555556,27,0.50637,-0.236981,0.440913
3,77,48613916248,1310228392,11871,0,1458,26333.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,1.0,27,0.0,0.0,0.995732
4,78,48613916248,1310228392,11870,0,1458,26333.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.518519,27,0.509175,-0.078558,0.400605
5,79,48613916248,1310228392,11872,0,1458,26333.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.222222,27,0.423659,1.416232,0.169121
6,80,48613916248,1310228392,11868,0,1458,26333.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.814815,27,0.395847,-1.717834,0.690569
7,81,4693192735,1637273633,5935,0,3149,19000.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.413043,46,0.497821,0.365228,0.445527
8,82,1254131274,674533997,6000,0,1046,10000.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,0.459119,159,0.4999,0.165638,0.412192
9,84,69704234415,2093197291,12611,0,5448,28750.0,True,-1,-1,-1.0,-1.0,-1.0,-1.0,1.0,1,-1.0,-1.0,0.995581


In [22]:
##Check with example_test
#example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv', low_memory=False)
#example_test = example_test.merge(user_answers_df, how = 'left', on = 'user_id')
#example_test = example_test.merge(content_answers_df, how = 'left', on = 'content_id')
#example_test['prior_question_had_explanation'] = example_test['prior_question_had_explanation'].fillna(value = False).astype(bool)
#example_test.fillna(value = 0.5, inplace = True)

In [23]:
#example_test['answered_correctly'] = model.predict_proba(example_test[features])[:,1]
#example_test

In [24]:
## Example submission
#for (test_df, sample_prediction_df) in iter_test:
#    test_df['answered_correctly'] = 0.5
#    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])