# Answer correctness prediction with SparkXshards on Orca

Copyright 2016 The BigDL Authors.

SparkXshards in Orca allows users to process large-scale dataset using existing Python codes in a distributed and data-parallel fashion, as shown below. This notebook is an example of feature engineering for [LightGBM](https://github.com/intel-analytics/BigDL/blob/main/python/dllib/src/bigdl/dllib/nnframes/tree_model.py) using SparkXshards on Orca. 

This notebook is adapted from [Riiid! Answer Correctness Prediction EDA. Modeling](https://www.kaggle.com/code/isaienkov/riiid-answer-correctness-prediction-eda-modeling).


In [None]:
# import necessary libraries
from bigdl.orca import init_orca_context, stop_orca_context
import bigdl.orca.data.pandas
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Start an OrcaContext
sc = init_orca_context(memory="8g")

## Load data in parallel and get general information

Load data into data_shards, it is a SparkXshards that can be operated on in parallel, here each element of the data_shards is a panda dataframe read from a file on the cluster. Users can distribute local code of `pd.read_csv(dataFile)` using `bigdl.orca.data.pandas.read_csv(datapath)`. The full dataset is more than 5G, you can sample a small portion to run the flow fast.

In [None]:
path = './answer_correctness/train.csv'
used_data_types_list = [
    'timestamp',
    'user_id',
    'content_id',
    'answered_correctly',
    'prior_question_elapsed_time',
    'prior_question_had_explanation'
]
data_shards = bigdl.orca.data.pandas.read_csv(path,
                                             usecols=used_data_types_list,
                                             index_col=0)
# sample data_shards if needed
data_shards = data_shards = data_shards.sample(0.01)

In [4]:
# show the first couple of rows in the data_shards
data_shards.head(5)

Unnamed: 0,timestamp,user_id,content_id,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
1693389,2877721343,35865606,1900,1,26333.0,True
1978897,8385893042,41563837,183,1,26666.0,True
245353,3720820653,4663746,6152,0,13000.0,True
2103373,22086593201,43827287,1113,1,17000.0,True
647272,1027906533,13149581,6799,1,57750.0,True


In [5]:
# count total number of rows in the data_shards
len(data_shards)

                                                                                

1012301

## Feature engineering

Get 90% of the data for training

In [6]:
def get_feature(df):
    feature_df = df.iloc[:int(9/10 * len(df))]
    return feature_df
feature_shards = data_shards.transform_shard(get_feature)

                                                                                

'answered_correctly' is our target, filter data with the correct label. 

In [7]:
def get_train_questions_only(df):
    train_questions_only_df = df[df['answered_correctly'] != -1]
    return train_questions_only_df
train_questions_only_shards = feature_shards.transform_shard(get_train_questions_only)

                                                                                

Extract statistic features.

In [8]:
train_questions_only_shards = \
    train_questions_only_shards.group_by(columns='user_id', agg={"answered_correctly": ['mean',
                                                                                       'count',
                                                                                       'stddev',
                                                                                       'skewness']
                                                                }, join=True)
train_questions_only_shards.head(5)

                                                                                

Unnamed: 0,user_id,timestamp,content_id,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,avg(answered_correctly),count(answered_correctly),stddev_samp(answered_correctly),skewness(answered_correctly)
0,921758,808038,7218,1,24000.0,False,0.5,2,0.707107,0.0
1,921758,39606020,6475,0,15000.0,True,0.5,2,0.707107,0.0
2,1263038,55294,7963,1,16000.0,False,1.0,2,0.0,
3,1263038,438857,1249,1,14000.0,True,1.0,2,0.0,
4,2706752,685394944,5669,1,17000.0,True,1.0,1,,


Fill None values with value of 0.5, and transform binary feature of 'prior_question_had_explanation' as integer

In [9]:
def transform(df, val):
    train_df = df.fillna(value=val)
    train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype(int)
    return train_df
train_shards = train_questions_only_shards.transform_shard(transform, 0.5)


                                                                                

Assembly 'features' and rename label columns to 'label'

In [10]:
feature_cols = ['prior_question_elapsed_time', 'prior_question_had_explanation', 'avg(answered_correctly)', 'count(answered_correctly)', 'stddev_samp(answered_correctly)', 'skewness(answered_correctly)']

def assembly(df):
    y = df.rename({'answered_correctly': 'label'}, axis=1)['label']
    df['features'] = df[feature_cols].values.tolist()
    df = df['features']
    df1 = pd.concat([df, y], axis=1)
    return df1

train_shards = train_shards.transform_shard(assembly)
train_shards.head(5)

                                                                                

Unnamed: 0,features,label
0,"[24000.0, 0.0, 0.5, 2.0, 0.7071067811865476, 0.0]",1
1,"[15000.0, 1.0, 0.5, 2.0, 0.7071067811865476, 0.0]",0
2,"[16000.0, 0.0, 1.0, 2.0, 0.0, 0.5]",1
3,"[14000.0, 1.0, 1.0, 2.0, 0.0, 0.5]",1
4,"[17000.0, 1.0, 1.0, 1.0, 0.5, 0.5]",1


Current LightGBM requires training data as Spark dataframe with a feature column of pyspark `DenseVector`, transform SparkXshards to Spark dataframe required.

In [11]:
from pyspark.ml.linalg import DenseVector, VectorUDT
from pyspark.sql.functions import udf, col
sdf = train_shards.to_spark_df()
sdf = sdf.withColumn('features', udf(lambda x: DenseVector(x), VectorUDT())(col('features')))

## Train a LightGBM model

Build a LightGBMClassifier

In [12]:
from bigdl.dllib.nnframes.tree_model import LightGBMClassifier, LightGBMClassifierModel

params = {"boosting_type": "gbdt", "num_leaves": 70, "learning_rate": 0.3,
          "min_data_in_leaf": 20, "objective": "binary",
          'num_iterations': 1000,
          'max_depth': 14,
          'lambda_l1': 0.01,
          'lambda_l2': 0.01,
          'bagging_freq': 5,
          'max_bin': 255,
          'early_stopping_round': 20
          }

estimator = LightGBMClassifier(params)

Train a model

In [13]:
model = estimator.fit(sdf)

[Stage 50:>                                                         (0 + 2) / 2]
User settings:

   KMP_AFFINITY=granularity=fine,compact,1,0
   KMP_BLOCKTIME=0
   KMP_SETTINGS=1
   OMP_NUM_THREADS=1

Effective settings:

   KMP_ABORT_DELAY=0
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=128
   KMP_ATOMIC_MODE=2
   KMP_BLOCKTIME=0
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DEVICE_THREAD_LIMIT=2147483647
   KMP_DISP_NUM_BUFFERS=7
   KMP_DUPLICATE_LIB_OK=false
   KMP_ENABLE_TASK_THROTTLING=true
   KMP_FORCE_MONOTONIC_DYNAMIC_SCHEDULE=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=3
   KMP_GTID_MODE=0
   KMP_HANDLE_SIGNALS=false
   KMP_HOT_TEAMS_MAX_LEVEL=1
   KMP_HOT_TEAMS_MODE=0
   KMP_INIT_AT_FORK=true
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=qu

Make prediction

In [14]:
predictions = model.transform(sdf)
predictions.show(5, False)

2022-11-23 23:11:40 WARN  DAGScheduler:69 - Broadcasting large task binary with size 7.2 MiB
+--------------------------------------------+-----+------------------------------------------+-------------------------------------------+----------+
|features                                    |label|rawPrediction                             |probability                                |prediction|
+--------------------------------------------+-----+------------------------------------------+-------------------------------------------+----------+
|[24000.0,0.0,0.5,2.0,0.7071067811865476,0.0]|1    |[0.18786715530061926,-0.18786715530061926]|[0.5468291372132481,0.45317086278675195]   |0.0       |
|[15000.0,1.0,0.5,2.0,0.7071067811865476,0.0]|0    |[-0.22410400024965002,0.22410400024965002]|[0.44420730923050533,0.5557926907694947]   |1.0       |
|[16000.0,0.0,1.0,2.0,0.0,0.5]               |1    |[-27.236102957272454,27.236102957272454]  |[1.4843681839238343E-12,0.9999999999985156]|1.0       |
|

[Stage 53:>                                                         (0 + 1) / 1]                                                                                