<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#GBTClassifier" data-toc-modified-id="GBTClassifier-1">GBTClassifier</a></span><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1.1">Load Data</a></span></li><li><span><a href="#Train-test-split" data-toc-modified-id="Train-test-split-1.2">Train-test split</a></span></li><li><span><a href="#Build-Pipeline" data-toc-modified-id="Build-Pipeline-1.3">Build Pipeline</a></span></li><li><span><a href="#Train-Models" data-toc-modified-id="Train-Models-1.4">Train Models</a></span></li><li><span><a href="#Predict-&amp;-Validate" data-toc-modified-id="Predict-&amp;-Validate-1.5">Predict &amp; Validate</a></span></li><li><span><a href="#Feature-Importances" data-toc-modified-id="Feature-Importances-1.6">Feature Importances</a></span></li></ul></li></ul></div>

In [20]:
# Standard lib
import time

# SparkML
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression, GBTClassifier, MultilayerPerceptronClassifier

# SparkSQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

In [21]:
# Get the ball rolling
ss = SparkSession.builder.getOrCreate()

------------------

# GBTClassifier

## Load Data

In [22]:
# Data path
data_path = "../../data/processed/lite_300.csv"

In [23]:
df_pills = ss.read.csv(data_path, header=True, inferSchema=True)

In [24]:
# n_observations
df_pills.count()

300

In [25]:
##### Encode pill counts less than or equal to threshold as 1, else 0
thresh = 15
convert_int = udf(lambda x : int(x <= thresh), IntegerType())
df_pills = df_pills.withColumn('label', convert_int('label'))

-------------

## Train-test split

In [26]:
# Train-test split
df_train, df_test = df_pills.randomSplit(weights=[0.80, 0.20])

-------

## Build Pipeline

In [27]:
# Drop target="label" col from training data
train_cols = df_train.drop("label")

In [28]:
# Transformer; excludes "label" col
va = VectorAssembler(outputCol="features", inputCols=df_train.columns[:-1])

# Estimators
gbt = GBTClassifier(maxIter=200, maxDepth=3)
lr = LogisticRegression(regParam=0, maxIter=1000, fitIntercept=True)

In [29]:
# Assemble features
train_lpoints = va.transform(df_train).select("features", "label").cache()
test_lpoints = va.transform(df_test).select("features", "label").cache()

In [30]:
train_lpoints.groupBy("label").count().show(2)

+-----+-----+
|label|count|
+-----+-----+
|    1|  159|
|    0|   79|
+-----+-----+



---------------

## Train Models

In [31]:
# Train Gradient Boosted Tree Model
start = time.time()
gbt_model = gbt.fit(train_lpoints)
end = time.time()

# Train time
print(end-start)

51.556219816207886


In [32]:
# Train Logistic Regression
start = time.time()
lr_model = lr.fit(train_lpoints)
end = time.time()

# Train time
print(end-start)

11.790694952011108


-----------

## Predict & Validate

In [33]:
# Predict
gbt_predict = gbt_model.transform(test_lpoints)
lr_predict = lr_model.transform(test_lpoints)

In [34]:
# Validate
metrics = MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

metrics.setMetricName("accuracy")
print(f"Gradient Boosted Accuracy: {metrics.evaluate(gbt_predict):.3f}")
print(f"Logistic Reg Accuracy: {metrics.evaluate(lr_predict):.3f}\n")


metrics.setMetricName("f1")
print(f"Gradient Boosted F1: {metrics.evaluate(gbt_predict):.3f}")
print(f"Logistic Reg F1: {metrics.evaluate(lr_predict):.3f}")

Gradient Boosted Accuracy: 0.742
Logistic Reg Accuracy: 0.790

Gradient Boosted F1: 0.729
Logistic Reg F1: 0.786


-----------

## Feature Importances

In [35]:
# Feature importances
top_tups_gbt = sorted(list(zip(gbt_model.featureImportances.indices, gbt_model.featureImportances.values)), key=lambda x: x[1], reverse=True)[:10]

In [36]:
top_indices = [x[0] for x in top_tups_gbt[:5]]

In [37]:
top_indices

[5, 26, 42, 85, 20]

In [38]:
[df_pills.columns[index] for index in top_indices]

['0_max_gyro_y',
 '1_avg_gyro_y',
 '2_avg_gyro_x',
 '4_max_gyro_y',
 '1_min_gyro_x']

--------------------