<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Evaluator" data-toc-modified-id="Evaluator-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Evaluator</a></span></li><li><span><a href="#sklearn-auc-score" data-toc-modified-id="sklearn-auc-score-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>sklearn auc score</a></span></li></ul></div>

# Area Under the Curve (AUC) in PySpark

Source Code Link
- https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala
- https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/mllib/src/main/scala/org/apache/spark/mllib/evaluation/AreaUnderCurve.scala

by Jason Jung

## Setup

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Jason's Spark App") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

print('Spark Version:', spark.version)

Spark Version: 2.3.0


In [5]:
import pandas as pd 
import numpy as np 

import pyspark 
from pyspark.sql.functions import pandas_udf,udf, col
from pyspark.sql.types import *
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors, DenseVector

## Dataset

In [6]:
# create pandas dataframe 
df = pd.DataFrame({'yhat':[.9,.8,.7,.5,.2,.1,.9],'y':[1,1,1,0,0,0,0]})
df

Unnamed: 0,yhat,y
0,0.9,1
1,0.8,1
2,0.7,1
3,0.5,0
4,0.2,0
5,0.1,0
6,0.9,0


In [8]:
# convert pandas dataframe to spark dataframe 
sdf = spark.createDataFrame(df)
sdf.show()

+----+---+
|yhat|  y|
+----+---+
| 0.9|  1|
| 0.8|  1|
| 0.7|  1|
| 0.5|  0|
| 0.2|  0|
| 0.1|  0|
| 0.9|  0|
+----+---+



## Evaluator

`metricName` = optional, roc auc is default

In [9]:
evaluator = BinaryClassificationEvaluator(labelCol='y', rawPredictionCol="yhat", metricName="areaUnderROC")

In [43]:
roc_auc = evaluator.evaluate(sdf, {evaluator.metricName: "areaUnderROC"})
pr_auc = evaluator.evaluate(sdf, {evaluator.metricName: "areaUnderPR"})

In [52]:
print('roc auc:',roc_auc)
print('pr auc:',pr_auc)

roc auc: 0.7916666666666666
pr auc: 0.5972222222222221


## sklearn auc score

Let's confirm the metrics are actually the same. Hmm.... They are not same, but similar.

In [71]:
from sklearn.metrics import roc_auc_score, recall_score, auc, make_scorer
from sklearn.metrics import precision_recall_curve, roc_curve
from sklearn.metrics import average_precision_score

print('ROC AUC:', roc_auc_score(df.y, df.yhat))
print('Avg Precision:', average_precision_score(df.y,df.yhat))

ROC AUC: 0.7916666666666666
Avg Precision: 0.6388888888888888


In [78]:
# custom pr auc score 
# Reference: # https://github.com/scikit-learn/scikit-learn/issues/5992
def jj_pr_auc_score(y_true, y_pred):
    """
    Input: 
        y_true - 1s and 0s
        y_pred - probabilities 
    Returns
        PR AUC Score 
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_pred,pos_label=1)
    return auc(recall, precision, reorder=True)

print('PR AUC:',jj_pr_auc_score(df.y,df.yhat))

PR AUC: 0.6805555555555556


In [77]:
# custom roc auc score 
def jj_roc_auc_score(y_true, y_pred):
    """
    Input: 
        y_true - 1s and 0s 
        y_pred - probabilities 
    Returns
        ROC AUC Score 
    """
    fpr, tpr, thresholds = roc_curve(y_true, y_pred)
    return auc(fpr, tpr, reorder=True)

print('ROC AUC:',jj_roc_auc_score(df.y,df.yhat))

ROC AUC: 0.7916666666666666
