# Efficient Implementation of ROC AUC Score

## Introduction

This post is a continuation of the [ROC and AUC Interpretation](https://maitbayev.github.io/posts/roc-auc). 
Please make sure that you understand that post before reading this one. 

In this post, we will implement an efficient ROC AUC Score in Python with $O(n\log n)$ runtime complexity.

[Subscribe](https://maitbayev.substack.com/subscribe) to get a notification about future posts. 

## Explanation

In [1]:
#| echo: false
#| output: false

import numpy as np
import pandas as pd
import plotly.graph_objects as go

LIGHT_RED = "#ff8886"
LIGHT_GREEN = "lightgreen"
DARK_RED = "#8b0000"

def make_dataframe(positives, negatives, threshold: float = 0):
    positives = np.array(positives)
    negatives = np.array(negatives)
    df_positives = pd.DataFrame({
        "score": positives,
        "target": np.ones(len(positives)),
        "color": np.full(len(positives), LIGHT_GREEN),
        "stroke_width": (positives > threshold) * 3,
        "stroke_color": np.full(len(positives), "green"),
    })
    df_negatives = pd.DataFrame({
        "score": negatives,
        "target": np.zeros(len(negatives)),
        "color": np.full(len(negatives), LIGHT_RED),
        "stroke_width": (negatives > threshold) * 3,
        "stroke_color": np.full(len(negatives), DARK_RED),
    })
    df = pd.concat([df_positives, df_negatives])
    df.sort_values("score", ascending=False, inplace=True)
    return df


def figure_auc1d(positives, negatives, threshold: float = 0.5, reverse=True):
    df = make_dataframe(positives, negatives, threshold)
    return go.Figure(data=[
        go.Scatter(
            x=df["score"],
            y=np.full(len(df), 0.5),
            mode="markers",
            marker=dict(
                size=df["score"] * 40,
                color=df["color"],
                opacity=1,
                line=dict(
                    width=df["stroke_width"],
                    color=df["stroke_color"]
                ),
            )
        ),
        go.Scatter(
            x=[threshold, threshold],
            y=[0, 1],
            mode="lines",
            line=dict(
                color="black",
                dash="dot",
            ),
            visible=(0 <= threshold <= 1)
        )
    ], layout=go.Layout(
        plot_bgcolor="#ffffff",
        height=100,
        margin=dict(l=5, r=5, t=20, b=20),
        xaxis=dict(
            linecolor="#cccccc",
            autorange="reversed" if reverse else True,
            nticks=10,
            range=[0, 1],
        ),
        yaxis=dict(visible=False),
        legend=dict(visible=False)
    ))

Let's say we have:

- A dataset with positive and negative items
- An ML model that predicts a probability score from 0 to 1, representing the probability that the input belongs to the positive class.

Similar to the previous post, we can visualize our dataset and their probability predictions in the same visualization as following:

In [8]:
#| echo: false
positives = [0.25, 0.35, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
negatives = [0.1, 0.15, 0.2, 0.3, 0.4, 0.6, 0.75, 0.95]

figure_auc1d(positives, negatives, threshold=1.1, reverse=False).show(renderer="iframe")

Positive items in the dataset are shown in green, and negative items are shown in red. The sizes of circles represent the predicted probability scores, with smaller circles representing scores close to 0 and larger circles representing scores close to 1.

Then, choosing a threshold gives us different points on the ROC curve with x representing the false positive rate (fpr) an y representing the true positive rate (tpr). For example:

In [9]:
figure_auc1d(positives, negatives, threshold=0.5, reverse=False).show(renderer="iframe")

A threshold of value 0.5 gives us the point $(\frac{3}{8}, \frac{4}{7})$, whereas a threshold of value 0.4 gives us the point $(\frac{3}{8}, \frac{5}{7})$. Then, we need to find the sum of areas of all trapezoids defined by the mentioned adjacent points:

![trapezoids](images/trapezoids.jpg)

We will sum all 5 trapezoids shown above, which will be the ROC AUC score.

### Explanation

Let's setup our environment:

In [3]:
import numpy as np

np.random.seed(0)
n = 100
target = np.random.randint(0, 2, n)
predicted = np.random.rand(n)

In [4]:
import sklearn
sklearn.metrics.roc_auc_score(target, predicted)

np.float64(0.4277597402597403)

### Trapezoid Area

![Area of trapezoid](images/trapezoid.jpg)

We want to find the area of the trapezoid defined by the $(x_0, y_0)$ and $(x_1, y_1)$ points as shown in the picture above. We can add the area of the the rectangle and the right triangle, which is:

$$
\begin{align}
\text{Area}&=(x_1-x_0) \times y0+\frac{1}{2}(x_1-x_0) \times (y_1-y_0)\\
&= \frac{1}{2}(x_1-x_0) \times (2y_0+y_1 - y_0)\\
&= \frac{1}{2}(x_1-x_0) \times (y_0 + y_1)\\
\end{align}
$$

Let's implement the formula in Python:

In [5]:
def trapezoid_area(p0, p1):
    return (p1[0] - p0[0]) * (p0[1] + p1[1]) / 2.0

In [6]:
def fast_roc_auc_score(target, predicted):
    n = target.shape[0]
    num_positive = np.sum(target == 1)
    num_negative = n - num_positive 
    
    order = np.argsort(predicted)[::-1]
    last = [0, 0]
    num_true_positive = 0
    num_false_positive = 0
    score = 0
    for index in range(n):
        # Make sure that the new threshold is unique
        if index == 0 or predicted[order[index]] != predicted[order[index - 1]]:
            # True positive rate
            tpr = num_true_positive / num_positive
            # False positive rate
            fpr = num_false_positive / num_negative
            # New point on the ROC curve
            cur = [tpr, fpr]
            
            score += trapezoid_area(last, cur)
            last = cur
        
        if target[order[index]] == 1:
            num_true_positive += 1
        else:
            num_false_positive += 1
    score += trapezoid_area(last, [1, 1])

    return score 

In [7]:
fast_roc_auc_score(target, predicted)

np.float64(0.5722402597402597)