# Implementation of ROC AUC Score

## Introduction

This post is a continuation of the [ROC and AUC Interpretation](https://maitbayev.github.io/posts/roc-auc). 
Please make sure that you understand that post before reading this one. 

In this post, we will implement a ROC AUC Score in Python with $O(n\log n)$ runtime complexity.

[Subscribe](https://maitbayev.substack.com/subscribe) to get a notification about future posts. 

## Explanation

In [1]:
#| echo: false
#| output: false

import numpy as np
import pandas as pd
import plotly.graph_objects as go

LIGHT_RED = "#ff8886"
LIGHT_GREEN = "lightgreen"
DARK_RED = "#8b0000"

def make_dataframe(positives, negatives, threshold: float = 0):
    positives = np.array(positives)
    negatives = np.array(negatives)
    df_positives = pd.DataFrame({
        "score": positives,
        "target": np.ones(len(positives)),
        "color": np.full(len(positives), LIGHT_GREEN),
        "stroke_width": (positives > threshold) * 3,
        "stroke_color": np.full(len(positives), "green"),
    })
    df_negatives = pd.DataFrame({
        "score": negatives,
        "target": np.zeros(len(negatives)),
        "color": np.full(len(negatives), LIGHT_RED),
        "stroke_width": (negatives > threshold) * 3,
        "stroke_color": np.full(len(negatives), DARK_RED),
    })
    df = pd.concat([df_positives, df_negatives])
    df.sort_values("score", ascending=False, inplace=True)
    return df


def figure_auc1d(positives, negatives, threshold: float = 0.5, reverse=True):
    df = make_dataframe(positives, negatives, threshold)
    return go.Figure(data=[
        go.Scatter(
            x=df["score"],
            y=np.full(len(df), 0.5),
            mode="markers",
            marker=dict(
                size=df["score"] * 40,
                color=df["color"],
                opacity=1,
                line=dict(
                    width=df["stroke_width"],
                    color=df["stroke_color"]
                ),
            )
        ),
        go.Scatter(
            x=[threshold, threshold],
            y=[0, 1],
            mode="lines",
            line=dict(
                color="black",
                dash="dot",
            ),
            visible=(0 <= threshold <= 1)
        )
    ], layout=go.Layout(
        plot_bgcolor="#ffffff",
        height=100,
        margin=dict(l=5, r=5, t=20, b=20),
        xaxis=dict(
            linecolor="#cccccc",
            autorange="reversed" if reverse else True,
            nticks=10,
            range=[0, 1],
        ),
        yaxis=dict(visible=False),
        legend=dict(visible=False)
    ))

Similar to the [previous post](https://maitbayev.github.io/posts/roc-auc) we have:

- A dataset with positive and negative items
- An ML model that predicts a probability score from 0 to 1, representing the probability that the input belongs to the positive class

We want to compute the ROC AUC score of our model predictions. The algorithm that we are going to implement is explained more easily with a visualization (press the play button):

In [2]:
#| echo: false
from IPython.display import HTML


HTML('''
<video controls style='width: 94%; max-width: 600px; display: block; margin: auto; padding: 10px;'>
<source src='images/algo.mp4' type='video/mp4'>
</video>
''')

This is a slightly modified visualization from [the other post](https://maitbayev.github.io/posts/roc-auc). A few notes from the animation video:

- The ROC score is the sum of the areas of trapezoids formed by two adjacent points on the ROC curve
- Some trapezoids have zero area
- We process the dataset items in order of their probability scores, from the highest to the lowest

## Implementation

Let's setup our environment:

In [3]:
import numpy as np

np.random.seed(0)
n = 100
target = np.random.randint(0, 2, n)
predicted = np.random.rand(n)

We randomly generated targets and predicted probability scores. Let's check the result of `sklearn.metrics.roc_auc_score`:

In [4]:
import sklearn
sklearn.metrics.roc_auc_score(target, predicted)

np.float64(0.4277597402597403)

Our implementation should have the same score. 

### Trapezoid Area

First, let's implement a helper function that finds the area of the trapezoid defined by two points $(x_0, y_0)$ and $(x_1, y_1)$.

![Area of trapezoid](images/trapezoid.jpg){width=300}

To achieve this, we can add the area of the rectangle and the area of the right triangle, which is:

$$
\begin{align}
\text{Area}&=(x_1-x_0) \times y0+\frac{1}{2}(x_1-x_0) \times (y_1-y_0)\\
&= \frac{1}{2}(x_1-x_0) \times (2y_0+y_1 - y_0)\\
&= \frac{1}{2}(x_1-x_0) \times (y_0 + y_1)\\
\end{align}
$$

Let's implement the formula in Python:

In [5]:
def trapezoid_area(p0, p1):
    return (p1[0] - p0[0]) * (p0[1] + p1[1]) / 2.0

### ROC AUC Score 

Now our main implementation:

In [6]:
def roc_auc_score(target, predicted):
    n = target.shape[0]
    num_positive = np.sum(target == 1)
    num_negative = n - num_positive 
    # argsort in reverse order
    order = np.argsort(predicted)[::-1]
    last = [0, 0]
    num_true_positive = 0
    num_false_positive = 0
    score = 0
    for index in range(n):
        # Make sure that the new threshold is unique
        if index == 0 or predicted[order[index]] != predicted[order[index - 1]]:
            # True positive rate
            tpr = num_true_positive / num_positive
            # False positive rate
            fpr = num_false_positive / num_negative
            # New point on the ROC curve
            cur = [fpr, tpr]
            
            score += trapezoid_area(last, cur)
            last = cur
        
        if target[order[index]] == 1:
            num_true_positive += 1
        else:
            num_false_positive += 1
    score += trapezoid_area(last, [1, 1])

    return score 

Let's verify the result:

In [7]:
roc_auc_score(target, predicted)

np.float64(0.4277597402597403)

Nice, we got exactly the same result as `sklearn`. 

It is better explained in the code, but roughtly our algorithm is:

1. Sort items by their predicted scores, from largest to smallest
2. Process the sorted items one by one in a loop
    1. Form the current point on the ROC curve by: $(\frac{\text{num\_false\_positive}}{\text{num\_negative}}, \frac{\text{num\_true\_positive}}{\text{num\_positive}})$
    2. Add the trapezoid area formed by the previous point and the current one 
    3. If the current item is positive, then increase `num_true_positive` by one
    4. If the current item is negative, then increase `num_false_positive` by one 

## The End

I hope you enjoyed this post.

[Subscribe](https://maitbayev.substack.com/subscribe) to get a notification about future posts.