## Beta distributions for CTR regularization



In [None]:
import pandas as pd
import numpy as np
import plotnine as p9
import statistics as stats

### Some example data

This data has impressions and downloads, which we will convert to `ctr`.

In [None]:
# read-in the downloaded data
df = pd.read_csv("click_log.csv")
df

In [None]:
# create a new column for CTR
df['ctr'] = df['clicks'] / df['impressions']

In [None]:
# calculate the average CTR as a variable
mean_ctr = stats.mean(df['ctr'])

In [None]:
# calculate the variance CTR
var_ctr = stats.variance(df['ctr'])

In [None]:
# calculate sum(alpha, beta)
alpha_beta = mean_ctr * (1 - mean_ctr)  / var_ctr - 1
alpha_beta

In [None]:
# calculate alpha
alpha = mean_ctr * alpha_beta
alpha

In [None]:
# calculate beta
beta = (1-mean_ctr) * alpha_beta
beta

In [None]:
# calulate expected CTR
expected_ctr = alpha / (alpha + beta)
expected_ctr

### Focus on the query 'duck'

To show how this can adjust our CTR metric let's look at for just a single query and handful of documents. You should notice that the adjusted values are all closer to the global average that the were as raw CTR values. The magnitude of this shift is based on the number of observations

The concept is shrinking individual resource CTRs towards to the global average CTR. The less data we have on a given resource the more we trust the global average. The more data we have for a resource the more we trust it.

This process gives a robust estimates when we only have a few observations, which is often the case for search.


In [None]:
ducks = df.loc[df['term'] == 'duck']
ducks = ducks.head(100)
ducks

In [None]:
# caculate adjusted CTR
ducks['adjusted_ctr'] = (alpha + ducks['clicks']) / (alpha + beta + ducks['impressions'])
ducks

In [None]:
# visualize it to see the shift

ducks_long = pd.melt(ducks[['resource_id', 'ctr', 'adjusted_ctr', 'impressions']], id_vars=['resource_id', 'impressions'], value_vars=['ctr', 'adjusted_ctr'])

{
    p9.ggplot(ducks_long, p9.aes('value', 'resource_id', color = 'variable', size = 'impressions')) +
    p9.geom_vline(xintercept = expected_ctr, linetype='dashed') +
    p9.geom_point(alpha = .5) +
    p9.scale_x_log10()
}