Three types of methods towards list ranking
- **pointwise**<br>we model document relevance score
- **pairwise**<br>we model difference between all scores
- **listwise**<br>we model list relevance score

Ground-truth relevance (Target) can be achieved using two sources of data
- (offline data) assessor's estimate
- (online data) behavioral statistics

What's the problem with online data?<br>Behavior does not reflect only relevance - it depends on way more factors:
- position of a document 
- generated list (which can fell into feedback loop)
- clickbaitness of document (not obvious for the user, but noticable for assessors)

# RankNet

Introduced in 2005 by researchers of Microsoft. Reviwed in 2010 during the rise of learning to rank models popularity.

List is perfectly ranked if for any pair of documents the score of the more relevant is larger. We will try to force our ML model to make correct comparisons for any given pair of documents.

Plan:
- feed arbitrary pairs of documents (all combinations or a subset) to a model
- get two independent relevance scores for each document
- estimate the difference between two documents as a delta of their scores
- compute loss function that measures how close we are to correct comparison output
- iteratively update model weights using loss function gradients

### Model
RankNet model consists of two parts
- the inner part represents some ML model, that computes pointwise score for two documents
- on the outer part it combines scores into a single score

<img src="img/ranknet.png" width=500>

The first step of RankNet is model-agnostic => we can use any ML model. The only requirement - it has to be differentiable and output some kind of score that correlates with a relevance of a document.

On the second step RankNet does the following
1. takes the difference of scores $s(i)-s(j)$ 
2. maps it to $[0,1]$ using smooth sigmoid function:
<img src="img/prob.png" width=300>

Now instead of optimizing complex list ranking we can model a simple binary comparison $U_i > U_j$ (first document is more relevant than second). 

### Cost function
Next we need to define a function that would measure how close our current model's predictions are from perfect solution. Since we are working with probabilities, a good option would be using **Negative LogLoss** aka **binary cross-entropy**.

#### Negative Logloss

As a quick reminder, logloss is a function that "smoothly" penalyzes wrong predictions - the further predicted Prob is from the true answer, the more penalty is assigned.

The function is symmetric - penalyzes errors irrespective of class. Red line is a penlaty for "0" class instances, blue - for "1" class instances.

<img src="img/logloss.png" width=300>



---

The total Logloss of the model is:
$$L = -\sum_{i,j,q} L^{i,j,q}$$ 

A single comparison loss function is:
$$= - \hat{y} \cdot log{(P)} - (1-\hat{y}) \cdot log{(1-P)}$$

where indicator function is
$$\hat{y} = \begin{cases} 0, \text{   if   } U_i > U_j \text{   was labeled as False} \\ 1, \text{   if   } U_i > U_j \text{   was labeled as True} \end{cases}$$

The modeled probability was defined above as:
$$P=\frac{1}{1+e^{-\sigma \Delta s}}$$

The respective logarithms are:
$$logP=-log{(1+e^{-\sigma \Delta s})}$$
$$log(1-P)=-\sigma \Delta s - log{(1+e^{-\sigma \Delta s})}$$

After plugging it in:
$$-\sum_{i,j,q} \bigg( y \cdot (-\log{(1+e^{-\sigma \Delta s}})) + (1-y) \cdot (-\sigma \Delta s - \log{(1+e^{-\sigma \Delta s}})))\bigg)$$

the expression reduces to:
$$L = \begin{cases} -\sigma (s_i - s_j) + log{\big(1+e^{-\sigma (s_i - s_j)}\big)} \text{,      if      } y=0 \\ log{\big(1+e^{-\sigma (s_i - s_j)}\big)} \text{,      if      } y=1 \end{cases}$$

We know that Logloss is a symmetric function, but at first glance what we got does not look like one. The second component is the same for both classes while there is a weird addon for the first class. Let's further rewrite this expression to show that it is symmetric indeed.

Function $\log \big(1+e^{x}\big)$ is called **softplus**. It is often considered a "smooth" version of ReLU activation function (for those who familiar with neural networks design). A function $\log \big(1+e^{-x}\big)$ is its symmetric inversion and either of them can be used as loss function depending on the direction of axis.

Softmax heavily penalyzes predictions on the $R^{+}$ side of the axis and is way more tolerable to predictions on the $R^{-}$ side.

<img src="img/softplus.png" width=300>

In some sense it can be seen as an expansion to logloss function on a continuous axis. Compare it to binary Logloss.



It has a very nice property:
$$log \big( 1 + e^{-x} \big) = -x + log \big(1 + e^{x}\big)$$

Which allows us to further rewite the expression and get completely symmetric representation:
$$L = \begin{cases} log{\big(1+e^{-\sigma (s_j - s_i)}\big)} \text{,      if      } y=0 \\ log{\big(1+e^{-\sigma (s_i - s_j)}\big)} \text{,      if      } y=1  \end{cases}$$

This gives us loss function
<img src="img/loss_function.png" width=500>



## Model Training

The purpose of training is to find model weights (we will denote them $w_k$) for optimal loss function value. <br>To do so we utilize standard SGD appproach
$$w_k = w_k - \eta \frac{\partial{L}}{\partial{w_k}}$$

The loss function depends directly on two variables $L=L(s_i,s_j)$ => we can unroll its derivative as follows:

$$\frac{\partial C}{\partial w_k} = \frac{\partial C}{\partial s_i} \cdot  \frac{\partial s_i}{\partial w_k} + \frac{\partial C}{\partial s_j} \cdot \frac{\partial s_j}{\partial w_k}$$

Let's write the first derivative (here $S_{ij} \in \{-1, 1\}$ is a true label) <br>how change in output score s(i) affects Cost value
<img src="img/first_derivative.png" width=300>

Notice that derivatives of scores differ only in sign<br>The same change in the cost can be equally achieved by growing the first score as well as lowering the second one. Which is straitforward since the cost depends only on  the difference.

$$\frac{\partial C}{\partial w_k} = \frac{\partial C}{\partial s_i} \cdot \bigg( \frac{\partial s_i}{\partial w_k} - \frac{\partial s_j}{\partial w_k} \bigg)$$

Interpretation of $\frac{\partial L}{\partial s_i}$

This illustrates the fact that loss function change depends on difference in derivatives<br>If, for example, weight update improves $s_i$ but pulls $s_j$ in the same direction, ther will be no improvement in L

Let's write the second derivative - how change in weight affects the output score. It's quite straitforward and depends on model type which can be any

$$$$


References:
- <a href="https://www.researchgate.net/publication/221345726_Learning_to_Rank_using_Gradient_Descent">Learning_to_Rank_using_Gradient_Descent</a>

BTW Even if positive/negative ratio is perfectly equal to predicted probability the penalty is non-zero => optimizes obvious cases.

*Logloss is often called cross-entropy loss. In the next paragpaph there is a quick aside on why Logloss is a case of cross-entropy.

<img src="img/cross_entropy.png" width=500>