# Week 4 - Feedback in IR

## Statistical language model

- A distribution over word sequences.
- A generative distribution for sequences of words.

### Unigram language model (LM)

Assumes words in a sentence are generated **independently**.

\begin{align}
P(w_1, \ldots, w_n \mid \theta) &= \prod_{i=1}^n P(w_i \mid \theta)\\ 
\sum_{i=1}^n P(w_i \mid \theta) &= 1
\end{align}

#### Maximum likelihood (ML) estimator

The following is the probability of generating a word $w$ based on a unigram model learned from a document $d$.
\begin{align}
P(w \mid \theta) = P(w \mid d) = \frac{c(w, d)}{\mid d \mid}
\end{align}

#### Background LM

A model for language for a "background" topic. Basically a language model trained on a general collection of document rather than documents for a specific topic.

\begin{equation}
P(w \mid B)
\end{equation}

#### Collection LM

\begin{equation}
P(w \mid C)
\end{equation}

#### Document LM

\begin{equation}
P(w \mid d)
\end{equation}

### Maximum-likelihood estimation

We can estimate the above probability distribution using maximum-likelihood (ML) estimation simply by finding the fraction of words in the respective cases that matches a particular word $w$. E.g.,

\begin{align}
P(w \mid d) = \frac{c(w, d)}{\lvert d \rvert}
\end{align}

### Normalized topic model

We can use the background language model to down-weigh words that are uninformative for a given topic. 
For e.g., words that appear frequently in **any** given context will have a high probability based on the background LM but this is not true for topic-specific words.

One way to down-weigh uninformative words is to divide the probability of each word from a document or collection LM by the probability of that word given by the background LM.

\begin{equation}
 P_{\text{normalized}}(w) = \frac{P(w \mid d)}{P(w \mid C)}
\end{equation} 
 
or if we just need to compare the relative magnitude, we can compute the log-likelihood,

\begin{equation}
 \log P_{\text{normalized}}(w) = \log P(w \mid d) - \log P(w \mid C)
\end{equation}



## Unigram query likelihood

\begin{equation}
P(q = (w_1, \ldots, w_N) \mid d) = \prod_{i=1}^N P(w_i \mid d)
\end{equation}


## Improved model: sampling words from a document model

If we assume that words in a query must be sampled from existing documents, then queries containing words that are not found in a document will have 0 probability of being generated from that document under the indepent sampling assumption.

A potential fix is to assume that there is a hypothetical document that exists in the user's mind and this document is represented by a document model that is to be estimated and compute the necessary probabilities from this model.


## Scoring function for ranking documents based on query likelihood

\begin{align}
\text{For } q &= (w_1, \ldots, w_n) , \\ 
 f(q, d) & = \log P(q \mid d) \\
 &= 
\underbrace{\sum_{i=1}^n}_{ \substack{ \text{Sum over words} \\ \text{in query} } } 
\log P(w_i \mid d) \\
 & =   
\underbrace{ \sum_{ w \in \mathcal{V} } }
_{ \substack{ \text{Sum over words} \\ \text{in vocabulary} } } 
c(w, q) \log P(w \mid d)
\end{align}

The goal of retrieval is to estimate $P(w \mid d)$.

Different models for $P(w \mid d)$ gives different ranking models.

## Estimating $P(w \mid d)$

Need to have a way to assign probability to unseen word so that they do not end up with 0 probability.

Let probability of unseen word be proportional to a reference language model (usually collection language model).

\begin{align}
P(w \mid d) = 
\begin{cases}
&P_{\text{Seen}}(w \mid d) &, & \text{if } w \text{ is in } d\\
&
\alpha_d \underbrace{P(w \mid C)}_{ \substack{ \text{Background probability} \\ \text{as reference LM} } } &, 
& \text{ otherwise}
\end{cases}
\end{align}

Note that **we still need to estimate $P_{\text{Seen}}(w \mid d)$ and how to set $\alpha_d$.** This will be explained below.

For now, we can substitute the expression for $P(w \mid d)$ into the expression for $f(q, d)$
split the sum according to the two cases of $P(w \mid d)$ 
to obtain a smoothed ranking function.


## Ranking function with smoothing


\begin{align}
\log P(q \mid d) 
& = 
\overbrace { 
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log P_{\text{Seen}}(w \mid d) 
}^{\text{Contribution from words in doc}} 
&+& 
\overbrace {
\sum_{w \in \mathcal{V}, c(w, d) = 0} 
c(w, q) \log \alpha_d P(w \mid C)
}^{ \text{Contribution from words } \textbf{not } \text{in doc} } \\
& =
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log P_{\text{Seen}}(w \mid d) 
&+& 
\sum_{w \in \mathcal{V}} c(w, q) \log \alpha_d P(w \mid C) -
\sum_{w \in \mathcal{V}, c(w, d) > 0} c(w, q) 
\log \alpha_d P(w \mid C) \\
& = 
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log \frac{P_{\text{Seen}}(w \mid d)}{\alpha_d P(w \mid C)}
&+& 
\sum_{w \in \mathcal{V}} c(w, q) \log \alpha_d P(w \mid C) \\
& = 
\underbrace {
\sum_{w \in q \cap d} 
}
_{ \substack{ \text{Common words} \\ \text{in query and doc} } }
c(w, q) \log 
\frac{
   \overbrace{P_{\text{Seen}}(w \mid d)}^{\text{TF weighting}}
}
{ 
  \underbrace{ \alpha_d P(w \mid C) }_{\text{IDF weighting}}
}
&+& 
\underbrace { 
n\log \alpha_d 
}_{ \substack{ \text{Document normalization.} \\
\text{Precompute } \because \\ \text{ independent of query}
} }
+
\underbrace{
\sum_{w \in \mathcal{V}} c(w, q) \log P(w \mid C)
}_{ \substack{ \text{Ignore } \because \\ \text{independent of doc} } }.
\end{align}

We then ignore the sum that is independent of the document and define the scoring function to be

\begin{align}
f(q, d) 
&= 
\sum_{w \in q \cap d} c(w, q) \log \frac{P_{\text{Seen}}(w \mid d)}{\alpha_d P(w \mid C)}
+
n \log \alpha_d.
\end{align}

The two advantages to this approach to smoothing the function are

1. **Efficient computation** because we only need to sum over the number of terms proportional to the size of the query.
2. **Help us better understand TF-IDF weighting** by highlighting the similarities between the two.

In particular, the expressions are obtained from a principled approach by stating the probabilistic modeling assumptions up front and the intuitive properties may not necessarily appear if we simply design the ranking function  based on heuristics.



## Estimate $P_{\text{Seen}}(w \mid d)$ and $\alpha_d$

### Linear interpolation (Jelinek-Mercer) smoothing

One way we can smooth the seen word distribution by linearly interpolating the ML-estimated model with the background model

\begin{align}
P_{\text{Seen}}(w \mid d) 
&= (1 - \lambda) P_{\text{ML}}(w \mid d) + \lambda P(w \mid C), \ \lambda \in [0, 1]\\
&= (1 - \lambda) \frac{c(w, d)}{\lvert d \rvert} + \lambda P(w \mid C)
\end{align}

This ensures $P_{\text{Seen}}(w \mid d)$ dooes not return 0 probabilities for words not in the document $d$.

### Dirichlet (Bayesian) smoothing

Another way we can smooth $P_{\text{Seen}}(w \mid d)$ is given $\mu \in [0, \infty)$, let

\begin{align}
P_{\text{Seen}}(w \mid d) 
&= \frac{ P_{\text{ML}}(w \mid d) + \mu P(w \mid C) }
{\lvert d \rvert + \mu} \\
&= 
\underbrace{ \frac{ \lvert d \rvert }{\lvert d \rvert + \mu} }
_{ \substack{ \text{Doc length dependent} \\ \text{interpolating weights}} }
\frac{c(w, d)}{\lvert d \rvert} + 
\underbrace {
\frac{\mu}{\lvert d \rvert + \mu} 
}_{ \substack{\text{For fixed } \mu \\ \text{longer doc } \\ \rightarrow \text{ less weight} }}
P(w \mid C) \\
&= \frac{c(w, d) + 
\overbrace {
\mu P(w \mid C) }^{\text{Pseudo-word counts}} }
{ \lvert d \rvert + \underbrace{\mu}_{ \substack{ \text{ Total pseudo} \\ \text{-word counts}} } }
\end{align}

Similar to Jelinek-Mercer smoothing, 
the second expression 
above tells us that we are also doing a linear interpolation between $P_{\text{ML}}(w \mid d)$
and the collection background model $P(w \mid C)$,
but in this case the weights are dependent on the length of each document, $\lvert d \rvert$, 
and the parameter $\mu$.

## Determine exact form of ranking function

We now need to determine $\alpha_d$ for each of the smoothing methods.

### JM smoothing

First, we substitute $P_{\text{Seen}}(w \mid d)$ into the ratio in the expression of $f(q, d)$:

\begin{align}
\frac{P_{\text{Seen}}(w \mid d)}{\alpha_d P(w \mid C)}
&= 
\frac {(1 - \lambda) P_{\text{ML}}(w \mid d) + \lambda P(w \mid C)}
      {\lambda P(w \mid C))} \\
&= 
1 + \frac{1 - \lambda }{\lambda} \frac{c(w, d)}{\lvert d \rvert P(w \mid C)}, \text{ where } \lambda \in [0, 1].
\end{align}

Plugging this into the expression for $f(q, d)$, we get
\begin{align}
f(q, d) 
&= 
\sum_{w \in q \cap d} c(w, q) 
\log \left( 1 + \frac{1 - \lambda }{\lambda} \frac{c(w, d)}{\lvert d \rvert P(w \mid C)} \right)
+
n \log \alpha_d,
\end{align}

where $\alpha_d = \lambda$.

Because the term $n \log \alpha_d = n \log \lambda$ is independent of the document, we can ignore it and obtain
the ranking function

\begin{align}
f(q, d) 
&= 
\sum_{w \in q \cap d} c(w, q) 
\log 
\left( 
1 + 
\frac{1 - \lambda }{\lambda} 
\overbrace {
\frac{c(w, d)}{ \underbrace{ \lvert d \rvert P(w \mid C) }_{\text{Expected count of } w} } 
}^{\text{Ratio of actual vs expected count}}
\right).
\end{align}

which is a vector space model as it is just the dot product between the query vector and a weighted document vector.

### Dirichlet smoothing

\begin{align}
f(q, d) 
&= 
\sum_{w \in q \cap d} c(w, q) 
\log 
\left( 
\frac{ \frac{ c(w, d) + \mu P(w \mid C) }
            { \lvert d \rvert + \mu     } }
     { \frac{\mu P(w \mid C) }{\lvert d \rvert + \mu } } 
\right) 
& + &  
n \log \alpha_d, \text{ where } \alpha_d = \frac{\mu}{\lvert d \rvert + \mu} \\
&=   
\sum_{w \in q \cap d} c(w, q) 
\log 
\left( 
\frac{ c(w, d) + \mu P(w \mid C) }
     { \mu P(w \mid C)           } 
\right) 
& + &  n \log \alpha_d \\
&= 
\sum_{w \in q \cap d} c(w, q) 
\log 
\left(
1 +
\frac{ c(w, d) }
     { \mu P(w \mid C) } 
\right) 
& + &  n \log \alpha_d,
\end{align}

where again we see a ratio of actual vs expected word counts in the $\log$ function and the addition of 1 to prevent taking the logarithm of a 0.