# Week 4 - Feedback in IR

## Statistical language model

- A distribution over word sequences.
- A generative distribution for sequences of words.

### Unigram language model (LM)

Assumes words in a sentence are generated **independently**.

\begin{align}
P(w_1, \ldots, w_n \mid \theta) &= \prod_{i=1}^n P(w_i \mid \theta)\\ 
\sum_{i=1}^n P(w_i \mid \theta) &= 1
\end{align}

#### Maximum likelihood (ML) estimator

The following is the probability of generating a word $w$ based on a unigram model learned from a document $d$.
\begin{align}
P(w \mid \theta) = P(w \mid d) = \frac{c(w, d)}{\mid d \mid}
\end{align}

#### Background LM

A model for language for a "background" topic. Basically a language model trained on a general collection of document rather than documents for a specific topic.

\begin{equation}
P(w \mid B)
\end{equation}

#### Collection LM

\begin{equation}
P(w \mid C)
\end{equation}

#### Document LM

\begin{equation}
P(w \mid d)
\end{equation}

### Normalized topic model

We can use the background language model to down-weigh words that are uninformative for a given topic. 
For e.g., words that appear frequently in **any** given context will have a high probability based on the background LM but this is not true for topic-specific words.

One way to down-weigh uninformative words is to divide the probability of each word from a document or collection LM by the probability of that word given by the background LM.

\begin{equation}
 P_{\text{normalized}}(w) = \frac{P(w \mid d)}{P(w \mid C)}
\end{equation} 
 
or if we just need to compare the relative magnitude, we can compute the log-likelihood,

\begin{equation}
 \log P_{\text{normalized}}(w) = \log P(w \mid d) - \log P(w \mid C)
\end{equation}



## Unigram query likelihood

\begin{equation}
P(q = (w_1, \ldots, w_N) \mid d) = \prod_{i=1}^N P(w_i \mid d)
\end{equation}


## Improved model: sampling words from a document model

If we assume that words in a query must be sampled from existing documents, then queries containing words that are not found in a document will have 0 probability of being generated from that document under the indepent sampling assumption.

A potential fix is to assume that there is a hypothetical document that exists in the user's mind and this document is represented by a document model that is to be estimated.


## Scoring function for ranking documents based on query likelihood

\begin{align}
\text{For } q = (w_1, \ldots, w_n) , \ 
 f(q, d) & = \log P(q \mid d) \\
 &= 
\underbrace{\sum_{i=1}^n}_{ \substack{ \text{Sum over words} \\ \text{in query} } } 
\log P(w_i \mid d) \\
 & =   
\underbrace{ \sum_{ w \in \mathcal{V} } }
_{ \substack{ \text{Sum over words} \\ \text{in vocabulary} } } 
c(w, q) \log P(w \mid d)
\end{align}

## Estimating $P(w \mid d)$

Need to smooth the language model so that it does not produce 0 probability for words that are in the query but not seen in the document.

\begin{align}
P(w \mid d) = 
\begin{cases}
&P_{\text{Seen}}(w \mid d) &, & \text{if } w \text{ is in } d\\
&\underbrace{
\alpha_d P(w \mid C)}_{ \text{Background probability} } &, 
& \text{ otherwise}
\end{cases}
\end{align}

## Ranking function with smoothing

\begin{align}
\log P(q \mid d) 
& = 
\overbrace { 
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log P_{\text{Seen}}(w \mid d) 
}^{\text{Contribution from words in doc}} 
&+& 
\overbrace {
\sum_{w \in \mathcal{V}, c(w, d) = 0} 
c(w, q) \log \alpha_d P(w \mid C)
}^{ \text{Contribution from words } \textbf{not } \text{in doc} } \\
& =
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log P_{\text{Seen}}(w \mid d) 
&+& 
\sum_{w \in \mathcal{V}} c(w, q) \log \alpha_d P(w \mid C) -
\sum_{w \in \mathcal{V}, c(w, d) > 0} c(w, q) 
\log \alpha_d P(w \mid C) \\
& = 
\sum_{w \in \mathcal{V}, c(w, d) > 0} 
c(w, q) \log \frac{P_{\text{Seen}}(w \mid d)}{\alpha_d P(w \mid C)}
&+& 
\sum_{w \in \mathcal{V}} c(w, q) \log \alpha_d P(w \mid C) \\
& = 
\underbrace {
\sum_{w \in q \cap d} 
}
_{ \text{Size of query} }
c(w, q) \log 
\frac{
   \overbrace{P_{\text{Seen}}(w \mid d)}^{\text{TF weighting}}
}
{ 
  \underbrace{ \alpha_d P(w \mid C) }_{\text{IDF weighting}}
}
&+& 
\underbrace { 
n\log \alpha_d 
}_{ \substack{ \text{Document normalization.} \\
\text{Precompute } \because \\ \text{ independent of query}
                } }
+
\underbrace{
\sum_{w \in \mathcal{V}} c(w, q) \log P(w \mid C)
}_{ \substack{ \text{Ignore } \because \\ \text{independent of doc} } }
\end{align}

There are two advantages to this are

1. **Efficient computation** because we only need to sum over the number of terms proportional to the size of the query.
2. **Help us better understand TF-IDF weighting** by recognizing the similarities between the two.

