CS7545_Sp24_Lecture_25: Hallucination

CS 7545: Machine Learning Theory -- Spring 2024

Instructor: Tyler LaBonte

Lecturer: Santosh S. Vempala

Notes based on Lecture 25 and the paper "Calibrated Language Models Must Hallucinate"

April 24, 2024

Scribes: Hoang Ho, Yinan Huang

1. What is a Language Model?

A Language Model is simply a probability distribution $D$ over sequences of tokens, i.e., words or other character sequences. More formally, let $X$ be the set of all documents (i.e., strings of text) and $\Delta(X)$ be the set of all probabilistic distributions over $X$. A Language Model is defined as $D_{LM} ∈ \Delta(X)$.

2. What is hallucination?

Hallucination is "a plausible but false or misleading response generated by an artificial intelligence algorithm." - Merriam-Webster (2023) dictionary.

Why is hallucination not a major problem for Trigram LM? It only predicts the next token's probability using the previous 2 tokens instead of capturing a complicated semantic level using a probability distribution over pieces of information.

3. Simplified settings

We define 2 types of fact: arbitrary fact and systematic fact. Arbitrary fact is the truth usually undetermined from the training set. Systematic fact is predictable from a training set by learning the underlying correctness rules.

We also define factoids which are arbitrary pieces of information that are either true (facts) or false (hallucinations) and whose truth is statistically hard to determine from the training data. We assume that each document has at most 1 factor and all training documents are factoids.

Some factoids assumptions:

1 factoid for one doc: $f: X \rightarrow Y$ where $X$ is documents, $Y$ is factoids. Let $D_{L}\in\Delta(X)$ be the ground-truth distribution and denote the induced factoid distribution by $p(y):=\sum_{x:f(x)=y}D_{L}(x)$. Correspondingly Let $g(y)=\sum_{x:f(x)=y}D_{LM}(x)$ be the factoid distributed induced from Language model.
Good training data: $F := supp(p) \cup \{ \perp \}$ where $F$ is facts, $supp(p) = \{y \in Y | p(y) > 0 \}$. That is, training dataset only contains facts. The set of hallucinations is $H := Y \backslash F$.
Sparse facts: There exists $s \in R$ such that, with probability 1 over $D_L ∼ D_{world}$: $$|F| \leq e^{-s}|H|, s \geq 0$$
Regularity: $D_{world}$ has regular facts if for all $x_{train} \in supp(D_{train})$: $$Pr[y \in F |x_{train}] = Pr[y′ \in F | x_{train}], \forall y, y' \in U$$

For $r \geq 1$, $D_{world}$ has r-regular-facts if for all $x_{train} \in supp(D_{train})$, $$Pr[y \in F |x_{train}] \leq \frac{r}{|U|} E[|F \cap U| | x_{train}], \forall y \in U$$

4. Calibration

(Coarsening) Given any set $Y$, for partition $\Pi \in P(Y)$ and $D \in \Delta(Y)$, $D^{\Pi} \in \Delta(Y)$ is the $\Pi$-coarsening of $D$ if, $$\forall B \in \Pi, \forall y \in B, D^{\Pi}(y)= \frac{D(B)}{|B|}$$

The average probability over each bin that shares a common value of $g(y)$ is $g(y)$, which can be written as, $$\forall z \in [0, 1], E[p(y) | g(y) = z] = z$$

(Approx Calibration) $V_b(g) := \{B_{[0, t_1]}, B_{(t_1, t_2]}, ..., B_{(t_{b-1}, t_b]}\}$ where $t_i = \sup\{z \in [0, 1] | \sum_{y:g(y) \leq z} g(y) \leq i/b\}$

(Miscalibration) The miscalibration of $g$ with respect to $p$ is $Mis_{b}(g, p) := ||p^{V_{b}(g)} - g||_{TV}=\frac{1}{2}\sum_{B\in V_{b}}\sum_{y\in B}|p(B)/B-g(y)|$.

5. Main theorem

Fix $\theta \in [0, 1], b,n \in \mathbb{N}, r,s \in \mathbb{R}$. Then for any algorithm $A: X^n \rightarrow \Delta(X)$, with probability $\geq 1 - \theta$ over $D_L ~ D_{world}$ and $x_{train} ~ D_{L}^{\times n}$, $$g(H) \geq \hat{MF} - Mis_b(g, p) - \frac{3rne^{-s}}{\theta} - \sqrt{\frac{6ln(6/\theta)}{n}}$$ where $D_{LM} = A(x_{train}), g(H)$ is the LM hallucination rate, and $\hat{MF}$ is the monofact estimate of the missing fact rate $\Rightarrow$ The language model MUST hallucinate at a certain point.

Proof. Recall Markov's inequality $Pr(W\ge \mathbb{E}[W]\ge \delta)\le \delta$ for non-negative random variable $W$. In our case, let $W:=(p(U)-||p^{\Pi}-g||_{TV}-g(H))_+$. Apply Theorem 1 in the paper, we obtain that with probability at most $2\delta/3$, $$p(U)-||p^{\Pi}-g||_{TV}-g(H)\ge \frac{3}{2\delta}(\max_{y\in U}Pr_{p\sim v}[y\in F]+|O|\max_{y\in U}\mathbb{E}_{v}p(y)).$$ Rearrange terms, we get: with probability at least $1-2\delta/3$, $$g(H)\ge p(U)-||p^{\Pi}-g||_{TV}- \frac{3}{2\delta}(\max_{y\in U}Pr_{p\sim v}[y\in F]+|O|\max_{y\in U}\mathbb{E}_{v}p(y)).$$ Also, Corollary 6 of Appendix A with $\delta \rightarrow \delta /3$ implies that for any $\delta \in (0, 1]$ and any $D_{L}$, $$Pr_{x_{\text {train }} \sim D_L^{\times n}}\left[p(U) \geq \widehat{M F}-\sqrt{\frac{6 \ln (6 / \delta)}{n}}\right] \geq 1-\frac{\delta}{3} .$$ Now, suppose $D_{\text{world}}$ has 1-regular facts and 1-regular-probabilities, with probability 1 it satisfies $$\max_{y \in U} Pr_{p \sim \nu}[y \in F]+|O| \max_{y \in U} \mathbb{E}[p(y)] \leq \frac{\mathbb{E}[|F \cap U|]}{|U|}+\frac{|O|}{|U|} \mathbb{E}[p(U)] \leq 2 \frac{|F|}{|U|} \leq 2 e^{-s} .$$ In the above we have used the fact that $O \subseteq F$ and $U\supseteq H$. The desired lower bound of $g(H)$ follows immediately from the three equations above and union bound, using $\Pi = V_b(g)$. The proof of $D_{\text{world}}$ with only $r$-regular facts also follows directly, where we use the fact that $|O|\le n$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly