# Ch2-Lecture 7

## Enropy and Entanglement Distillation

The theme of the last two lectures has been of a quantum information theoretic nature — we have studied cloning (or rather, lack thereof), entanglement, and non-local correlations. Before progressing to our next main theme of quantum algorithms, we now give a brief taste of more advanced ideas in quantum information. In the process, we will continue getting used to working with quantum states in both the state vector and density operator formalisms.

The main questions we ask in this lecture are the following:

* How can we quantify the “amount” of entanglement in a composite quantum system?
* Under what conditions can “less entangled” states be “converted” to “more entangled” states?

The first question will require the foundational concept of entropy, introduced in the context of classical information theory by Claude Shannon in 1948. The entropy is worthy of a lecture in itself, being an extremely important quantity. The second question above is tied to the discovery of entanglement distillation, in which, similar to the age-old process of distilling potable water from salt water (or more fittingly for our analogy, “pure” water from “dirty” water), one can “distill” pure entanglement from noisy entanglement.

## Entropy

One of the most influential scientific contributions of the 20th century was the 1948 paper of Claude Shannon, “A Mathematical Theory of Communication”, which single-handed founded the field of information theory. Roughly, the aim of information theory is to study information transmission and compression. For this, Shannon introduced the notion of entropy, which intuitively quantifies “how random” a data source is, or the “average information content” of the source. It turns out that a quantum generalization of entropy will be vital to quantifying entanglement; as such, we begin by defining and motivating the classical Shannon entropy.

### Shannon entropy

Let $X$ be a discrete random variable taking values from set $ \left\{x_1,\dots,x_n\right\} $, where $ \text{Pr}\left(x_i\right) :=  \text{Pr}\left(X=x_i\right)$ denotes the probability that $X$ takes value $x_i$ . Then, the Shannon entropy $H(X)$ is defined as

$$
H(X)= \sum_{i=1}^{n}- \text{Pr}\left(x_i\right)\log \left( \text{Pr}\left(x_i\right) \right)
$$

Here, the logarithm is taken base 2, and we define $0.\log\left(0\right)=0 $.

Before getting our hands dirty understanding $H(x)$, let us step back and motivate it via data compression. Suppose we have a data source whose output we wish to transmit from (say) Germany to Canada. Naturally, we may wish to first _compress_ the data, so that we need to transmit as few bits as possible between the two countries. Furthermore, a compression scheme is useless unless we are later able to _recover_ or _uncompress_ the data in Canada. This raises the natural question: **How few bits can one transmit, so as to ensure recovery of the data on the receiving end?** Remarkably, Shannon’s noiseless coding theorem says that this quantity is given by the entropy! Roughly, the theorem says that in order to reliably transmit N i.i.d. (independently and identically distributed) random variables $X_i$ from a random source $X$, it is necessary and sufficient to instead send $H(X)$ bits of communication.

 We now explore the sense in which $H(X)$ indeed quantifies the “randomness” or “uncertainty” of $X$ by considering two boundary cases. In the first boundary case, $X$ models a fair coin flip, i.e. $X$ takes value HEADS or TAILS with probability 1/2 each. Then,

$$
H(X)=- \frac{1}{2}  \log\left( \frac{1}{2} \right)- \frac{1}}  \log\left( \frac{1}{2} \right)= \frac{1}{2} + \frac{1}{2} =1
$$

Therefore, we interpret a fair coin as encoding, on average, _one_ bit of information. Alternatively, in the information transmission setting, we would need to transmit a single bit to convey the answer of the coin flip from Germany to Canada. This intuitively makes sense — since the outcome of the coin flip is completely random, there is no way to guess its outcome in Canada with success probability greater than 1/2 (i.e. a random guess).

#### Send it after class 1
Suppose random variable $Y$ models a biased coin, e.g. takes value HEADS and TAILS with probability 1 and 0, respectively. What is $H(Y)$?

In the above exercise, there is no “uncertainty”; we know the outcome will be HEADS. Thus, in the communication setting, one can interpret this as saying zero bits of communication are required to transmit the outcome of the coin flip from Germany to Canada (assuming both Germany and Canada know the probabilities of obtaining HEADS and TAILS beforehand). Indeed, the answer to the exercise above is $H(Y) = 0$.

#### Send it after class 2

Let random variable $Z$ take values in set $\{0, 1, 2\}$ with probabilities $\{1/4, 1/4, 1/2\}$, respectively. What is $H(Z)$

The entropy formula is odd-looking; to understand how it is derived, the key observation is the intuition behind the coin flip examples, which says that “when an event is _less_ likely to happen, it reveals _more_ information”. To capture this intuition, Shannon started with a formula for _information content_ $I(x_i)$, which for any possible event $x_i$ for random variable $X$, is given by

$$
I  \left(x_i\right)= \log\left( \frac{1}{ \text{Pr}\left(x_i\right) } \right)=- \log\left( \text{Pr}\left(x_i\right) \right)
$$

Since the log function is strictly monotonically increasing (i.e. $I(x) > I(y)$ if $x > y$ for $x, y  \in  (0, \infty)$), it holds that $I(x_i)$ captures the idea that “rare events yield more information”. But $I(x)$ also has three other important properties we expect of an “information measure”; here are the first two:
1. (Information is non-negative) $I(x)  \geq  0$
2. if $ \text{Pr}\left(x\right)  = 1$, then $I(x) = 0$. (If an event occurs with certainty, said occurrence conveys no information)


For the third important property, we ask — why did we use the log function? Why not any other monotonically increasing function satisfying properties (1) and (2) above? Recall that, by definition, two random variables $X$ and $Y$ are independent if

$$
 \text{Pr}\left(X = x \text{ and } Y = y\right)=  \text{Pr}\left(X = x\right)\text{Pr}\left(Y = y\right) .
$$

Moreover, if $X$ and $Y$ are independent, then intuitively one expects the information conveyed by the joint event $z := (X = x \text{ and } Y = y)$ to be _additive_, i.e. $I(z) = I(x) + I(y)$. But this is precisely what the information content I(x) satisfies, due to its use of the log function.

### Send it after class 3

Let $X$ and $Y$ be independent random variables. Then, for $z := (X = x \text{ and } Y =y)$, express $I(z)$ in terms of $I(x)$ and $I(y)$.

We can now use the information content to derive the formula for entropy — $H(X)$ is simply the _expected_ information content over all possible events $ \left\{  x_1 \dots  x_n \right\} $. (Recall here that for random variable $X$ taking values $x  \in  \{x_i\}$, its expectation $E[x]$ is given by $E[x] =  \sum_{i} \text{Pr}\left(x_i\right). xi$ ) This is precisely why at the start of this section, we referred to $H(x)$ as the **average** information content of a data source.

### Von Neumann Entropy

Recall the first aim of this lecture was to use entropy to measure entanglement. For this, we shall require a quantum generalization of the Shannon entropy $H(X)$, named as  the *von Neumann entropy* $S( \rho )$, for density operator $ \rho $. To motivate this definition, let us recall the “hierarchy of matrix classes” we introduced in discussing measurements:
* Hermitian operators, $\text{Herm}  \left(\mathbb{C}^d\right) $ , which generalize the real numbers.
* Positive semidefinite operators, $\text{Pos} \left(\mathbb{C}^d\right)$ , which generalize the non-negative real numbers.
* Orthogonal projection operators, $\Pi\left(\mathbb{C}^d\right)$, which generalize the set $\{0, 1\}$.

Note that $\Pi\left(\mathbb{C}^d\right) \subseteq \text{Pos} \left(\mathbb{C}^d\right) \subseteq \text{Herm}  \left(\mathbb{C}^d\right)$ , and that the notion of “generalize” above means the eigenvalues of the operators fall into the respective class the operators generalize. (For example, matrices in $\text{Pos}  \left(\mathbb{C}^d\right)$ have non-negative eigenvalues.) Applying this same interpretation to the set of *density operators* acting on $\mathbb{C}^d$, $D  \left(\mathbb{C}^d\right)$ , we thus have that density operators generalize the notion of a *probability distribution*. Indeed, any probability distribution can be embedded into a diagonal density matrix.

#### Send it after class 4

Let $ \left\{p_i\right\}^d_{i=1} $ denote a probability distribution. Define diagonal matrix $ \rho  \in  \mathcal{L}  \left( \mathbb{C}^d\right) $ such that $ \rho  \left(i,i\right)=p_i $ . Show that $ \rho $ is a density matrix.

Since the eigenvalues $ \lambda_i \left( \rho \right) $ of a density operator $ \rho  \in D \left( \mathbb{C} ^d\right) $ form a probability distribution, the natural approach for defining a quantum entropy formula is to apply the classical Shannon entropy to the spectrum of $ \rho $:

$$
S \left( \rho \right):=H \left( \left\{ \lambda _i \left( \rho \right) \right\}_{i=1}^d \right)= \sum_{i=1}^{d}- \lambda_i \left(\rho\right) \log\left(\lambda_i \left(\rho\right)\right)
$$

**Operator functions**. It is important to pause now and take stock of what we have done in defining $S( \rho )$ : In order to apply a function $f :  \mathbb{R}  \mapsto  \mathbb{R} $ to a Hermitian matrix $H  \in  \text{Herm}  \left(\mathbb{C}^d\right)$ , we instead applied $f$ to the *eigenvalues* of $H$. Why does this “work”? Let us look at the Taylor series expansion of $f$ , which for e.g. $f = e^x$ is (the series converges for all x)

$$
e^x= \sum_{i=1}^{\infty} \frac{x^n}{n!}=1+x+ \frac{x^2}{2!}+\frac{x^3}{3!}+\dots
$$

This suggests an idea — to define $e^H$ , perhaps we can substitute $H$ in the right hand side of the Taylor series expansion of $e^x$ :

$$
e^H:= I+H+ \frac{H^2}{2!}+\frac{H^3}{3!}+\dots
$$

Indeed, this leads to our desired definition; that to generalize the function $f(x) = e^x$ to Hermitian matrices, we apply $f$ to the eigenvalues of $H$.

#### Send it after class 5

Let $H$ have spectral decomposition $H =  \sum_{i=1} \lambda_i \left| \lambda_i \right\rangle\!\left\langle \lambda_i \right|$. Show that

$$
e^H=\sum_{i=1} e^{\lambda_i} \left| \lambda_i \right\rangle\!\left\langle \lambda_i \right|
$$

This idea of applying functions $f :  \mathbb{R}  \mapsto  \mathbb{R} $ to the eigenvalues of Hermitian operators is used so frequently in quantum information that we give such “generalized $f$ ” a name — *operator functions*. In the case of $S( \rho )$, by setting $f(x) =  \log\left(x\right)$, we can rewrite $S( \rho )$ as

$$
S( \rho )=- \text{Tr}\left( \rho  \log\left( \rho \right) \right)
$$
