Working on concavity of entropy and convexity of KL divergence

gnthibault · Feb 22, 2024 · 897b37a · 897b37a
1 parent 94328b8
commit 897b37a
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 2 deletions.
diff --git a/InformationTheoryOptimization.ipynb b/InformationTheoryOptimization.ipynb
@@ -177,6 +177,37 @@
     "    return p, -np.dot(p,SafeLog2(p))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Interesting property of entropy\n",
+    "### Concavity of entropy and convexity of KL-divergence\n",
+    "The entropy is concave in the space of probability mass function, more formally, this reads:\n",
+    "\\begin{align*}\n",
+    " H[\\lambda p_1 + (1-\\lambda p_2)] \\geq \\lambda H[p_1] + (1-\\lambda p_2) H[p_2]\n",
+    "\\end{align*}\n",
+    "where $p_1$ and $p_2$ are probability mass functions and $\\lambda \\in [0,1]$\n",
+    "\n",
+    "Proof: Let $X$ be a discrete random variable with possible outcomes $\\mathcal{X} := {x_i, i \\in 0,1,\\dots N-1}$ and let $u(x)$ be the probability mass function of a discrete uniform distribution on $X \\in \\mathcal{X}$. Then, the entropy of an arbitrary probability mass function $p(x)$ can be rewritten as\n",
+    "\n",
+    "\\begin{align*}\n",
+    "    H(X) &= - \\sum_{i=0}^{N-1} p(x_i)log(p(x_i)) \\\\\n",
+    "    &= - \\sum_{i=0}^{N-1} p(x_i)log\\left(\\frac{p(x_i)}{u(x_i)} u(x_i)\\right) \\\\\n",
+    "    &= - \\sum_{i=0}^{N-1} p(x_i)log\\left(\\frac{p(x_i)}{u(x_i)}\\right) - \\sum_{i=0}^{N-1} p(x_i)log(u(x_i)) \\\\\n",
+    "    &= -KL[p\\|u] - \\sum_{i=0}^{N-1} p(x_i)log(u(x_i)) \\\\\n",
+    "    &= -KL[p\\|u] - log \\left(\\frac{1}{N} \\right) \\sum_{i=0}^{N-1} p(x_i) \\\\\n",
+    "    &= log(N) - KL[p\\|u]\n",
+    "    log(N) - H(X) &= KL[p\\|u]\n",
+    "\\end{align*}\n",
+    "\n",
+    "Where $KL[p\\|u]$ is the Kullback-Leibler divergence between $p$ and the discrete uniform distriution $u$ over $\\mathcal{X}$, a concept we will explain more in detail later on this page. \n",
+    "Note that the KL divergence is convex in the space of the pair of probability distributions $(p,q)$:\n",
+    "\\begin{align*}\n",
+    " KL[\\lambda p_1 + (1-\\lambda p_2) \\| \\lambda q_1 + (1-\\lambda q_2)] \\geq \\lambda KL[p_1\\|q_1] + (1-\\lambda p_2) KL[p_2\\|q_2]\n",
+    "\\end{align*}\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/OptimalTransportWasserteinDistance.ipynb b/OptimalTransportWasserteinDistance.ipynb
@@ -559,7 +559,7 @@
    "metadata": {},
    "source": [
     "### OT and statistical concepts\n",
-    "Some of the basics to understand the following statements can be found in the notebook \"InformationTheoryOptimization\"\n",
+    "Some of the basics to understand the following statements can be found in the notebook \"InformationTheoryOptimization\" this part is also partly a direct reproduction of Marco Cuturi famous article \"Sinkhorn Distances: Lightspeed Computation of Optimal Transport\"\n",
     "\n",
     "I would like to stop and mention that as we now interpret $P$ as a joint probability matrix, we can define its entropy, the marginal probabiilty entropy, and KL-divergence between two different transportation matrix. These takes the form of\n",
     "\n",
@@ -585,7 +585,7 @@
     "    KL(P\\|rcˆT) = h(r) + h(c) − h(P)\n",
     "\\end{align*}\n",
     "\n",
-    "This quantity is also the mutual information $I(X\\|Y)$ of two random variables $(X, Y)$ should they follow the joint probability $P$ (Cover and Thomas, 1991, §2). Hence, the set of tables P whose Kullback-Leibler divergence to rcT is constrained to lie below a certain threshold can be interpreted as the set of joint probabilities P in U (r, c) which have sufficient entropy with respect to h(r) and h(c), or small enough mutual information. For reasons that will become clear in Section 4, we call the quantity below the Sinkhorn distance of r and c:"
+    "This quantity is also the mutual information $I(X\\|Y)$ of two random variables $(X, Y)$ should they follow the joint probability $P$ . Hence, the set of tables P whose Kullback-Leibler divergence to rcT is constrained to lie below a certain threshold can be interpreted as the set of joint probabilities P in U (r, c) which have sufficient entropy with respect to h(r) and h(c), or small enough mutual information. For reasons that will become clear in Section 4, we call the quantity below the Sinkhorn distance of r and c:"
    ]
   },
   {