this a tutorial on exponential families. It's primarily based on Chapter 3 of [Graphical Models, Exponential Families, and Variational Inference](http://dx.doi.org/10.1561/2200000001) (called GMEFV later), as well as some of Section 2.4 of PRML, and Chapter 9 of MLAPP.

## Notation

The following are common notations for definition of exponential family.

* **Sufficient Statistics** $\mu(x)$ or $T(x)$.
* **natural parameters** $\eta$ or $\theta$.
* **partition function** there are many versions of it.
    1. $Z(\theta)$. This is just the integral over the non normalized part. Boltzmann machine usually use this notation <https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine>.
    2. $g(\eta)$. This is $1/Z$. PRML uses this notation.
    3. $A(\theta)$. this is $\log Z$.
* **base measure** it's either $h(x)$ or $\nu$. In the latter case, it's simply incorporated into the integral operator, as treated in GMEFV. Essentially, this term helps to weight different $x$ differently, and seems that it doesn't affect most of the analysis or properties about expoenential families, so maybe this is why not many people talk about it.
    * When using $\nu$, (for example Eq. (3.6) of GMEFV), we use $\nu(dx)$, instead of $dx$, probably reflecting that we don't use the (usual) Lebesgue measure, but $\nu$ measure. For more understanding of this, check <https://www.youtube.com/user/mathematicalmonk/>, who has some video lectures on measure theory.
    * Before knowing about measure theory, I thought about the meaning of this base measure as well, and it's as follows (copied from <https://math.stackexchange.com/questions/735916/whats-the-role-of-hx-base-measure-in-the-definition-of-exponential-family>).

### my thought on base measure

While the correct definition of exponential family is 

$$
f_X(x\mid\theta) = h(x) \exp \left (\eta(\theta) \cdot T(x) -A(\theta)\right ),
$$

it seems that in many materials I read, they don't pay much attention to $h(x)$. Sometimes, authors just drop this term.

Can somebody tell me more about the meaning of this $h(x)$ term? The most detailed description about this term I've found is "simply reflects the underlying measure w.r.t. which $p(x\mid\theta)$ is a density." (See http://stat-www.berkeley.edu/pub/users/mjwain/Fall2012_Stat241a/reader_ch8.pdf). Also, some people call it "base measure".

I kind of understand that $h(x)$ may not be important because it's trivial in most cases (1 or a constant like $1/\sqrt{2\pi}$), but for some distributions, e.g. Poisson, this term is nontrivial ($1/x!$), and it moderates the exponential term so greatly.

My understanding of this term is 1) it assigns a base probability to each element in the space of $x$, and the exponential term modifies this base probability. 2) it's here so that we don't get unbounded log-partition function $A(\theta)$, in cases like Poisson. 

Can anybody tell me more about this $h(x)$, and any comment on my understanding of it is welcomed.

## The subset of exponential families with all good properties.

According to pp. 40 of GMEFV, in practice we focus on exponential families with the following properties.

1. Regular. the feasible natural parameters form an open set. **ALL LATER ANALYSIS ASSUME REGULARITY**.
2. Minimal. There's no nontrivial linear combination of sufficient statistics such that it's constant over all choices of $x$. If this is true, then it's easy to show that different natural parameters must give different distributions, using proof by contradiction.
    * one common example of non-minimal exponential family is multinomial distribution for $N$ variables using $N$ parameters (also requring their log summing to 1, as in pp. 114 of PRML), instead of $N-1$.
        * there are two problems with this. First, the natural parameters are not independent, thus when you do partial differentiation, and this dependency makes differentiation impossible (because you can't fix others and perturb one). Also, this means that your feasible natural parameters are not an open set. But this is purely arbitrary, as if you don't make thier log sum to 1, they can still be normalized. So the partition function is really not a constant. Check pp. 5 of <https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf> for some more detail.
        * Second problem is that, it's actually not minimal. according to definition of minimal (using notation in pp. 40 of GMEFV), we can pick $\alpha$ to be an all-one vector. This is the same redundancy in softmax function used in Deep Learning frameworks.
        * one tricky part about multinomial distribution, is that although it's multidimensional, the dimensions are not independent: one of them must be 1, and others zero. But this is not a problem, since we can specify the base measure $\nu$ to only have nonzero value to these points.

### About the power of high order Ising model

In Example 3.1 of GMEFV, the authors claim that Ising model (for arbitrary order $k$) is minimal. But suppose this is true, then clearly, as claimed at end of Example 3.1, when $k=n$, the Ising model can represent all distributions (without any configuration taking zero probability), since at that time we have $2^n-1$ natural parameters (check binomial coefficient, or expansion of $\prod_i (x_i + 1)$ to see this) to solve, and we have $2^n$ equations corresponding to the $2^n$ configurations. Ising model being minimal implies that the matrix $A$ of size $2^n$ by $2^n-1$ of those $2^n-1$ sufficient statistics at $2^n$ configurations is full rank. Even better, actually you can concatenate along column (`axis=1` in numpy) with a vector with all one, and it's still full rank. Suppose we have a distribution $p$ we want to model, with $\sum_{i=1}^{2^n} p_i = 1$, then we can find a set of natural parameters by solving $[A, 1] [\theta; \theta'] = \log(p)$. Then this set of $\theta$ is the one we want. Notice that in this case, the log of partition function is negative additional natural parameter $-\theta'$ for the additonal column. You can show this easily using definition of partition function.

#### Why full order Ising model is minimal, even with an additional constant bias sufficient statistic.

You can do this by induction, for any $n$ order Ising model with $n$ nodes, we say that their sufficient statistics, plus a constant bias 1, are always linearly independent (this is stronger than Ising model being mininal, which doesn't have that bias term).

1. The base case is when $n=1$. You can always do this by eyeball.
2. Suppose for $n-1$ node Ising model of order $n-1$, the model is minimal. Then consider a $n$ node Ising model of order $n$.
    * Notice that we can write all the sufficient statistics (plus the one), as expansion of $\prod_i (x_i + 1) = (x_n + 1) \prod_{i=1}^{n-1} (x_i + 1) =  x_n\prod_{i=1}^{n-1} (x_i + 1) + \prod_{i=1}^{n-1} (x_i + 1)$.
    * Therefore, the (augmented) sufficient statistics can decomposed in to two parts. $2^{n-1}$ original terms in the $n-1$ node model, and $2^{n-1}$ terms with original terms multiplied by the additional node $x_n$. Suppose that we can find a set of cofficients $\alpha_i$ to make all $2^{n}$ terms sum to zero for all $2^n$ configurations of $x_i$, then when setting $x_n=0$, we see that half of $\alpha_i$ should linear combine those original terms to zero. By induction hypothesis, this half of $\alpha_i$ must be zero. Then we can set $x_n=1$, and we see it's still reduced to induction hypothesis, making another half of $\alpha_i$ to be zero.
    
Thus, the configuration matrix $A$, or $[A, 1]$ for a full order Ising model is full rank. And since the configuration matrix for lower order Ising models is a subset of columns of $A$ for the full one, it must have full column rank as well, implying minimal.


#### Further understanding of high order Ising model

Wenhao suggests having a look at [Information geometry on hierarchy of probability distributions](http://dx.doi.org/10.1109/18.930911), which talks about orthogonal decomposition of dependencies of different orders.

## Properties of partition function.

Most important property about partition function $A(\theta)$ is its derivatives. See Proposition 3.1 of GMEFV, or Eq. (2.226) of PRML. When doing paramter estimation, the corresponding equations are Eq. (3.38) of GMEFV, or Eq. (2.228) of PRML.

Another property is that the partition function is convex, and strictly convex for minimal representation. Notice that convex function implies that the domain of the function (here $\Omega$) must be convex. See <https://www.quora.com/Does-the-domain-of-a-convex-function-have-to-be-a-convex-set>.

## Correspondence between natural parameters and mean parameters.

Proposition 3.2 and Theorem 3.3 basically establishes the **one-to-one correspondence** between **interior** of mean parameters, and all natural parameters. This also tells us that, as long as our mean parameters from collected sample are in the interior, we can always find a unique expoenential family matching those mean parameters.

Indeed, we can only achieve the interior. For example, consider Bernoulli distribution, we can't express it as an exponential distribution, when the mean is 0 or 1.

Notice that the space of all realizable mean parameters $\mathcal{M}$, is convex, as discussed in pp. 54 (or after Example 3.7) of GMEFV.

Such one-to-one correspondence is also the reason we can use mean parameter to specify GLM.

## Conjugate Duality

In GMEFV, we can see that (3.42) is same as (3.38), and that means when we do maximum likelihood estimation, we are also computing the conjugate. And the optimal value of this conjugate is also related to the entropy of this distribution.

On one hand, we can write $A^*$ in a variational manner (I mean in some optimization form), Thoerem 3.4 shows that we can also write $A$ in such a manner as well.

In Eq. (3.45) of GMEFV, we don't write $\mu \in \overline{\mathcal{M}}$, but $\mu \in \mathcal{M}$. I believe this is ok, as $A^*$ is convex thus continous in the interior, and below Eq. (3.44) it says that all boundary points are limits of interior points, and thus supremum (which is what we need for computing conjugate of conjuate) won't be affected whether we search through only interior or all the domain. In addition, Eq. (3.46) explicitly says that supremum is attained at interior points. For continuity of 1D convex function when domain is open, check <http://math.stackexchange.com/questions/258511/proof-of-every-convex-function-is-continuous>. Essentially, using that inequality, we first prove that left/right derivatitves both exist, one by constructing a bounded decreasing sequence, and one by a bounded increasing sequence, and then definitely it's left/right continuous, and thus continuous.

$A$ and $A^*$ both have nice properties, both being convex (thus having convex domain), and also differentiable, and their derivatives are the mapping between mean ($\mathcal{M}^\circ$) and natural parameters $\Omega$, as shown in Figure (3.8) of GMEFV.

## Why exponential family is great.

Section 9.2 of MLAPP gives a lot of reasons why it's great.

> * It can be shown that, under certain regularity conditions, the exponential family is the only family of distributions with finite-sized sufficient statistics, meaning that we can compress the data into a fixed-sized summary without loss of information. This is particularly useful for online learning, as we will see later.
> * The exponential family is the only family of distributions for which conjugate priors exist, which simplifies the computation of the posterior (see Section 9.2.5).
> * The exponential family can be shown to be the family of distributions that makes the least set of assumptions subject to some user-chosen constraints (see Section 9.2.6).
> * The exponential family is at the core of generalized linear models, as discussed in Section 9.3.
> * The exponential family is at the core of variational inference, as discussed in Section 21.2.

First of these properties (finite-sized sufficient statistics) is elaborated in 9.2.4 of MLAPP. Notice that it's "under certain regularity conditions", one of them being having fixed support, as said in [Wikipedia](https://en.wikipedia.org/wiki/Sufficient_statistic#Exponential_family).

> According to the Pitman–Koopman–Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases.

9.2.4 of MLAPP also gives an example on this.

## some distributions are not exponential.

According to [Wikipedia](https://en.wikipedia.org/wiki/Exponential_family#Examples), and many other sources, such as 9.2.2.4 of MLAPP, exponential family distributions must have support irrelevant with the parameters. Quoted from <http://www.math.uah.edu/stat/special/GeneralExponential.html>:

> Many of the special distributions studied in this chapter are general exponential families, at least with respect to some of their parameters. On the other hand, most commonly, a parametric family fails to be a general exponential family because the support set depends on the parameter. The following theorems give a number of examples. Proofs will be provided in the individual sections.

Notice that we are talking about "family of distributions" being expoenential or not. I think if you go extreme, and assign each instance of uniform distribution to be a singleton family of distribution, then each of them is exponential.

Wikipedia gives additional counterexamples.

> Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed.

## multiple forms of conjugate prior

Here, two additional notes from Michael I. Jordan are useful.

* [8 The exponential family: Basics](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf) [local copy](./exponential_families/chapter8.pdf)
* [9 The exponential family: Conjugate priors](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter9.pdf) [local copy](./exponential_families/chapter9.pdf)

As shown in Eq. (9.5) of Section 9.2.1 of MLAPP, or Eq. (8.4) of 9hapter 8 note of Jordan, we can parameterize an exponential family distribution differently, using a one-to-one (bijection) mapping between canonical paramter and our "normal" parameters. One special case of this would be the mean parameterization, and the natural parameterization.

As shown in 9.2.5.2 of MLAPP, you can define conjugate priors for both parameterization. But are they different? Based on Chapter 9 note of Jordan, seems that in most useful cases they are the same. See below my notes on pp. 15 and pp. 17

Another important thing to note is that conjugate priors are not unique. See note on pp. 2 below.

### some other notes about conjugate prior, in Chapter 9 note of Jordan.

#### pp. 2

> In general these two goals are in conflict. For example, the goal of invariance of prior-to-posterior updating (i.e., asking that the posterior remains in the same family of distributions of the prior) can be acheived vacuously by defining the family of all probability distributions, but this would not yield tractable integrals.

So conjugate priors are not unique. For example, in 5.4 of MLAPP, the author mentions mixture of conjugate prior. More examples are given in end of pp. 2 of <https://www.stat.ubc.ca/~bouchard/courses/stat520-sp2014-15/images/handout_1_expfam.pdf> [local copy](./exponential_families/handout_1_expfam.pdf) Setting our prior family to be all possible distributions vacuously make the conjugacy true, but it's not tractable.

Notice that here Jordan defines conjugate prior to be both tractable and having same prior and posterior forms. I think most definitions only require the latter.

> we can define conjugate priors by mimicking the form of the likelihood.

> From the objective perspective, however, conjugate priors are decidedly dangerous; objective priors aim to maximize the impact of the data on the posterior.

> The general point to be made is that one should take care with conjugate priors.

> it is particularly important to do sensitivity analysis to assess how strongly the posterior is influenced by the prior

Essentially, conjugate prior is just math convenience, and maybe very wrong for practice.

#### pp. 3

9.0.1 gives an example how to "define conjugate priors by mimicking the form of the likelihood".

#### pp. 15

> We can also obtain conjugate priors for non-canonical representations. In particular, if we replace $\eta$ by $\phi(\theta)$ in Eq. (9.89), then we obtain a conjugate prior for $\theta$ by simply replacing $\eta$ in Eq. (9.91) with $\phi(\theta)$ ... Note that this is not the same prior as would be obtained by applying the change-of-variables procedure to Eq. (9.92). Such a procedure generally yields a well-defined prior, but that prior is not generally a conjugate prior for $\theta$. (We discuss the relationship between these two priors further in Section 9.0.6 below).

"not the same prior as would ... to Eq. (9.92)" should be to **Eq. (9.91)**. This is guaranteed to work, since it’s a proper change of variable, and doesn’t affect distribution of our random variable of interest ($X$) at all.

Replacing $\eta$ in Eq. (9.91) is like getting a conjugate prior for $\theta$ directly. This may or may not give nice properties as in canonical parameterization (but I guess most of the time it works, and it’s exactly the same as canonical parameterization, see pp. 17).

#### pp. 17

> We thus see that the posterior expectation of the mean is a convex combination of the prior expectation and the maximum likelihood estimate in general for conjugate priors.

This is a very good property of (the default) conjugate prior for canonical parameterization. By default, I mean because conjugate priors are not unique. See my comments for pp. 2.

Here mean is defined as the expectation of sufficient statistics vector.

However, in practice we often set conjugate prior in terms of other normal parameterizations. Does this property still hold? In addition, what's the relationship between two conjugate priors under two parameterizations?

Jordan also mentioned two approaches to think about designing prior for a non-canonical parameterization.

> In choosing a prior for $\theta$ we have two choices. First, we can retain the standard conjugate prior and use the change-of-variables formula ... Alternatively, we can place a standard conjugate prior on $\theta$ directly.

First approach is the "applying the change-of-variables procedure to Eq. (9.92)." or "such procedure" in my quote for pp. 15. This approach may not give conjugate prior for $\theta$, but guarantees posterior expectation of the mean being convex combination, since we just changed parameterization, and doesn't affect the distribution of the sufficient statistics vector.

Second approach is the actual taken approach by many. This is what I mean by "in practice we often set conjugate prior in terms of other normal parameterizations." above. Jordan claims that such approach is equivalent (tranforming bewteen this and the canonical one using change of variable) to canonical one, if and only if that linear combination property holds.

> By the Diaconis and Ylvisaker (1979) results, we know that we obtain a linear posterior expectation if and only if the resulting prior is the standard conjugate prior.

I think this is probably the case for all the cases presented so far and those in many ML books, which usually doesn't use canonical space when defining conjugate prior.