this a tutorial on exponential families. It's primarily based on Chapter 3 of [Graphical Models, Exponential Families, and Variational Inference](http://dx.doi.org/10.1561/2200000001) (called GMEFV later), as well as some of Section 2.4 of PRML.

## Notation

The following are common notations for definition of exponential family.

* **Sufficient Statistics** $\mu(x)$ or $T(x)$.
* **natural parameters** $\eta$ or $\theta$.
* **partition function** there are many versions of it.
    1. $Z(\theta)$. This is just the integral over the non normalized part. Boltzmann machine usually use this notation <https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine>.
    2. $g(\eta)$. This is $1/Z$. PRML uses this notation.
    3. $A(\theta)$. this is $\log Z$.
* **base measure** it's either $h(x)$ or $\nu$. In the latter case, it's simply incorporated into the integral operator, as treated in GMEFV. Essentially, this term helps to weight different $x$ differently, and seems that it doesn't affect most of the analysis or properties about expoenential families, so maybe this is why not many people talk about it.
    * When using $\nu$, (for example Eq. (3.6) of GMEFV), we use $\nu(dx)$, instead of $dx$, probably reflecting that we don't use the (usual) Lebesgue measure, but $\nu$ measure. For more understanding of this, check <https://www.youtube.com/user/mathematicalmonk/>, who has some video lectures on measure theory.
    * Before knowing about measure theory, I thought about the meaning of this base measure as well, and it's as follows (copied from <https://math.stackexchange.com/questions/735916/whats-the-role-of-hx-base-measure-in-the-definition-of-exponential-family>).

### my thought on base measure

While the correct definition of exponential family is 

$$
f_X(x\mid\theta) = h(x) \exp \left (\eta(\theta) \cdot T(x) -A(\theta)\right ),
$$

it seems that in many materials I read, they don't pay much attention to $h(x)$. Sometimes, authors just drop this term.

Can somebody tell me more about the meaning of this $h(x)$ term? The most detailed description about this term I've found is "simply reflects the underlying measure w.r.t. which $p(x\mid\theta)$ is a density." (See http://stat-www.berkeley.edu/pub/users/mjwain/Fall2012_Stat241a/reader_ch8.pdf). Also, some people call it "base measure".

I kind of understand that $h(x)$ may not be important because it's trivial in most cases (1 or a constant like $1/\sqrt{2\pi}$), but for some distributions, e.g. Poisson, this term is nontrivial ($1/x!$), and it moderates the exponential term so greatly.

My understanding of this term is 1) it assigns a base probability to each element in the space of $x$, and the exponential term modifies this base probability. 2) it's here so that we don't get unbounded log-partition function $A(\theta)$, in cases like Poisson. 

Can anybody tell me more about this $h(x)$, and any comment on my understanding of it is welcomed.

## The subset of exponential families with all good properties.

According to pp. 40 of GMEFV, in practice we focus on exponential families with the following properties.

1. Regular. the feasible natural parameters form an open set. **ALL LATER ANALYSIS ASSUME REGULARITY**.
2. Minimal. There's no nontrivial linear combination of sufficient statistics such that it's constant over all choices of $x$. If this is true, then it's easy to show that different natural parameters must give different distributions, using proof by contradiction.
    * one common example of non-minimal exponential family is multinomial distribution for $N$ variables using $N$ parameters (also requring their log summing to 1, as in pp. 114 of PRML), instead of $N-1$.
        * there are two problems with this. First, the natural parameters are not independent, thus when you do partial differentiation, and this dependency makes differentiation impossible (because you can't fix others and perturb one). Also, this means that your feasible natural parameters are not an open set. But this is purely arbitrary, as if you don't make thier log sum to 1, they can still be normalized. So the partition function is really not a constant. Check pp. 5 of <https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter8.pdf> for some more detail.
        * Second problem is that, it's actually not minimal. according to definition of minimal (using notation in pp. 40 of GMEFV), we can pick $\alpha$ to be an all-one vector. This is the same redundancy in softmax function used in Deep Learning frameworks.
        * one tricky part about multinomial distribution, is that although it's multidimensional, the dimensions are not independent: one of them must be 1, and others zero. But this is not a problem, since we can specify the base measure $\nu$ to only have nonzero value to these points.

### About the power of high order Ising model

In Example 3.1 of GMEFV, the authors claim that Ising model (for arbitrary order $k$) is minimal. But suppose this is true, then clearly, as claimed at end of Example 3.1, when $k=n$, the Ising model can represent all distributions (without any configuration taking zero probability), since at that time we have $2^n-1$ natural parameters (check binomial coefficient, or expansion of $\prod_i (x_i + 1)$ to see this) to solve, and we have $2^n$ equations corresponding to the $2^n$ configurations. Ising model being minimal implies that the matrix $A$ of size $2^n$ by $2^n-1$ of those $2^n-1$ sufficient statistics at $2^n$ configurations is full rank. Even better, actually you can concatenate along column (`axis=1` in numpy) with a vector with all one, and it's still full rank. Suppose we have a distribution $p$ we want to model, with $\sum_{i=1}^{2^n} p_i = 1$, then we can find a set of natural parameters by solving $[A, 1] [\theta; \theta'] = \log(p)$. Then this set of $\theta$ is the one we want. Notice that in this case, the log of partition function is negative additional natural parameter $-\theta'$ for the additonal column. You can show this easily using definition of partition function.

#### Why full order Ising model is minimal, even with an additional constant bias sufficient statistic.

You can do this by induction, for any $n$ order Ising model with $n$ nodes, we say that their sufficient statistics, plus a constant bias 1, are always linearly independent (this is stronger than Ising model being mininal, which doesn't have that bias term).

1. The base case is when $n=1$. You can always do this by eyeball.
2. Suppose for $n-1$ node Ising model of order $n-1$, the model is minimal. Then consider a $n$ node Ising model of order $n$.
    * Notice that we can write all the sufficient statistics (plus the one), as expansion of $\prod_i (x_i + 1) = (x_n + 1) \prod_{i=1}^{n-1} (x_i + 1) =  x_n\prod_{i=1}^{n-1} (x_i + 1) + \prod_{i=1}^{n-1} (x_i + 1)$.
    * Therefore, the (augmented) sufficient statistics can decomposed in to two parts. $2^{n-1}$ original terms in the $n-1$ node model, and $2^{n-1}$ terms with original terms multiplied by the additional node $x_n$. Suppose that we can find a set of cofficients $\alpha_i$ to make all $2^{n}$ terms sum to zero for all $2^n$ configurations of $x_i$, then when setting $x_n=0$, we see that half of $\alpha_i$ should linear combine those original terms to zero. By induction hypothesis, this half of $\alpha_i$ must be zero. Then we can set $x_n=1$, and we see it's still reduced to induction hypothesis, making another half of $\alpha_i$ to be zero.
    
Thus, the configuration matrix $A$, or $[A, 1]$ for a full order Ising model is full rank. And since the configuration matrix for lower order Ising models is a subset of columns of $A$ for the full one, it must have full column rank as well, implying minimal.


#### Further understanding of high order Ising model

Wenhao suggests having a look at [Information geometry on hierarchy of probability distributions](http://dx.doi.org/10.1109/18.930911), which talks about orthogonal decomposition of dependencies of different orders.

## Properties of partition function.

Most important property about partition function $A(\theta)$ is its derivatives. See Proposition 3.1 of GMEFV, or Eq. (2.226) of PRML. When doing paramter estimation, the corresponding equations are Eq. (3.38) of GMEFV, or Eq. (2.228) of PRML.

Another property is that the partition function is convex, and strictly convex for minimal representation. Notice that convex function implies that the domain of the function (here $\Omega$) must be convex. See <https://www.quora.com/Does-the-domain-of-a-convex-function-have-to-be-a-convex-set>.

## Correspondence between natural parameters and mean parameters.

Proposition 3.2 and Theorem 3.3 basically establishes the **one-to-one correspondence** between **interior** of mean parameters, and all natural parameters. This also tells us that, as long as our mean parameters from collected sample are in the interior, we can always find a unique expoenential family matching those mean parameters.

Indeed, we can only achieve the interior. For example, consider Bernoulli distribution, we can't express it as an exponential distribution, when the mean is 0 or 1.

Notice that the space of all realizable mean parameters $\mathcal{M}$, is convex, as discussed in pp. 54 (or after Example 3.7) of GMEFV.

Such one-to-one correspondence is also the reason we can use mean parameter to specify GLM.

## Conjugate Duality

In GMEFV, we can see that (3.42) is same as (3.38), and that means when we do maximum likelihood estimation, we are also computing the conjugate. And the optimal value of this conjugate is also related to the entropy of this distribution.

On one hand, we can write $A^*$ in a variational manner (I mean in some optimization form), Thoerem 3.4 shows that we can also write $A$ in such a manner as well.

In Eq. (3.45) of GMEFV, we don't write $\mu \in \overline{\mathcal{M}}$, but $\mu \in \mathcal{M}$. I believe this is ok, as $A^*$ is convex thus continous in the interior, and supremum (which is what we need for computing conjugate of conjuate) won't be affected. For continuity of 1D convex function when domain is open, check <http://math.stackexchange.com/questions/258511/proof-of-every-convex-function-is-continuous>. Essentially, using that inequality, we first prove that left/right derivatitves both exist, one by constructing a bounded decreasing sequence, and one by a bounded increasing sequence, and then definitely it's left/right continuous, and thus continuous.

$A$ and $A^*$ both have nice properties, both being convex (thus having convex domain), and also differentiable, and their derivatives are the mapping between mean ($\mathcal{M}^\circ$) and natural parameters $\Omega$, as shown in Figure (3.8) of GMEFV.