# First

[Linear regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html), [logistic regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html) and some others linear models like softmax regression etc. are all just specialized forms of [GLM (Generalized Linear Model)](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function).

Usually when we build one model, we always firstly go to analyze which distribution does the target sample space subject to, for GLM we should look into [Exponential family](https://en.wikipedia.org/wiki/Exponential_family), then we convert that distribution's parameters to GLM's parameters, then we can proceed our solution with that specialized form of GLM, that is:

<img src="./modeling.svg">

In [linear regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html) we assume: $y|x;\theta \sim \mathcal{N}(\mu, \sigma^2)$, and in [logistic regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#How-to-estimate-the-$\theta$:-MLE-(Maximum-Likelihood-Estimation)) we assume: $y|x;\theta \sim Bernoulli(p)$.

# Exponential family

### [Members of exponential family distributions](https://en.wikipedia.org/wiki/Exponential_family#Examples_of_exponential_family_distributions)

### Exponential family distributions can be expressed in a generic form as below:

$$
  f(y|x; \eta) = h(y) exp\big(\eta(\theta) \cdot T(y) - A(\eta)\big)
$$

- $\eta = \eta(\theta)$ is its natural parameter
- $T(y)$ is the [sufficient statistics](https://en.wikipedia.org/wiki/Sufficient_statistic) (usually $T(y) = y$)
- $A(\eta)$ is the log-[partition function](https://en.wikipedia.org/wiki/Partition_function_(mathematics)) ($A(\eta)$ plays the regularization role, to make: $\sum f\big(T(y); \eta\big) = 1$)

That is: $h(y)$, $T(y)$ and $A(\eta)$ determined a new distribution, the transformed parameter $\eta = \eta(\theta)$ is this distribution's parameter.

### Bernoulli distribution in GLM form

The probability mass function $f$ of bernoulli distribution over possible outcomes $x$ is:

$$
\begin{align*}
  f(x;p) &= p^x(1-p)^{1-x} \\
  &= exp\big(xlnp + (1-x)ln(1-p)\big) \\
  &= exp\big(xln\frac{p}{1-p} + ln(1-p)\big) \enspace \text{for } x \in \{0, 1\}
\end{align*}
$$

$$
\eta = ln\frac{p}{1-p} \Rightarrow e^\eta = \frac{p}{1-p} \Rightarrow p = \frac{1}{1 + e^{-\eta}}
$$

that is, when:

$$
\begin{align*}
  h(x) &= 1 \\
  T(x) &= x \\
  A(\eta) &= -ln(1-p) = ln(1+e^\eta)
\end{align*}
$$

GLM subjects bernoulli distribution.

### Gaussian distribution (normal distribution) in GLM form

The probability density function $f$ of normal distribution with mean $\mu$ and standard deviation $\sigma$ is:

$$
\begin{align*}
  f(x| \mu, \sigma^2) &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2}) \\
  &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}x^2)exp(\mu x - \frac{\mu^2}{2})
\end{align*}
$$

$$
\eta = \mu
$$

that is, when:

$$
\begin{align*}
  h(x) &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}x^2) \\
  T(x) &= x \\
  A(\eta) &= \frac{\mu^2}{2} = \frac{\eta^2}{2}
\end{align*}
$$

GLM subjects normal distribution.

# Modeling with GLM

When modeling with GLM we need to comply with the below three hypotheses：

1. $y|x;\theta \sim ExponentialFamily(\eta)$, that is: the conditional probability of $y$ given $x$ subjects to exponential family distributions;

2. **The target of modeling with GLM** is finding out the fitting function $h(x)$ to make $h(x) = E\big[T(y)|x;\theta\big]$, but due to in most cases $T(y) = y$, so the target changes to find out the fitting function $h(x)$ to make $h(x) = E\big[y|x;\theta\big]$;

3. The natural parameter $\eta$ should be linear with $x$;

### With above three hypotheses, GLM $\Rightarrow$ linear regression

- With above hypothesis 1:

$$
y|x;\theta \sim \mathcal{N}(\mu, \sigma^2) 
$$


- With above hypothesis 2, and check the table [normal distribution known variance](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions) for the 'Natural parameter(s) $\eta$', 'Sufficient statistic $T(x)$', and 'Inverse parameter mapping' we can conclude out fitting function:

$$
\begin{align*}
  f_\theta(x) &= E\big[y|x;\theta\big] = \mu \\
  &= \sigma\eta \enspace \text{(Inverse parameter mapping)} \\
  &\text{refer to above 'Bernoulli distribution in GLM form' section to see how to calculate 'Inverse parameter mapping'} \\
  &\text{and since variance here doesn't affect the accuracy of our model, we make: } \sigma = 1 \\
  \Rightarrow &f_\theta(x) = \eta
\end{align*}
$$


- With above hypothesis 3:

$$
\eta = \theta^Tx
$$

Then finally for linear regresion we can conclude the fitting equation is: $f_\theta(x) = \theta^Tx = \theta_0 + \theta_1x_1 + \dots + \theta_mx_m$, as we declared in this previous post [Multivariable linear regression(gradient descent)](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html#And-we-want-to-find-out-the-fitting-equation:).

### With above three hypotheses, GLM $\Rightarrow$ logistic regression

- With above hypothesis 1:

$$
y|x;\theta \sim Bernoulli(p)
$$


- With above hypothesis 2, and check the table [Bernoulli distribution](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions) for the 'Natural parameter(s) $\eta$', 'Sufficient statistic $T(x)$', and 'Inverse parameter mapping' we can conclude out fitting function:

$$
\begin{align*}
  f_\theta(x) &= E\big[y|x;\theta\big] = p \\
  &= \frac{1}{1 + e^{-\eta}} \enspace \text{(Inverse parameter mapping)} \\
  &\text{refer to above 'Bernoulli distribution in GLM form' section to see how to calculate 'Inverse parameter mapping'}
\end{align*}
$$


- With above hypothesis 3:

$$
\begin{align*}
  \eta &= \theta^Tx \\
  \Rightarrow f_\theta(x) &= \frac{1}{1 + e^{-\theta^Tx}}
\end{align*}
$$

Then finally for logistic regresion we can conclude the fitting equation is: $f_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}$, as we declared in this previous post [Logistic regression (binomial regression) and regularization](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#Modeling).

### Softmax regression ([categorical distribution (variant 3)](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions)) in GLM form

Lets say $y$ can be classified into $k$ classes: $y \in \{1, 2, \dots, k\}$, the probabilities are $p_1, p_2, \dots, p_k$ respectively, and the last class $k$ has special meaning: any sample point not belongs to any of the first $k-1$ classes falls into class $k$, that is:

$$
\begin{cases}
  p_i \quad i \in \{1, 2, \dots, k-1\} \\
  p_k = 1 - \sum\limits_{i=1}^{k-1}p_i
\end{cases}
$$

lets define $T(y) \in R^{k-1}$:

$$
T(1) = \begin{pmatrix}1 \\ 0 \\ 0 \\ \vdots \\ 0\end{pmatrix}, \
T(2) = \begin{pmatrix}0 \\ 1 \\ 0 \\ \vdots \\ 0\end{pmatrix}, \
T(3) = \begin{pmatrix}0 \\ 0 \\ 1 \\ \vdots \\ 0\end{pmatrix}, \
\dots, \
T(k-1) = \begin{pmatrix}0 \\ 0 \\ 0 \\ \vdots \\ 1\end{pmatrix}, \
T(k) = \begin{pmatrix}0 \\ 0 \\ 0 \\ \vdots \\ 0\end{pmatrix}
$$

for expressing above $T(y) \enspace y \in {1,2,\dots,k}$ function better, lets introduce in [indicator function](https://en.wikipedia.org/wiki/Indicator_function) $1\{\cdot\}$: $1\{true\} = 1$ and $1\{false\} = 0$

that is $T(y)|x; p \sim CategoricalDistribution(x_1,x_2,\dots,x_k;1;p_1,p_2,\dots,p_k)$, thus: $E\big[T(y)|x; p\big] = 1\cdot p_i = p_i$ (mean of [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) or also can refer to mean of the [categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution)).


***
***
#### Why the PMF has no coefficient? 

[categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution) can be treated as a special form of the [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) with $k>2$ and $n=1$, and since $n=1$, then the coefficient $\frac{n!}{x_1! x_2! \dots x_k!}$ of the multinomial distribution's PMF(https://en.wikipedia.org/wiki/Multinomial_distribution#Probability_mass_function) will be all 1, that is why for our below 'PMF of $T(y)|x; p$' has the form: $p_1^{x_1} p_2^{x_2} \dots p_k^{x_k}$ only, **WITHOUT** that coefficient, more details:

$$
\begin{cases}
  \text{class 1 is chosen: } \frac{1!}{1! \cdot 0! \cdot 0! \dots 0!} = 1 \\
  \text{class 2 is chosen: } \frac{1!}{0! \cdot 1! \cdot 0! \dots 0!} = 1 \\
  \text{class 3 is chosen: } \frac{1!}{0! \cdot 0! \cdot 1! \dots 0!} = 1 \\
  \vdots
\end{cases}
$$
***
***

#### Reference Point
PMF of $T(y)|x; p$

$$
\begin{align*}
  f\big(T(y)|x;p\big) &= p_1^{1\{y=1\}} p_2^{1\{y=2\}} \dots p_{k-1}^{1\{y=k-1\}} p_k^{1\{y=k\}} \\
  &= p_1^{1\{y=1\}} p_2^{1\{y=2\}} \dots p_{k-1}^{1\{y=k-1\}} p_k^{1 - \sum\limits_{i=1}^{k-1}1\{y=i\}} \\
  &\text{note: above } 1 - \sum\limits_{i=1}^{k-1}1\{y=i\} \text{ the 1 actually is a vector with all 1 components} \\
  &= p_1^{T_1(y)} p_2^{T_2(y)} \dots p_{k-1}^{T_{k-1}(y)} p_k^{1 - \sum\limits_{i=1}^{k-1}T_i(y)} \\
  &=  exp\Big( T_1(y)lnp_1 + T_2(y)lnp_2 + \dots + T_{k-1}(y)lnp_{k-1} + \big(1 - \sum\limits_{i=1}^{k-1}T_i(y)\big)lnp_k \Big) \\
  &= exp\Big( T_1(y)ln\frac{p_1}{p_k} + T_2(y)ln\frac{p_2}{p_k} + \dots + T_{k-1}(y)ln\frac{p_{k-1}}{p_k} + lnp_k \Big) \\
  &= h(y) exp\big( \eta^T T(y) - A(\eta) \big)
\end{align*}
$$

where

$$
\begin{align*}
  \eta &= \begin{pmatrix} ln\frac{p_1}{p_k} \\ ln\frac{p_2}{p_k} \\ \vdots \\ ln\frac{p_{k-1}}{p_k} \end{pmatrix} \
  = \begin{pmatrix} \theta_1^Tx \\ \theta_2^Tx \\ \vdots \\ \theta_{k-1}^Tx \end{pmatrix} \\
  A(\eta) &= -lnp_k \\
  h(x) &= 1
\end{align*}
$$

calculate $p_i$:

$$
\begin{align*}
  \eta_i &= ln\frac{p_i}{p_k} \\
  &\Rightarrow exp(\eta_i) = \frac{p_i}{p_k} \\
  &\Rightarrow p_kexp(\eta_i) = p_i \\
  &\because \sum\limits_{i=1}^k p_i = 1 = \sum\limits_{i=1}^k p_k exp(\eta_i) = p_k \sum\limits_{i=1}^k exp(\eta_i) \\
  &\therefore p_k = 1 \Big/ \sum\limits_{i=1}^k exp(\eta_i) = 1 \Big/ \sum\limits_{i=1}^k exp(\theta_i^Tx) \\
  \Rightarrow p_i &= p_k exp(\eta_i) = \frac{exp(\eta_i)}{\sum\limits_{i=1}^k exp(\eta_i)}
\end{align*}
$$

that is:

$$
\begin{align*}
  p(y=i|x;\theta) &= p_i \\
  &= \frac{exp(\eta_i)}{\sum\limits_{j=1}^k exp(\eta_j)} \\
  &= \frac{exp(\theta_i^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)}
\end{align*}
$$

With above hypothesis 2 lets define the fitting equation for $T(y)|x; \theta$

$$
\begin{align*}
  f_\theta(x) &= E\Big[ T(y)|x; \theta \Big] \\
  &= E \left[ \
    \begin{array}{c|c}
      1\{y=1\} \\
      1\{y=2\} \\
      \vdots & x; \theta \\
      1\{y=k-2\} \\
      1\{y=k-1\}
    \end{array}
    \right] \\
  &= \begin{bmatrix} p_1 \\ p_2 \\ \vdots \\ p_{k-2} \\ p_{k-1} \end{bmatrix} \\
  &= \begin{bmatrix} \
        \frac{exp(\theta_1^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)} \\
        \frac{exp(\theta_2^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)} \\
        \vdots \\
        \frac{exp(\theta_{k-2}^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)} \\
        \frac{exp(\theta_{k-1}^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)}
    \end{bmatrix}
\end{align*}
$$

PMF of $T(y)|x; \theta$

$$
  f\big( T(y)|x; \theta \big) = \prod_{i=1}^k \Bigg( \frac{exp(\theta_i^T x)}{\sum\limits_{j=1}^k exp(\theta_j^T x)} \Bigg)^{1\{y=i\}}
$$

***
***
with sample dataset:

$$
\begin{align*}
& \
(x_1^{(1)}, x_2^{(1)}, \dots, x_m^{(1)}, y^{(1)}),
(x_1^{(2)}, x_2^{(2)}, \dots, x_m^{(2)}, y^{(2)}),
\dots,
(x_1^{(n)}, x_2^{(n)}, \dots, x_m^{(n)}, y^{(n)})
\\
& \
x_i^{(j)} \
\Big(
  \begin{aligned}
    i = 1, 2, \dots, m \\
    j = 1, 2, \dots, n
  \end{aligned}
\Big) \
\text{represents the value of the feature } x_i \
\text{of the } j^{th} \
\text{sample record}
\\
& \
y^{(i)} \
\big(
  \begin{aligned}
    i = 1, 2, \dots, n
  \end{aligned}
\Big) \
\text{represents the target value of the } i^{th} \
\text{sample record}
\end{align*}
$$
***
***

**log-it** and apply MLE(consider all the samples) to find the $\theta$:

$$
\ell(\theta) = \sum\limits_{c=1}^n log \prod_{i=1}^k \Bigg( \frac{exp(\theta_i^T x^{(c)})}{\sum\limits_{j=1}^k exp(\theta_j^T x^{(c)})} \Bigg)^{1\{y^{(c)}=i\}}
$$

Find the $\theta$ with gradient descent algorithm.

We are all done, cheers!

# References

- [广义线性模型（Generalized Linear Model）](https://zhuanlan.zhihu.com/p/22876460)

- [机器学习中的logistic regression的sigmoid函数如何解释？为啥要用它？](https://www.zhihu.com/question/23666587/answer/462453898)

- [Generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function)

- [Exponential family](https://en.wikipedia.org/wiki/Exponential_family)

- [Sufficient statistics](https://en.wikipedia.org/wiki/Sufficient_statistic)

- [Partition function (mathematics)](https://en.wikipedia.org/wiki/Partition_function_(mathematics))

- [Canonical form](https://en.wikipedia.org/wiki/Canonical_form)

- [Cumulant generating function](https://en.wikipedia.org/wiki/Cumulant_generating_function)

- [Multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution)

- [Categorical distribution](https://en.wikipedia.org/wiki/Categorical_distribution)

- [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution)

- [Joint probability distribution](https://en.wikipedia.org/wiki/Joint_probability_distribution)

- [Indicator function](https://en.wikipedia.org/wiki/Indicator_function)

