# First

[Linear regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html), [logistic regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html) and some others linear models like softmax regression etc. are all just specialized forms of [GLM (Generalized Linear Model)](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function).

Usually when we build one model, we always firstly go to analyze which distribution does the target sample space subject to, for GLM we should look into [Exponential family](https://en.wikipedia.org/wiki/Exponential_family), then we convert that distribution's parameters to GLM's parameters, then we can proceed our solution with that specialized form of GLM, that is:

<img src="./modeling.svg">

In [linear regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html) we assume: $y|x;\theta \sim \mathcal{N}(\mu, \sigma^2)$, and in [logistic regression](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#How-to-estimate-the-$\theta$:-MLE-(Maximum-Likelihood-Estimation)) we assume: $y|x;\theta \sim Bernoulli(p)$.

# Exponential family

### [Members of exponential family distributions](https://en.wikipedia.org/wiki/Exponential_family#Examples_of_exponential_family_distributions)

### Exponential family distributions can be expressed in a generic form as below:

$$
  f(y|x; \eta) = h(y) exp\big(\eta(\theta) \cdot T(y) - A(\eta)\big)
$$

- $\eta = \eta(\theta)$ is its natural parameter
- $T(y)$ is the [sufficient statistics](https://en.wikipedia.org/wiki/Sufficient_statistic) (usually $T(y) = y$)
- $A(\eta)$ is the log-[partition function](https://en.wikipedia.org/wiki/Partition_function_(mathematics)) ($A(\eta)$ plays the regularization role, to make: $\sum f\big(T(y); \eta\big) = 1$)

That is: $h(y)$, $T(y)$ and $A(\eta)$ determined a new distribution, the transformed parameter $\eta = \eta(\theta)$ is this distribution's parameter.

### Bernoulli distribution in GLM form

The probability mass function $f$ of bernoulli distribution over possible outcomes $x$ is:

$$
\begin{align*}
  f(x;p) &= p^x(1-p)^{1-x} \\
  &= exp\big(xlnp + (1-x)ln(1-p)\big) \\
  &= exp\big(xln\frac{p}{1-p} + ln(1-p)\big) \enspace \text{for } x \in \{0, 1\}
\end{align*}
$$

$$
\eta = ln\frac{p}{1-p} \Rightarrow e^\eta = \frac{p}{1-p} \Rightarrow p = \frac{1}{1 + e^{-\eta}}
$$

that is, when:

$$
\begin{align*}
  h(x) &= 1 \\
  T(x) &= x \\
  A(\eta) &= -ln(1-p) = ln(1+e^\eta)
\end{align*}
$$

GLM subjects bernoulli distribution.

### Gaussian distribution (normal distribution) in GLM form

The probability density function $f$ of normal distribution with mean $\mu$ and standard deviation $\sigma$ is:

$$
\begin{align*}
  f(x| \mu, \sigma^2) &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2}) \\
  &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}x^2)exp(\mu x - \frac{\mu^2}{2})
\end{align*}
$$

$$
\eta = \mu
$$

that is, when:

$$
\begin{align*}
  h(x) &= \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}x^2) \\
  T(x) &= x \\
  A(\eta) &= \frac{\mu^2}{2} = \frac{\eta^2}{2}
\end{align*}
$$

GLM subjects normal distribution.

# Modeling with GLM

When modeling with GLM we need to comply with the below three hypotheses：

1. $y|x;\theta \sim ExponentialFamily(\eta)$, that is: the conditional probability of $y$ given $x$ subjects to exponential family distributions;

2. **The target of modeling with GLM** is finding out the fitting function $h(x)$ to make $h(x) = E\big[T(y)|x;\theta\big]$, but due to in most cases $T(y) = y$, so the target changes to find out the fitting function $h(x)$ to make $h(x) = E\big[y|x;\theta\big]$;

3. The natural parameter $\eta$ should be linear with $x$;

### With above three hypotheses, GLM $\Rightarrow$ linear regression

- With above hypothesis 1:

$$
y|x;\theta \sim \mathcal{N}(\mu, \sigma^2) 
$$


- With above hypothesis 2, and check the table [normal distribution known variance](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions) for the 'Natural parameter(s) $\eta$', 'Sufficient statistic $T(x)$', and 'Inverse parameter mapping' we can conclude out fitting function:

$$
\begin{align*}
  f_\theta(x) &= E\big[y|x;\theta\big] = \mu \\
  &= \sigma\eta \enspace \text{(Inverse parameter mapping)} \\
  &\text{refer to above 'Bernoulli distribution in GLM form' section to see how to calculate 'Inverse parameter mapping'} \\
  &\text{and since variance here doesn't affect the accuracy of our model, we make: } \sigma = 1 \\
  \Rightarrow &f_\theta(x) = \eta
\end{align*}
$$


- With above hypothesis 3:

$$
\eta = \theta^Tx
$$

Then finally for linear regresion we can conclude the fitting equation is: $f_\theta(x) = \theta^Tx = \theta_0 + \theta_1x_1 + \dots + \theta_mx_m$, as we declared in this previous post [Multivariable linear regression(gradient descent)](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm001_multivariable_linear_regression_gradient_descent/multivariable_linear_regression_gradient_descent.html#And-we-want-to-find-out-the-fitting-equation:).

### With above three hypotheses, GLM $\Rightarrow$ logistic regression

- With above hypothesis 1:

$$
y|x;\theta \sim Bernoulli(p)
$$


- With above hypothesis 2, and check the table [Bernoulli distribution](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions) for the 'Natural parameter(s) $\eta$', 'Sufficient statistic $T(x)$', and 'Inverse parameter mapping' we can conclude out fitting function:

$$
\begin{align*}
  f_\theta(x) &= E\big[y|x;\theta\big] = p \\
  &= \frac{1}{1 + e^{-\eta}} \enspace \text{(Inverse parameter mapping)} \\
  &\text{refer to above 'Bernoulli distribution in GLM form' section to see how to calculate 'Inverse parameter mapping'}
\end{align*}
$$


- With above hypothesis 3:

$$
\begin{align*}
  \eta &= \theta^Tx \\
  \Rightarrow f_\theta(x) &= \frac{1}{1 + e^{-\theta^Tx}}
\end{align*}
$$

Then finally for logistic regresion we can conclude the fitting equation is: $f_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}$, as we declared in this previous post [Logistic regression (binomial regression) and regularization](https://lnshi.github.io/ml-exercises/ml_basics_in_html/rdm007_logistic_regression%28binomial_regression%29_and_regularization/logistic_regression%28binomial_regression%29_and_regularization.html#Modeling).

### With above three hypotheses, GLM $\Rightarrow$ softmax regression ([categorical distribution (variant 3)](https://en.wikipedia.org/wiki/Exponential_family#Table_of_distributions))

- With above hypothesis 1:

$y$ might can be divided into multiple classes, lets say $k$ classes: $y \in \{1, 2, \dots, k\}$, and the probabilities are $p_1, p_2, \dots, p_k$ respectively, but due to the fact that $p_1 + p_2 + \dots + p_k = 1$, usually we just use amount $k-1$ parameters, that is:

$$
\begin{cases}
  p_i \quad i \in \{1, 2, \dots, k-1\} \\
  p_k = 1 - \sum\limits_{i=1}^{k-1}p_i
\end{cases}
$$

and we define $T(y|x) \in R^{k-1}$, and:

$$
T(1) = \begin{pmatrix}1 \\ 0 \\ 0 \\ \vdots \\ 0\end{pmatrix}, \
T(2) = \begin{pmatrix}0 \\ 1 \\ 0 \\ \vdots \\ 0\end{pmatrix}, \
T(3) = \begin{pmatrix}0 \\ 0 \\ 1 \\ \vdots \\ 0\end{pmatrix}, \
T(k-1) = \begin{pmatrix}0 \\ 0 \\ 0 \\ \vdots \\ 1\end{pmatrix}, \
\dots \
T(k) = \begin{pmatrix}0 \\ 0 \\ 0 \\ \vdots \\ 0\end{pmatrix}
$$

then for each $T_i(y|x) \sim Bernoulli(p_i)$ distribution, then for each $E\big[T_i(y|x)\big] = p_i$, and each $T_i(y|x)$ is independent to others, so we can calculate $T(y|x)$'s [Joint probability distribution](https://en.wikipedia.org/wiki/Joint_probability_distribution):

$$
\begin{align*}
  J\big(T(y|x);p\big) &= p_1 p_2 \dots p_{k-1} p_k \\
  &= p_1 p_2 \dots p_{k-1} (1 - \sum\limits_{i=1}^{k-1}p_i) \\
  &= exp()
\end{align*}
$$

# References

- [广义线性模型（Generalized Linear Model）](https://zhuanlan.zhihu.com/p/22876460)

- [机器学习中的logistic regression的sigmoid函数如何解释？为啥要用它？](https://www.zhihu.com/question/23666587/answer/462453898)

- [Generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model#Link_function)

- [Exponential family](https://en.wikipedia.org/wiki/Exponential_family)

- [Sufficient statistics](https://en.wikipedia.org/wiki/Sufficient_statistic)

- [Partition function (mathematics)](https://en.wikipedia.org/wiki/Partition_function_(mathematics))

- [Canonical form](https://en.wikipedia.org/wiki/Canonical_form)

- [Cumulant generating function](https://en.wikipedia.org/wiki/Cumulant_generating_function)