# 1. Maximum likelihood estimation (MLE)

## 1.1. Likelihood function

In statistics, the likelihood function (often simply called the likelihood) measures the goodness of fit of a statistical model to a sample of data $\boldsymbol{\mathscr{X}} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)})$ for given values of the unknown parameters of the statistical model.

Let ${\displaystyle \boldsymbol{\mathsf{X}}}$ be a discrete random variable modeled using probability density function ${\displaystyle p_{\boldsymbol{\theta}}}$ depending on the vector of parameters ${\displaystyle \boldsymbol{\theta}}$. Then the function

\begin{equation}
{\displaystyle \mathcal{L}(\boldsymbol{\theta} \mid \boldsymbol{\mathscr{X}}):\boldsymbol{\theta} \mapsto p_{\boldsymbol{\theta}}(\boldsymbol{\mathscr{X}})=P_{\boldsymbol{\theta}}(\boldsymbol{\mathsf{X}}^m=\boldsymbol{\mathscr{X}} )}
\end{equation}

considered as a function of $\boldsymbol{\theta}$, is the likelihood function, given the sample outcome $\boldsymbol{\mathscr{X}} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)})$ of the random variable ${\displaystyle \boldsymbol{\mathsf{X}}}$. 

The likelihood ${\displaystyle {\mathcal {L}}(\boldsymbol{\theta} \mid \boldsymbol{\mathscr{X}})}$ is equal to the probability that the particular outcome $\boldsymbol{\mathscr{X}}$ is observed when the true value of the parameter is $\boldsymbol{\theta}$.

To summarize:

* For a sample of data $\boldsymbol{\mathscr{X}} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)})$
* We fit a statistical model $p_{\boldsymbol{\theta}}$
* The likelihood function ${\displaystyle {\mathcal {L}}(\boldsymbol{\theta} \mid \boldsymbol{\mathscr{X}}):\boldsymbol{\theta}\mapsto P_{\boldsymbol{\theta}}(\boldsymbol{\mathsf{X}}^m=\boldsymbol{\mathscr{X}})}$ is then the goodness of fit of the statistical model $p_{\boldsymbol{\theta}}$.
* $p_{\boldsymbol{\theta}}(\boldsymbol{\mathscr{X}})$ is the probability that this particular outcome $\boldsymbol{\mathscr{X}}$ is observed under the statistical model $p_{\boldsymbol{\theta}}$.

## 1.2. Maximum likelihood estimation

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.

From a statistical standpoint, a given set of observations are a random sample from an unknown population. The goal of maximum likelihood estimation is to make inferences about the population that is most likely to have generated the sample, specifically the joint probability distribution of the random variables ${\displaystyle \left\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\ldots \right\}}$. Associated with each probability distribution is a unique unique vector ${\displaystyle \boldsymbol{\theta} =\left[\theta _{0},\,\theta _{1},\ldots \,,\theta _{n}\right]^{\mathsf {T}}}$ of parameters that index the probability distribution within a parametric family ${\displaystyle \{f(\cdot \,;\boldsymbol{\theta} )\mid \boldsymbol{\theta} \in \Phi \}}$, where ${\displaystyle \Phi }$  is called the parameter space, a finite-dimensional subset of Euclidean space. Evaluating the joint density at the observed data sample $\boldsymbol{\mathscr{X}} = (\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, ..., \mathbf{x}^{(m)})$ gives a real-valued function $f_{m}(\boldsymbol{\mathscr{X}}; \boldsymbol{\theta})$.

The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space, that is

$$
{\displaystyle {\hat {\boldsymbol{\theta}}}={\underset {\boldsymbol{\theta} \in \Phi }{\operatorname {arg\;max} }}\ {\widehat{\mathcal{L}}}(\boldsymbol{\theta} \,\mid\boldsymbol{\mathscr{X}}) = {\underset {\boldsymbol{\theta} \in \Phi }{\operatorname {arg\;max} }} \displaystyle f_{m}(\boldsymbol{\mathscr{X}} ;\boldsymbol{\theta})}
$$

For independent and identically distributed random variables, ${\displaystyle f_{m}(\boldsymbol{\mathscr{X}} ;\boldsymbol{\theta})}$ will be the product of univariate density functions.

$$
{\displaystyle {\hat {\boldsymbol{\theta} }} = {\underset {\boldsymbol{\theta} \in \Phi }{\operatorname {arg\;max} }} \prod_{i=1}^{m} f({\mathbf{x}}^{(i)} ;\boldsymbol{\theta})}
$$

Intuitively, this selects the parameter values that make the observed data most probable. The specific value ${\displaystyle {\hat {\boldsymbol{\theta} }} \in \Phi }$ that maximizes the likelihood function ${\displaystyle \mathcal{L}}$ is called the maximum likelihood estimate. 

## 1.3. Likelihood of a logestic regression model

In the context of classification, the likelihood function measures how well a model fits the data. Suppose we have a dataset $\boldsymbol{\mathscr{X}}$ consisting of $m$ datapoints and $n$ features. The class variable $\boldsymbol{\mathscr{y}}$ is a vector of length $m$ which can have two values $1$ or $0$. Under the classification model, the probability of the class variable value $y^{(i)}=1$ , $i=1,2,...,m$ can be modelled as follows:

\begin{equation}
p(y^{(i)}=1|\mathbf{x}^{(i)};\boldsymbol{\theta}) = h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})
\end{equation}

So $y^{(i)}=1$ with probability $h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$ and $y^{(i)}=0$ with probability $1-h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$. This can be combined into a single equation as follows, (actually $y^{(i)}$ follows a **Bernoulli** distribution):

\begin{equation}
p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}) = h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})^{y^{(i)}}(1-h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}))^{1-y^{(i)}}
\end{equation}

This is also the likelihood of single data point $\mathbf{x}^{(i)}$, i.e. the probability of $\mathbf{x}^{(i)}$ occurring given the value of $y^{(i)}$ under the assumption of our model $p_{\boldsymbol{\theta}}$. It is the conditional probability $p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}|y^{(i)})$.

Given that the data points are independently and identically distributed, the likelihood of the entire dataset $\boldsymbol{\mathscr{X}}$ is the product of the individual data point likelihoods. Thus,

$$
{\mathcal {L}}(\boldsymbol{\theta}) = P(\boldsymbol{\mathscr{X}}|\boldsymbol{\mathscr{y}}) = \prod_{i=1}^{m} p(\mathbf{x}^{(i)} | y^{(i)}) = \prod_{i=1}^{m} h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})^{y^{(i)}} (1 - h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}))^{1-y^{(i)}}
$$

Now the principle of maximum likelihood, the optimal parameters $\boldsymbol{\theta}$ of our model can be found by maximizing likelihood $P(\boldsymbol{\mathscr{X}}|\boldsymbol{\mathscr{y}})$. Logarithms are used because they convert products into sums (easier to manipulate) and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood. We take the natural logarithm as maximising the likelihood is same as maximizing the log likelihood, so log likelihood ${\mathcal{L}}(\theta)$ is now:

$$
\log {\mathcal {L}}(\boldsymbol{\theta}) = \log P(\boldsymbol{\mathscr{X}}|\boldsymbol{\mathscr{y}}) =  \sum_{i=1}^{m} y^{(i)} \log(h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})) + (1-y^{(i)}) \log(1 - h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}))
$$

In linear regression we found the $\theta$ that minimizes our cost function, here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of ${\mathcal {L}}(\theta)$. Maximization of ${\mathcal {L}}(\theta)$ is equivalent to minimization of −${\mathcal {L}}(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be:

$$
J(\boldsymbol{\theta}) = - \dfrac{1}{m} \log \mathcal{L}(\boldsymbol{\theta})
$$

$$
= - \dfrac{1}{m} \left(  \sum_{i=1}^{m} y^{(i)} \log (h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})) +  (1-y^{(i)}) \log (1 - h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})) \right )
$$

# 2. Cross-Entropy

## 2.1. Entropy

The information entropy is a basic quantity in information theory associated to any random variable, which can be interpreted as the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. Entropy can be seen as the amount of information (for example **number of bits**) required to transmit a randomly selected event from a probability distribution. The more number of bits you need, the more uncertainty/information there are. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.

Shannon defined the entropy $H$ of a discrete random variable ${\textstyle Y}$ with possible values ${\textstyle \left\{y_{1},\ldots ,y_{C}\right\}}$ and probability mass function ${\textstyle \mathrm{q}(y)}$ as:

$$
{\displaystyle \mathrm {H} (y)=\operatorname {\mathbb{E}} [\operatorname {I} (y)]=\operatorname {\mathbb{E}} [-\log(\mathrm {q} (y))].}
$$

Here ${\displaystyle \operatorname {\mathbb{E}} }$ is the expected value operator, and $I$ is the information content of $y$. ${\displaystyle I(y)}$ is itself a random variable.

The entropy can explicitly be written as:

$$
{\displaystyle \mathrm {H} (y)=-\sum _{i=1}^{C}{\mathrm {q} (y_{c})\log _{b}\mathrm {q} (y_{c})}}
$$

where b is the base of the logarithm used. Common values of b are 2, Euler's number e, and 10, and the corresponding units of entropy are the bits for b = 2, nats for b = e, and bans for b = 10.

In the case of $q(y_i) = 0$ for some i, the value of the corresponding summand $0 log_b(0)$ is taken to be 0, which is consistent with the limit ${\displaystyle \lim _{q\to 0^{+}}q\log(q)=0}$.

#### Example

Let's assume we have a set of observations of 2 models of cars: BMW and AUDI. These are our labels. This can be modelled as a Bernoulli process of probabilities $(p, 1-p)$.

What if all of our observations were BMW? What would be the uncertainty of that distribution? ZERO, right? After all, there would be no doubt about the model of a car: it is always BMW! So, the entropy is zero! In this case, the probability mass distribution of $Y$ can be written as $y_{i} \in \{\mathtt{BMW}\} \sim \{1\}$, the assocaited entropy can be calculated using:

\begin{align*}
\mathrm{H}(y) 
&= -\sum_{i=1}^{k}{\mathrm{q}(y_{i})\log_{b}\mathrm{q}(y_{i})} \newline
&= - \mathrm{Pr}(\mathtt{BMW})\log_{2}\mathrm{Pr}(\mathtt{BMW}) \newline
&= -  (1)\log_{2}(1) = 0
\end{align*}
    
In other words, we need **ZERO bits** to represent this random variable.

On the other hand, what if we knew exactly half of the cars were BMW and the other half, AUDI? That’s the worst case scenario, right? We would have absolutely no edge on guessing the model of a car: it is totally random! This is the situation of maximum uncertainty as it is most difficult to predict the model of the next car. The probability mass distribution of $Y$ can be written as $y_{i} \in \{\mathtt{BMW}, \mathtt{AUDI}\} \sim \{0.5, 0.5\}$, the corresponding entropy is then:

\begin{align*}
\mathrm {H} (y) 
&= -\sum_{i=1}^{k}{\mathrm{q}(y_{i})\log_{b}\mathrm{q}(y_{i})} \newline
&= - \left[\mathrm{Pr}(\mathtt{BMW})\log_{2}\mathrm{p}(\mathtt{BMW}) + \mathrm{Pr}(\mathtt{AUDI})\log_{2}\mathrm{p}(\mathtt{AUDI})\right]\newline
&= - \left[ (0.5)\log_{2}(0.5) + (0.5)\log_{2}(0.5) \right] = 1
\end{align*}
    
We need exactly **ONE bit** to represent this random variable. 

The next figure shows entropy of a bernoulli(p) distribution as a function of the probability $p$:

<img src="figures/entropy.png" alt="entropy" style="width: 300px;"/>

## 2.2. Cross-Entropy

Cross-entropy builds upon the idea of entropy from information theory and calculates the number of bits required to represent or transmit an average event from one distribution compared to another distribution.

> ... the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q ... — Page 57, Machine Learning: A Probabilistic Perspective, 2012.

The intuition for this definition comes if we consider a target or underlying probability distribution p and an approximation of the target distribution q, then the cross-entropy of q from p is the number of additional bits to represent an event using q instead of p.

For a discrete random variable ${\textstyle Y}$ with possible values ${\textstyle \left\{y_{1},\ldots ,y_{C}\right\}}$ and probability mass function ${\textstyle \mathrm{q}(y)}$ represented using ${\textstyle \mathrm{p}(y)}$, this means:

$$
{\displaystyle H(q,p) = -\operatorname{E}_{q}[\log p] =- \sum_{c=1}^{C} q(y_c)\,\log_b p(y_c)}
$$

## 2.3. Binary cross-entropy of logestic regression model 

Suppose we have a dataset $\boldsymbol{\mathscr{X}}$ consisting of $m$ datapoints and $n$ features. The class variable $\boldsymbol{\mathscr{y}}$ is a vector of length $m$ which can have two values $1$ or $0$.

Let $q(y \mid \mathbf{x})$ be the real unknown distribution of our data (we can also call it **data disribution**) of our distribution modelled using $p(y \mid \mathbf{x}) = h_{\theta}(\mathbf{x})$ (**model distribution**).

Logistic regression typically optimizes the loss for all the observations on which it is trained, which is the same as optimizing the average cross-entropy in the sample. For example, suppose we have ${\displaystyle m}$ samples with each sample indexed by ${\displaystyle i=1,\dots ,m}$. The average of the cross-entropy is then given by:

\begin{align*}
H(q,p) 
&= \frac {1}{m} \sum _{i=1}^{m} H(q_{i},p_{i}) \newline
&= -\frac{1}{m}\sum_{i=1}^{m} \sum_{y^{(i)} \in \{0,1\}} q(y^{(i)}|\mathbf{x}_i) \log p_{\theta}(y^{(i)}|\mathbf{x}_i) \newline
&=-\frac{1}{m}\sum_{i=1}^{m} \left[ q(y^{(i)}=1|\mathbf{x}_i) \log p_{\theta}(y^{(i)}=1|\mathbf{x}_i) + q(y^{(i)}=0|\mathbf{x}_i) \log p_{\theta}(y^{(i)}=0|\mathbf{x}_i)\right]
\end{align*}

${q(y^{(i)}=1|\mathbf{x}_i)={\begin{cases}1&{\text{if }}y^{(i)}=1,\\0&{\text{if }}y^{(i)}=0.\end{cases}}}$, then we can write $q(y^{(i)}=1|\mathbf{x}_i) = y^{(i)}$ .

${q(y^{(i)}=0|\mathbf{x}_i)={\begin{cases}0&{\text{if }}y^{(i)}=1,\\1&{\text{if }}y^{(i)}=0.\end{cases}}}$, then we can write $q(y^{(i)}=0|\mathbf{x}_i) = 1-y^{(i)}$.

Therefore, the cross-entropy of our logestic model can be written as:

$$
H(q,p) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(\mathbf{x}_i)) + (1-y^{(i)})\log(1-h_{\theta}(\mathbf{x}_i)) \right]
$$

# 3. Hypothesis function

In the previous sections, we explored 2 methods that allow us to define our cost function. The choice of the cost function is tightly coupled with the choice of the hypothesis function. Most of the time, we simply use cross-entropy between the data distribution and model distribution. The choice of how to represent the hypothesis function then determines the form of the cross-entropy function.

Our hypothesis function needs to predict only $p_\theta(y=1\mid \mathbf{x})$. For this number to be a valid probability, it must lie in the interval $[0,1]$. Satisfying this design constraints requires some careful design effort. Suppose we were to use a linear function and threshold its value to obtain a valid probability:

$$
p_\theta(y=1 \mid \mathbf{x}) = \max \{0, \min \{1, \boldsymbol{\theta}^\top \mathbf{x}\}\}
$$

This would indeed define a valid conditional distribution, but we would not be able to train it very effectively with gradient descent. Any time $\boldsymbol{\theta}^\top \mathbf{x}$ strayed outside the unit interval, the gradient of the function of the model with respect to its parameters would be 0. A gradient of 0 is typically problematic because the learning algorithm no longer has a guide for how to improve the corresponding parameters. Instead, it is better to use a different approach that ensures there is always a strong gradient whenever the model has the wrong answer.

We may proceed to define a hypothesis function that renders our loss function linear in $\boldsymbol{\theta}$. If we begin our assumption that the unnormalized log probabilities ($\log \tilde{p}$) are linear in $\theta$. We can exponentiate to obtain the unnormalized probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation of $\boldsymbol{\theta}^\top \mathbf{x}$

Let 

\begin{equation}
\log \tilde{p}(y=c) = c \ \boldsymbol{\theta}^\top \mathbf{x} \ \ \ \ \ c \in \{0,1\}
\end{equation}

Then we can write;

\begin{align*}
\tilde{p}(y=c) & = e^{c \ \boldsymbol{\theta}^\top \mathbf{x}} \ \ \ \ \ c \in \{0,1\} \newline
{p}(y=c) & = \frac{e^{c \ \boldsymbol{\theta}^\top \mathbf{x}}}{\sum_{c^\prime=0}^1 e^{c^\prime \ \boldsymbol{\theta}^\top \mathbf{x}}} \ \ \ \ \ c \in \{0,1\} 
\end{align*}

We can then assume the the class probabilities are the following:

$${{p}(y=c) = {
\begin{cases} 
\frac{e^{ \boldsymbol{\theta}^\top \mathbf{x}}}{1+e^{\boldsymbol{\theta}^\top \mathbf{x}}} = \frac{1}{1+e^{-\boldsymbol{\theta}^\top \mathbf{x}}} = \sigma(\boldsymbol{\theta}^\top \mathbf{x}) & {\text{if }}c=1, \\
\frac{1}{1+e^{ \boldsymbol{\theta}^\top \mathbf{x}}} = \sigma(-\boldsymbol{\theta}^\top \mathbf{x}) = 1 - \sigma(\boldsymbol{\theta}^\top \mathbf{x}) & {\text{if }}c=0. 
\end{cases}}
}
$$

$\sigma(z)=\frac{1}{1+e^{-z}}$ is the sigmoid function.

## 4. Generalizing for multinomial logestic regression

Now we will approach the classification of data when we have more than two categories. Instead of y = {0,1} we will expand our definition so that $y = \{1,2,...C\}$. In this case, our model parameters is a 2 dimensional matrix of size ($C \times n$), the probability of the class variable value $y^{(i)}=c$ for $c \in \{1, 2, ..., C\}$, $i=1,2,...,m$ can be modelled as follows:


\begin{equation}
p(y^{(i)}=c|\mathbf{x}^{(i)};\boldsymbol{\Theta}) = h^{(c)}_{\boldsymbol{\Theta}}(\mathbf{x}^{(i)}), \ \ \ \ \ \ \ \  c \in \{0, 1, ..., C\}
\end{equation}

Where the matrix of parameters $\Theta$ can be written as:

\begin{equation}
\Theta = \left[\begin{array}{cccc}| & | & | & | \\
\boldsymbol{\theta}^{(1)} & \boldsymbol{\theta}^{(2)} & \cdots & \boldsymbol{\theta}^{(C)} \\
| & | & | & |
\end{array}\right], \ \ \ \ \ \ \ \  \boldsymbol{\theta}^{(c)} = [\theta_{1}^{(c)}, \theta_{2}^{(c)}, ..., \theta_{n}^{(c)}] 
\end{equation}

This is also the likelihood of single data point $\mathbf{x}^{(i)}$ occurring given $y^{(i)}$.

\begin{equation}
p(\mathbf{x}^{(i)}|y^{(i)};{\Theta}) = \prod_{c=1}^C h_{\Theta}^{(c)} (\mathbf{x}^{(i)})^{\mathbb{1}_{y^{(i)}=c}} \ \ \ \text{where} \ \ \ {\mathbb{1}_{y^{(i)}=c}={\begin{cases}1&{\text{if }}y^{(i)}=c,\\0&{\text{if }}y^{(i)}=0.\end{cases}}} 
\end{equation}

Given that the data points are independently and identically distributed, the likelihood of the entire dataset $\boldsymbol{\mathscr{X}}$ is the product of the individual data point likelihoods. Thus,

$$
{\mathcal {L}}({\Theta}) = P(\boldsymbol{\mathscr{X}}|\boldsymbol{\mathscr{y}}) = \prod_{i=1}^{m} p(\mathbf{x}^{(i)} | y^{(i)})
$$

Maximizing the log likelihood, so log likelihood ${\mathcal{L}}(\theta)$ is now:

\begin{align*}
\log {\mathcal {L}}({\Theta}) = \log P(\boldsymbol{\mathscr{X}}|\boldsymbol{\mathscr{y}}) & =  \sum_{i=1}^{m} \log(\prod_{c=1}^C h_{\Theta}^{(c)}(\mathbf{x}^{(i)}) ^ {\mathbb{1}_{y^{(i)}=c}}) \newline
& =  \sum_{i=1}^{m} \sum_{c=1}^C {\mathbb{1}_{y^{(i)}=c}} \log( h_{\Theta}^{(c)}(\mathbf{x}^{(i)}) ) \newline
\end{align*}

Using the average cost over all data points, our cost function for logistic regresion comes out to be:

$$
J({\Theta}) = - \dfrac{1}{m} \log \mathcal{L}({\Theta}) = - \dfrac{1}{m} \sum_{i=1}^{m} \sum_{c=1}^C {\mathbb{1}_{y^{(i)}=c}} \log( h_{\Theta}^{(c)}(\mathbf{x}^{(i)}) ) 
$$

This can also be easily obtained using a multi-class cross-entropy

\begin{align*}
H(q,p) 
&= \frac {1}{m} \sum _{i=1}^{m} H(q_{i},p_{i}) \newline
&= -\frac{1}{m}\sum_{i=1}^{m} \sum_{c =1}^C q(y^{(i)}=c|\mathbf{x}_i) \log p(y^{(i)}=c|\mathbf{x}_i, \Theta) \newline
&= -\frac{1}{m}\sum_{i=1}^{m} \sum_{c =1}^C {\mathbb{1}_{y^{(i)}=c}}  \log h^{(c)}_{\boldsymbol{\Theta}}(\mathbf{x}^{(i)}) \newline
\end{align*}


Similarly, we may proceed to define a hypothesis function that renders our loss function linear in $\boldsymbol{\theta}^{(c)}$ for all $c \in \{1, 2, ..., C\}$. If we begin our assumption that the unnormalized log probability ($\log \tilde{p}(y=c)$) is linear in $\boldsymbol{\theta}^{(c)}$. We can exponentiate to obtain the unnormalized probabilities. We then normalize to obtain the softmax function.

Let 

\begin{equation}
\log \tilde{p}(y=c) = \boldsymbol{\theta}^{(c)\top} \mathbf{x} \ \ \ \ \ c \in \{1, 2,..., C\}
\end{equation}

Then we can write;

\begin{align*}
\tilde{p}(y=c) & = e^{\boldsymbol{\theta}^{(c)\top} \mathbf{x}} \ \ \ \ \ c \in \{1, 2,..., C\} \newline
{p}(y=c) & = \frac{e^{\boldsymbol{\theta}^{(c)\top} \mathbf{x}}}{\sum_{c^\prime=1}^{C} e^{\boldsymbol{\theta}^{(c^\prime)\top} \mathbf{x}}} \ \ \ \ \ c \in \{1, 2,..., C\}
\end{align*}

# References

Wikipedia contributors. (2020, March 31). Likelihood function. In Wikipedia, The Free Encyclopedia. Retrieved 14:56, April 3, 2020, from https://en.wikipedia.org/w/index.php?title=Likelihood_function&oldid=948338993

Wikipedia contributors. (2020, March 4). Maximum likelihood estimation. In Wikipedia, The Free Encyclopedia. Retrieved 14:57, April 3, 2020, from https://en.wikipedia.org/w/index.php?title=Maximum_likelihood_estimation&oldid=943960597


Wikipedia contributors. (2020, March 27). Entropy (information theory). In Wikipedia, The Free Encyclopedia. Retrieved 14:54, April 3, 2020, from https://en.wikipedia.org/w/index.php?title=Entropy_(information_theory)&oldid=947602458


Wikipedia contributors. (2020, March 12). Cross entropy. In Wikipedia, The Free Encyclopedia. Retrieved 14:53, April 3, 2020, from https://en.wikipedia.org/w/index.php?title=Cross_entropy&oldid=945192009

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.