# Introduction to Deep Learning


**References**
- [Understanding Deep Learning](https://udlbook.github.io/udlbook/)
- [Probabilistic Machine Learning](https://probml.github.io/pml-book/)
- [Dive into Deep Learning](https://d2l.ai/)
- [Deep Learning Fundamentals](https://lightning.ai/courses/deep-learning-fundamentals/)
- [Practical Deep Learning](https://course.fast.ai/)

## Deep Learning

- **Artificial intelligence (AI)** is concerned with building systems that simulate intelligent behavior. <br>
- **Machine learning** is a subset of AI that learns to
make decisions by fitting mathematical models to observed data. <br>
- A **deep neural network** is a type of machine learning
model, and the process of fitting these models to data is referred to as **deep learning**.

## Loss functions

- When we train these models, we seek the parameters that produce the best possible mapping from input to output for the task we are considering.
- Supervised learning models define a mapping from input data to an output prediction.

We have a training dataset $\{x_i, y_i\}$ of input/output pairs.
Consider a model $f_{\phi}(x)$ with parameters $\phi$ that computes an output from input $x$.
We often think that the model directly computes a prediction $y$.
We now shift perspective and consider the model as computing a conditional probability distribution $p(y|x)$ over possible outputs $y$ given input $x$.
A **loss function** $L(\phi)$ returns a single number that describes the mismatch between the model predictions and their corresponding ground-truth outputs.
The loss encourages each training output $y_i$ to have a high probability under the distribution $p(y_i|x_i)$ computed from the corresponding input $x_i$.

### Negative log-likelihood criterion
We choose a parametric distribution $p(y|\theta)$ defined on the output domain $y$. Then we use the neural network to compute one or more of the parameters $\theta$ of this distribution.

The model now computes different distribution parameters $\theta_i = f_\phi(x_i)$ for each training input $x_i$. Each observed training output $y_i$ should have high probability under its corresponding distribution $p(y_i|\theta_i)$. Hence, we choose the model parameters $\phi$ so that they maximize the combined probability across all $I$ training examples.

We assume that
- the data are identically distributed (the form of the probability distribution over the outputs $y_i$ is the same for each data point).
- the conditional distribution $p(y_i|x_i)$ of the output given the input are independent, so the total likelihood of the training data decomposes as:
$$
p(y_1, y_2, ..., y_I | x_1, x_2, ..., x_I) = \prod_{i=1}^{I}p(y_i|x_i)
$$
In other words, we assume the data are **independent and identically distributed (i.i.d.)**.

Then, the model parameter $\hat{\phi}$ we want to find is:

$$
\begin{align}
\hat{\phi} & = \underset{\phi}{\mathrm{argmax}}\left[ \prod_{i=1}^{I} p(y_i|x_i)  \right] \\
& = \underset{\phi}{\mathrm{argmax}}\left[ \prod_{i=1}^{I} p(y_i|\theta_i)  \right] \\
& = \underset{\phi}{\mathrm{argmax}}\left[ \prod_{i=1}^{I} p(y_i|f_\phi(x_i))  \right] \\
\end{align}
$$

The combined probability term is the **likelihood** of the parameters and this equation is known as the **maximum likelihood** criterion.

A conditional probability $p(z|\psi)$ can be considered in two ways.
- As a function of $z$, it is a probability distribution that sums to one.
- As a function of $\psi$, it is a likelihood and does not generally sum to one.

The maximum likelihood criterion is not very practical. Each term $p(y_i|f_\phi(x_i))$ can be small, so the product of many of these terms can be tiny. It may be difficult to represent this quantity with finite precision arithmetic. Fortunately, we can equivalently maximize the logarithm of the likelihood:

$$
\begin{align}
\hat{\phi} & = \underset{\phi}{\mathrm{argmax}}\left[ \prod_{i=1}^{I} p(y_i|f_\phi(x_i))  \right] \\
& = \underset{\phi}{\mathrm{argmax}}\left[ \log \prod_{i=1}^{I} p(y_i|f_\phi(x_i))  \right] \\
& = \underset{\phi}{\mathrm{argmax}}\left[ \sum_{i=1}^{I} \log p(y_i|f_\phi(x_i))  \right] \\
\end{align}
$$

This **log-likelihood** criterion is equivalent because the logarithm is a monotonically increasing function. The log-likelihood criterion has the practical advantage of using a sum of terms, not a product, so representing it with finite precision isn't problematic.

By convention, model fitting problems are framed in terms of minimizing a loss. To convert the maximum log-likelihood criterion to a minimization
problem, we multiply by minus one, which gives us the **negative log-likelihood criterion**:

$$
\begin{align}
\hat{\phi} & = \underset{\phi}{\mathrm{argmin}}\left[-\sum_{i=1}^{I}\log p(y_i|f_\phi(x_i))\right] \\
& = \underset{\phi}{\mathrm{argmin}}[L(\phi)]
\end{align}
$$
which is what forms the final loss function $L(\phi)$.

The network no longer directly predicts the outputs $y$ but instead determines a probability distribution over $y$. When we perform inference, we often want a point estimate rather than a distribution, so we return the maximum of the distribution:

$$
\hat{y} = \underset{y}{\mathrm{argmax}}[p(y|f_\hat{\phi}(x))]
$$


#### Recipe for constructing loss functions

The recipe for constructing loss functions for training data $\{x_i, y_i\}$ using the maximum likelihood approach is:

1. Choose a sutiable probability distribution $p(y|\theta)$ defined over the domain of the predictions $y$ with distribution parameters $\theta$.
1. Set the machine learning model $f_\phi(x)$ to predict one or more of these parameters, so $\theta=f_\phi(x)$ and $p(y|\theta)=p(y|f_\phi(x))$.
1. To train the model, find the network parameters $\hat{\phi}$ that minimize the negative log-likelihood loss function over the training dataset pairs $\{x_i, y_i\}$:
$$
\hat{\phi} = \underset{\phi}{\mathrm{argmin}}[L(\phi)] = \underset{\phi}{\mathrm{argmin}}\left[-\sum_{i=1}^{I}\log p(y_i|f_\phi(x_i))\right]
$$
1. To perform inference for a new test example $x$, return either the full distribution $p(y|f_\hat{\phi}(x))$ or the value where the distribution is maximized.

## Cross-entropy criterion

The cross-entropy loss is based on the idea of finding parameters $\theta$ that minimize the distance between the empirical distribution $q(y)$ of the observed data $y$ and a model distribution $p(y|\theta)$. The distance between two probability distributions $q(z)$ and $p(z)$ can be evaluated using the Kullback-Leiber (KL) divergence:

$$
D_{KL}[q||p] = \int_{-\infty}^{\infty} q(z) \log \frac{q(z)}{p(z)} dz
$$

Consider an empirical data distribution at points $\{ y_i \}_{i=1}^{I}$. We can describe this as a weighted sum of point masses:

$$
q(y) = \frac{1}{I} \sum_{i=1}^{I} \delta(y-y_i)
$$

where $\delta$ is the Dirac delta function. We want to minimize the KL divergence between the model distribution $p(y|\theta)$ and this empirical distribution $q(y)$:

$$
\begin{align}
\hat{\theta} & = \underset{\theta}{\mathrm{argmin}}\left[
  \int_{-\infty}^{\infty} q(y) \log q(y) dy - \int_{-\infty}^{\infty} q(y) \log p(y|\theta) dy
  \right] \\
  & = \underset{\theta}{\mathrm{argmin}}\left[- \int_{-\infty}^{\infty} q(y) \log p(y|\theta) dy
  \right]
\end{align}
$$

where the first term disapperas, as it has no dependence on $\theta$. The remaining second term is known as the **cross-entropy**. It can be interpreted as the amount of uncertainty that remains in one distribution after taking into account what we already know from the other.

$$
\begin{align}
\hat{\theta}
  & = \underset{\theta}{\mathrm{argmin}}\left[- \int_{-\infty}^{\infty} \frac{1}{I} \sum_{i=1}^{I} \delta(y-y_i) \log p(y|\theta) dy
  \right] \\
  & = \underset{\theta}{\mathrm{argmin}}\left[- \frac{1}{I} \sum_{i=1}^{I} \log p(y_i|\theta)
  \right] \\
  & = \underset{\theta}{\mathrm{argmin}}\left[- \sum_{i=1}^{I} \log p(y_i|\theta)
  \right]
\end{align}
$$

In machine learning, the distribution parameters $\theta$ are computed by the model $f_\phi(x_i)$, so we have:

$$
\hat{\phi} = \underset{\phi}{\mathrm{argmin}}\left[- \sum_{i=1}^{I} \log p(y_i|f_\phi(x_i))
  \right]
$$

This is precisely the negative log-likelihood criterion.

**It follows that the negative log-likelihood criterion (from maximizing the data likelihood) and the cross-entropy criterion (from minimizing the distance between the model and empirical data distributions) are equivalent.**

## Cross entropy

- [Entropy (for data science) Clearly Explained!!!](https://www.youtube.com/watch?v=YtebGVx-Fxw)
- [Neural Networks Part 6: Cross Entropy](https://www.youtube.com/watch?v=6ArSys5qHAU)
- [A Short Introduction to Entropy, Cross-Entropy and KL-Divergence
](https://www.youtube.com/watch?v=ErfnhcEV1O8)
- [KL divergence](https://www.youtube.com/watch?v=SxGYPqCgJWM)
- [The KL Divergence : Data Science Basics](https://www.youtube.com/watch?v=q0AkK8aYbLY)
- [A Gentle Introduction to Cross-Entropy for Machine Learning](https://machinelearningmastery.com/cross-entropy-for-machine-learning/)

The **information** quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.

- Low Probability Event (surprising): More information.
- Higher Probability Event (unsurprising): Less information.

Information $h(x)$ can be calculated for an event $x$, given the probability of the event $P(x)$ as follows:

$$
h(x) = -\log P(x)
$$

The **entropy** is the number of bits required to transmit a randomly selected event from a probability distribution. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.

A skewed probability distribution has less “surprise” and in turn a low entropy because likely events dominate. Balanced distribution are more surprising and turn have higher entropy because events are equally likely.

- Skewed Probability Distribution (unsurprising): Low entropy.
- Balanced Probability Distribution (surprising): High entropy.

Entropy $H(P)$ is an expected information for probability distribution $P(x)$.

$$
H(P) = \sum_x P(x)h(x) = -\sum_x P(x) \log P(x)
$$

The **cross entropy** is the average number of bits needed to encode data coming from a source with distribution $P$ when we use model $Q$

$$
H(P, Q) = -\sum_x P(x) \log Q(x) = H(P) + D_{KL}(P || Q)
$$

The **Kullback-Leibler (KL) divergence** is the average number of extra bits needed to encode the data, due to the fact that we used distribution $Q$ to encode the data instead of the true distribution $P$.

- Cross-Entropy: Average number of total bits to represent an event from $Q$ instead of $P$.
- Relative Entropy (KL Divergence): Average number of extra bits to represent an event from $Q$ instead of $P$.

$$
D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}
$$

## Entropy

- [Concepts in Thermal Physics](https://students.aiu.edu/submissions/profiles/resources/onlineBook/W5S8i2_Thermal_Physics.pdf)

### Information and Shannon entropy

One property we could
notice is that the greater the probability of the statement being true in the absence of any prior information, the less the information content of
the statement. Since
the probability of two independent statements being true is the product
of their individual probabilities, and since it is natural to assume that information content is additive, one is motivated to adopt the definition
of information which was proposed by Claude Shannon (1916-2001) as
follows:

The **information** content $Q$ of a statement is defiend by

$$
Q = -k \log P
$$

where $P$ is the probability of the statement and $k$ is a positive constant. (We need $k$ to be a positive constant so that as $P$ goes up, $Q$ goes down.)

If we use $\log_2$ (log to the base 2) for the logarithm in this expression and also $k = 1$, then the information $Q$ is measured in **bits**.

If we have a set of statements with probability $P_i$, with corresponding information $Q_i = -k \log P_i$,  then the average information content $S$ is given by

$$
S = \langle Q \rangle = \sum_i Q_i P_i = -k \sum_i P_i \log P_i
$$

The average information is called the **Shannon entropy**.

The Shannon entropy quantifies how much information we gain, on
average, following a measurement of a particular quantity. (Another way
of looking at it is to say the Shannon entropy quantifies the amount of uncertainty we have about a quantity before we measure it.)

### Information and thermodynamics
If instead
we use $\ln \equiv \log_e$ and choose $k = k_B$, then we have a definition that will match what we have found in thermodynamics (Gibbs' entropy). This gives us a useful
perspective on what thermodynamic entropy is. It is a measure of our
uncertainty of a system, based on our limited knowledge of its properties and ignorance about which of its microstates it is in.

In making inferences on the basis of partial information, we can assign probabilities on the basis that we maximize entropy subject to the constraints provided by what is known about the system. When we maximized the Gibbs’ entropy of an
isolated system subject to the constraint that the total energy U was constant; we found that we recovered the Boltzmann probability distribution. With this viewpoint, one can begin to understand thermodynamics from an information theory viewpoint.

### Ensembles

We are using probability to describe thermal systems and our approach
is to imagine repeating an experiment to measure a property of a system again and again because we cannot control the microscopic properties
(as described by the system’s microstates). In an attempt to formalize this, Josiah Willard Gibbs in 1878 introduced a concept known as an
**ensemble**.

This is an idealization in which one considers making a
large number of mental "photocopies" of the system, each one of which
represents a possible state the system could be in. There are three main
ensembles that tend to be used in thermal physics:

- (1) The **microcanonical ensemble**: an ensemble of systems that each have the same fixed energy
- (2) The **canonical ensemble**: an ensemble of systems, each of which
can exchange its energy with a large reservoir of heat.
- (3) The **grand canonical ensemble**: an ensemble of systems, each of which can exchange both energy and particles with a large reservoir.

### A statistical definition of temperature

The two bodies are said to be in **thermal equilibrium**, which is defined by saying that the energy content and the temperatures of the
two bodies will no longer be changing with time. We would expect that the two bodies in thermal equilibrium are now at the same temperature.

- The system could be described by a very large number of equally likely microstates.
- What you actually measure is a property of the macrostate of the system. The macrostates are not equally likely, because different
macrostates correspond to different numbers of microstates.

Consider two large systems
that can exchange energy with each other, but not with anything else. In other words, the two systems are in thermal contact with each other, but thermally isolated from their surroundings. The first
system has energy $E_1$ and the second system has energy $E_2$. The total
energy $E = E_1 + E_2$ is therefore assumed fixed since the two systems
cannot exchange energy with anything else. Hence the value of $E_1$ is
enough to determine the macrostate of this joint system.

Let us assume that the first system can be in any one of $\Omega_1(E_1)$ microstates and the second system can be in any one of $\Omega_2(E_2)$ microstates. Thus the whole system can be in any one of $\Omega_1(E_1)\Omega_2(E_2)$ microstates.

A system will appear to choose
a macroscopic configuration that maximizes the number of microstates. This idea is based on the following assumptions:
- (1) Each one of the possible microstates of a system is equally likely to occur;
- (2) The system’s internal dynamics are such that the microstates of
the system are continually changing;
- (3) Given enough time, the system will explore all possible microstates
and spend an equal time in each of them. (This is the so-called ergodic hypothesis.)

These assumptions imply that the system will most likely be found in a
configuration that is represented by the most microstates.

For our problem of two connected systems, the most probable division of energy between the two systems is the one that maximizes
$\Omega_1(E_1)\Omega_2(E_2)$, because this will correspond to the greatest number of possible microstates. Our systems are large and hence we can use calculus to study their properties; we can therefore consider making infinitesimal changes to the energy of one of the systems and seeing what
happens. Therefore, we can maximize this expression with respect to $E_1$ by writing

$$
\frac{d}{dE_1}\Omega_1(E_1)\Omega_2(E_2) = 0
$$

and hence, using standard rules for the differentiation of a product,

$$
\Omega_2(E_2)\frac{d \Omega_1(E_1)}{dE_1} + \Omega_1(E_1)\frac{d \Omega_2(E_2)}{dE_2}\frac{d E_2}{d E_1} = 0
$$

Since the total energy $E = E_1 + E_2$ is assumed fixed, this implies that

$$
dE_1 =  - dE_2
$$

and hence

$$
\frac{1}{\Omega_1}\frac{d \Omega_1}{dE_1} - \frac{1}{\Omega_2}\frac{d \Omega_2}{dE_2} = 0
$$

$$
\frac{d \ln \Omega_1}{dE_1} = \frac{d \ln \Omega_2}{dE_2}
$$

This condition defines the most likely division of energy between the
two systems if they are allowed to exchange energy since it maximizes
the total number of microstates. This division of energy is, of course, more usually called “being at the same temperature”, and so we identify $d\ln\Omega / dE$ with the temperature $T$ (so that $T_1 = T_2$). We define the **temperature** $T$ by

$$
\frac{1}{k_B T} = \frac{d \ln \Omega}{d E}
$$

where $k_B$ is the **Boltzmann constant**.

### Canonical ensemble

Consider two systems coupled as before in such a way that they
can exchange energy. This time, we will make one of them enormous, and call it the **reservoir** (also known as a **heat bath**).  It
is so large that you can take quite a lot of energy out of it and yet it
can remain at essentially the same temperature.  The other system is small and will be known as the **system**.

We will assume that for each allowed energy of the system there is only
a single microstate, and therefore the system always has a value of $\Omega$ equal to one. Once again, we fix the total energy of the system plus reservoir to be $E$. The energy of the reservoir is taken to be $E-\epsilon$ while the energy of the system is taken to be $\epsilon$. This situation of a system in
thermal contact with a large reservoir is very important and is known
as the **canonical ensemble**.

The probability $P(\epsilon)$  that the system has energy $\epsilon$ is proportional to
the number of microstates that are accessible to the reservoir multiplied
by the number of microstates that are accessible to the system.

$$
P(\epsilon) \propto \Omega(E-\epsilon) \times 1
$$

Since we have an expression for temperature in terms of the logarithm
of $\Omega$, and since $\epsilon \ll E$, we can perform a Taylor expansion of $\ln \Omega(E-\epsilon)$ around $\epsilon=0$, so that

$$
\ln \Omega(E-\epsilon) = \ln \Omega(E) - \frac{d \ln \Omega(E)}{dE} \epsilon + \cdots
$$

and so we have

$$
\ln \Omega(E-\epsilon) = \ln \Omega(E) - \frac{\epsilon}{k_B T} + \cdots
$$

where $T$ is the temperature of the reservoir. In fact, we can neglect
the further terms in the Taylor expansion and hence this equation becomes

$$
\Omega(E-\epsilon) = \Omega(E)e^{-\epsilon/k_B T}
$$

We thus arrive at the following result for the probability
distribution describing the system, which is given by

$$
P(\epsilon) \propto e^{-\epsilon/k_B T}
$$

Since the system is now in equilibrium with the reservoir, it must also have the same temperature as the reservoir. But notice that although the system therefore has fixed temperature $T$, its energy $\epsilon$ is not a constant but is governed by the probability distribution in this equation. This is known as the **Boltzmann distribution** and also as the **canonical distribution**. The term $e^{-\epsilon/k_B T}$ is known as a **Boltzmann factor**.

If a system is in contact with a reservoir and has a microstate $r$ with energy $E_r$, then

$$
P(\text{microstate } r) = \frac{e^{-E_r /k_B T}}{\sum_i e^{-E_i /k_B T}}
$$

where the sum in the denominator makes sure that the probability is
normalized. The sum in the denominator is called the **partition function** and is given the symbol $Z$.

We have derived the Boltzmann distribution on the basis of statistical arguments that show that this distribution of energy maximizes the number of microstates.

### The second law of thermodynamics
- No process is possible whose sole result is the transfer of heat from
a colder to a hotter body. (Clausius’ statement)
- No process is possible whose sole result is the complete conversion of heat into work. (Kelvin’s statement)
- Of all the heat engines working between two given temperatures,
none is more efficient than a Carnot engine. (Carnot’s theorem)
- For any closed cycle, $\displaystyle \oint \frac{\delta Q}{T} \le 0$ where equality necessarily holds for a reversible cycle. (Clausius' theorem)

#### The thermodynamic definition of entropy

Accroding to Clausius' theorem,

$$
\oint \frac{\delta Q_{\text{rev}}}{T} = 0
$$

is path independent. Therefore the quantity $\displaystyle \frac{\delta Q_{\text{rev}}}{T}$ is an exact differential and we can write down a new state function which we call entropy. We define the **entropy** $S$ by

$$
dS = \frac{\delta Q_{\text{rev}}}{T}
$$

so that

$$
S(B) - S(A) = \int_{A}^{B} \frac{\delta Q_{\text{rev}}}{T}
$$

Consider a loop which contains an irreversible section (A→B)
and a reversible section (B→A). The Clausius inequality implies that, integrating around this loop, we have
that

$$
\oint \frac{\delta Q}{T} \le 0
$$

$$
\int_{A}^{B} \frac{\delta Q}{T} + \int_{B}^{A} \frac{\delta Q_{\text{rev}}}{T} \le 0
$$

$$
\int_{A}^{B} \frac{\delta Q}{T} \le \int_{A}^{B} \frac{\delta Q_{\text{rev}}}{T}
$$

This is true however close A and B get to each other, so in general we
can write that the change in entropy $dS$ is given by

$$
dS = \frac{\delta Q_{\text{rev}}}{T} \ge \frac{\delta Q}{T}
$$

Consider a thermally isolated system. In such a system $\delta Q = 0$ for any process, so that the above inequality becomes

$$
dS \ge 0
$$

This is a very important equation and is, in fact, another statement of
the second law of thermodynamics. It shows that any change for this
thermally isolated system always results in the entropy either staying the same (for a reversible change) or increasing (for an irreversible change). This gives us yet another statement of the second law, namely that: **"the
entropy of an isolated system tends to a maximum."**

### The first law of thermodynamics

- Energy is conserved and heat and work are both forms of energy.

The first law is given by
$$
dU = \delta Q + \delta W
$$

- $\delta Q$ is the heat supplied to the system. (positive for heat supplied to the system, negative for heat extracted from the system)
- $\delta W$ is the work done on the system. (positive for work done on the system, negative for work done by the system)

For a reversible change only, we have that $\delta Q = T dS$ and $\delta W = -pdV$

Combining these, we find that

$$
dU = T dS - p dV
$$

Constructing this equation, we stress, has assumed that the change is reversible. However, since all the quantities in this equation are functions of state, and are therefore path independent, this equation holds for irreversible processes as well!

For an irreversible change, $\delta Q \le T dS$ and also $\delta W \ge -p dV$, but with $\delta Q$ being smaller than for the reversible case and $\delta W$ being larger than for the reversible case so that $dU$ is the same whether the change is reversible or irreversible.

Therefore, we always have that
$$
dU = TdS - pdV
$$

#### The Joule expansion

An ideal gas (pressure $p_i$, temperature $T_i$) is confined to the left-hand side of a thermally isolated container and occupies a volume $V_0$. The right-hand size of the container (also volume $V_0$) is evacuated. The tap between the two parts of the container is then suddenly opened and the gas fills the entire container of volume $2V_0$ (and has new temperature $T_f$ and pressure $p_f$). Both containers are assumed to be thermally isolated from their surroundings.

For the initial state, the ideal gas law implies that

$$
p_i V_0 = Nk_B T_i
$$

and for the final state that

$$
p_f (2V_0) = Nk_B T_f
$$

Since the system is thermally isolated from its surroundings, $\Delta U = 0$. Also, since $U$ is only a function of $T$ for an ideal gas, $\Delta T=0$ and hence $T_i=T_f$. This implies that $p_i V_0 = p_f (2 V_0)$, so that the pressure halves, i.e.,

$$
p_f = \frac{p_i}{2}
$$

It is hard to calculate directly the change of entropy of a gas in a
Joule expansion along the route that it takes from its initial state to
the final state. The pressure and volume of the system are undefined
during the process immediately after the partition is removed since the
gas is in a non-equilibrium state.

However, entropy is a function of state and therefore for the purposes of the calculation, we can take another route from the initial state to the final state since changes of functions of state are independent of the route taken.

Let us calculate the change in entropy for a reversible isothermal expansion of the gas from volume $V_0$ to volume $2V_0$.

Since the internal energy is
constant in the isothermal expansion of an ideal gas, $dU = 0$, and hence the $TdS = pdV$, so that

$$
\Delta S = \int_{i}^{f} dS = \int_{V_0}^{2 V_0} \frac{p dV}{T} = \int_{V_0}^{2 V_0} \frac{Nk_B dV}{V} = Nk_B \ln 2.
$$

### Boltzmann's entropy

The first law $dU = TdS - pdV$ implies that

$$
T = \left( \frac{\partial U}{\partial S} \right)_{V}
$$

or equivalently

$$
\frac{1}{T} = \left( \frac{\partial S}{\partial U} \right)_{V}
$$

We know that

$$
\frac{1}{k_B T} = \frac{d \ln \Omega}{d E}
$$

Comparing these last two equations motivates the identification of $S$ with $k_B \ln \Omega$, i.e.,

$$
S = k_B \ln \Omega
$$

This is the expression for the entropy of a system that is in a particular macrostate in terms of $\Omega$, the number of microstates associated
with that macrostate. We are assuming that the system is in a particular macrostate with fixed energy, and this situation is known as the microcanonical ensemble.

### The Joule expansion

Following a Joule expansion, each molecule can be either on the left-hand side or the right-hand side of the container. For each molecule there are therefore two ways of placing it.

For $N$ particles, there are $2^N$ ways of placing them. The number of microstates associated with the gas being in a container twice as big as the initial volume is larger by a multiplicative factor $2^N$, so that the additional entropy is

$$
\Delta S = k_B \ln 2^N =  N k_B \ln2
$$

### Gibbs' entropy

Suppose that a system can have $N$ different, equally likely microstates. These microstates are divided into various groups (we will call these groups macrostates) with $n_i$ microstates contained in the $i$th macrostate. We must have that
the sum of all the microstates in each macrostate is equal to the total
number of microstates, so that

$$
\sum_i n_i = N
$$

The probability $P_i$ of finding the system in the ith macrostate is then
given by

$$
P_i = \frac{n_i}{N}
$$

The total entropy is of course $S_{\text{tot}} = k_B \ln N$, though we can’t measure that directly
(having no information about the microstates which is easily accessible). Nevertheless, $S_{\text{tot}}$ is equal to the sum of the entropy associated with the freedom of being able to be in different macrostates, which is our measured entropy $S$, and the entropy $S_{\text{micro}}$ associated with it being able to
be in different microstates within a macrostate.

$$
S_{\text{tot}} = S + S_{\text{micro}}
$$

 The entropy associated with being able
to be in different microstates (the aspect we can’t measure) is given by

$$
S_{\text{micro}} = \langle S_i \rangle = \sum_i P_i S_i
$$

where $S_i = k_B \ln n_i$ is the entropy of the microstates in the $i$th macrostate and $P_i$ is the probability of a particular macrostate being occupied. Hence

$$
\begin{align}
S & = S_\text{tot} - S_\text{micro} \\
& = k_B \left( \ln N - \sum_i P_i \ln n_i \right) \\
& = k_B \sum_i P_i (\ln N - \ln n_i)
\end{align}
$$

and using $\ln N - \ln n_i = - \ln (n_i / N) = - \ln P_i$ yields **Gibbs' expression for the entropy**:

$$
S = -k_B \sum_i P_i \ln P_i
$$

### Gibbs's entropy to Boltzmann entropy

Find the entropy for a system with $\Omega$ macrostates, each with probability $P_i = 1/\Omega$ (i.e., assuming the microcanonical ensemble).

Using Gibbs' entropy, substitution $P_i = 1/\Omega$ yields

$$
S = -k_B \sum_i P_i \ln P_i = -k_B \sum_{i=1}^{\Omega}\frac{1}{\Omega}\ln \frac{1}{\Omega} = -k_B \ln \frac{1}{\Omega} = k_B \ln \Omega
$$

### Gibbs' entropy to Bolztmann probability
Maximize $S=-k_B \sum_i P_i \ln P_i$ subject to the constraints that $\sum_i P_i = 1$ and $\sum_i P_i E_i = U$.

Use the method of Larange multipilers, in which we maximize

$$
\frac{S}{k_B} - \alpha \times (\text{constraint 1}) - \beta \times (\text{constraint 2})
$$

where $\alpha$ and $\beta$ are Larange multipliers. Thus we vary this expression with respect to one of the probabilies $P_j$ and get

$$
\frac{\partial }{\partial P_j} \left( \sum_{i} -P_i \ln P_i - \alpha P_i - \beta P_i E_i \right) = 0
$$

so that

$$
-ln P_j - 1 - \alpha - \beta E_j = 0
$$

This can be rearranged to give

$$
P_j = \frac{e^{-\beta E_j}}{Z}
$$

with $Z = e^{1+\alpha}$.

This is the expression for the Boltzmann probability.