## Weight decay
This additonal penalty term is L2 regularisation of the weights of the neural networks. Here, our weights can't grow too large and overfit the model. Since, the weights won't grow too large the updation is very small. We don't use L1 regulariser or Lasso regression because we don't want to simplify the neural network by promoting sparsity.
$$L(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2$$

In the start of the training we might want to have large updates and as we go closer to the global minimum, we want small updates, this can be achieved by using learning rate decay.

In pytorch, we can access the weight of `torch.nn` models using `network.weights` and `network.bias`.

## Softmax regression
In classification, there are only two possible outputs, 0 and 1. The Softmax makes sure that the output is nonnegative, usign the exponential terms and the denominator normalises the output between 0 and 1 such that the sum of all outputs is 1. 
$$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \quad \textrm{where}\quad \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}$$

##  Information Theory Basics
The central idea in information theory is to quantify the amount of information contained in data. This places a limit on our ability to compress data.
### Entropy
For a distribution $P$ its entropy H[P] is defined as:

$$H[P] = \sum_j - P(j) \log P(j)$$

If we have more information in the data, for example every single sample has a new pattern, the the entropy is high. Low entropy meaning the data is more predictable and les information-rich.

###  Surprisal
This is closely related to entropy. A high probably outcome will have low surprisal and vice-versa. For example, filling a fair coin, getting heads in 90% flips has a high surprisal. Mathematically it is measured as 
$$\text{Surprisal} = \log \frac{1}{P(j)} = -\log P(j)$$
where $j$ is the outcome of an event.


###  Cross-Entropy
Entropy is the level of surprise experienced by someone who knows the true probability. The cross-entropy *from* $P$ *to* $Q$, denoted $H(P, Q)$,
is the expected surprisal of an observer with subjective probabilities $Q$
This is given by $H(P, Q) \stackrel{\textrm{def}}{=} \sum_j - P(j) \log Q(j)$. The lowest possible cross-entropy is achieved when $P=Q$. In this case, the cross-entropy from $P$ to $Q$ is $H(P, P)= H(P)$.

The cross-entropy classification has two objectives:
* maximizing the likelihood of the observed data
* minimizing our surprisal

### Implementing Softmax

In [5]:
import torch
def softmax(X):
    X_exp = torch.exp(X)
    partition = X_exp.sum(1, keepdims=True)
    return X_exp / partition  # The broadcasting mechanism is applied here

In [6]:
X = torch.rand((2, 5))
X_prob = softmax(X)
X_prob, X_prob.sum(1)

(tensor([[0.2417, 0.2327, 0.1925, 0.1858, 0.1473],
         [0.2757, 0.2109, 0.1464, 0.1675, 0.1995]]),
 tensor([1., 1.]))

## modified Softmax
Although softmax produced outputs between 0 and 1, when calculating the individual numerator and denominator the computer can result in overflow due to its exponential nature. Thus softmax can be modified to reduce the individual values of the numerator and denominator.

$$\hat y_j = \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})}$$
where $$\bar{o} {=} \max_k o_k$$ i.e. the maximum value from all entries for that specific sample. By construction, we know that, $$o_j - \bar{o} \leq 0$$

### RealSoftMax
In modified softmax, we can still achieve very large negative values resulting in underflow. To address this, we use  LogSumExp (LSE), or RealSoftMax, or  multivariable softplus. Basically, we combine the computation of softmax and cross-entropy. The essential component of cross-entropy, $\log\hat{y}$ can be directly calculated as

$$\log \hat{y}_j =
\log \frac{\exp(o_j - \bar{o})}{\sum_k \exp (o_k - \bar{o})} =
o_j - \bar{o} - \log \sum_k \exp (o_k - \bar{o}).$$

Here used basic logarithm manipulation to simplify the calculation of $\log\hat{y}$ witout seperately calculating the output probabilities of softmax.

## Distribution shift
Sometimes models appear to perform marvelously as measured by test set accuracy but fail catastrophically in deployment when the distribution of data suddenly shifts. Say we trained a model which tells us that the marks of a student in an exam depends on the distance of his house from the school. Sometimes we need to step outside the realm of statistical prediction.

### Types of Distribution Shift
We assume that our training data was sampled from some distribution $p_S(\mathbf{x},y)$ and the test data was sampled from some distribution $p_T(\mathbf{x},y)$. There is no way we can build a robust classifier without knowing any relation between the two distributions. For example, if we train a model to recognise outdoor seasons, and the model was only trained on photos taken in summer then the model will perform poorly on photos taken in winter.

#### 1. Covariate Shift
