### derivation of least squares

# Softmax Exercise

recall that the formula for Softmax is:

$$\text{Prob}(i) = \dfrac{\exp(z_i)}{\sum_{N}^{j=1} \exp(z_j)}$$

$$\log\text{Prob}(i) = z_i - \log \sum_{j=1}^{N} \exp(z_j)$$

Consider a classification task with three classes 111, 222, 333. Suppose a particular input is presented, producing outputs:

$$z_1 = 1$$
$$z_2 = 2$$
$$z_3 = 3$$

and that the correct class is 2

### Question 1
Compute each of the following to 2dp:

Prob(1)
Prob(2)
Prob(3)

In [None]:
import torch as T

In [None]:
device = T.device("cpu")

In [None]:
t1 = T.tensor([1.0, 2.0, 3.0], dtype=T.float32).to(device)
sm = T.nn.functional.softmax(t1, dim=0)
# lsm = T.nn.functional.log_softmax(t1, dim=0)
# l_sm = T.log(T.nn.functional.softmax(t1, dim=0))

In [None]:
T.set_printoptions(precision=4)
print("tensor t1        = ", end=""); print(t1)
print("softmax(t1)      = ", end=""); print(sm)
# print("log_softmax(t1)  = ", end=""); print(lsm)
# print("log(softmax(t1)) = ", end=""); print(l_sm)

### question 2
Compute each of the following, to two decimal places:

d(log Prob(2))/$dz_1$

d(log Prob(2))/$dz_2$

d(log Prob(2))/$dz_3$

In [31]:
from sympy import *

In [32]:
z = 2.718

In [33]:
expr = -z / (z + z**2 + z**3)
expr  # d(log Prob(2))/$dz_1$

-0.09004527836777446

In [34]:
expr = 1 - (z**2 / (z + z**2 + z**3))
expr  # d(log Prob(2))/$dz_2$

0.755256933396389

In [35]:
expr = -z**3 / (z + z**2 + z**3)
expr  # d(log Prob(2))/$dz_2$

-0.6652116550286146

### Question 3
Consider a degenerate case of supervised learning where the training set consists of just a single
input, repeated 100 times. 

In 80 of the 100 cases, the target output value is 1; in the other 20, it is 0.

What will a back-propagation neural network predict for this example, assuming that it has been
trained and reaches a global minimum? Does it make a difierence whether the loss function is sum
squared error or cross entropy? 

(**Hint**: to find the global minimum, differentiate the loss function and
set the derivative to zero.)

calculate the SSE and Cross Entropy

### 2-3a: Sum of Squared Errors
$E = \dfrac{1}{2} \sum_{i}(t_i - z_i)^2$

In [None]:
from sympy import *

In [None]:
z = symbols('z')

In [None]:
expr = -80 * log(z) - 20 * log(1-z)
expr

In [None]:
expr_1 = diff(expr, z)
expr_1

In [None]:
solve(expr_1)  # this is equivalent to 0.8

### 2-3b: Cross Entropy
$E = \sum_{i} (-t_i\log(z_i) - (1-t_i)\log(1-z_i)$

In [None]:
expr_2 = diff(expr, z)
expr_2

In [None]:
solve(expr_2)  # this is equivalent to 0.8

### Question 1
Explain the difference between the following paradigms, in terms of what is presaented to the system, and what it aims to achieve:

- Supervised learning
- Reinforcement Learning
- Unsupervised Learning

### Answer:
**Supervised Learning**: The system is presented with training items consisting of an input and a target output. The aim is to predict the output given the input (for the training set as well as an unseen test set)

**Reinforcement Learning**: the system chooses actions in a simulated environment, observing its state and receiving reqards along the way. The aim is to maximise the cumulativ reward.

**unsupervised learning**: the system is presented with training items consisting of only an input (no target value). The aim is to extract hidden features or other structure from these data.

### Question 2
Explain what is meant by Overfitting in neural networks, and list four different methods for avoiding it.

### Answer
Overfitting is where the training set error continues to reduce but the test set error stalls or increases. This can be avoided by:
- limiting the number of neurons or connection in the network
- early stopping, with a validation set
- dropout
- weight decay (this can avoid over fitting by limiting the size of the weights)

### Question 3
Explain how Dropout is used for neural networks, in both the training and testing phase.

### Answer
During each minibatch of training, a fixed percentage (usually one half) of nodes are chosen to be inactive. In the testing phase, all nodes are active but the activation of each node is multiplied by the same percentage that was used during training

### Question 4
Write the formulas for these Loss functions: Squared Error, Cross Entropy, Softmax, Weight Decay (remember to define any variables you use)

### Answer:
assume $z_i$ is the actual output, $t_i$ is the target output and $w_j$ are the weights

Squared error: $E = \dfrac{1}{2} \sum_i (z_i - t_i)^2$

Cross Entropy: $E = \sum_i (-t_i \log z_i - (1 - t_i) \log(1-z_i)$

Softmax: $-( z_i - \log \sum_{j=1}^{N} \exp(z_j))$

Weight Decay: $E = \dfrac{1}{2} \sum_j w_j^2$

### Question 5
in the context of supervised learning, explain the difference between MLE and Bayesian inference.

### Answer
In MLE, the hypothesis $h \in H$ is chosen which maximises the conditional probability $P(D|h)$ of the observed data $D$ conditioned on $H$

In bayesian inference, the hypothesis $h \in H$ is chosen which maximises $P(D|h)P(h)$ where $P(h)$ is the prior probability of $h$

### Question 6
Briefly explain the concept of Momentum, as an enhancement for Gradient Descent.

### answer

a running average of the differentials for each weight is maintained and used to update the weights as follows:

$$\delta w = \alpha \delta w - \eta \dfrac{dE}{dw}$$

$$w = w + \delta w$$

the constant $\alpha$ with $0 \leq \alpha < 1$ is called momentum

### Derivation of Least Squares

### compute softmax

### compute weight decay

### compute momentum

### code simple pytorch operations

### analyse the geometry of hidden unit activations in neural networks