# Various ML questions

### Why use Cross Entropy over say, Mean Squared Error for a Classification problem?

https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/

Example shows how MSE puts slightly more emphasis on the output instead of the weights that contributed to the classification.

MSE is used when node activations can be understood as representing the probability that each hypothesis might be true, i.e. when the output is a probability distribution. Cross-entropy is commonly used to quantify the difference between two probability distributions. 

The `ln()` function in cross-entropy takes into account the closeness of a prediction and is a more granular way to compute error.

In [3]:
# computed       | targets              | correct?
# -----------------------------------------------
# 0.3  0.3  0.4  | 0  0  1 (democrat)   | yes
# 0.3  0.4  0.3  | 0  1  0 (republican) | yes
# 0.1  0.2  0.7  | 1  0  0 (other)      | no

# computed       | targets              | correct?
# -----------------------------------------------
# 0.1  0.2  0.7  | 0  0  1 (democrat)   | yes
# 0.1  0.7  0.2  | 0  1  0 (republican) | yes
# 0.3  0.4  0.3  | 1  0  0 (other)      | no

### Why use softmax for classification instead of say, standard normalization?

![Softmax Formula](https://jamesmccaffrey.files.wordpress.com/2016/03/softmaxequation.jpg?w=279&h=&zoom=2)

The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied.

An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family
https://arxiv.org/pdf/1511.05042.pdf

> Our experiments showed that for several low dimensional problems, the log-softmax is surprisingly
outperformed by certain losses of the spherical family, in particular the log-Taylor softmax. On the
other hand, in higher dimensional problems, the log-softmax yields better results. The reasons of
this qualitative shift remain unclear and further research should be carried out to understand it.

From StackOverflow:

> Reacts to low stimulation (think blurry image) of your neural net with rather uniform distribution and to high stimulation (ie. large numbers, think crisp image) with probabilities close to 0 and 1.

> While standard normalisation does not care as long as the proportion are the same.

> Have a look what happens when soft max has 10 times larger input, ie your neural net got a crisp image and a lot of neurones got activated

In [None]:
# Difference in using softmax from 

# softmax([1,2])              # blurry image of a ferret
# [0.26894142,      0.73105858])  #     it is a cat perhaps !?
# softmax([10,20])            # crisp image of a cat
# [0.0000453978687, 0.999954602]) #     it is definitely a CAT !

# >>> std_norm([1,2])                      # blurry image of a ferret
# [0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?
# >>> std_norm([10,20])                    # crisp image of a cat
# [0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?


#### From the Deep Learning Book: 
The use of cross-entropy losses greatlyimproved the performance of models with sigmoid and softmax outputs, whichhad previously suﬀered from saturation and slow learning when using the meansquared error loss

### What are skip connections?



![Skip Connection Diagram](https://i.stack.imgur.com/UDvbg.png)

Skip connections are extra connections between nodes in different layers of a neural network that
skip one or more layers of nonlinear processing.

Their main purpose is on improving gradient flow through the network, which in essence increases the capacity without increasing the number of parameters.

Via Resnet's paper: In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. 

Other answer: Maybe no one knows? One of the papers submitted to ICLR (International Conference on Learning Representations) https://openreview.net/pdf?id=HkwBEMWCZ: We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the “ghosts” of these singularities and sculpt the landscape around them to alleviate the learning slow-down. 