# Terms & Notes

## Data preprocessing
3 steps:
1. Cleaning
2. Transformation
    * Features normalization
        * Min-max scaling $$X'=a+{\frac {\left(X-X_{\min }\right)\left(b-a\right)}{X_{\max }-X_{\min }}}$$ where $a$ and $b$ are boundary values
    * One-hot encoding
3. [Dimentionality] reduction
    * Principal Component Analysis (PCA)
        * Normalize the data
        * Compute covariance matrix $$\sum=\frac{1}{n-1}\left((X-\bar x)^T(X-\bar x)\right)$$
        * Eigen decomposition
        * Create projection matrix
        * Squash features with projection matrix
    * T-SNE
    * LDA (Linear Discriminant Analysis)
4. Data randomization

## Activation functions
* linear $$Wx+b$$
* rectified linear unit (ReLU) $$f(x) = \max(x, 0)$$
* sigmoid $$sigmoid = \frac{1}{1 + e^{-x}} = 1 - \frac{1}{1 + e^x}$$ derivative $$\frac{d}{dx}(sigmoid)=\frac{e^x}{(1+e^x)^2}=\frac{1}{1 + e^x}\left(1 - \frac{1}{1 + e^x}\right)=sigmoid(1-sigmoid)$$
* softmax $$\sigma(z)_j = \frac{e^{z_j}}{\sum_{k=1}^K{e^{z_k}}} \quad \text{for } j = 1, \dots, K$$ where $z$ is a $K$-dimentional vector and $\sigma(z)$ is a "squashed" vector of the same dimention

## Cost functions
* sum of squared errors (SSE)
* cross entropy - for classification with one-hot encoded labels $$\hat y = \left\lgroup \matrix{0.1\cr 0.5\cr 0.4}\right\rgroup \quad y = \left\lgroup \matrix{0\cr 1\cr 0}\right\rgroup$$  $$D(\hat y, y)=-\sum_j{y_j\ln{\hat y_j}}$$
Loss function: $$L=\frac{1}{N} \sum_i{D(S(Wx_i + b), L_i)}$$ where $D$ is cross entropy loss and $S$ is softmax

## Learning optimization
* weights initialization with random values from truncated ($\sigma$) normal distribution
* back propagation $$w \leftarrow w - \alpha \Delta_w L$$ $$b \leftarrow b - \alpha \Delta_b L$$
* gradient descent
* stochastic gradient descent (SGD)
    * momentum - running average for SGD: $$M \leftarrow 0.9M + \Delta L$$
    * learning rate decay - lowering learning rate during SGD
* ADAGRAD - SGD which implicitly does momentum and learning rate decay
* Mini-batching (optimize memory consumption, computationally inefficietn)
* early termination
* regularizatoin 
    * L2 Regularization $$L' = L + \beta\frac{1}{2}\|w\|^2_2$$ where $\beta$ is a small constant (hyperparameter), and $$\frac{1}{2}\|w\|^2_2 = \frac{1}{2}(w^2_1 + w^2_2 + \dots + w^2_n)$$
    * dropout

For convolutional:
* pooling
    * max pooling
    * average pooling

## Techniques
* Wight sharing (statistical invariants)
* Word2Vec
    * Skip-gram
    * CBOW (Continuous Bag-Of-Words)

## Hyperparameters
* learning rate $\alpha$
* number of layers and neurons in each layer
* batch size
* number of epochs
* $\beta$ (L2 Regularization constant)
* stride (convolutional networks)
* depth of CNN filter $k$
* pooling region size (convolutional networks)
* pooling region stride (convolutional networks)

## Neural network layer calculation
* convolutional layer output shape: $$(W−F+2P)/S+1$$ where $W$ is volume of input layer, $F$ volume of filter ($\text{height}\times\text{width}\times\text{depth}$), $S$ is stride and $P$ is padding