# Deep Learning Notes 

Assuming some knowledge already, this is more of a note taking space for me. 

## Loss Functions 

Loss functions are a metric for the network's performance and come in a variety of flavors for different purposes. The core purpose is the Loss Function is to measure the distance between the ground truth and the model's outputs. Depending on the type of problem you want to solve, you will chose a specific loss function. Here are the most common ones: 

| **Loss Function**             | **Purpose**                | **Keras**                      |
|---------------------------|------------------------|----------------------------|
| **Binary Cross-Entropy**      | Binary Prediction      | `binary_crossentropy`      |
| **Categorical Cross-Entropy** | Multi-class Pred - OneHot | `categorical_crossentropy` |
| **Sparse Categorical Cross-Entropy** | Multi-class Pred - Int | `sparse_categorical_crossentropy` |
| **Mean Squared Error**        | Continious Regression  | `mean_squared_error`       |
| **Cosine Proximity**        | Vector Oreintation  | `cosine_proximity`       |


Cross-Entropy is a quantity from the field of information theory that measures the distance between probablity distributions (GT vs. output). This is good for models that output probablitites. In the case of Categorical Cross Entropy, it is imperative to one-hot-encode the data. Sparse Categorical Cross Entropy avoids this need and takes in intger values alone. 

Mean-Squared-Error measures the distance between two quantities (residuals) into a sort of average. Sum-of-Squared-Errors (SSE) is another option, though this can explode more easily. Root Mean Squared Error (RMSE)is just the square root of the mean square error. That is probably the most easily interpreted statistic, since it has the same units as the quantity plotted on the vertical axis for a linear regression. 

Cosine Proximity, or Cosine Similary, is a measure of how close two vectors are in terms of orientation, and not magnitude. This is useful in models like Word2Vec. 

More information on the [math here](https://isaacchanghau.github.io/2017/06/07/Loss-Functions-in-Artificial-Neural-Networks/), or in the [Keras Docs](https://keras.io/losses/)

## Metrics 

In addition to loss there are other metrics to validate the training of a network. Metrics differ for cclassification and regression problems. 

### Accuracy 

Accuracy (acc) is the simple percentage of correct predictions for a classification problem. 

### Mean Absolute Error: 

Mean Absolute Error (MAE) is the absolute value of the difference between the predictions and the targets. It can only be used for regression. It can be interpreted as how off you are in a one-to-one comparison with the scale units of your target. 

## Optmizers 

In general, it safe to start with RMSprop (`rmsprop`), whatever the problem. 

## Activation Functions 

### Softmax 

Since softmax functions output a probablity distriubtion over many categories, you need to be sure to format the last layer correctly. The example below shows the right format for a 10-class classification with a softmax output. The model will output a 10-dim vector of probablties, whos sum is one. 

```python 
modeladd(layers.Dense(10, activation='softmax'))
```

## Why is my Neural Net not Working? 

Essential Checklist: 
####  0. You're not documenting your process! C'mon! Be a scientist! 

####  1. Does the last layer of the network have (N nodes == N classes)? 

####  2. Are you using the right loss function? (check above)
- are you one-hot encoding? (if so don't use sparse categorical cross entropy) 
- MSE for regression 

####  3. Are you using the right activation on the last layer?(check above)
- no activation for regression 
- softmax for probablities 
- sigmoid [0, 1]
- ReLU [0, inf]

#### 4. You're network is too big for your data 
- scale down the number of hidden layers and nodes 

#### 5. You did not shuffle your data 

#### 6. You did not normalize your data 
- data should be between [0, 1]
- your data could be heterogenous, where features have different scales 
    - normalize independently on a per-feature basis 
    - mean = 0, std = 1
    - `x -= x.mean(axis=0`
    - `x /= x.std(axis=0)`

#### 7. Can you reduce the dimensionality of your data 
- feature engineering! 

#### 8. You're scaling up wrong 
- You are applying regularization and increasing network size at the wrong time 
- The ideal workflow is iterative, like tightening a car wheel: 
        **start with a small, basic network**
        --> train, work out kinks, graident check 
        while Tuning: 
            --> overfit this model (add layers, nodes, epochs, gridsearch)
            --> add regularization (L1/L2, dropout, remove layers)
            --> work out kinks
 


## Code Snippets 

In [None]:
### Plotting the Training and Validation Loss ### 
## assumes: ##
# import matplotlib.pyplot as plt 
# history = model.fit(...)

metric = 'loss' #acc

history_dict = history.history
loss_values = history_dict['{}'.format(metric)]
val_loss_values = history_dict['val_{}'.format(metric)]
epochs = range(1, len(acc) + 1)

plt.clf()
plt.figure(figsize=(8, 3))
plt.plot(epochs, loss_values, 'bo', label='Training {}'.format(metric))
plt.plot(epochs, val_loss_values, 'b', label='Validation {]}'.format(metric))
plt.title('Training and Validation {}'.format(metric))
plt.xlabel('epochs')
plt.ylabel('{}'.format(metric))
plt.legend()
plt.show()

### K-Fold Cross Validation 

In the case of small training data, you can try this: Apply K-fold multiple times, shuffling the data every time before splitting it k-ways. The final score is the average of all the scores obtained at each run of the k-fold. So you run the model (P * K) times, where P is the number of iterations of K-folds! Yes, its expensive... Psudeocode below

```python 
P = 5
K = 5
scores = []
for p in range(P): 
    np.random.shuffle(data) #shuffle after each k-fold 
    for k in range(K): 
        #...k-fold data split 
        model.train(train_data) 
        scores.append(mode.evaluate(val_data))
np.average(scores)
```

An imlementation of just the k-fold can be found below. 


In [None]:
### Normal K-Fold Cross Validation ###
## assumes ##
# build_model() #function to build model
# import numpy as np 
# withhold test data

k = 4 
num_validation_samples = len(data) // k
np.random.shuffle(data) #inplace 

validation_scores = []
for fold in range(k): 
    start_split = num_validation_samples * fold
    end_split = num_validation_samples * (fold + 1)
    
    #grab the data 
    val_data = data[start_split:end_split] #validation split
    train_data = data[:start_split] + data[end_split:] #train split 
    
    #train & validate new instance of the model 
    model = build_model() 
    history = model.train(train_data)
    val_score = model.evaluate(val_data)
    val_scores.append(val_score)

val_score = np.mean(val_scores)

model.get_model()
model.train(data) 
test_score = model.evaluate(test_data)