## Fundamentals of Machine Learning

### 4 branches of machine learning

* supervised learning
    - binary and multiclass classification
    - scalar regression
    - sequence generation: given a picture, predict a caption describing it.
    - syntax tree prediction: given a sentence, predict its decomposition into a syntax tree.
    - object detection: given a picture, draw a bouding box around certain objects inside the picture.
    - image segmentation: given a picture, draw a pixel-level mask on a specific object.
    
* unsupervised learning
    - finding interesting transformations of the input dat without the help of any targets, for data visualization, data compression or data denoising or better understand the correlations present in the data.
    - dimensionality reduction
    - clustering
    
* self-supervised learning
    - supervised learning without human-annotated labels.
    - autoencoders: a well-known instance where the generated targets are the input.
    - temporally supervised learning: predict next frame, given past frames, or next work in a text, given previous words.
    
* reinforcement learning
    - agent receives information about its environment and learns to choose action that will maximize some reward.
    

### Evaluating ML models

- split the data into: training, validation and test.
- hyper-parameter tuning on the validation
- information leaks: hyperparameter tune leaks validation data into the model.
- simple hold out validation
    - train, validation, test
    - not good with small data

```python
num_validation_samples = 10000
np.random.shuffle(data)
validation_data = data[:num_validation_samples]
data = data[num_validation_samples:]
training_data = data[:]
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)
# model tuning
model = get_model()
model.train(np.concatenate([training_data, validation_data]))
test_score = model.evaluate(test_data)
```

- k-fold validation
```python
k=4
num_validation_samples = len(data)//4
np.random.shuffle(data)
validation_scores = []
for fold in range(k):
    validation_data = data[num_validation_sample*fold:num_validation_sample*(fold+1)]
    training_data = data[:num_validation_sample*fold]+data[num_validation_sample*(fold+1):]
    model = get_model()
    model.train(training_data)
    validation_score = model.evalluate(validation_data)
    validation_scores.append(validation_score)
validation_score = np.average(validation_scores)
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)
```
    

- iterated k-fold validation with shuffling
    - for situations in which you have relatively little data available and you nee to evaluate the model as precisely as possible
    - applying k-fold validation multiple times, shuffling the data every time before splitting it k ways. 
    - final score is the average of the scores obtained at each run of k-fold validation. 
    
Things to keep in mind:
- data representation: random shuffle
- arrow of time: do not random shuffle if tring to predict the future given the past
- redundancy in data: ex. data points appear twice,, make sure training and validation are disjoint

### Data preprocessng, feature engineering and feature learning

Data preprocessing:
- vectorization: input and targets need to be tensors of flaoting-point data
- normalization: 
    - take small values: typically 0-1
    - homogenous: all features take values in roughly the same range
    - normalize each feature independently to have a mean 0 and std 1

```python
x -= x.mean(axis=0)
x /= x.std(acis=0)
```

- handling missing values
    - with neural network, its safe to input missing values as 0, with the condition that 0 is not already a meaningful value. The network will learn from exposure that 0 means missing.
    - if test has missing, but training does not, model will not leaned to ignore missing values. One should artificially generate training samples with missing entries.
- feature extraction

### Overfitting and Underfitting

regularization:
- reducing the network size 
    - number of layers
    - number of units per layer
    - learnable parameters -> capacity
    - start with few layers and parameters and increase the size until seeing diminishing return wrt validation loss

In [13]:
from keras import models
from keras import layers

In [4]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [6]:
# smaller model
model = models.Sequential()
model.add(layers.Dense(4, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [8]:
# high capacity model
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000, )))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
# bigger model overfit early, noisier validation loss

- adding weight regularization
    - Occam;s razor: given two explanations for something, the one most likely to be correct is the simpler one- the one with fewer assumptions.
    - simpler models are less likely to overfit than complex ones
    - simple model is a model where the distribution of parameters values has less entropy
    - put constraints on the complexity of a network by forcing its weights to take only small values.
    - weight regularization
    - add to the loss function a cost associated with having large weights
    - L1
    - L2
    - pass weight relularizer instances to layers as keword arguments

In [9]:
from keras import regularizers

In [14]:
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000, )))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

`l2(0.001)` means every coefficient will add `0.001* weight_coeffcient_value` to the total loss of the network.

In [15]:
from keras import regularizers
regularizers.l1(0.001) # L1 regularization
regularizers.l1_l2(l1=0.001, l2=0.001) # simultaneous L1 and L2 relularization

<keras.regularizers.L1L2 at 0x7f1114b4e4e0>

- adding dropout
    - dropout, applied to a layer, consists of randomly dropping out a number of output features of the layer during training.
    - dropout rate is the fraction of features that zre zeroed out, usually in (0.2, 0.5)
    - at test no units are dropped out, but the layer's output values are scaled down by a factor equal to the dropout rate to balance (or scale up in training)
    - introduce dropout via `Dropout` layer
    
    

In [18]:
import numpy as np

In [24]:
# layer_output = * np.random.randint(0, high=2, size = layer_output.shape)
# layer_output /= 0.5

In [26]:
model = models.Sequential()
model.add(layers.Dense(16,  activation='relu', input_shape=(10000, )))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16,  activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

To recap, most common ways to prevent overfitting in neural network:
- get more training data
- reducing the capacity of the network
- add weight regularization
- add dropout

### Universal workflow of machine learning

- defining the problem and assembling the datasets
    - what is input data
    - what is target
    - what type of model
- choosing a measure of success
    - metric of success will guide the choice of loss function
    - for balanced-classification problem: accuracy, roc
    - for ranking problems or multilabel classification: mean average precision
- deciding on an evaluation protocol
    - maintain a hold-out validation set (for most of the cases)
    - doing k-fold cross-validation
    - doing iterated k-fold validation
- preparing the data
    - once you know what you are training on, what you are optimizing for and how to evaluate your approach
    - format data in a way that can be fed into a machine-learning model
        - formatted as tensors
        - values be scaled to small values in $[-1, 1]$ or $[0, 1]$ range
        - normalize if different feature takes different ranges
        - feature engineering especially for small-data problems
- developing a first model that does better than a baseline: a model wtih statistical power
    - hypotheses:
        - output can be predicted given inputs
        - available data is sufficiently informative to learn the relationship between input and output
    - key choices:
        - last-layer activation: to establish useful constraints on the network's output
        - loss function: mathc the type of problem trying to solve
        - optimization configuration: what optimizer at what learning rate, mostly safe to use `rmsprop` and its default learning rate
        
Note: loss funciton 
    - need to be computable given only a mini-batch of data(ideally a single data point)
    - must be differentiable 
<img src = 'loss_activate.png'>

- scaling up: developing a model that overfits
    - to figure out how big a model you will need, first develoo one that overfits
        - add layers
        - make the layers bigger
        - train for more epochs
    - monitor the train and validation loss and evaluation metrics
- regularizing your model and tuning your parameters
    - add dropout
    - try different architectures: add or remove layers
    - add l1 and/l2 regularization
    - try different hyperparameters
        - number of units per layer
        - learning rate
    - iterate on feature engineering
        - add new features
        - remove features that dont seem to be informative
    - feedback from validatio process to tune model is leaking information from validation