# Deep Learning 4

## 4.1 Machine learning

## 4.2 Supervise Learning

- Supervised learning
  - map input data to known targets/annotations
  - Categories
    - classification
    - regression
    - Sequence generation
      - Given a picture, predict a caption describing it. Sequence generation can sometimes be reformulated as a series of classification problems (such as repeatedly predicting a word or token in a sequence).
    - Syntax tree prediction
      - Given a sentence, predict its decomposition into a syntax tree
    - Object detection
      - Given a picture, draw a bounding box around certain objects inside the picture. This can also be expressed as a classification problem (given many candidate bounding boxes, classify the contents of each one) or as a joint classification and regression problem, where the bounding-box coordinates are predicted via vector regression.
    - Image segmentation
      - Given a picture, draw a pixel-level mask on a specific object.

## 4.3 Unsupervise Learning

- Unsupervised learning
  - finding interesting transformations of the input data without the help of any targets
  - purposes 
    - data visualization
    - data compression
    - data denoising
    - to better understand the correlations present in the data at hand.
  - Categories
    - Dimensionality reduction
    - clustering

## 4.4 Self Supervise Learning

- Self-supervised learning / temporally supervised learning
  - supervised learning without human-annotated labels
  - supervision comes from future input data
  - labels generated from the input data (heuristic algorithm)
  - Examples
    - autoencoders
      - generated targets are the input, unmodified
    - predict the next frame in a video, given past frames
    - predict the the next word in a text, given previous words

## 4.5 Reinforcement Learning
- an agent receives information about its environment and learns to choose actions that will maximize some reward

## 4.6 Glossory Classification & Regression

- Classification and regression glossary
  - Sample or input
    - One data point that goes into your model
  - Prediction or output
    - What comes out of your model.
  - Target
    - The truth. What your model should ideally have predicted, according to an external source of data.
  - Prediction error or loss value
    - A measure of the distance between your model’s prediction and the target.
  - Classes
    - A set of possible labels to choose from in a classification problem.
  - Label
    - A specific instance of a class annotation in a classification problem.
  - Ground-truth or annotations
    - All targets for a dataset, typically collected by humans.
  - Binary classification
    - A classification task where each input sample should be categorized into two exclusive categories.
  - Multiclass classification
    - A classification task where each input sample should be categorized into more than two categories
  - Multilabel classification
    - A classification task where each input sample can be assigned multiple labels.
  - Scalar regression
    - A task where the target is a continuous scalar value.
  - Vector regression
    - A task where the target is a set of continuous values
  - Mini-batch or batch
    - A small set of samples that are processed simultaneously by the model.


## 4.7 Under Fitting Over Fitting

![](./snaps/5.1.PNG)

- Training cycle
  - beginning
    - optimization and generalization are correlated
      - the lower the loss on training data, the lower the loss on test data.
    - underfit
  - after some iterations
    - generalization stops improving
    - validation metrics stall and then begin to degrade 
    - the model is starting to overfit
      - beginning to learn patterns that are specific to the training
- Overfitting occur when 
  - data is noisy
  - involves uncertainty
  - includes rare features

- if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand

## 4.8 K-Fold Hold Out Validation

![](./snaps/5.7.PNG)

#### ITERATED K-FOLD VALIDATION WITH SHUFFLING
- apply K-fold validation multiple times, shuffling the data every time before splitting it K ways. 
- The final score is the average of the scores obtained at each run of K-fold validation.

## 4.9  Spliting Techniques
- Data representativeness
    - reshuffle data before split
- The arrow of time
    - for predictions use test set made up of samples from very past
- Redundancy in your data
    - training and testing data should be disjoint

## 4.10 Data Preprocessing
- vectorization
  - turning data into tensors of floating point
- normalization
  - data
    - large values / heterogeneous
      - can trigger large gradient updates
        - prevent the network from converging
    - Take small values
      - Typically, most values should be in the 0–1 range.
    - Be homogenous
      - all features should take values in roughly the same range.
  - solution
    - Normalize each feature independently to have a mean of 0.
    - Normalize each feature independently to have a standard deviation of 1. 
- handling missing values
  - input missing values as 0
  - artificially generate training samples with missing entries
    - copy some training samples several times, and drop some of the features that you expect are likely to be missing in the test dat
- feature extraction
    - process of using your own knowledge about the data and about the machine learning algorithm at hand to make the algorithm work better by applying hardcoded transformations to the data before it goes into the model
    - new algorithms don't need feature engineering but can be used for two cases
        - Good features still allow you to solve problems more elegantly while using fewer resources
        - Good features let you solve a problem with far less data. The ability of deep learning models to learn features on their own relies on having lots of training data available

## 4.11 Mitigate Over Fitting

## 4.12 Reduce Model Size
- model that is too small will not overfit
- to mitigate overfitting reduce the size of the model
- limited memorization resources won’t memorize training data
- smaller model starts overfitting later than the reference model its performance degrades more slowly once it starts overfitting.
- bigger model starts overfitting almost immediately overfits much more severely.
- The more capacity the model has, the more quickly it can model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the training and validation loss).


## 4.13 Weight Regularization
-  Simpler models are less likely to overfit than complex ones.
- A simple model 
  - a model where the distribution of parameter values has less entropy
- weight regularization
  - put constraints on the complexity of a model which makes the distribution of weight values more regular.
  - it’s done by adding to the loss function of the model a cost associated with having large weights
- Cost
  - L1 regularization
    - The cost added is proportional to the absolute value of the weight coefficients
  - L2 regularization
    - The cost added is proportional to the square of the value of the weight coefficients
    - L2 regularization is also called weight decay in the context of neural networks.
- small deep learning models
  - regularizers
- large deep learning models
  - dropout

## 4.14 Drop out
- dropping out a number of output features of the layer during training
-  The dropout rate is the fraction of the features that are zeroed out
- maximize generalization and prevent overfitting
  - Get more training data, or better training data.
  - Develop better features.
  - Reduce the capacity of the model.
  - Add weight regularization (for smaller models).
  - Add dropout.

## 4.15 Universal Workflow

- The universal workflow of machine learning
  - Universal blueprint that we can use to attack and solve any machine learning problem
    - problem defination
    - evaluation
    - feature engineering
    - fighting overfitting
- Defining the problem and assembling a dataset
  - What will our input data be?
  - What are we trying to predict?
  - What type of problem we are facing?
  - Is it binary classification?
  - Multiclass classification?
  - Scalr regression?
  - Vector regression?
  - Multiclass, multilabel classification, clustring, generation or reinforcement learning
- Chosing a measure of success
  - To control something we need to be able to observe it. To achieve success, we must define what we mean by success accuracy? Precision and recall? Customer-retention rate? Our metric for success will guide the choice of a loss function: what or model will optimize. It should directly align with our higher-level goals, such as the success of our business.
- Deciding on an evaluation protocol
  - Once we know what we are aiming for, we must esteblish how we shall measure our current progress. We have previously reviewed three common evaluation protocols:
    - Maintaing a hold-out validation set: The way to go when plenty of data
    - Doing K-fold cross-validation: The right choice when you have too few samples for hold-out validation to be reliable
    - Doing iterated K-fold validation: For performing highly accurate model evaluation when little data is available
  - for evalution just picking one of these will work, in most cases the first will work well enough
- Preparing our data
  - We should format our data in a way that can be fed into a machine learning model here, we shall assume a deep neural network
    - As we saw previously our data should be formatted as tensors
    - The values taken by these tensors should usually be scaled to small values
    - If different features take values in different ranges then the data should be normalized
    - We may want to do some feature engineering, especially for small-data problems
  - Once our tensors of input data and target data are ready, we can begin to rain model
- Developing a model that does better than baseline

  - Before starting the model defination and training we must define a baseline for success criterion If we can not create a model after many successive tries, means that the hypothesis we built at step are false and we need to move back to step for gathering input data which can predict the output
  - following table can help us chose a last -layer activation and loss function for a few common problem types

  | Problem Type                            | Last layer activation | Loss function              |
  | --------------------------------------- | --------------------- | -------------------------- |
  | Binary classification                   | sigmoid               | binary_crossentropy        |
  | Multiclass, single-label classification | softmax               | categorical_crossentropy   |
  | Multiclass, multi-label classification  | sigmoid               | binary_crossentropy        |
  | regression to arbitrary values          | None                  | mse                        |
  | Regression to values betweeb 0 and 1    | sigmoid               | mse or binary_crossentropy |


- Scaling up: developing a model that overfits
    - Once we have built a model that satisfies the baseline criterion,  now we need to make it powerful enough that optimizes accuracy wuthout compromising the generalization. Remember the Generalization lies between the underfitting and overfitting. The easy way to get to it by just reaching the border towards overfitting. Once our model starts overfitting we know how to cope up with overfitting
    - To make model overfit
        - Add layers
        - Make the layers bigger
        - Train for more apochs

- Regularizing model and tuning hyperparameters
    - This step will take the modt time: we shall repeatedly modify our model, train it, evaluate on our validation data, modify it again and repeat until the model is as good as it can get. These are somethings to try
        - Add dropout
        - Try different architectures: add or remove layers
        - Add L1 and/or L2 regularization
        - Try different hyperparameters to find the optimal configuration
        - Optionally, iterate on feature engineering: add new features, or remove features that don't seem to be informative

## 4.16 Summary