# Deep Learning

## 1. Introduction

To define deep learning and understand the difference between deep learning and other machine learning approaches, first we need some idea of what machine learning algorithms do. To perform machine learning, we need 3 things:

- Input data points
- Examples of expected output
- Measure of Performance

A machine learning model transforms its input data into meaningful outputs, a process that is learned from exposure to known examples of inputs and outputs. Therefore, the central problem in machine learning and deep learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand. But what is a representation? At its core, it's a different way to look at data - to represent or encode data. For instance, a color image can be encoded in the RGB format or in the HSV format: these are two representations of the same data. Machine learning models are all about finding appropriate representations for their input data - transformations of the data that make it more amenable to the task at hand, such as a classification task.

*Machine learning is technically: searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal.*

**1.1 Deep Learning**
Deep Learning is a new take on learning representations from data that puts an emphasis on learning successive layers of increasing meaningful representations.The number of layers contributing to the model define the *depth* of the model. The layered representations are learned via models called neural networks, structured in literal layers stacked on top of each other. 

*Deep Learning is technically: a multistage way to learn data representations.*

**1.1.1 Understanding how Deep Learning works**
The specification of what a layer does to its input data is stored in the layer's *weights*, which in essence are a bunch of numbers. In technical terms, we'd say that the transformation implemented by a layer is *parameterized* by its weights. In this context, learning means finding a set of values for the weights of all layers in a network, such that the network will map example inputs to their associated targets. 

To control a neural network, you need to be able to measure how far this output is from what we expected. This is done by a *loss function*, also called as *objective function*. The loss function takes the predictions of the network and the true target and computes a distance score. 

The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example. This adjustment is done by an *optimizer*, which implements what's called the *backprop* algorithm.

## 2. Fundamentals of Machine Learning

**2.1 Training, validation, and test sets**

**2.1.1 Simple Hold-Out Validation** Set apart some fraction of your data as your test set. Train on the remaining data, and evaluate on the test set. This is the simplest evaluation protocol, and it suffers from one flaw: if little data is available, then your validation and test sets may contain too few samples. 

**2.1.2 K-Fold Validation** With this approach, you split your data into K partitions of equal size. For each partition i, train a model on the remaining K - 1 partitions, and evaluate it on partition i. Your final score is then the averages of the K scores obtained. 

**2.1.3 Iterated K-Fold Validation with Shuffling** This one is for situations in which you have relatively little data available and you need to evaluate your model as precisely as possible. It consists of applying K-fold validation multiple times, shuffling the data every time before splitting it K ways. 

**2.2 Data Preprocessing, Feature Engineering, and Feature Learning**

**2.2.1 Data Preprocessing for Neural Networks** Data preprocessing aims at making the raw data at hand more amenable to neural networks. This includes vectorization, normalization, handling missing values, and feature extraction.

**Vectorization** All inputs and targets in a neural network must be tensors of floating-point data. Whatever data we need to process - sound, images, text - you must first turn into tensors, a step called data vectorization. 

**Value Normalization** To make learning easier for your network, your data should have the following characteristics:
- Take small values - Typically, most values should be in the range 0 - 1
- Be homogeneous - All features should take values in roughly the same range
- Each feature should be normalized independently to have a mean of 0 and standard deviation of 1

**2.2.2 Feature Engineering** is the process of using your own knowledge about the data and about machine learning algorithm at hand to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data before it goes into the model. 

**2.3 Overfitting and Underfitting**

The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data whereas generalization refers to how well the trained model performs on data it has never seen before. 

At the beginning of training, optimization and generalization are correlated: the lower the loss on training data, the lower the loss on test data. While this is happening, the model is said to underfit: there is still progress to be made; the network hasn't yet modeled all relevant patterns in the training data. But after a certain number of iterations on the training data, generalization stops improving, and validation metrics stall and begin to degrade: the model is staring to overfit. It's beginning to learn patterns that are specific to the training data but are misleading or irrelevant when it comes to new data. 

To prevent a model from learning misleading or irrelevant patterns found in the training data, the best solution is to get more training data. When that isn't possible, the next-best solution is to modulate the quantity of information that your model is allowed to store or to add constraints on what information it's allowed to store. If a network can only afford to memorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns, which have a better chance of generalizing well. The process of fighting overfitting this way is called regularization.

**2.3.1 Reducing the network's size** This is the simplest way of reduce overfitting. In deep learning, the number of learnable paramters in a model is often referred to as the model's capacity. A model with more paramters has more memorization capacity and therefore can easily learn a perfect dictionary-like mapping between training samples and their targets - a mapping without any generalization power. If the network has limited memorization resources, it won't be able to learn this mapping as easily; thus, in order to minimize its loss, it will have to resort to learning compressed representations that have predictive power regarding the targets - precisely the type of representations we are interested in. 

**2.3.2 Adding Weight Regularization** The principle of Occam's razor states that given two explanations for something, the explanation most likely to be correct is the simplest one - the one that makes fewer assumptions. The idea also applies to neural networks. Simpler models are less likely to overfit than complex ones. A simples model in this context is a model where the distribution of parameter values has less entropy. Thus, a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it's done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:
- L1 regularization: The cost added is proportional to the absolute value of the weight coefficient
- L2 regularization: The cost added is proportional to the square of the value of the weight coefficient. L2 is also called weight decay in the context of neural networks. 

**2.3.3 Adding Dropout** Dropout, applied to a layer, consists of randomly dropping out a number of output features of the layer during training. The dropout rate is the fraction of features that are zeroed out; it's usually set between 0.2 and 0.5. 

**2.4 Universal Workflow of Machine Learning**

**2.4.1 Defining the problem and assembling a dataset** First, you must define the problem at hand:
- What will your input data be? What are you trying to predict? You can only learn to predict something if you have available training data.
- What type of problem are you facing? Is it binary classification? Is it multiclass classification? Identifying the problem type will guide your choice of model architecture, loss function, and so on.

You can't move to the next stage until you know what your inputs and outpus are, and what data you'll use. Be aware of the hypotheses you make at this stage:
- You hypothesize that your outputs can be predicted given your inputs
- You hypothesize that your available data is sufficiently informative to learn the relationship between inputs and outputs. 

**2.4.2 Choosing a measure of success** Your metric for success will guide the choice of a loss function: what your model will optimize. For balanced classification problems, where every class is equally likely, accuracy and area under the receiver operating characteristic curve (ROC AUC) are common metrics. For class-imbalanced problems, you can use precision and recall. For ranking problems or multilabel classification, you can use mean average precision. 

**2.4.3 Deciding on an evaluation protocol** Once you know what you're aiming for, you must establish how you'll measure your current progress. The three common evaluation protocols are: 
- Maintaining a hold-out validation set
- Doing K-fold cross validation
- Doing iterated K-fold validation

**2.4.4 Preparing your data** 
- Your data should be formatted as tensors
- The values taken by these tensors should be usually be scaled to small values
- Data should be normalized
- Feature engineering

**2.4.5 Developing a model that is better than a baseline** You need to make three key choices to build your first working model:
- Last layer activation 
- Loss Function
- Optimization configuration

**2.4.6 Developing a model that overfits** 
- Add layers
- Make the layers bigger
- Train for more epochs

**2.4.7 Regularizing your model and tuning your hyperparamters**
- Add dropout
- Try different architectures: add or remove layers
- Add L1 / L2
- Try different hyperparameters to find optimal configuration
- Optionally, iterate on feature engineering: add new features, or remove features that don't seem informative







## 3. Deep Learning for Computer Vision

**3.1 Convnets**

A convnet takes as input tensors of shape (image_height, image_width, image_channels)

In [11]:
from keras.datasets import mnist
from keras.utils import to_categorical
from keras import layers
from keras import models

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype("float32") / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype("float32") / 255

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation = "relu", input_shape = (28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation = "relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation = "relu"))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation = "relu"))
model.add(layers.Dense(10, activation = "relu"))

model.compile(optimizer = "rmsprop",
             loss = 'categorical_crossentropy',
             metrics = ["accuracy"])

model.fit(train_images, train_labels, epochs = 5, batch_size = 64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x157b71310>

In [13]:
test_loss, test_acc = model.evaluate(test_images, test_labels)
test_acc * 100



9.799999743700027

**3.2 The convolution Operation**

The fundamental difference between a densely connected layer and a convolution layer is this: Dense layers learn global patterns in their input feature space, whereas convolution layers learn local patterns: in case of images, patterns found in small 2D windows of the inputs. 

This key characteristic gives convnets two interesting properties:
- The patterns they learn are translation invariant: After learning a certain a pattern in the lower-right corner of a picture, a convnet can recognize it anywhere: for example in the upper left corner. 
- They can learn spatial hierarchies of patterns: A first convolutional layer will learn small local patterns such as edges, a second conv layer will leran larger patterns made of features of the first layers, and so on. 

Convolutions operate over 3D tensors, called feature maps, with two spatial axes(height and width) as well as a depth axis (also called as channel axis). The convolution operation extracts pathes from its input feature map and applies the same transformation to all of these patches, producing an output feature map.

Convolutions are defined by two key paramters:
- Size of the patches extracted from the inputs
- Depth of the output feature map

**Max Pooling** consists of extracting windows from the input feature maps and outputting the max value of each filter. Pooling is used to reduce the number of feature-map coefficients to process, as well as to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows. 