# **Deep Learning with Python**

- Machine learning algorithms come up with rules for processin data bu learning from inputs and corresponding outputs.
- Machine learning algorithms consist of automatically finding appropriate transformations to turn the input data to useful representations that get us closer to the expected output.
- *Learning*, in the context of machine learning, describes an automatic search process for better representations.

## **Fundamentals of Deep Learning**

- Subset of machine learning, mathematical framework for learning representations from data.
- Emphasizes a different take on learning representations from data, that is, learning from subsequent layers of increasingly meaningful representations.(Extracting the useful representations is like feature engineering).
- The neural network transforms the input data into representations that are increasingly different from the original data and increasingly informative about the final result.
- You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified (that is, useful with regard to some task).

### How it works

- The layers in a neural network transform data into useful representations that bring the model closer to the output.
- Transformations applied to the input data are parameterized by the layer's weights.
- The initial weights assigned to a layer are essentially random, tehrefore initial transformations are random and consequently result in a high loss finction.
- To be able to control the resulting output, the neural network makes predictions from the data, compares the predictions with the true targets and calculates a loss score through a loss function(objective function).
- The loss function calculates the distance between the predictions and the true targets.
- The loss score is the used to adjust thee weights to values that result in a lower low score(predictions closer to the true targets). This is the work of the optimizer which conducts back propagation, where weights are constantly adjusted to get a low loss score. (Training loop).
- A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network.

- The reason why deep learning is more effective than previous machine learning models is beacuse it allows for an incremental, layer-by-layer way in which increasingly complex representations are developed, and the fact that these intermediate incremental representations are learned jointly, each layer being updated to follow both the representational needs of the layer above and the needs of the layer below.
- Neural networks automate feature engineering, the model gets to learn all features jointly and when one feature is adjusted all other features dependent on it automatically adjust with the change all with the similar goal of improving model performance.

### Building the network

- The core building block of a neural network is the layer.
- Layers extract representations out of the data fed into them.
- Most of deep learning consists of chaining together simple layers that will implement a form of progressive data distillation. A deep-learning model is like a sieve for data processing, made of a succession of increasingly refined data filters—the layers.

#### Data Representations for Neural Networks

- *Tensors* are containers for data, usually numerical data. Like a numpy array, with several axis
> ~ Vector data— 2D tensors of shape (samples, features)<br>
  ~ Timeseries data or sequence data— 3D tensors of shape (samples, timesteps, features)<br>
  ~ Images— 4D tensors of shape (samples, height, width, channels) or (samples, channels, height, width)<br>
  ~ Video— 5D tensors of shape (samples, frames, height, width, channels) or (samples, frames, channels, height, width)<br>
- *Scalars* are tensors with one number (or scalar tensor, or 0-dimensionaltensor, or 0D tensor)
- *Vector* is an array of numbers. Like a numpy array. 

#### Gradient Based Optimization

- Weights in a deep learning model are first assigned random numbers, which is called *random initialization*, then they are gradually adjusted based on the feedback signal.
- Below is the deep learning training loop:
> Draw a batch of training samples x and corresponding targets y.<br>
Run the network on x (a step called the forward pass) to obtain predictions y_pred.<br>
Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.<br>
Update all weights of the network in a way that slightly reduces the loss on this batch.
- Gradient based optimization involves using the magnitude of the derivative (slope) to determine whether or not to increase or decrease weights.

##### Mini-batch stochastic gradient descent

- Draw a batch of training samples x and corresponding targets y .
- Run the network on x to obtain predictions y_pred .
- Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y .
- Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
- Move the parameters a little in the opposite direction from the gradient—for example W -= step * gradient —thus reducing the loss on the batch a bit.

#### Model Layers

- Vector data (2D, samples and features) usually processed by densely connected layers (Dense)
- Sequence data (3D, samples, features and timesteps) processed by recurrent layers(eg LSTM)
- Image data (4D) processed by 2D convolution layers (Conv2D)
- Consequent layers have to be compatible, input of a layer should match the output of previous layer
- In an input layer, you first specify the shape of the output tensor, then input shape
- You dont specify the shape of the subsequent layers as they automatically know that the input shape is the output shape of the previous layer

#### Loss functions and Optimizers

- Choosing the right loss/ objective function for a problem is important as it should correlate with the success of the task at hand for your model to effectively minimize it, and so that the model serves the purpose you're creating it for
- For instance, you’ll use binary crossentropy for a two-class classification problem, categorical crossentropy for a many-class classification problem, mean-squared error for a regression problem, connectionist temporal classification ( CTC )for a sequence-learning problem, and so on.

#### Building your network

Having 16 hidden units means the weight matrix W will have shape (input_dimension,16) : the dot product with W will project the input data onto a 16-dimensional representation space (and then you’ll add the bias vector b and apply the relu operation). You can intuitively understand the dimensionality of your representation space as “how much freedom you’re allowing the network to have when learning internal representations.” Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).

#### Why are activation functions important?

- Without an activation function like relu (also called a non-linearity), the Dense layer would consist of two linear operations—a dot product and an addition:
output = dot(W, input) + b
- So the layer could only learn linear transformations (affine transformations) of the input data: the hypothesis space of the layer would be the set of all possible linear transformations of the input data into a 16-dimensional space. Such a hypothesis space is too restricted and wouldn’t benefit from multiple layers of representations, because a deep stack of linear layers would still implement a linear operation: adding more layers wouldn’t extend the hypothesis space.
- In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function. 
- relu is themost popular activation function in deep learning, but there are many other candidates, which all come with similarly strange names: prelu, elu, and so on.

- In a binary classification problem (two output classes), your network should end with a Dense layer with one unit and a sigmoid activation: the output of your network should be a scalar between 0 and 1, encoding a probability.
- With such a scalar sigmoid output on a binary classification problem, the loss function you should use is binary_crossentropy .
- The rmsprop optimizer is generally a good enough choice, whatever your problem. That’s one less thing for you to worry about.
- As they get better on their training data, neural networks eventually start overfitting and end up obtaining increasingly worse results on data they’ve never seen before. Be sure to always monitor performance on data that is outside of the training set.

### Multiclass Classification loss function

- In a multiclass classification problem, the loss function categorical_crossentropy expects labels to be in a categorical format, done usin the to_categorical() function.
- When you use np.asarray() to convert your labels to integer labels, use sparse_categorical_crossentropy

#### Building your network

- In general, the less training data you have, the worse overfitting will be, and using a small network is one way to mitigate overfitting.
- In scalar regression, (a regression where you’re trying to predict a single continuous value), applying an activation function would constrain the range the output can take; for instance, if you applied a sigmoid activation function to the last layer,the network could only learn to predict values between 0 and 1.
- When you have a small set of data, instead of slitting to get a validation set you can use the k-fold cross-validation method.

### Regression Models

In regression models:
- Regression is done using different loss functions than what we used for classification. Mean squared error ( MSE ) is a loss function commonly used for regres-
sion.
- Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally, the concept of accuracy doesn’t apply for regression. A common regression metric is mean absolute error ( MAE ).
- When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.
- When there is little data available, using K-fold validation is a great way to reliably evaluate a model.
- When little training data is available, it’s preferable to use a small network with few hidden layers (typically only one or two), in order to avoid severe overfitting.
- If your data is divided into many categories, you may cause information
bottlenecks if you make the intermediate layers too small.

## **Fundamentals of Machine Learning**

- Overfitting is the main obstacke to creating machine learning models that generalize well on never-before-seen data.

### Evaluating Machine Learning Models

- In model evaluation, it is important to split your data into three parts; the training set, validation set and the test set. 
- Using the validation data during hyperparameter and parameter tuning and eventually the test data to test model performance prevents overfitting and information leaks which can result in a model that performs artificially well on validation data and poorly on new data.

#### Classic Evaluation Techniques

##### *Simple hold-out validation*

- Involves simply setting apart a fraction of your data to be used as the test set. Train on the rest of the data then evaluate on the test set.
- Tuning your model on the test set will lead to information leaks, and you should therefore also set aside a validation set.

- Flaw: if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand. This is easy to recognize: if different random
shuffling rounds of the data before splitting end up yielding very different measures of model performance, then you’re having this issue. K-fold validation and iterated K -fold validation are two ways to address this.

##### *K-Fold Validation*

- Split your data into K partitions of equal size. 
- For each partition i , train a model on the remaining K – 1 partitions, and evaluate it on partition i .
- Your final score is then the averages of the K scores obtained. 
- This method is helpful when the performance of your model shows significant variance based on your train-test split.

##### *Iterated K-Fold validation with shuffling*

- Consists of applying K -fold validation multiple times, shuffling the data every time before splitting it K ways. The final score is the average of the
scores obtained at each run of K -fold validation. 
- Note that you end up training and evaluating P × K models (where P is the number of iterations you use), which can very expensive.

Things to consider when choosing an evaluation method:
- Data Representativeness (randomly shuffle data before splitting, unless its time series data)
- Arrow of time
- Redundancy in data (ensure data points don't appear in both the train and evaluation data)


### Data preprocessing, feature engineering and feature learning

#### Data preprocessing for neural networks