# 1 - What is Deep Learning?



In the past few years, **artificial intelligence** (AI) has been a subject of intense media hype. **Machine learning**, **deep learning**, and **AI** come up in countless articles, often outside of technology-minded publications. We’re promised a future of intelligent chatbots, self-driving cars, and virtual assistants—a future sometimes painted in a grim light and other times as utopian, where human jobs will be scarce and most economic activity will be handled by robots or AI agents.

So let’s tackle these questions: 
- What has deep learning achieved so far? 
- How significant is it? 
- Where are we headed next? 
- Should you believe the hype?


<img width="300" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1r_LkSAgQOBAJMecKg-B9g7DnMR6e-otN">


## 1.1 Artificial intelligence, machine learning, and deep learning



### 1.1.1 Artificial intelligence

A concise definition of the field would be as follows: **the effort to automate intellectual tasks normally performed by humans**. As such, AI is a general field that encompasses machine learning and deep learning, but that also includes many more approaches that don’t involve any learning. Early chess programs, for instance, only involved hardcoded rules crafted by programmers, and didn’t qualify as machine learning. For a fairly long time, many experts believed that human-level artificial intelligence could be achieved by having programmers handcraft a sufficiently large set of explicit rules for manipulating knowledge. This approach is known as **symbolic AI**, and it was the dominant paradigm in AI from the 1950s to the late 1980s. It reached its peak popularity during the **expert systems boom** of the 1980s.

Although symbolic AI proved suitable to solve well-defined, logical problems, such as playing chess, it turned out to be intractable to figure out explicit rules for solving more complex, fuzzy problems, such as image classification, speech recognition, and language translation. A new approach arose to take symbolic AI’s place: **machine learning**.

### 1.1.2 Machine Learning

Machine learning arises from this question: **could a computer go beyond** “what we know how to order it to perform” and learn on its own how to perform a specified task? **Could a computer surprise us**? Rather than programmers crafting data-processing rules by hand, **could a computer automatically learn these rules by looking at data**?

This question opens the door to a new programming paradigm. In classical programming, the paradigm of symbolic AI, humans input rules (a program) and data to be processed according to these rules, and out come answers (see figure 1.2). With machine learning, humans input data as well as the answers expected from the data, and out come the rules. These rules can then be applied to new data to produce original answers.


<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1ZQ0vk8lzgPcIrYPS8iKL1zuiTpLV9yjz">

**A machine-learning system is trained rather than explicitly programmed.**

Although machine learning only started to flourish in the 1990s, it has quickly become the most popular and most successful subfield of AI, **a trend driven by the availability of faster hardware and larger datasets**. Machine learning is tightly related to mathematical statistics, but it differs from statistics in several important ways. Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tens of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical. As a result, machine learning, and especially deep learning, exhibits comparatively little mathematical theory—maybe too little—and is engineering oriented. It’s a hands-on discipline in which ideas are proven empirically more often than theoretically.

So that’s what machine learning is, technically: **searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal**. This simple idea allows for solving a remarkably broad range of intellectual tasks, from speech recognition to autonomous car driving.

### 1.1.3 The “deep” in deep learning

Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach; rather, it stands for this idea of **successive layers of representations**.

Meanwhile, other approaches to machine learning tend to focus on learning only one or two layers of representations of the data; hence, they’re sometimes called **shallow learning**.

**What do the representations learned by a deep-learning algorithm look like**? Let’s examine how a network several layers deep transforms an image of a digit in order to recognize what digit it is.

As you can see in figure 1.6, the network transforms the digit image into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified (that is, useful with regard to some task).


<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1C2r5nhH0c4cl_6rSvT0OVhBlOVwh89dt">

So that’s what deep learning is, technically: **a multistage way to learn data representations.** 

### 1.1.4 Understanding how deep learning works


The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example (see figure 1.9). This adjustment is the job of the optimizer, which implements what’s called the **Backpropagation algorithm**: the central algorithm in deep learning. The next chapter explains in more detail how backpropagation works.


<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1aHb4jnSup4p1jQtHEERNpQo1cz33LBQj">

Initially, the weights of the network are assigned random values, so the network merely implements a series of random transformations. Naturally, its output is far from what it should ideally be, and the loss score is accordingly very high. But with every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function. **A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network.**

## 1.2. Before deep learning: a brief history of machine learning

### 1.2.1 Probabilistic modeling

Probabilistic modeling is the **application of the principles of statistics** to data analysis. It was one of the earliest forms of machine learning, and it’s still widely used to this day. One of the best-known algorithms in this category is the **Naive Bayes algorithm**.

A closely related model is the **logistic regression** (logreg for short), which is sometimes considered to be the **“hello world” of modern machine learning**. Don’t be misled by its name—logreg is a classification algorithm rather than a regression algorithm. 

Much like **Naive Bayes**, **logreg** predates computing by a long time, yet it’s still useful to this day, thanks to its simple and versatile nature. It’s often the first thing a data scientist will try on a dataset to get a feel for the classification task at hand.

### 1.2.2. Early neural networks
Although the core ideas of neural networks were investigated in toy forms as early as the **1950s**, the approach took decades to get started. For a long time, **the missing piece was an efficient way to train large neural networks**. 

This changed in the **mid-1980s**, when multiple people independently rediscovered the **Backpropagation algorithm** -- a way to train chains of parametric operations using gradient-descent optimization -- and started applying it to neural networks.

**The first successful practical application of neural nets came in 1989 from Bell Labs**, when [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) combined the earlier ideas of convolutional neural networks and backpropagation, and applied them to the problem of classifying handwritten digits. The resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s to automate the reading of ZIP codes on mail envelopes.

### 1.2.3. Kernel methods

As neural networks started to gain some respect among researchers in the 1990s, thanks to this first success, a new approach to machine learning rose to fame and quickly sent neural nets back to oblivion: **kernel methods**. Kernel methods are a **group of classification algorithms**, the best known of which is the **support vector machine (SVM)**. The modern formulation of an SVM was developed by Vladimir Vapnik and Corinna Cortes in the early **1990s** at Bell Labs and published in **1995**, although an older linear formulation was published by Vapnik and Alexey Chervonenkis as early as **1963**.

The technique of mapping data to a high-dimensional representation where a classification problem becomes simpler may look good on paper, but in practice it’s often computationally intractable.

At the time they were developed, SVMs exhibited state-of-the-art performance on simple classification problems and were one of the few machine-learning methods backed by extensive theory and amenable to serious mathematical analysis, making them well understood and easily interpretable. Because of these useful properties, **SVMs became extremely popular in the field for a long time**.

**But SVMs proved hard to scale to large datasets and didn’t provide good results for perceptual problems such as image classification**. Because an SVM is a shallow method, applying an SVM to perceptual problems requires first extracting useful representations manually (a step called **feature engineering**), which is difficult and brittle.

### 1.2.4. Decision trees, random forests, and gradient boosting machines

**Decision trees** are flowchart-like structures that let you classify input data points or predict output values given inputs. They’re **easy to visualize and interpret**. Decisions trees learned from data began to receive significant research interest in the 2000s, and by 2010 they were often preferred to kernel methods.


In particular, the **Random Forest** algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then **ensembling** their outputs. Random forests are applicable to a wide range of problems—you could say that **they’re almost always the second-best algorithm for any shallow machine-learning task**. 

When the popular machine-learning competition website [Kaggle](http://kaggle.com) got started in **2010**, **random forests quickly became a favorite on the platform—until 2014**, when gradient boosting machines took over. 

**A gradient boosting machine**, much like a random forest, is a machine-learning technique based on **ensembling weak prediction models**, generally decision trees. It uses gradient boosting, a way to improve any machine-learning model by iteratively training new models that specialize in addressing the weak points of the previous models. Applied to decision trees, the use of the gradient boosting technique results in models that **strictly outperform random forests most of the time, while having similar properties**. **It may be one of the best, if not the best, algorithm for dealing with nonperceptual data today**. Alongside deep learning, it’s one of the most commonly used techniques in Kaggle competitions.

### 1.2.5. Back to neural networks

**Around 2010**, although neural networks were almost completely shunned by the scientific community at large, a number of people still working on neural networks started to make important breakthroughs: the groups of [Geoffrey Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton) at the University of Toronto, [Yoshua Bengio](https://en.wikipedia.org/wiki/Yoshua_Bengio) at the University of Montreal, [Yann LeCun]([Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) at New York University, and IDSIA in Switzerland.


- **In 2011**, Dan Ciresan from IDSIA began to win academic image-classification competitions with **GPU-trained** deep neural networks.
- **In 2012**, a team led by Alex Krizhevsky and advised by Geoffrey Hinton was able to achieve a top-five accuracy of **83.6%**—a significant breakthrough.
- **By 2015**, the winner reached an accuracy of **96.4%**, and the classification task on ImageNet was considered to be a completely solved problem.
- Since 2012, **deep convolutional neural networks (convnets)** have become the go-to algorithm for all computer vision tasks; more generally, they work on all perceptual tasks.
- **At major computer vision conferences in 2015 and 2016**, it was nearly impossible to find presentations that didn’t involve convnets in some form.


### 1.2.6. The modern machine-learning landscape


A great way to get a sense of the current landscape of machine-learning algorithms and tools is to look at machine-learning competitions on [Kaggle](http://kaggle.com). 

**In 2016 and 2017**, Kaggle was dominated by two approaches: **gradient boosting machines** and **deep learning**. Specifically, **gradient boosting is used for problems where structured data is available**, whereas **deep learning is used for perceptual problems such as image classification**. 

These are the two techniques you should be the most familiar with in order to be successful in applied machine learning today: gradient boosting machines, for shallow-learning problems; and deep learning, for perceptual problems. In technical terms, this means you’ll need to be familiar with **XGBoost** and **Keras** —the two libraries that currently dominate Kaggle competitions.

## 1.3. Why deep learning? Why now?

The two key ideas of **deep learning** for computer vision — **convolutional neural networks** and **backpropagation** — were already well understood in **1989**. The **Long Short-Term Memory (LSTM)** algorithm, which is fundamental to deep learning for **timeseries**, was developed in **1997** and has barely changed since. So why did deep learning only take off after 2012? What changed in these two decades?

In general, three technical forces are driving advances in machine learning:

- Hardware
- Datasets and benchmarks
- Algorithmic advances

Because the field is guided by experimental findings rather than by theory, algorithmic advances only become possible when appropriate data and hardware are available to try new ideas (or scale up old ideas, as is often the case). **Machine learning** isn’t mathematics or physics, where major advances can be done with a pen and a piece of paper. **It’s an engineering science.**

The real **bottlenecks throughout the 1990s and 2000s** were **data** and **hardware**. But here’s what happened during that time: the internet took off, and **high-performance graphics chips were developed** for the needs of the gaming market.


### 1.3.1. A new wave of investment

**In 2011**, right before deep learning took the spotlight, the total venture capital investment in AI was around **19 million**, which went almost entirely to practical applications of shallow machine-learning approaches. **By 2014**, it had risen to a staggering **394 million**. Dozens of startups launched in these three years, trying to capitalize on the deep-learning hype. 

Meanwhile, large tech companies such as **Google**, **Facebook**, **Baidu**, and **Microsoft** have invested in internal research departments in amounts that would most likely dwarf the flow of venture-capital money. Only a few numbers have surfaced: **In 2013**, **Google acquired** the deep-learning startup **DeepMind** for a reported **500 million**—the largest acquisition of an AI company in history. **In 2014**, **Baidu** started a deep-learning research center in Silicon Valley, **investing 300 million** in the project. The deep-learning hardware **startup Nervana Systems** was acquired by **Intel** in **2016** for over **400 million.**

Machine learning—in particular, deep learning—has become central to the product strategy of these tech giants. **In late 2015**, **Google CEO Sundar Pichai stated**, “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything. **We’re thoughtfully applying it across all our products**, be it search, ads, YouTube, or Play. And we’re in early days, but you’ll see us—in a systematic way—apply machine learning in all these areas.”

# 2 - Mathematical building blocks of neural networks 

## 2.1 - Introduction

This section covers
- A first example of a neural network
- Tensors and tensor operations
- How neural networks learn via backpropagation and gradient descent

**Understanding deep learning** requires familiarity with many simple mathematical
concepts: **tensors**, **tensor operations**, **differentiation**, **gradient descent**, and so on.
Our goal in this section will be to build your intuition about these notions without
getting overly technical. In particular, we’ll steer away from mathematical notation,
which can be off-putting for those without any mathematics background and isn’t
strictly necessary to explain things well.

## 2.2 - A first look at a neural network

Let’s look at a concrete example of a neural network that uses the [Python library Keras](https://keras.io/) to learn to **classify handwritten digits**. Unless you already have experience with Keras or similar libraries, you won’t understand everything about this first example right away.

You probably haven’t even installed Keras yet (you can install in your machine even if it doesnt have a GPU):

>```bash
conda install -c conda-forge tensorflow keras
```

In [0]:
import keras
from keras import backend as K

print('Using Keras version:', keras.__version__, 
      '\nbackend:', K.backend())

If the backend is Tensorflow, we can display some further information:

In [0]:
if K.backend() == "tensorflow":
    import tensorflow as tf
    device_name = tf.test.gpu_device_name()
    if device_name == '':
        device_name = "None"
    print('Using TensorFlow version:', tf.__version__,
          ', GPU:', device_name)

In [0]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

In [0]:
import psutil
import humanize
import os
import GPUtil as GPU

# get all GPU on machine
GPUs = GPU.getGPUs()

# colab give us only one GPU
gpu = GPUs[0]

def printm():
  process = psutil.Process(os.getpid())
  print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ),
        " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
  print("GPU RAM Free: {0:.0f}MB  | Used: {1:.0f}MB | Util {2:3.0f}% |\
  Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, 
                                gpu.memoryUtil*100, gpu.memoryTotal))

printm()

In [0]:
# RSS - resident set size 
!ps -aux --sort -rss

The problem we’re trying to solve here is to **classify grayscale images of handwritten digits (28 × 28 pixels)** into their **10 categories (0 through 9)**. We’ll use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), a classic in the machine-learning community, which has been around almost as long as the field itself and has been intensively studied. 

It’s a **set of 60,000 training images**, plus **10,000 test images**, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. You can think of **“solving” MNIST as the “Hello World” of deep learning** — it’s what you do to verify that your algorithms are working as expected. As you become a machine-learning practitioner, you’ll see MNIST come up over and over again, in scientific papers, blog posts, and so on.


<img width="350" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1vlzxpafdf5r8mkcxeJ5tqgW6BhuLZIIZ">


### 2.2.1 -  Loading the data

In [0]:
# Loading the MNIST dataset in Keras
from keras.datasets import mnist

# The images are encoded as Numpy arrays, 
# and the labels are an array of digits, ranging
# from 0 to 9. The images and labels have a one-to-one correspondence.

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

In [0]:
print("Type of train_images {}\nType of train_labels {}".format(type(train_images),type(train_labels)))
print("\nShape of train data: images {}, labels {}".format(train_images.shape,train_labels.shape))
print("Shape of test data: images {}, labels {}".format(test_images.shape,test_labels.shape))

In [0]:
# first ten train labels
train_labels[:10]

### 2.2.2 - The network architecture

```python
from keras import models
from keras import layers

# The network architecture
network = models.Sequential()

# Hidden layer
network.add(layers.Dense(512, 
                         activation='relu',
                         input_shape=(28 * 28,)))

# Output layer
network.add(layers.Dense(10, activation='softmax'))
```


In [0]:
from keras import models
from keras import layers

# The network architecture
network = models.Sequential()

# Hidden layer
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))

# Output layer
network.add(layers.Dense(10, activation='softmax'))

To make the network ready for training, we need to pick three more things, as part of the compilation step:
- **A loss function**: How the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
- **An optimizer**: The mechanism through which the network will update itself based on the data it sees and its loss function.
- **Metrics to monitor during training and testing**: Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).

### 2.2.3 - The compilation step

In [0]:
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

### 2.2.4 - Preparing data

Before training, we’ll preprocess the data by **reshaping** it into the shape the network expects and scaling it so that **all values are in the [0, 1] interval**. Previously, our training images, for instance, were stored in an array of shape **(60000, 28, 28)** of type uint8 with values in the [0, 255] interval. We transform it into a float32 array of shape **(60000, 28 * 28)** with values between 0 and 1.

In [0]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

### 2.2.5 - Preparing the labels

In [0]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [0]:
# show the first ten train labels transformed to categorical
train_labels[:10]

### 2.2.6 - Train the network

In [0]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Two quantities are displayed during training: the **loss of the network** over the training data, and the **accuracy of the network** over the training data. We quickly reach an accuracy of 0.9885 (98.9%) on the training data. Now let’s
check that the model performs well on the test set, too:

In [0]:
network.summary()

In [0]:
from keras.utils.vis_utils import plot_model
import IPython

plot_model(network, to_file="model.png", show_shapes=True, show_layer_names=True)
IPython.display.Image("model.png")

### 2.2.7 - Evaluate the network

The test-set accuracy turns out to be **97.8%** — that’s quite a bit lower than the training set accuracy. This gap between training accuracy and test accuracy is an example of **overfitting**: the fact that machine-learning models tend to perform worse on new data than on their training data.

In [0]:
test_loss, test_acc = network.evaluate(test_images, test_labels)
print('test_acc:', test_acc)

### 2.2.8 - Summarization

This concludes our first example—you just saw how you can build and train a neural network to classify handwritten digits in less than 20 lines of Python code. The follow steps were performed:

- Loading the data
- Create the network architecture
- Compilation
- Preparing the data
- Train the network 
- Evaluate the network

## 2.3 - Data representations for neural networks

In the previous example, we started from data stored in **multidimensional Numpy arrays**, also called **tensors**. 

So what’s a tensor?
- At its core, a **tensor is a container for data** — almost always numerical data. So, it’s a
container for numbers.
- You may be already familiar with **matrices, which are 2D tensors**.
- Tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a **dimension is often called an axis**).

In [0]:
import numpy as np

# Scalars (0D tensors) - a tensor that contains only one number is called a scalar
scalar = np.array(12)
print("0D tensor (dimension): {}".format(scalar.ndim))

# Vectors (1D tensors) - an array of numbers is called a vector, or 1D tensor. A 1D tensor is said to have exactly one axis.
tensor1D = np.array([12, 3, 6, 14])
print("1D tensor (dimension): {}".format(tensor1D.ndim))

# Matrices (2D tensors) - an array of vectors is a matrix, or 2D tensor. A matrix has two axes (rows and columns).
tensor2D = np.array([[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]])
print("2D tensor (dimension): {}".format(tensor2D.ndim))

# 3D tensors and higher-dimensional tensors
# If you pack such matrices in a new array, you obtain a 3D tensor, which you can visually interpret as a cube of numbers.
tensor3D = np.array([[[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],
                      [[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]],
                      [[5, 78, 2, 34, 0],[6, 79, 3, 35, 1],[7, 80, 4, 36, 2]]])
print("3D tensor (dimension): {}".format(tensor3D.ndim))

### 2.3.1 - Real-world examples of data tensors

Let’s make data tensors more concrete with a few examples similar to what you’ll encounter later. The data you’ll manipulate will almost always fall into one of the following categories:

- **Vector**: data—2D tensors of shape **(samples, features)**
- **Timeseries**: data or sequence data—3D tensors of shape **(samples, timesteps,features)**
- **Images**: 4D tensors of shape **(samples, height, width, channels)** or **(samples,channels, height, width)**
- **Video**: 5D tensors of shape **(samples, frames, height, width, channels)** or **(samples, frames, channels, height, width)**

## 2.4 - The gears of neural networks: tensor operations



All transformations learned by deep neural networks can be reduced to a handful of tensor operations applied to
tensors of numeric data. For instance, it’s possible to **add tensors, multiply tensors**, **reshape tensors**, and so on.

```python
# this layer takes as input a 2D tensor and returns another 2D tensor
keras.layers.Dense(512, activation='relu')
```

- Specifically, the function is as follows (where W is a 2D tensor and b is a vector, both attributes of the layer):

```python
output = relu(dot(W, input) + b)
```

- We have three tensor operations here: 
    - a **dot product** (dot) between the input tensor and a tensor named W; 
        - Tensor product operation
    - an **addition (+)** between the resulting 2D tensor and a vector b (1D); 
        - Broadcast operation
    - a **relu** operation. relu(x) is max(x, 0).
        - Element-wise operation

In [0]:
''' Tensor product (similar to a matrix multiplication)
(a, b, c, d) . (d,) -> (a, b, c)
(a, b, c, d) . (d, e) -> (a, b, c, e)
'''

tensor2D = np.array([[1,-1],
                     [0,2]])
tensor1D = np.array([1, 0])

# Tensor product
print(np.dot(tensor2D,tensor1D))


In [0]:
tensor2D = np.array([[1,-1],
                     [0,2]])
tensor1D = np.array([1, 0])

# Element-wise operation
print(tensor2D * tensor1D)

In [0]:
''' Broadcast operation
What happens with addition when the shapes 
of the two tensors being added differ?
'''

tensor2D = np.array([[1,-1],[0,2]])
tensor1D = np.array([1, 0])

tensor2D + tensor1D


In [0]:
''' Element-wise operation 
tensor2D =  |1,-1|   ->   relu(tensor2D)   ->  |1,0|     
            |0, 2|                             |0,2|
'''

tensor2D = np.array([[1,-1],[0,2]])
# before relu()
print(tensor2D)

# after relu()
print(np.maximum(tensor2D,0))

## 2.5 - The engine of neural networks: gradient-based optimization



As you saw in the previous section, each neural layer from our first network example transforms its input data as follows:

>```python
output = relu(dot(W, input) + b)
```

In this expression, **W** and **b** are tensors that are attributes of the layer. They’re called the **weights or trainable parameters** of the layer (**the kernel and bias attributes**, respectively). 

- These weights contain the **information learned** by the network from **exposure to training data**.
- Initially, these **weight** matrices are filled with small random values (a step called **random initialization**). 
- **Gradually adjust** these weights, based on a feedback signal (also called **training**). A trainning loop:
    1. Draw a **batch** of training samples x and corresponding targets y.
    2. Run the network on x (a step called the **forward pass**) to **obtain predictions y_pred**.
    3. **Compute the loss** of the network on the batch, a measure of the mismatch between y_pred and y.
    4. **Update all weights** of the network in a way that slightly reduces the loss on this batch.
    
**Step 1** sounds easy enough—just I/O code. **Steps 2 and 3** are merely the application of a handful of tensor operations, so you could implement these steps purely from what you learned in the previous section. The difficult part is **step 4**: updating the network’s weights.

### 2.5.1 - Question!!!
Given an individual weight coefficient in the network, how can you compute whether the coefficient should be increased or decreased, and by how much?

**One naive solution** would be to freeze all weights in the network except the one scalar coefficient being considered, and try different values for this coefficient. 
- Let’s say the initial value of the **coefficient is 0.3**. 
- After the forward pass on a batch of data, **the loss** of the network on the batch is **0.5**. 
- If you change the **coefficient’s value to 0.35** and rerun the forward pass, the **loss increases to 0.6**. 
- But if you lower the **coefficient to 0.25**, the **loss falls to 0.4**.
- In this case, it seems that updating the coefficient by -0.05 would contribute to minimizing the loss. 
- This would have to be repeated for all coefficients in the network.

But such an approach would be horribly inefficient, because you’d need to compute two forward passes (which are expensive) for every individual coefficient (of which there are many, usually thousands and sometimes up to millions). 

**A much better approach** is to take advantage of the fact that **all operations** used in the network are **differentiable**, and compute the gradient of the loss with regard to the network’s coefficients. You can then **move the coefficients** in the **opposite direction from the gradient**, thus **decreasing the loss**.

### 2.5.2 - What’s a derivative?

Consider a continuous, smooth function $f(x)$, mapping a real number x to a new real number y. 

$$f(x) = y$$

Because the function is continuous, a small change in $x$ can only result in a small change in $y$ — that’s the intuition behind continuity. Let’s say you increase $x$ by a small factor $epsilon\_x$ will result in a small $epsilon\_y$ change to $y$:

$$f(x + epsilon\_x) = y + epsilon\_y$$

In addition, because the function is *smooth* (its curve doesn’t have any abrupt angles), **when $epsilon\_x$ is small enough**, around a certain point $p$, it’s possible to approximate $f$ as a linear function of slope $a$, so that $epsilon\_y$ becomes $a * epsilon\_x$:

$$f(x + epsilon\_x) = y + a * epsilon\_x$$

**The slope a** is called the **derivative** of $f$ in $p$. 

- If $a$ is negative, it means a small change of $x$ around $p$ will result in a decrease of $f(x)$
- If f $a$ is positive, a small change in $x$ will result in an increase of $f(x)$. 
- Further, the absolute value of $a$ (the magnitude of the derivative) tells you how quickly this increase or decrease
will happen.

If you’re trying to update $x$ by a factor $epsilon\_x$ in order to minimize $f(x)$, and you know the derivative of $f$, then your job is done: the derivative completely describes how $f(x)$ evolves as you change $x$. **If you want to reduce the value of $f(x)$, you just need to move $x$ a little in the opposite direction from the derivative.**

### 2.5.3 Derivative of a tensor operation: the gradient

A **gradient** is the **derivative of a tensor operation**. 

Consider:
- input vector $x$
- matrix $W$
- target $y$
- loss function. 

You can use $W$ to compute a target candidate $y\_pred$, and compute the $loss$, or mismatch, between the target candidate $y\_pred$ and the target $y$:

$$
y\_pred = dot(W, x)
$$

$$
loss\_value = loss(y\_pred, y)
$$

If the data inputs $x$ and $y$ are frozen, then this can be interpreted as a function mapping values of $W$ to loss values:

$$loss\_value = f(W)$$

Let’s say the current value of $W$ is $W0$. 

- The derivative of $f$ in the point $W0$ is a tensor $gradient(f)(W0)$ 
- Each coefficient $gradient(f) (W0)[i, j]$ indicates the direction and magnitude of the change in $loss\_value$


Thus, for a function $f(x)$, you can reduce the value of $f(x)$ by moving $x$ a little in the opposite direction from the derivative, with a function f(W) of a tensor, you can reduce f(W) by moving W in the opposite direction from the gradient: 

$$
W1 = W0 - step * gradient(f)(W0)
$$

where step is a small scaling factor. 

### 2.5.4 - Stochastic gradient descent

1. Draw a batch of training samples $x$ and corresponding targets $y$.
2. Run the network on $x$ to obtain predictions $y\_pred$.
3. Compute the loss of the network on the batch, a measure of the mismatch between $y\_pred$ and $y$.
4. Compute the gradient of the loss with regard to the network’s parameters (a backward pass).
5. Move the parameters a little in the opposite direction from the gradient 
    - $W -= step * gradient$
    
What I just described is called mini-batch stochastic gradient descent.

Additionally, there exist multiple variants of SGD that differ by taking into account previous weight updates when computing the next weight update, rather than just looking at the current value of the gradients. There is, for instance, **SGD with momentum**, as well as **Adagrad**, **RMSProp**, and several others. **Such variants are known as optimization methods or optimizers.**

## 2.6 - Looking back at our first example



You’ve reached the end of this chapter, and you should now have a general understanding of what’s going on behind the scenes in a neural network. Let’s go back to the first example and review each piece of it in the light of what you’ve learned in the previous three sections.

This was the input data:

>```python
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255
```

Now you understand that the input images are stored in Numpy tensors, which are here formatted as float32 tensors of shape **(60000, 784) (training data)** and **(10000,784) (test data)**, respectively.

This was our network:

>```python
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))
```

Now you understand that this network consists of a **chain of two Dense layers**, that each layer applies a few simple tensor operations to the input data, and that these operations involve weight tensors. Weight tensors, which are attributes of the layers, are where the knowledge of the network persists.

This was the network-compilation step:

>```python
network.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
```

Now you understand that **categorical_crossentropy** is the loss function that’s used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. You also know that this **reduction of the loss happens via minibatch stochastic gradient descent**. The exact rules governing a specific use of gradient descent are defined by the **rmsprop optimizer** passed as the first argument.

Finally, this was the training loop:

```python
network.fit(train_images, 
            train_labels, epochs=5, 
            batch_size=128)
```

Now you understand what happens when you call fit: the network will start to iterate on the training data in mini-batches of **128 samples, 5 times over** (each iteration over all the training data is called an epoch). At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly. **After these 5 epochs, the network will have performed 2,345 gradient updates (469 per epoch - 60k/128)**, and the loss of the network will be sufficiently low that the network will be capable of classifying handwritten digits with high accuracy.

**At this point, you already know most of what there is to know about neural networks.**

# 3 - Getting started with neural networks

## 3.1 - Introduction



This section covers
- **Core** components of **neural networks**
- An introduction to **Keras**
- Using neural networks to solve basic classification and regression problems
    - Classifying movie reviews as positive or negative (**binary classification**)
    - Classifying news wires by topic (**multiclass classification**)
    - Estimating the price of a house, given real-estate data (**regression**)

## 3.2 - Anatomy of a neural network



As you saw in the previous chapters, training a neural network revolves around the following objects:
 - **Layers**, which are combined into a network (or model)
 - The **input data** and corresponding **targets**
 - The **loss function**, which defines the **feedback signal used for learning**
 - The **optimizer**, which determines **how learning proceeds**
 
 <img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1CFWZALw9MnJvStvh8VkCAZFiVLs7E4lW">

### 3.2.1 - Layers: the building blocks of deep learning

A layer is a data-processing module that takes as input one or more tensors and that outputs one or more tensors. Different layers are appropriate for different tensor formats and different types of data processing.

- **fully connected or dense layers**
    - simple vector data, stored in 2D tensors of shape
- **recurrent layers such as an LSTM layer**
    - Sequence data, stored in 3D tensors of shape
- **2D convolution layers (Conv2D)**
    - image data, stored in 4D tensors

### 3.2.2 -  Loss functions and optimizers: keys to configuring the learning process



Once the network architecture is defined, you still have to choose two more things:
- **Loss function** (objective function) — The quantity that will be minimized during training. It represents a measure of success for the task at hand.
- **Optimizer** — Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).

**A neural network that has multiple outputs may have multiple loss functions (one per output)**. But the gradient-descent process must be based on a single scalar loss value; so, for multiloss networks, all losses are combined (via averaging) into a single scalar quantity.

Choosing the right objective function for the right problem is extremely important.

- **binary crossentropy**
    - two-class classification
- **categorical crossentropy**
    - many-class classification problem
- **mean squared error** 
    - regression problem
- **connectionist temporal classification (CTC)**
    - sequence-learning problem

## 3.3 - Introduction to Keras



[Keras](https://keras.io/) is a **deep-learning framework for Python** that provides a convenient way to define and train almost any kind of deep-learning model.

Keras has the following key features:
- It allows the same code to run seamlessly on CPU or GPU.
- It has a user-friendly API that makes it easy to quickly prototype deep-learning models.
- It has built-in support for convolutional networks (for computer vision), recurrent
networks (for sequence processing), and any combination of both.
- It supports arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, and so on. 


Keras is appropriate for building essentially any **deep-learning model**, from a **generative adversarial network**
to a **neural Turing machine**.

Keras is distributed under the permissive MIT license, which means **it can be freely used in commercial projects**. It’s compatible with any version of Python from 2.7 to 3.6 (as of december 2018).


**Keras is a model-level library**, providing high-level building blocks for developing deep-learning models. **It doesn’t handle low-level operations** such as tensor manipulation and differentiation. Instead, **it relies on a specialized, well-optimized tensor library** to do so, serving as the backend engine of Keras.

<img width="300" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1XeOgdVuT35U9ahTWwO5vfW6EiUlklrql">

## 3.4 - Classifying movie reviews: a binary classification example



**Two-class classification**, or **binary classification**, may be the most widely applied kind of machine-learning problem. In this example, you’ll learn to classify movie reviews as positive or negative, based on the text content of the reviews.




### 3.4.1 - IMDB dataset



You’ll work with the IMDB dataset: a set of 50,000 highly polarized reviews from the
Internet Movie Database. They’re split into:

- 25,000 reviews for training
- 25,000 reviews for testing

each set consisting of 50% negative and 50% positive reviews.

In [0]:
# loading the IMDB dataset

from keras.datasets import imdb

# num_words=10000 means you’ll only keep the 
# top 10,000 most frequently occurring words
#in the training data.
(train_data, train_labels),(test_data, test_labels) = imdb.load_data(num_words=10000)

In [0]:
# The variables train_data and test_data are lists of reviews; 
# each review is a list of word indices (encoding a sequence of words)
print(train_data[0][:20])
print(len(train_data[0]))

In [0]:
# train_labels and test_labels are lists of 0s and 1s, 
# where 0 stands for negative and 1 stands for positive
train_labels[0]

In [0]:
# Because you’re restricting yourself to the top 10,000
# most frequent words, no word index will exceed 10,000
max([max(sequence) for sequence in train_data])

In [0]:
import textwrap

# how to decode one sequence back to english

# get all word indexes ('word': id)
word_index = imdb.get_word_index()

# get a new dict (id: 'word')
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# Note that the indices are offset by 3 because 0, 1, and 2 are reserved 
# indices for “padding,” “start of sequence,” and “unknown.”
decoded_review = ' '.join([reverse_word_index.get(i - 3, '?') 
                           for i in train_data[1]])


print("\n".join(textwrap.wrap(decoded_review,90)))

### 3.4.2 Preparing the data



**You can’t feed lists of integers into a neural network**. You have to turn your lists into tensors. There are two ways to do that:
- Pad your lists so that they all have the same length, turn them into an integer tensor of shape (samples, word_indices)
- One-hot encode your lists to turn them into vectors of 0s and 1s. This would mean, for instance, turning the sequence [3, 5] into a 10,000-dimensional vector that would be all 0s except for indices 3 and 5, which would be 1s. 

In [0]:
# Encoding the integer sequences into a binary matrix (25000,10000)
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [0]:
print(x_train.shape)
print(x_train[0])

In [0]:
# You should also vectorize your labels, which is straightforward

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

### 3.4.3 - Building your network



- The input data is vectors
- The labels are scalars (1s and 0s)

A type of network that performs well on such a problem is **a simple stack of fully connected (Dense) layers** with relu activations 

```python
Dense(16,activation='relu')
```

The argument being passed to each Dense layer (16) is the number of hidden units of the layer. **A hidden unit is a dimension in the representation space of the layer**. Remember that each such Dense layer with a relu activation implements the following chain of tensor operations:

```python
output = relu(dot(W, input) + b)
```

Having 16 hidden units means the weight matrix W will have shape (input_dimension,16): the dot product with W will project the input data onto a 16-dimensional representation space (and then you’ll add the bias vector b and apply the relu operation). 

You can intuitively understand the dimensionality of your representation space as **“how much freedom you’re allowing the network to have when learning internal representations.”** 

- Having more hidden units (a higher-dimensional representation space) allows your network to learn more-complex representations
- But, having more hidden units make the network more computationally expensive and may lead to **learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).**


There are two **key architecture decisions** to be made about such a stack of Dense layers:
- How many layers to use
- How many hidden units to choose for each layer

In the next chapter, you’ll learn formal principles to guide you in making these choices. For the time being, you’ll have to trust me with the following architecture choice:

- Two intermediate layers with 16 hidden units each
- A third layer that will output the scalar prediction regarding the sentiment of the current review

The **intermediate layers will use relu** as their activation function, and the **final layer will use a sigmoid** activation so as to output a probability (a score between 0 and 1).

<table>
<tr>
    <td> <img src="https://drive.google.com/uc?export=view&id=1tGaZcNRpAmDaSEAOGCKGhH9QtPHlbDUi" width="150"> </td>
    <td> <img src="https://drive.google.com/uc?export=view&id=1ku-YktP--VeodXH1oaUPPtkvgfSVclm7" width="300"> </td>
    <td> <img src="https://drive.google.com/uc?export=view&id=1LoV52worh3m-S9uL4YvyDm2NVOcVDvaJ" width="300"> </td>
</tr>
</table>



In [0]:
from keras import models
from keras import layers

# the model definition
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

### 3.4.4 - Validating your approach



In order to monitor during training the accuracy of the model on data it has never seen before, you’ll create a validation set by setting apart 10,000 samples from the original training data.

In [0]:
# Setting aside a validation set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [0]:
import time

start = time.time()

# Training your model
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

end = time.time()

print("Duration: ", end-start)

In [0]:
# Plotting the training and validation loss

import matplotlib.pyplot as plt
%matplotlib inline

history_dict = history.history

# trainning loss X validation loss
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)

# trainning accuracy X validation accuracy
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

fig = plt.figure(figsize=(14, 6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.plot(epochs, loss_values, 'bo', label='Training loss')
ax1.plot(epochs, val_loss_values, 'b', label='Validationb loss')
ax1.set_title('Training and validation loss')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.set_xticks(epochs)

ax2.plot(epochs, acc_values, 'bo', label='Training acc')
ax2.plot(epochs, val_acc_values, 'b', label='Validation acc')
ax2.set_title('Training and validation accuracy')
ax2.set_xlabel('Epochs')
ax2.set_xticks(epochs)
ax2.set_ylabel('Loss')
ax2.legend()

plt.show()

- the training loss decreases with every epoch
- the training accuracy increases with every epoch

That’s what you would expect when running gradient descent optimization — the quantity you’re trying to minimize should be less with every iteration. But that isn’t the case for the validation loss and accuracy: they seem to peak at the fourth epoch. This is an example of what we warned against earlier: a model that performs better on the training data isn’t necessarily a model that will do better on data it has never seen before. In precise terms, what you’re seeing is **overfitting**: after the second epoch, you’re overoptimizing on the training data, and you end up learning representations that are specific to the training data and don’t generalize to data outside of the training set.

In [0]:
# this approach achieves an accuracy of 85%. 
# With state-of-the-art approaches, 
# you should be able to get close to 95%
results = model.evaluate(x_test, y_test)
results

### 3.4.5 - Using a trained network to generate predictions on new data



After having trained a network, you’ll want to use it in a practical setting. You can generate the likelihood of reviews being positive by using the predict method:

In [0]:
model.predict(x_test)

In [0]:
model.predict_classes(x_test)

### 3.4.6 - Further experiments


**Exercise Start**

<img width="100" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


The following experiments will help convince you that the architecture choices you’ve made are all fairly reasonable, although there’s still room for improvement:
1. You used two hidden layers. **Try using one or three hidden layers**, and see how doing so affects validation and test accuracy.
2. Try using layers with more **hidden units** or fewer hidden units: **32 units, 64 units**, and so on.
3. Try using the **mse loss** function instead of binary_crossentropy.
4. Try using the **tanh activation** (an activation that was popular in the early days of neural networks) instead of relu.

In [0]:
# parameters to be evaluated
hidden_units = [16,32,64]
activations_funct = ['relu','tanh']
loss_funct = ['binary_crossentropy', 'mean_squared_error']
training = []
results = []

# put your code here

#### 3.4.6.1 - Model 1

In [0]:
# parameters to be evaluated

hidden_units = [16,32,64]
activations_funct = ['relu','tanh']
loss_funct = ['binary_crossentropy', 
              'mean_squared_error']
training = []
results = []

In [0]:
import time

start = time.time()

# the model definition
# layer_1,layer_2

for hidden in hidden_units:
    for activations in activations_funct:
        for losses in loss_funct:
            model = models.Sequential()
            model.add(layers.Dense(hidden, activation=activations, 
                                   input_shape=(10000,)))
            model.add(layers.Dense(1, activation='sigmoid'))

            # compile the model
            model.compile(optimizer='rmsprop',loss=losses,metrics=['accuracy'])
            
            # Training your model
            history = model.fit(partial_x_train,
                                partial_y_train,
                                epochs=20,
                                batch_size=512,
                                validation_data=(x_val, y_val))
            training.append(history)
            results.append(model.evaluate(x_test, y_test))
            
            
end = time.time()

print("Duration: ", end-start)

| Model           | Hidden Unit | Loss Function | Evaluation (accuracy) |
|-----------------|-------------|---------------|-----------------------|
| **relu,sigmoid(1)** | **16**          | **binary_cross**  |       ** 0.8574  **             |
| relu,sigmoid(1) | 16          | mse           |        0.8551             |
| tanh,sigmoid(1) | 16          | binary_cross  |        0.8513               |
| tanh,sigmoid(1) | 16          | mse           |        0.8515               |
| relu,sigmoid(1) | 32          | binary_cross  |         0.8500              |
| relu,sigmoid(1) | 32          | mse           |         0.8538              |
| tanh,sigmoid(1) | 32          | binary_cross  |          0.8460             |
| tanh,sigmoid(1) | 32          | mse           |          0.8469             |
| relu,sigmoid(1) | 64          | binary_cross  |         0.8470              |
| relu,sigmoid(1) | 64          | mse           |          0.8502             |
| tanh,sigmoid(1) | 64          | binary_cross  |         0.8464              |
| tanh,sigmoid(1) | 64          | mse           |         0.7906              |

The best accuary was 0.8575. 

- For all results the activation function relu was better than tanh
- 16 hidden units was a better choice when compared to 32 and 64

### 3.4.7 -  Wrapping up



Here’s what you should take away from this example:
- You usually need to do quite a bit of **preprocessing on your raw data** in order to be able to feed it—as tensors—into a neural network. 
- Stacks of Dense layers with **relu activations can solve a wide range of problems** and you’ll likely use them frequently.
- In a **binary classification** problem (two output classes), your network should **end with a Dense layer with one unit and a sigmoid activation**: the output of your network should be a scalar between 0 and 1, encoding a probability.
- The **rmsprop optimizer** is generally a good enough choice, whatever your problem. That’s one less thing for you to worry about.
- As they get better on their training data, neural networks eventually start **overfitting** and end up obtaining increasingly worse results on data they’ve never seen before. Be sure to always monitor performance on data that is outside of
the training set.

## 3.5 - Classifying newswires:a multiclass classification example



In the previous section, you saw how to classify vector inputs into two mutually exclusive classes using a densely connected neural network. **But what happens when you have more than two classes?**

In this section, you’ll build a network to **classify Reuters newswires into 46 mutually exclusive topics**. Because you have many classes, this problem is an instance of multiclass classification; and because each data point should be classified into only one category, the problem is more specifically an instance of **single-label, multiclass classification**. If each data point could belong to multiple categories (in this case, topics), you’d be facing a **multilabel, multiclass classification problem**.

In [0]:
# Loading the Reuters dataset

from keras.datasets import reuters

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

In [0]:
# 8982 train data samples
print(len(train_data))

# 2246 test data samples
print(len(test_data))

In [0]:
# print a train data sample
print(train_data[3][:10])

In [0]:
import textwrap

# Decoding newswires back to text
word_index = reuters.get_word_index()

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') 
                             for i in train_data[3]])

print("\n".join(textwrap.wrap(decoded_newswire,90)))

In [0]:
# Encoding the data

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [0]:
from keras.utils.np_utils import to_categorical

one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

In [0]:
# Model definition

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', 
                       input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

There are two other things you should note about this architecture:

- **You end the network with a Dense layer of size 46**. This means for each input sample, the network will output a 46-dimensional vector. Each entry in this vector (each dimension) will encode a different output class.
- The **last layer uses a softmax activation**. You saw this pattern in the MNIST example. It means the network will output a probability distribution over the 46 different output classes—for every input sample, the network will produce a 46-
dimensional output vector, where output[i] is the probability that the sample belongs to class i. **The 46 scores will sum to 1.**

In [0]:
# Compiling the model
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

**The best loss function to use in this case is categorical_crossentropy**. It measures the distance between two probability distributions: here, between the probability distribution output by the network and the true distribution of the labels. By minimizing the distance between these two distributions, you train the network to output something as close as possible to the true labels.

In [0]:
# Setting aside a validation set

x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

In [0]:
# Training the model

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

In [0]:
# Plotting the training and validation loss

import matplotlib.pyplot as plt
%matplotlib inline

history_dict = history.history

# trainning loss X validation loss
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)

# trainning accuracy X validation accuracy
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

fig = plt.figure(figsize=(14, 6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

ax1.plot(epochs, loss_values, 'bo', label='Training loss')
ax1.plot(epochs, val_loss_values, 'b', label='Validationb loss')
ax1.set_title('Training and validation loss')
ax1.set_xlabel('Epochs')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.set_xticks(epochs)

ax2.plot(epochs, acc_values, 'bo', label='Training acc')
ax2.plot(epochs, val_acc_values, 'b', label='Validation acc')
ax2.set_title('Training and validation accuracy')
ax2.set_xlabel('Epochs')
ax2.set_xticks(epochs)
ax2.set_ylabel('Loss')
ax2.legend()

plt.show()

In [0]:
# evaluating the model
model.evaluate(x_test,one_hot_test_labels)

In [0]:
# Generating predictions for new data
predictions = model.predict(x_test)

# Each entry in predictions is a vector of length 46
print(predictions[0].shape)

# The coefficients in this vector sum to 1
print(np.sum(predictions[0]))

# The largest entry is the predicted class—the class with the highest probability
print(np.argmax(predictions[0]))

### 3.5.1 - A different way to handle the labels and the loss



We mentioned earlier that another way to encode the labels would be to cast them as an integer tensor, like this:

>```python
y_train = np.array(train_labels)
y_test = np.array(test_labels)
```

The only thing this approach would change is the choice of the loss function. The loss function categorical_crossentropy, expects the labels to follow a categorical encoding. With integer labels, you should use sparse_categorical_crossentropy.

>```python
model.compile(optimizer='rmsprop',
            loss='sparse_categorical_crossentropy',
            metrics=['acc'])
```

This new loss function is still mathematically the same as categorical_crossentropy; it just has a different interface.

###  3.5.2 -  The importance of having sufficiently large intermediate layers



We mentioned earlier that because the final outputs are 46-dimensional, you should avoid intermediate layers with many fewer than 46 hidden units. Now let’s see what happens when you introduce an information bottleneck by having intermediate layers
that are significantly less than 46-dimensional: for example, 4-dimensional.

In [0]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=512,
          validation_data=(x_val, y_val))

In [0]:
# evaluating the model
model.evaluate(x_test,one_hot_test_labels)

**The network now peaks at ~70% validation accuracy, an 8% absolute drop**. This drop is mostly due to the fact that you’re trying to compress a lot of information (enough information to recover the separation hyperplanes of 46 classes) into an intermediate space that is too low-dimensional.

### 3.5.3 - Further experiments




**Exercise Start**

<img width="100" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

- Try using larger or smaller layers: 32 units, 128 units, 256, and so on.
- You used two hidden layers. Now try using a single hidden layer, three hidden layers or four hidden layers.

In [0]:
# parameters to be evaluate

hidden_units = [32,64,128,256]
activations_funct = ['relu']
loss_funct = ['categorical_crossentropy']
training = []
results = []


# put your code here

### 3.5.4 - Wrapping up



Here’s what you should take away from this example:

1. If you’re trying to **classify data points among N classes**, your network should **end with a Dense layer of size N**.
2. In a **single-label, multiclass classification** problem, your network should **end with a softmax** activation so that it will output a probability distribution over the N output classes.
3. **Categorical crossentropy is almost always the loss function you should use for such problems**. It minimizes the distance between the probability distributions output by the network and the true distribution of the targets.
4. There are two ways to handle labels in multiclass classification:
    - Encoding the labels via **categorical encoding** (also known as one-hot encoding) and using **categorical_crossentropy** as a loss function
    - **Encoding the labels as integers** and using the **sparse_categorical_crossentropy** loss function
5. If you need to classify data into a large number of categories, you should avoid creating information bottlenecks in your network due to intermediate layers that are too small (**less than 46 in this example**)

## 3.6 -  Predicting house prices: a regression example



The two previous examples were considered classification problems, where the goal was to predict a single discrete label of an input data point. Another common type of machine-learning problem is **regression**, which consists of predicting a continuous
value instead of a discrete label: for instance, predicting the temperature tomorrow, given meteorological data; or predicting the time that a software project will take to complete, given its specifications.

> Don’t confuse regression and the algorithm logistic regression. Confusingly, **logistic regression isn’t a regression algorithm—it’s a classification algorithm**.



### 3.6.1 -  The Boston Housing Price dataset



- You’ll attempt to predict the median price of homes in a given **Boston suburb in the mid-1970s** given data points about the suburb at the time, such as the **crime rate, the local property tax rate**, and so on. 
- The dataset has relatively **few data points**: only 506, split between 404 training samples and 102 test samples. 
- Each **feature** in the input data (for example, the crime rate) has a **different scale**. 
- For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others between 0 and 100, and so on.

In [0]:
# Loading the Boston housing dataset

from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# Training sample size
print(train_data.shape)

# Test sample size
print(test_data.shape)

In [0]:
import numpy as np
# Train targets x  U$ 1,000 - value are between 5.0 and 50.0
print(np.min(train_targets))
print(np.max(train_targets))

### 3.6.2 - Preparing the data



**It would be problematic to feed into a neural network** values that all take wildly different ranges. The network might be able to automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal with such data is to do **feature-wise normalization**: for each feature in the input data (a column in the input data matrix), you subtract the mean of the feature and divide by the standard deviation, so that the feature is centered around 0 and has a unit standard deviation. This is easily done in [Scikit-learn](http://scikit-learn.org/stable/modules/preprocessing.html).


The [**preprocessing module**](http://scikit-learn.org/stable/modules/preprocessing.html) further provides a utility **class StandardScaler** that implements the Transformer API to compute the **mean** and **standard deviation** on a **training set** so as to be able to **later reapply the same transformation on the testing set**. 

In [0]:
# Normalizing the data

from sklearn import preprocessing

# create a scaler to fit train_data
scaler = preprocessing.StandardScaler().fit(train_data)

# feature-wise normalization over train_data
train_data_scaled = scaler.transform(train_data)

# Scaled train data has zero mean and unit variance
print(train_data_scaled.mean(axis=0))
print(train_data_scaled.std(axis=0))

# Note that the quantities used for normalizing the 
# test data are computed using the training data. 
# You should never use in your workflow any quantity 
# computed on the test data, 
# even for something as simple as data normalization.
test_data_scaled = scaler.transform(test_data)

###  3.6.3 -  Building your network



Because so few samples are available, you’ll use a very small network with two hidden layers, each with 64 units. **In general, the less training data you have, the worse overfitting will be**, and using a small network is one way to mitigate overfitting.

In [0]:
# Model definition

from keras import models
from keras import layers

def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu',
                           input_shape=(train_data_scaled.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    
    # The network ends with a single unit and no activation (it will be a linear layer). 
    # This is a typical setup for scalar regression (a regression where you’re trying 
    # to predict a single continuous value).
    model.add(layers.Dense(1))
    
    # compile the network with the mse loss function—mean squared error,
    # the square of the difference between the predictions and the targets. 
    # This is a widely used loss function for regression problems.
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    
    # You’re also monitoring a new metric during training: mean absolute error (MAE). 
    # It’s the absolute value of the difference between the predictions and the targets.
    return model

### 3.6.4 -  Validating your approach using K-fold validation

- To evaluate your network while you keep adjusting its parameters (such as the number of epochs used for training), you could **split the data into a training set and a validation set**
- Because you have so **few data points**, the validation set would end up being very small.
- The best practice in such situations is to use **K-fold cross-validation**.
    - It consists of splitting the available data into K partitions (typically K = 4 or 5)
    - Instantiating K identical models, and training each one on K – 1 partitions while evaluating on the remaining partition. 
    - The **validation score** for the model used is then the **average of the K validation scores obtained**. 
    

In [0]:
from sklearn.model_selection import KFold

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

# define k-fold cross validation test
kf = KFold(n_splits=4, random_state=seed)

all_mae_histories = []

for train, test in kf.split(train_data_scaled, train_targets):
    # create model
    model = build_model()
    history = model.fit(train_data_scaled[train], 
              train_targets[train], 
              epochs=100, 
              batch_size=1, 
              verbose=1,
              validation_data=(train_data_scaled[test], 
                               train_targets[test]))
    mae_history = history.history['val_mean_absolute_error']
    all_mae_histories.append(mae_history)

In [0]:
# mae for each k-fold step
np.mean(all_mae_histories,axis=1)

The different runs do indeed show rather different validation scores, from 2.1 to 2.7. The average (2.5) is a much more reliable metric than any single score - that’s the entire point of K-fold cross-validation. In this case, you’re off by  US2,500 on average, which is significant considering that the prices range from US5,000 to US50,000.

In [0]:
np.mean(all_mae_histories)

In [0]:
# Building the history of successive mean K-fold validation scores
num_epochs = 100
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) 
                       for i in range(num_epochs)]

In [0]:
import matplotlib.pyplot as plt

plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

In [0]:
epochs_min = np.argmin(average_mae_history)
print('Minimum MAE: {:.4f}\nEpoch: {:d}'.format(average_mae_history[epochs_min],epochs_min))

In [0]:
# Training the final model

model = build_model()
model.fit(train_data_scaled, 
          train_targets,
          epochs=76, 
          batch_size=1, 
          verbose=0)

test_mse_score, test_mae_score = model.evaluate(test_data_scaled, test_targets)

In [0]:
test_mae_score

### 3.6.5 - Wrapping up

Here’s what you should take away from this example:

- **Mean squared error (MSE)** is a **loss function** commonly used for **regression**.
- A common **regression metric** is **mean absolute error (MAE)**.
- When **features** in the input data have values in **different ranges**, each feature should be **scaled independently** as a preprocessing step.
- When there is **little data available, using K-fold validation** is a great way to reliably evaluate a model.
- When **little training data** is available, it’s preferable to use a **small network with few hidden layers** (typically only one or two), in order to avoid severe overfitting.
