* Understanding convolutional neural networks (convnets)
* Using data augmentation to mitigate overfitting
* Using a pretrained convnet to do feature extraction
* Fine-tuning a pretrained convnet
* Visualizing what convnets learn and how they make classification decisions

#### Instantiating a small convnet

In [1]:
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

ModuleNotFoundError: No module named 'keras'

A convnet takes as input tensors of shape (image_height, image_width, image_channels) (not including the batch dimension). In this case, we’ll configure the convnet to process inputs of size (28, 28, 1), which is the format of MNIST images. We’ll do this by passing the argument input_shape=(28, 28, 1) to the first layer.

In [None]:
model.summary()

Every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as you go deeper in the network. The number of channels is controlled by the first argument passed to the Conv2D layers (32 or 64).

The next step is to feed the last output tensor (of shape (3, 3, 64)) into a densely connected classifier network like those you’re already familiar with: a stack of Dense layers. These classifiers process vectors, which are 1D, whereas the current output is a 3D tensor. First we have to flatten the 3D outputs to 1D, and then add a few Dense layers on top.

#### Adding a classifier on top of the convnet

In [None]:
model.add(layers.Flatten()) #the (3, 3, 64) outputs are flattened into vectors of shape (576,) before going through two Dense layers.
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

We’ll do 10-way classification, using a final layer with 10 outputs and a softmax activation. Here’s what the network looks like now:

In [None]:
model.summary()

#### Training the convnet on MNIST images

In [None]:
from keras.datasets import mnist
from keras.utils import to_categorical
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5, batch_size=64)

#### Evaluate the model on the test data:

In [None]:
test_loss, test_acc = model.evaluate(test_images, test_labels)

In [None]:
test_acc

Whereas the densely connected network from chapter 2 had a test accuracy of 97.8%, the basic convnet has a test accuracy of 99.3%: we decreased the error rate by 68% (relative). Not bad!

### The convolution operation

*Dense* layers learn global patterns in their input feature space (for example, for a MNIST digit, patterns involving all pixels)

*convolution layers* learn local patterns: in the case of images, patterns found in small 2D windows of the inputs. In the previous example, these windows were all 3 × 3.

![convnet.PNG](attachment:convnet.PNG)

This key characteristic gives convnets two interesting properties:

*The patterns they learn are translation invariant:*

After learning a certain pattern in the lower-right corner of a picture, a convnet can recognize it anywhere: for example, in the upper-left corner.
A densely connected network would have to learn the pattern a new if it appeared at a new location. This makes convnets data efficient when processing images (because the visual world is fundamentally translation invariant): they need fewer training samples to learn representations that have generalization power.

*They can learn spatial hierarchies of patterns:* 

A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently learn increasingly complex and abstract visual concepts (because the visual world is fundamentally spatially hierarchical).

![heir.PNG](attachment:heir.PNG)

Convolutions operate over 3D tensors, called *feature maps*, with two spatial axes (height
and width) as well as a *depth* axis (also called the *channels* axis). 

For an RGB image, the dimension of the depth axis is 3, because the image has three color channels: red, green, and blue. 

For a black-and-white picture, like the MNIST digits, the depth is 1 (levels of gray). 

The convolution operation extracts patches from its input feature map and applies the same transformation to all of these patches, producing an output feature map. 

This output feature map is still a 3D tensor: it has a width and a height. Its depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors as in RGB input; rather, **they stand for filters**. Filters encode specific aspects of the input data: at a high level, a single filter could encode the concept “presence of a face in the input,” for instance.

In the MNIST example, the first convolution layer takes a feature map of size (28, 28, 1) and outputs a feature map of size (26, 26, 32): it computes 32 filters over its input. 

Each of these 32 output channels contains a 26 × 26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input. 
That is what the term feature map means: every dimension in the depth axis is a feature (or filter), and the 2D tensor output[:, :, n] is the 2D spatial map of the response of this filter over the input.

![conv.PNG](attachment:conv.PNG)

Convolutions are defined by two key parameters:
* Size of the patches extracted from the inputs: These are typically 3 × 3 or 5 × 5. 
* Depth of the output feature map: The number of filters computed by the convolution. The example started with a depth of 32 and ended with a depth of 64.

In Keras Conv2D layers, these parameters are the first arguments passed to the layer:

```Conv2D(output_depth, (window_height, window_width))```

A convolution works by sliding these windows of size 3 × 3 or 5 × 5 over the 3D input
feature map, stopping at every possible location, and extracting the 3D patch of surrounding features (shape (window_height, window_width, input_depth)). 

Each such 3D patch is then transformed (via a tensor product with the same learned weight
matrix, called the convolution kernel) into a 1D vector of shape (output_depth,). 

All of these vectors are then spatially reassembled into a 3D output map of shape (height, width, output_depth). 

Every spatial location in the output feature map corresponds to the same location in the input feature map (for example, the lower-right corner of the output contains information about the lower-right corner of the input). 

For instance, with 3 × 3 windows, the vector output[i, j, :] comes from the 3D patch
input[i-1:i+1, j-1:j+1, :].

![convwork.PNG](attachment:convwork.PNG)

### UNDERSTANDING BORDER EFFECTS AND PADDING

Consider a 5 × 5 input feature map (25 tiles total). There are only 9 tiles around which you can center a 3 × 3 window (filter, kernel), forming a 3 × 3 grid (output feature map) . Hence, the output feature map will be 3 × 3. It shrinks a little: by exactly two tiles alongside each dimension, in this case. You can see this border effect in action in the earlier example: you start with 28 × 28 inputs, which become 26 × 26 after the first convolution layer.

![valid.PNG](attachment:valid.PNG)

If you want to get an output feature map with the same spatial dimensions as the input, you can use **padding**. Padding consists of adding an appropriate number of rows and columns on each side of the input feature map so as to make it possible to fit center convolution windows around every input tile. For a 3 × 3 window, you add one column on the right, one column on the left, one row at the top, and one row at the bottom. For a 5 × 5 window, you add two rows.

![padd.PNG](attachment:padd.PNG)

In *Conv2D* layers, padding is configurable via the padding argument, which takes two values: 

"valid": means no padding (only valid window locations will be used)

"same": means “pad in such a way as to have an output with the same width and height as the input.” The padding argument defaults to "valid".

#### UNDERSTANDING CONVOLUTION STRIDES

The other factor that can influence output size is the notion of strides. 

The distance between two successive windows is a parameter of the convolution, called its stride, which defaults to 1. 

It’s possible to have strided convolutions: convolutions with a stride higher than 1. In figure , you can see the patches extracted by a 3 × 3 convolution with stride 2 over a 5 × 5 input (without padding).

![stride.PNG](attachment:stride.PNG)

Using stride 2 means the width and height of the feature map are downsampled by a
factor of 2.

Strided convolutions are rarely used in practice, although they can come in handy for some types of models; it’s good to be familiar with the concept.

To downsample feature maps, instead of strides, we tend to use the **max-pooling** operation.

#### The max-pooling operation

Max pooling: aggressively downsample feature maps, much like strided convolutions.

Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel. 

It’s conceptually similar to convolution, except that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded **max** tensor operation. 

Max pooling is usually done with 2 × 2 windows and stride 2, in order to downsample the feature maps by a factor of 2. 

Convolution is typically done with 3 × 3 windows and no stride (stride 1).

Why downsample feature maps this way? Why not remove the max-pooling layers and keep fairly large feature maps all the way up? Let’s look at this option. The convolutional base of the model would then look like this:

In [None]:
model_no_max_pool = models.Sequential()
model_no_max_pool.add(layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(28, 28, 1)))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))

In [None]:
model_no_max_pool.summary()

What’s wrong with this setup? Two things:

* It isn’t conducive to learning a spatial hierarchy of features. The 3 × 3 windows in the third layer will only contain information coming from 7 × 7 windows in the initial input. The high-level patterns learned by the convnet will still be very small with regard to the initial input, which may not be enough to learn to classify digits (try recognizing a digit by only looking at it through windows that are 7 × 7 pixels!). We need the features from the last convolution layer to contain information about the totality of the input.

* The final feature map has 22 × 22 × 64 = 30,976 total coefficients per sample. This is huge. If you were to flatten it to stick a Dense layer of size 512 on top, that layer would have 15.8 million parameters. This is far too large for such a small model and would result in intense overfitting.

In short, the reason to use downsampling is to reduce the number of feature-map coefficients to process, as well as to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows

## Training a convnet from scratch on a small dataset

We’ll focus on classifying images as dogs or cats, in a dataset containing 4,000 pictures of cats and dogs (2,000 cats, 2,000 dogs). We’ll use 2,000 pictures for training—1,000 for validation, and 1,000 for testing.

One basic strategy to tackle this problem: 
training a new model from scratch using what little data you have. 

Start by naively training a small convnet on the 2,000 training samples, without any regularization, to set a baseline for what can be achieved. This will get you to a classification accuracy of 71%. 

At that point, the main issue will be overfitting. Then we’ll introduce **data augmentation**, a powerful technique for mitigating overfitting in computer vision. By using data augmentation, you’ll improve the network to reach an accuracy of 82%.

We’ll review two more essential techniques for applying deep learning to small datasets: 

Feature extraction with a pretrained network (which will get you to an accuracy of 90% to 96%) 

Fine-tuning a pretrained network (this will get you to a final accuracy of 97%). 

Together, these three strategies—training a small model from scratch, doing feature extraction using a pretrained model, and fine-tuning a pretrained model—will constitute your future toolbox for tackling the problem of performing image classification with small datasets.

#### The relevance of deep learning for small-data problems

One fundamental characteristic of deep learning is that it can find interesting features in the training data on its own, without any need for manual feature engineering, and this can only be achieved when lots of training examples are available. 

This is especially true for problems where the input samples are very high dimensional, like images.

But what constitutes lots of samples is relative—relative to the size and depth of the network you’re trying to train, for starters. 

It isn’t possible to train a convnet to solve a complex problem with just a few tens of samples, but a few hundred can potentially suffice *if the model is small and well regularized and the task is simple*. Because convnets learn local, translation-invariant features, they’re highly data efficient on perceptual problems. 

Training a convnet from scratch on a very small image dataset will still yield reasonable results despite a relative lack of data, without the need for any custom feature engineering.

What’s more, deep-learning models are by nature highly repurposable: 

You can take, say, an image-classification or speech-to-text model trained on a large-scale dataset and reuse it on a significantly different problem with only minor changes.

Specifically, in the case of computer vision, many pretrained models (usually trained on the ImageNet dataset) are now publicly available for download and can be used to bootstrap powerful vision models out of very little data.

#### Downloading the data

The Dogs vs. Cats dataset that you’ll use isn’t packaged with Keras. You can download the original dataset from www.kaggle.com/c/dogs-vs-cats/data

The dogs-versus-cats Kaggle competition in 2013 was won by entrants who used convnets. 

The best entries achieved up to 95% accuracy. 

You’ll get fairly close to this accuracy, even though you’ll train your models on less than 10% of the data that was available to the competitors.

This dataset contains 25,000 images of dogs and cats (12,500 from each class).

After downloading and uncompressing it, you’ll create a new dataset containing three subsets:

* a training set with 1,000 samples of each class

* a validation set with 500 samples of each class

* a test set with 500 samples of each class.

##### Copying images to training, validation, and test directories

In [None]:
import os, shutil

# Path to the directory where the original dataset was uncompressed
original_dataset_dir = '/Users/fchollet/Downloads/kaggle_original_data' 

# Directory where you’ll store your smaller dataset
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
os.mkdir(base_dir)

#Directories for the training, validation, and test splits
train_dir = os.path.join(base_dir, 'train')
os.mkdir(train_dir)
validation_dir = os.path.join(base_dir, 'validation')
os.mkdir(validation_dir)
test_dir = os.path.join(base_dir, 'test')
os.mkdir(test_dir)


# Directory with training cat pictures
train_cats_dir = os.path.join(train_dir, 'cats')
os.mkdir(train_cats_dir)

# Directory with training dog pictures
train_dogs_dir = os.path.join(train_dir, 'dogs')
os.mkdir(train_dogs_dir)



# Directory with validation cat pictures
validation_cats_dir = os.path.join(validation_dir, 'cats')
os.mkdir(validation_cats_dir)

# Directory with validation dog pictures
validation_dogs_dir = os.path.join(validation_dir, 'dogs')
os.mkdir(validation_dogs_dir)




# Directory with test cat pictures
test_cats_dir = os.path.join(test_dir, 'cats')
os.mkdir(test_cats_dir)

# Directory with test dog pictures
test_dogs_dir = os.path.join(test_dir, 'dogs')
os.mkdir(test_dogs_dir)



# Copies the first 1,000 cat images to train_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(1000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(train_cats_dir, fname)
    shutil.copyfile(src, dst)

# Copies the next 500 cat images to validation_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(1000, 1500)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(validation_cats_dir, fname)
    shutil.copyfile(src, dst)

# Copies the next 500 cat images to test_cats_dir
fnames = ['cat.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(test_cats_dir, fname)
    shutil.copyfile(src, dst)

# Copies the first 1,000 dog images to train_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(train_dogs_dir, fname)
    shutil.copyfile(src, dst)

    
# Copies the next 500 dog images to validation_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1000, 1500)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(validation_dogs_dir, fname)
    shutil.copyfile(src, dst)


# Copies the next 500 dog images to test_dogs_dir
fnames = ['dog.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
    src = os.path.join(original_dataset_dir, fname)
    dst = os.path.join(test_dogs_dir, fname)
    shutil.copyfile(src, dst)    
    

Let’s count how many pictures are in each training split (train/validation/test):

In [2]:
print('total training cat images:', len(os.listdir(train_cats_dir)))
print('total training dog images:', len(os.listdir(train_dogs_dir)))
print('total validation cat images:', len(os.listdir(validation_cats_dir)))
print('total validation dog images:', len(os.listdir(validation_dogs_dir)))
print('total test cat images:', len(os.listdir(test_cats_dir)))
print('total test dog images:', len(os.listdir(test_dogs_dir)))

NameError: name 'os' is not defined

So you do indeed have 2,000 training images, 1,000 validation images, and 1,000 test images. 
Each split contains the same number of samples from each class: this is a **balanced binary-classification** problem, which means classification accuracy will be an appropriate measure of success.

### Building your network

The convnet will be a stack of alternated *Conv2D* (with relu activation) and *MaxPooling2D* layers.

Because you’re dealing with bigger images and a more complex problem, you’ll make your network larger, accordingly: 
it will have one more Conv2D + MaxPooling2D stage. This serves both to augment the capacity of the network and to further reduce the size of the feature maps so they aren’t overly large when you reach the Flatten layer.

Because you start from inputs of size 150 × 150 (a somewhat arbitrary choice), you end up with feature maps of size 7 × 7 just before the Flatten layer.

** The depth of the feature maps progressively increases in the network (from 32 to 128), whereas the size of the feature maps decreases (from 148 × 148 to 7 × 7). This is a pattern you’ll see in almost all convnets.**

Because you’re attacking a binary-classification problem, you’ll end the network with a single unit (a Dense layer of size 1) and a **sigmoid** activation. This unit will encode the probability that the network is looking at one class or the other.

### Instantiating a small convnet for dogs vs. cats classification

In [3]:
from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

ModuleNotFoundError: No module named 'keras'

Let’s look at how the dimensions of the feature maps change with every successive layer:

In [4]:
model.summary()

NameError: name 'model' is not defined

For the compilation step, you’ll go with the **RMSprop** optimizer, as usual. Because you ended the network with a single sigmoid unit, you’ll use binary crossentropy as the loss:

#### Configuring the model for training

In [5]:
from keras import optimizers
model.compile(loss='binary_crossentropy',
              optimizer=optimizers.RMSprop(lr=1e-4),
              metrics=['acc'])

ModuleNotFoundError: No module named 'keras'

#### Data preprocessing

Currently, the data sits on a drive as JPEG files, so the steps for getting it into the network are roughly as follows:

1 Read the picture files.

2 Decode the JPEG content to RGB grids of pixels.

3 Convert these into floating-point tensors.

4 Rescale the pixel values (between 0 and 255) to the [0, 1] interval (as you know, neural networks prefer to deal with small input values).

*Keras* has a module with image-processing helper tools, located at **keras.preprocessing.image**. 

In particular, it contains the class **ImageDataGenerator**, which lets you quickly set up Python generators that can automatically turn image files on disk into batches of preprocessed tensors.

#### Using ImageDataGenerator to read images from directories

In [6]:
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir, # Target directory
    target_size=(150, 150), # Resizes all images to 150 × 150
    batch_size=20,
    class_mode='binary') # Because you use binary_crossentropy loss, you need binary labels.

validation_generator = test_datagen.flow_from_directory(
    validation_dir, target_size=(150, 150),
    batch_size=20,
    class_mode='binary')

ModuleNotFoundError: No module named 'keras'

Let’s look at the output of one of these generators: 
it yields batches of 150 × 150 RGB images (shape (20, 150, 150, 3)) and binary labels (shape (20,)). 
There are 20 samples in each batch (the batch size). 

**Note that the generator yields these batches *indefinitely*: it loops endlessly over the images in the target folder. For this reason, you need to break the iteration loop at some point:**

In [8]:
for data_batch, labels_batch in train_generator:
    print('data batch shape:', data_batch.shape)
    print('labels batch shape:', labels_batch.shape)
    break

NameError: name 'train_generator' is not defined

Let’s fit the model to the data using the generator. 
You do so using the *fit_generator* method, the equivalent of fit for data generators like this one. 
It expects as its first argument a Python generator that will yield batches of inputs and targets **indefinitely**, like this one does. 

Because the data is being generated endlessly, the Keras model needs to know **how many samples to draw from the generator before declaring an epoch over**. 

This is the role of the **steps_per_epoch** argument: after having drawn steps_per_epoch batches from the generator—that is, after having run for steps_per_epoch gradient descent steps — the fitting process will go to the next epoch. In this case, batches are 20 samples, so it will take 100 batches until you see your target of 2,000 samples.

When using *fit_generator*, you can pass a *validation_data* argument, much as with the fit method. It’s important to note that this argument is allowed to be a data generator, but it could also be a tuple of Numpy arrays. If you pass a generator as validation_data, then this generator is expected to yield batches of validation data endlessly; thus you should also specify the validation_steps argument, which tells the process how many batches to draw from the validation generator for evaluation.

### Fitting the model using a batch generator

In [10]:
history = model.fit_generator(
    train_generator,     
    steps_per_epoch=100,  # number of batches in each epoch. 100 ==> 100 * 20 sample  
    epochs=30,  # number of epochs
    validation_data=validation_generator,
    validation_steps=50)

NameError: name 'model' is not defined

### Saving the model

It’s good practice to always save your models after training.

In [11]:
model.save('cats_and_dogs_small_1.h5')

NameError: name 'model' is not defined

##### Displaying curves of loss and accuracy during training

In [12]:
import matplotlib.pyplot as plt
%matplotlib inline
acc = history.history['acc']
val_acc = history.history['val_acc']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

These plots are characteristic of **overfitting**: 

The training accuracy increases linearly over time, until it reaches nearly 100%, whereas the validation accuracy stalls at 70–72%.

The validation loss reaches its minimum after only five epochs and then stalls, whereas the training loss keeps decreasing linearly until it reaches nearly 0.

*Because you have relatively few training samples (2,000), overfitting will be your number-one concern.* 

You already know about a number of techniques that can help mitigate overfitting, such as *dropout and weight decay (L2 regularization)*. 

We’re now going to work with a new one, specific to computer vision and used almost universally when processing images with deep-learning models: **data augmentation**.

#### Using data augmentation

Overfitting is caused by having too few samples to learn from, rendering you unable
to train a model that can generalize to new data. 

Given infinite data, your model would be exposed to **every possible aspect of the data distribution at hand**: you would never overfit. 

**Data augmentation** takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.

In Keras, this can be done by configuring a number of random transformations to be performed on the images read by the ImageDataGenerator instance. 

##### Setting up a data augmentation configuration via ImageDataGenerator

In [13]:
datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')

NameError: name 'ImageDataGenerator' is not defined

* **rotation_range**: is a value in degrees (0–180), a range within which to randomly rotate pictures.
* **width_shift** and **height_shift**: are ranges (as a fraction of total width or height) within which to randomly translate pictures vertically or horizontally.
* **shear_range**: is for randomly applying shearing transformations.
* **zoom_range**: is for randomly zooming inside pictures.
* **horizontal_flip**: is for randomly flipping half the images horizontally—relevant when there are no assumptions of horizontal asymmetry (for example, real-world pictures).
* **fill_mode**: is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.

##### Displaying some randomly augmented training images

In [19]:
from keras.preprocessing import image

fnames = [os.path.join(train_cats_dir, fname) for fname in os.listdir(train_cats_dir)]

img_path = fnames[3] #Chooses one image to augment
img = image.load_img(img_path, target_size=(150, 150)) #Reads the image and resizes it
x = image.img_to_array(img) #Converts it to a Numpy array with shape (150, 150, 3)
x = x.reshape((1,) + x.shape) #Reshapes it to (1, 150, 150, 3)


# Generates batches of randomly transformed images. Loops indefinitely, so you need to break the loop at some point!

i = 0
for batch in datagen.flow(x, batch_size=1):
    plt.figure(i)
    imgplot = plt.imshow(image.array_to_img(batch[0]))
    i += 1
    if i % 4 == 0:
        break
plt.show()

ModuleNotFoundError: No module named 'keras'

If you train a new network using this data-augmentation configuration, the network will never see the same input twice. But the inputs it sees are still heavily intercorrelated, because they come from a small number of original images—you can’t produce new information, you can only remix existing information. 

This may not be enough to completely get rid of overfitting. To further fight overfitting, you’ll also add a Dropout layer to your model, *right before the densely connected classifier*.

##### Defining a new convnet that includes dropout

In [20]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), 
          activation='relu',
          input_shape=(150, 150, 3)))

model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))

model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Conv2D(128, (3, 3), activation='relu'))

model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Flatten())

model.add(layers.Dropout(0.5))

model.add(layers.Dense(512, activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-4),
metrics=['acc'])

NameError: name 'models' is not defined

#### Training the convnet using data-augmentation generators

In [21]:
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
validation_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')

history = model.fit_generator(train_generator, steps_per_epoch=100, 
                              epochs=100, validation_data=validation_generator, validation_steps=50)

NameError: name 'ImageDataGenerator' is not defined

#### Saving the model

In [22]:
model.save('cats_and_dogs_small_2.h5')

NameError: name 'model' is not defined

##### Displaying curves of loss and accuracy during training

In [23]:
acc = history.history['acc']
val_acc = history.history['val_acc']

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

By using regularization techniques even further, and by tuning the network’s parameters (such as the number of filters per convolution layer, or the number of layers in the network), you may be able to get an even better accuracy, likely up to 86% or 87%. 

But it would prove difficult to go any higher just by training your own convnet from scratch, because you have so little data to work with. As a next step to improve your accuracy on this problem, you’ll have to use a **pretrained model**, which is the focus of the next two sections.

#### Using a pretrained convnet

A *pretrained network* is a saved network that was previously trained on a large dataset, typically on a large-scale image-classification task.

If this original dataset is large enough and general enough, then the spatial hierarchy of features learned by the pretrained network can effectively act as a generic model of the visual world, and hence its features can prove useful for many different computervision problems, even though these new problems may involve completely different classes than those of the original task. 

For instance, you might train a network on ImageNet (where classes are mostly animals and everyday objects) and then repurpose this trained network for something as remote as dentifying furniture items in images. Such portability of learned features across different problems is a key advantage of deep learning compared to many older, shallow-learning approaches, and it makes deep learning very effective for small-data problems.

Let’s consider a large convnet trained on the ImageNet dataset (1.4 million labeled images and 1,000 different classes). 

ImageNet contains many animal classes, including different species of cats and dogs, and you can thus expect to perform well on the dogs-versus-cats classification problem.

You’ll use the **VGG16 architecture**, developed by Karen Simonyan and Andrew Zisserman in 2014; it’s a simple and widely used convnet architecture for ImageNet.

There are two ways to use a pretrained network: 
* **feature extraction** 
* **fine-tuning**

## 1. Feature extraction

Feature extraction consists of using the *representations learned* by a previous network
to extract interesting features from new samples. 

These features are then run through a new classifier, which is trained from scratch.

Convnets used for image classification comprise two parts: 

* They start with a series of pooling and convolution layers.
* They end with a densely connected classifier. 

The first part is called the **convolutional base** of the model.

**Feature extraction consists of taking the convolutional base of a previously trained network, running the new data through it, and training a new classifier on top of the output.**

![featureextraction.PNG](attachment:featureextraction.PNG)

Why only reuse the convolutional base? Could you reuse the densely connected classifier as well? In general, doing so should be avoided. 

The reason is that the representations learned by the convolutional base are likely to be more generic and therefore more reusable: 

**the feature maps of a convnet are presence maps of generic concepts over a picture, which is likely to be useful regardless of the computer-vision problem at hand.** 


But the representations learned by the classifier will necessarily **be specific to the
set of classes on which the model was trained—they will only contain information about
the presence probability of this or that class in the entire picture.** 

Additionally, representations found in densely connected layers no longer contain any information about where objects are located in the input image: these layers get rid of the notion of space, whereas the object location is still described by convolutional feature maps.


For problems where object location matters, densely connected features are largely useless.
Note that the level of generality (and therefore reusability) of the representations extracted by specific convolution layers depends on the depth of the layer in the model. **Layers that come earlier in the model extract local, highly generic feature maps (such as visual edges, colors, and textures), whereas layers that are higher up extract more-abstract concepts (such as “cat ear” or “dog eye”).** 

So **if your new dataset differs a lot from the dataset on which the original model was trained, you may be better off using only the first few layers of the model to do feature extraction, rather than using the entire convolutional base.**

Because the ImageNet class set contains multiple dog and cat classes, it’s likely to be beneficial to reuse the information contained in the densely connected layers of the original model. But we’ll choose not to, in order to cover the more general case where the class set of the new problem doesn’t overlap the class set of the original model. 

Let’s put this in practice by **using the convolutional base of the VGG16 network, trained on ImageNet,** to extract interesting features from cat and dog images, and then train a dogs-versus-cats classifier on top of these features.

List of image-classification models (all pretrained on the ImageNet dataset) that are available as part of keras.applications:
* Xception
* Inception V3
* ResNet50
* VGG16
* VGG19
* MobileNet

##### Instantiating the VGG16 convolutional base

In [24]:
from keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
include_top=False,
input_shape=(150, 150, 3))

ModuleNotFoundError: No module named 'keras'

You pass three arguments to the constructor:
* weights: specifies the weight checkpoint from which to initialize the model.
* include_top: refers to including (or not) the densely connected classifier on top of the network. By default, this densely connected classifier corresponds to the 1,000 classes from ImageNet. Because you intend to use your own densely connected classifier (with only two classes: cat and dog), you don’t need to include it.
* input_shape: is the shape of the image tensors that you’ll feed to the network. This argument is purely optional if you don’t pass it, the network will be able to process inputs of any size.

In [1]:
conv_base.summary()

NameError: name 'conv_base' is not defined

The final feature map has shape (4, 4, 512). 
That’s the feature on top of which you’ll stick a densely connected classifier.

At this point, there are two ways you could proceed:
* **Running the convolutional base over your dataset**, **recording its output to a Numpy array on disk**, and then using this data as input to a standalone, densely connected classifier similar to those you saw in previous part. 

Inputs -> Trained Convolution -> Output of Convolution for each input

Outputs -> Dense Layers

This solution is fast and cheap to run, because it only requires running the convolutional base once for every input image, and the convolutional base is by far the most expensive part of the pipeline. But for the same reason, this technique won’t allow you to use data augmentation.

* Extending the model you have (conv_base) by adding Dense layers on top, and running the whole thing end to end on the input data. This will allow you to use data augmentation, because every input image goes through the convolutional base every time it’s seen by the model. But for the same reason, this technique is far more expensive than the first.

Let’s walk through the code required to set up the first one: recording the output of conv_base on your data and using these outputs as inputs to a new model.

##### FAST FEATURE EXTRACTION WITHOUT DATA AUGMENTATION

You’ll start by running instances of the previously introduced *ImageDataGenerator* to extract images as Numpy arrays as well as their labels. You’ll extract features from these images by calling the *predict* method of the conv_base model.

##### Extracting features using the pretrained convolutional base

In [2]:
import os
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')
datagen = ImageDataGenerator(rescale=1./255)
batch_size = 20
def extract_features(directory, sample_count):
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    labels = np.zeros(shape=(sample_count))
    generator = datagen.flow_from_directory(
        directory,
        target_size=(150, 150),
        batch_size=batch_size,
        class_mode='binary')
    
    i = 0
    for inputs_batch, labels_batch in generator:
        features_batch = conv_base.predict(inputs_batch)
        features[i * batch_size : (i + 1) * batch_size] = features_batch
        labels[i * batch_size : (i + 1) * batch_size] = labels_batch
        i += 1
        if i * batch_size >= sample_count:
            break #because generators yield data indefinitely in a loop, you must break after every image has been seen once.
    return features, labels
train_features, train_labels = extract_features(train_dir, 2000)
validation_features, validation_labels = extract_features(validation_dir, 1000)
test_features, test_labels = extract_features(test_dir, 1000)

ModuleNotFoundError: No module named 'keras'

The extracted features are currently of shape (samples, 4, 4, 512). You’ll feed them to a densely connected classifier, so first you must flatten them to (samples, 8192):

In [3]:
train_features = np.reshape(train_features, (2000, 4 * 4 * 512))
validation_features = np.reshape(validation_features, (1000, 4 * 4 * 512))
test_features = np.reshape(test_features, (1000, 4 * 4 * 512))

NameError: name 'train_features' is not defined

At this point, you can define your densely connected classifier (note the use of dropout for regularization) and train it on the data and labels that you just recorded:

##### Defining and training the densely connected classifier

In [4]:
from keras import models
from keras import layers
from keras import optimizers
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizers.RMSprop(lr=2e-5),
loss='binary_crossentropy',
metrics=['acc'])

history = model.fit(train_features, train_labels,
epochs=30,
batch_size=20,
validation_data=(validation_features, validation_labels))

ModuleNotFoundError: No module named 'keras'

Training is very fast, because you only have to deal with two Dense layers—an epoch takes less than one second even on CPU.

#### Plotting the results

In [5]:
import matplotlib.pyplot as plt
%matplotlib inline

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

The plots also indicate that you’re overfitting almost from the start—despite using dropout with a fairly large rate. That’s because this technique doesn’t use data augmentation, which is essential for preventing overfitting with small image datasets.

##### FEATURE EXTRACTION WITH DATA AUGMENTATION

Let’s review the second technique I mentioned for doing feature extraction, which is much slower and more expensive, but which allows you to use data augmentation during training: **extending the *conv_base* model and running it end to end on the inputs.**

**NOTE This technique is so expensive that you should only attempt it if you have access to a GPU—it’s absolutely intractable on CPU. If you can’t run your code on GPU, then the previous technique is the way to go.**

Because models behave just like layers, you can add a model (like conv_base) to a Sequential model just like you would add a layer.

### Adding a densely connected classifier on top of the convolutional base

In [6]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

ModuleNotFoundError: No module named 'keras'

In [7]:
model.summary()

NameError: name 'model' is not defined

As you can see, the convolutional base of VGG16 has 14,714,688 parameters, which is very large. The classifier you’re adding on top has 2 million parameters.

Before you compile and train the model, it’s very important to **freeze** the convolutional base. 

Freezing a layer or set of layers means preventing their weights from being updated during training. 

If you don’t do this, then the representations that were previously learned by the convolutional base will be modified during training. Because the Dense layers on top are randomly initialized, very large weight updates would be propagated through the network, effectively destroying the representations previously learned.

In Keras, you freeze a network by setting its **trainable** attribute to **False**:

In [1]:
print('This is the number of trainable weights before freezing the conv base:', len(model.trainable_weights))
conv_base.trainable = False
print('This is the number of trainable weights after freezing the conv base:', len(model.trainable_weights))

NameError: name 'model' is not defined

With this setup, only the weights from the two Dense layers that you added will be trained. 

That’s a total of four weight tensors: two per layer (the main weight matrix and the bias vector). Note that in order for these changes to take effect, you must first
compile the model. 

If you ever modify weight trainability after compilation, **you should then recompile the model**, or these changes will be ignored.
Now you can start training your model, with the same data-augmentation configuration that you used in the previous example.

##### Training the model end to end with a frozen convolutional base

In [2]:
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
train_datagen = ImageDataGenerator(
                                rescale=1./255,
                                rotation_range=40,
                                width_shift_range=0.2,
                                height_shift_range=0.2,
                                shear_range=0.2,
                                zoom_range=0.2,
                                horizontal_flip=True,
                                fill_mode='nearest')

test_datagen = ImageDataGenerator(rescale=1./255) # Note that the validation data shouldn’t be augmented!

train_generator = train_datagen.flow_from_directory(train_dir,
                                                    target_size=(150, 150),
                                                    batch_size=20,
                                                    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(validation_dir,
                                                        target_size=(150, 150),
                                                        batch_size=20,
                                                        class_mode='binary')


model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=2e-5),
metrics=['acc'])


history = model.fit_generator(
train_generator,
steps_per_epoch=100,
epochs=30,
validation_data=validation_generator,
validation_steps=50)

ModuleNotFoundError: No module named 'keras'

#### Plotting the results

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

## 2. Fine-tuning

Fine-tuning consists of *unfreezing a few of the top layers of a frozen model base used for feature extraction*, and jointly training both the **newly added part of the model (in this case, the fully connected classifier)** and **these top layers**. 

Fine-tuning slightly adjusts the more abstract representations of the model being reused, in order to make them more relevant for the problem at hand.

![finetune.PNG](attachment:finetune.PNG)

### important
I stated earlier that it’s necessary to freeze the convolution base of VGG16 in order to be able to train a randomly initialized classifier on top. 

**It’s only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained.**

برای تیون کردن لایه های پیچشی بالا لازم است که دسته بند نهایی آموزش دیده شده باشد.

If the classifier isn’t already trained, then the error signal propagating through the network during training will be too large, and the representations previously learned by the layers being fine-tuned will be destroyed. Thus the steps for fine-tuning a network are as follow:

1 Add your custom network on top of an already-trained base network.

2 Freeze the base network.

3 Train the part you added.

4 Unfreeze some layers in the base network.

5 Jointly train both these layers and the part you added.

You already completed the first three steps when doing feature extraction. Let’s proceed with step 4: you’ll unfreeze your conv_base and then freeze individual layers
inside it.


As a reminder, this is what your convolutional base looks like:

In [4]:
conv_base.summary()

NameError: name 'conv_base' is not defined

We’ll fine-tune the **last three convolutional layers**, which means all layers up to block4_pool should be frozen, and the layers block5_conv1, block5_conv2, and block5_conv3 should be trainable.

Why not fine-tune more layers? Why not fine-tune the entire convolutional base?
You could. But you need to consider the following:

* Earlier layers in the convolutional base encode more-generic, reusable features, whereas layers higher up encode more-specialized features. It’s more useful to fine-tune the more specialized features, because these are the ones that need to be repurposed on your new problem. There would be fast-decreasing returns in fine-tuning lower layers.

* The more parameters you’re training, the more you’re at risk of overfitting. The convolutional base has 15 million parameters, so it would be risky to attempt to train it on your small dataset.

Thus, in this situation, it’s a good strategy to fine-tune only the top two or three layers
in the convolutional base. 

Let’s set this up, starting from where you left off in the previous example.

##### Freezing all layers up to a specific one

In [5]:
conv_base.trainable = True
set_trainable = False
for layer in conv_base.layers:
    if layer.name == 'block5_conv1':
        set_trainable = True
    if set_trainable:
        layer.trainable = True
    else:
        layer.trainable = False

NameError: name 'conv_base' is not defined

Now you can begin fine-tuning the network. 

You’ll do this with the RMSProp optimizer, using **a very low learning rate**. 
The reason for using a low learning rate is that you want to limit the magnitude of the modifications you make to the representations of the three layers you’re fine-tuning. 
Updates that are too large may harm these representations.

##### Fine-tuning the model

In [6]:
model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=1e-5),
metrics=['acc'])

history = model.fit_generator(
                            train_generator,
                            steps_per_epoch=100,
                            epochs=100,
                            validation_data=validation_generator,
                            validation_steps=50)

NameError: name 'model' is not defined

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

NameError: name 'history' is not defined

These curves look noisy. To make them more readable, you can smooth them by replacing every loss and accuracy with exponential moving averages of these quantities. Here’s a trivial utility function to do this

In [8]:
def smooth_curve(points, factor=0.8):
smoothed_points = []
for point in points:
    if smoothed_points:
        previous = smoothed_points[-1]
        smoothed_points.append(previous * factor + point * (1 - factor))
    else:
        smoothed_points.append(point)
return smoothed_points

plt.plot(epochs, smooth_curve(acc), 'bo', label='Smoothed training acc')
plt.plot(epochs, smooth_curve(val_acc), 'b', label='Smoothed validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot(epochs, smooth_curve(loss), 'bo', label='Smoothed training loss')
plt.plot(epochs, smooth_curve(val_loss), 'b', label='Smoothed validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

IndentationError: expected an indented block (<ipython-input-8-8d29c6120dff>, line 2)

The validation accuracy curve look much cleaner. 

You’re seeing a nice 1% absolute improvement in accuracy, from about 96% to above 97%.

Note that the loss curve doesn’t show any real improvement (in fact, it’s deteriorating). You may wonder, how could accuracy stay stable or improve if the loss isn’t decreasing? 

The answer is simple: what you display is an average of pointwise loss values; but **what matters for accuracy is the distribution of the loss values, not their average, because accuracy is the result of a binary thresholding of the class probability
predicted by the model.** 
The model may still be improving even if this isn’t reflected in the average loss.


You can now finally evaluate this model on the test data:

In [9]:
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(150, 150),
batch_size=20,
class_mode='binary')

test_loss, test_acc = model.evaluate_generator(test_generator, steps=50)
print('test acc:', test_acc)

NameError: name 'test_datagen' is not defined

Here you get a test accuracy of 97%. In the original Kaggle competition around this dataset, this would have been one of the top results. But using modern deep-learning techniques, you managed to reach this result using only a small fraction of the training data available (about 10%). 

There is a huge difference between being able to train on 20,000 samples compared to 2,000 samples!

## Wrapping up

* Convnets are the best type of machine-learning models for computer-vision tasks. It’s possible to train one from scratch even on a very small dataset, with decent results.

* On a small dataset, overfitting will be the main issue. Data augmentation is a powerful way to fight overfitting when you’re working with image data.

* It’s easy to reuse an existing convnet on a new dataset via feature extraction. This is a valuable technique for working with small image datasets.

* As a complement to feature extraction, you can use fine-tuning, which adapts to a new problem some of the representations previously learned by an existing model. This pushes performance a bit further.

## Visualizing what convnets learn

The representations learned by convnets are highly amenable to visualization, in large part because they’re representations of visual concepts. 

Since 2013, a wide array of techniques have been developed for visualizing and interpreting these representations. We won’t survey all of them, but we’ll cover three of the most accessible and useful ones:

* **Visualizing intermediate convnet outputs (intermediate activations)**: Useful for understanding how successive convnet layers transform their input, and for getting a first idea of the meaning of individual convnet filters.

* **Visualizing convnets filters**: Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to.

* **Visualizing heatmaps of class activation in an image**: Useful for understanding which parts of an image were identified as belonging to a given class, thus allowing you to localize objects in images.

For the first method—activation visualization—you’ll use the small convnet that you trained from scratch on the dogs-versus-cats classification problem.

For the next two methods, you’ll use the VGG16 model.

#### 1. Visualizing intermediate activations

Visualizing intermediate activations consists of displaying the feature maps that are
output by various convolution and pooling layers in a network, given a certain input
(the output of a layer is often called its **activation**, the output of the activation function). 


This gives a view into how an input is decomposed into the different filters learned by the network. 

You want to visualize feature maps with three dimensions:
*width, height, and depth (channels)*. 

Each channel encodes relatively independent features, so the proper way to visualize these feature maps is by independently plotting the contents of **every channel as a 2D image**. 
Let’s start by loading the model that you saved:

In [5]:
from keras.models import load_model
model = load_model('cats_and_dogs_small_2.h5')
model.summary()

ModuleNotFoundError: No module named 'keras'

Next, you’ll get an input image—a picture of a cat, not part of the images the network was trained on.

### Preprocessing a single image

In [6]:
img_path = '/Users/fchollet/Downloads/cats_and_dogs_small/test/cats/cat.1700.jpg'
from keras.preprocessing import image # Preprocesses the image into a 4D tensor
import numpy as np
img = image.load_img(img_path, target_size=(150, 150))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
img_tensor /= 255. # Remember that the model was trained on inputs that were preprocessed this way.
print(img_tensor.shape) #<1> Its shape is (1, 150, 150, 3)

ModuleNotFoundError: No module named 'keras'

Let’s display the picture 

##### Displaying the test picture

In [7]:
import matplotlib.pyplot as plt
plt.imshow(img_tensor[0])
plt.show()

NameError: name 'img_tensor' is not defined

In order to extract the feature maps you want to look at, you’ll create a Keras model that 
takes batches of images as input

outputs the activations of all convolution and pooling layers. 

To do this, you’ll use the Keras class **Model**. A model is instantiated using two arguments: 

** an input tensor (or list of input tensors) 

** output tensor (or list of output tensors). 

The resulting class is a Keras model, just like the **Sequential** models you’re familiar with, mapping the specified inputs to the specified outputs.

What sets the Model class apart is that it allows for models with multiple outputs, unlike
Sequential. For more information about the Model class, see section 7.1.

##### Instantiating a model from an input tensor and a list of output tensors

In [9]:
from keras import models
layer_outputs = [layer.output for layer in model.layers[:8]] # Extracts the outputs of the top eight layers
activation_model = models.Model(inputs=model.input, outputs=layer_outputs) #Creates a model that will return these outputs, given the model input

ModuleNotFoundError: No module named 'keras'

When fed an image input, this model returns the values of the layer activations in the original model. 

This is a **multi-output model** in this book: 
until now, the models you’ve seen have had exactly **one input** and **one output**.

In the general case, a model can have any number of inputs and outputs. 
This one has **one input** and **eight outputs**: one output per layer activation.

##### Running the model in predict mode

In [10]:
activations = activation_model.predict(img_tensor) # Returns a list of five Numpy arrays: one array per layer activation

NameError: name 'activation_model' is not defined

For instance, this is the activation of the first convolution layer for the cat image input:

In [11]:
first_layer_activation = activations[0]
print(first_layer_activation.shape)

NameError: name 'activations' is not defined

It’s a 148 × 148 feature map with 32 channels. Let’s try plotting the fourth channel of the activation of the first layer of the original model

##### Visualizing the fourth channel

In [12]:
import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')

NameError: name 'first_layer_activation' is not defined

This channel appears to encode a *diagonal edge detector*. Let’s try the seventh channel—but note that your own channels may vary, because the specific filters learned by convolution layers aren’t deterministic.

##### Visualizing the seventh channel

In [13]:
plt.matshow(first_layer_activation[0, :, :, 7], cmap='viridis')

NameError: name 'first_layer_activation' is not defined

##### Visualizing every channel in every intermediate activation

In [14]:
layer_names = []
for layer in model.layers[:8]:
    layer_names.append(layer.name)

images_per_row = 16
    
for layer_name, layer_activation in zip(layer_names, activations):
    n_features = layer_activation.shape[-1] #Number of features in the feature map
    
    size = layer_activation.shape[1] # The feature map has shape (1, size, size, n_features)
    
    n_cols = n_features // images_per_row # Tiles the activation channels in this matrix
    
    display_grid = np.zeros((size * n_cols, images_per_row * size))
    for col in range(n_cols): # Tiles each filter into a big horizontal grid
        for row in range(images_per_row):
            channel_image = layer_activation[0, :, :, col * images_per_row + row]
            # Post-processes the feature to make it visually palatable
            channel_image -= channel_image.mean()
            
            channel_image /= channel_image.std()
            
            channel_image *= 64
            
            channel_image += 128
            
            channel_image = np.clip(channel_image, 0, 255).astype('uint8')
            
            display_grid[col * size : (col + 1) * size, row * size : (row + 1) * size] = channel_image
            
            scale = 1. / size
            
            plt.figure(figsize=(scale * display_grid.shape[1], scale * display_grid.shape[0]))
            plt.title(layer_name)
            plt.grid(False)
            plt.imshow(display_grid, aspect='auto', cmap='viridis')

NameError: name 'model' is not defined