<a href="https://colab.research.google.com/github/DataScienceUB/DeepLearningMaster2019/blob/master/10.%20Convolutional%20Neural%20Networks%20II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 10. Convolutional Neural Networks II
## Large Convolutional Networks

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

+ **LeNet**, 1990’s. 
<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/conv2.png?raw=1" alt="" style="width: 600px;"/> 
</center>



+ **AlexNet**. 2012.
<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/alexnet.png?raw=1" alt="" style="width: 700px;"/>
(Source: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
</center>

> AlexNet has about 60 million parameters!


+ **ZF Net**. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.
<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/zfnet.png?raw=1" alt="" style="width: 700px;"/> 
(Source: https://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf)
</center>


+ **VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the VGGNet. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an **extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end**. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters. Most of these parameters are in the first fully connected layer, and it was since found that these FC layers can be removed with no performance downgrade, significantly reducing the number of necessary parameters.


<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/vgg16.png?raw=1" alt="" style="width: 600px;"/> 
(Source: https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/)
</center>

In [0]:
# Small VGG-like convnet in Keras

import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD

# Generate dummy data

def to_categorical(y, num_classes=None):
    """
    Converts a class vector (integers) to binary class matrix.
    """
    y = np.array(y, dtype='int').ravel()
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    categorical = np.zeros((n, num_classes))
    categorical[np.arange(n), y] = 1
    return categorical

x_train = np.random.random((100, 100, 100, 3))
y_train = to_categorical(np.random.randint(10, size=(100, 1)), num_classes=10)
x_test = np.random.random((20, 100, 100, 3))
y_test = to_categorical(np.random.randint(10, size=(20, 1)), num_classes=10)

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100, 100, 3)))
model.add(Conv2D(32, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(32, (3, 3), activation="relu"))
model.add(Conv2D(32, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
print(model.summary())

model.fit(x_train, y_train, batch_size=32, epochs=10)
score = model.evaluate(x_test, y_test, batch_size=32)

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 98, 98, 32)        896       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 96, 96, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 48, 48, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 48, 48, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 46, 46, 32)        9248      
_________________________________________________________________
conv2d_4 (Conv2D)    

In [0]:
# how to compute the numer of trainable and non trainable weights in a model

from keras import backend as K
import numpy

trainable_count = int(numpy.sum([K.count_params(p) for p in set(model.trainable_weights)]))

non_trainable_count = int(numpy.sum([K.count_params(p) for p in set(model.non_trainable_weights)]))

print(trainable_count,non_trainable_count)

3996394 0


In [0]:
# how to compute the memory allocated by the activations of a model

batch = 1
shapes_count = int(numpy.sum([numpy.prod(numpy.array([s if isinstance(s, int) 
                                                      else 1 for s in l.output_shape])) 
                              for l in model.layers]))
memory = shapes_count * 4 * batch

print(memory)

3643436


**Exercise**

+ Why do we have 896 parameters in the ``convolution2d_1`` layer of the previous example?

+ Compute the number of parameters of the original VGG16 (all CONV layers are 3x3).
> The VGG16 architecture is: INPUT: [224x224x3] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ CONV3-64: [224x224x64] $\rightarrow$ POOL2: [112x112x64] $\rightarrow$ CONV3-128: [112x112x128]  $\rightarrow$ CONV3-128: [112x112x128]  $\rightarrow$ POOL2: [56x56x128] $\rightarrow$ CONV3-256: [56x56x256] $\rightarrow$ CONV3-256: [56x56x256]  $\rightarrow$ CONV3-256: [56x56x256]  $\rightarrow$ POOL2: [28x28x256]  $\rightarrow$ CONV3-512: [28x28x512]  $\rightarrow$ CONV3-512: [28x28x512]  $\rightarrow$ CONV3-512: [28x28x512]  $\rightarrow$ POOL2: [14x14x512] $\rightarrow$ CONV3-512: [14x14x512]  $\rightarrow$ CONV3-512: [14x14x512]  $\rightarrow$ CONV3-512: [14x14x512]  $\rightarrow$ POOL2: [7x7x512]  $\rightarrow$ FC: [1x1x4096]  $\rightarrow$ FC: [1x1x4096]  $\rightarrow$ FC: [1x1x1000].

+ The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. What is the necessary memory size (supposing that we need 4 bytes for each element) to store intermediate data?



In [0]:
# your code here


## More Large Convolutional Networks


+ **GoogLeNet**. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an **Inception Module** that dramatically reduced the number of parameters in the network (4M, compared to VGG with 138,357,544). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.


<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/googlenet.png?raw=1" alt="" style="width: 400px;"/> 
GoogLeNet Architecture. Source: https://arxiv.org/pdf/1409.4842v1.pdf
</center>

Blue Box: Convolution | Red Box: Pooling | Yelow Box: Softmax | Green Box: Normalization

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/inception.png?raw=1" alt="" style="width: 400px;"/> 
Inception Layer. Source: https://arxiv.org/pdf/1409.4842v1.pdf
</center>

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/googlenet2.png?raw=1" alt="" style="width: 600px;"/> 
GoogLeNet parameters and ops. Source: https://arxiv.org/pdf/1409.4842v1.pdf
</center>

> What is the role of 1x1 convolutions?

+ **ResNet**. Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special **skip connections** and a heavy use of batch normalization. A Residual Network, or ResNet is a neural network architecture which solves the problem of vanishing gradients in the simplest way possible. If there is trouble sending the gradient signal backwards, why not provide the network with a shortcut at each layer to make things happen more smoothly? The architecture is also missing fully connected layers at the end of the network. 

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/res1.png?raw=1" alt="" style="width: 400px;"/> 
(Source: https://arxiv.org/pdf/1512.03385.pdf)
</center>

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/resnet.png?raw=1" alt="" style="width: 400px;"/> 
    
(Source: https://arxiv.org/pdf/1512.03385.pdf)
</center>

##  Deeper is better?

When it comes to neural network design, the trend in the past few years has pointed in one direction: deeper. 

Whereas the state of the art only a few years ago consisted of networks which were roughly twelve layers deep, it is now not surprising to come across networks which are hundreds of layers deep. 

This move hasn’t just consisted of greater depth for depths sake. For many applications, the most prominent of which being object classification, the deeper the neural network, the better the performance.

So the problem is to design a network in which the gradient can more easily reach all the layers of a network which might be dozens, or even hundreds of layers deep. This is the goal behind some of state of the art architectures: ResNets, HighwayNets, and DenseNets.

**HighwayNets** builds on the ResNet in a pretty intuitive way. The Highway Network preserves the shortcuts introduced in the ResNet, but augments them with a learnable parameter to determine to what extent each layer should be a skip connection or a nonlinear connection. Layers in a Highway Network are defined as follows:

 $$ y = H(x, W_H) \cdot T(x,W_T) + x \cdot C(x, W_C) $$
 
In this equation we can see an outline of two kinds of layers discussed: $y = H(x,W_H)$ mirrors the traditional layer, and $y = H(x,W_H) + x$ mirrors our residual unit. 

The traditional layer can be implemented as:

```python
def dense(x, input_size, output_size, activation):
  W = tf.Variable(tf.truncated_normal([input_size, output_size], stddev=0.1), name="weight")
  b = tf.Variable(tf.constant(0.1, shape=[output_size]), name="bias")
  y = activation(tf.matmul(x, W) + b)
  return y
```

What is new is the $T(x,W_t)$, the transform gate function and $C(x,W_C) = 1 - T(x,W_t)$, the carry gate function. What happens is that when the transform gate is 1, we pass through our activation (H) and suppress the carry gate (since it will be 0). When the carry gate is 1, we pass through the unmodified input (x), while the activation is suppressed.

```python
def highway(x, size, activation, carry_bias=-1.0):
  W_T = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight_transform")
  b_T = tf.Variable(tf.constant(carry_bias, shape=[size]), name="bias_transform")

  W = tf.Variable(tf.truncated_normal([size, size], stddev=0.1), name="weight")
  b = tf.Variable(tf.constant(0.1, shape=[size]), name="bias")

  T = tf.sigmoid(tf.matmul(x, W_T) + b_T, name="transform_gate")
  H = activation(tf.matmul(x, W) + b, name="activation")
  C = tf.sub(1.0, T, name="carry_gate")

  y = tf.add(tf.mul(H, T), tf.mul(x, C), "y")
  return y
```

With this kind of network you can train models with hundreds of layers.

**DenseNet** takes the insights of the skip connection to the extreme. The idea here is that if connecting a skip connection from the previous layer improves performance, why not connect every layer to every other layer? That way there is always a direct route for the information backwards through the network.

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/densenet.png?raw=1" alt="" style="width: 700px;"/> 
(Source: https://arxiv.org/abs/1608.06993)
</center>

Instead of using an addition however, the DenseNet relies on stacking of layers. Mathematically this looks like:

$$ y = f(x, x-1, x-2, \dots, x-n) $$

This architecture makes intuitive sense in both the feedforward and feed backward settings. In the feed-forward setting, a task may benefit from being able to get low-level feature activations in addition to high level feature activations. In classifying objects for example, a lower layer of the network may determine edges in an image, whereas a higher layer would determine larger-scale features such as presence of faces. There may be cases where being able to use information about edges can help in determining the correct object in a complex scene. In the backwards case, having all the layers connected allows us to quickly send gradients to their respective places in the network easily.



## Fully Convolutional Networks

(Source: http://cs231n.github.io/convolutional-networks/#convert)

The only difference between **Fully Connected (FC)** and **Convolutional (CONV)** layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. 

However, the neurons in both layers still compute dot products, so their functional form is identical.

Then, it is easy to see that for any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certain blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing).

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/t10.png?raw=1" alt="" style="width: 800px;"/> 
</center>


Conversely, any FC layer can be converted to a CONV layer. 

Let $F$ be the receptive field size of the CONV layer neurons and $K$ the depth (number of bands) of the CONV layer.

For example, an FC layer with $K=4096$ that is looking at some input volume of size $7×7×512$ (this is a tensor with size $(7×7×512, 4096$) can be equivalently expressed as $4096$ CONV layers with $F=7,K=512$ (this are $4096$ $(7,7,512)$ matrices). 

This can be very useful, bacause now we can apply the network to arbitrary large images!

## Object Detection and Segmentation
(Source: https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4)

In classification, there’s generally an image with a single object as the focus and the task is to say what that image is. But when we look at the world around us, we see complicated sights with multiple overlapping objects, and different backgrounds and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!

To what extent do CNN generalize to object detection? Object detection is the task of finding the different objects in an image and classifying them.

### R-CNN

A team, comprised of Ross Girshick (a name we’ll see again), Jeff Donahue, and Trevor Darrel found that this problem can be solved with AlexNet by testing on the PASCAL VOC Challenge, a popular object detection challenge akin to ImageNet.

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image.

>Inputs: Image

>Outputs: Bounding boxes + labels for each object in the image.

But how do we find out where these bounding boxes are? R-CNN proposes a bunch of boxes in the image and see if any of them actually correspond to an object.

R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search (see http://www.cs.cornell.edu/courses/cs7670/2014sp/slides/VisionSeminar14.pdf). 

At a high level, Selective Search looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.

Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet.

On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object. 

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/rcnn.png?raw=1" alt="" style="width: 800px;"/> 
</center>

Now, having found the object in the box, can we tighten the box to fit the true dimensions of the object? We can, and this is the final step of R-CNN. R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result. Here are the inputs and outputs of this regression model:

> Inputs: sub-regions of the image corresponding to objects.

> Outputs: New bounding box coordinates for the object in the sub-region.



### Fast R-CNN

R-CNN works really well, but is really quite slow for a few simple reasons:
+ It requires a forward pass of the CNN (AlexNet) for every single region proposal for every single image (that’s around 2000 forward passes per image!).
+ It has to train three different models separately - the CNN to generate image features, the classifier that predicts the class, and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.

In 2015, Ross Girshick, the first author of R-CNN, solved both these problems, leading to Fast R-CNN. 

For the forward pass of the CNN, Girshick realized that for each image, a lot of proposed regions for the image invariably overlapped causing us to run the same CNN computation again and again (~2000 times!). His insight was simple — Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?

This is exactly what Fast R-CNN does using a technique known as **RoIPool** (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image below, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/fastrcnn.png?raw=1" alt="" style="width: 600px;"/> 
(Source: Stanford’s CS231N slides by Fei Fei Li, Andrei Karpathy, and Justin Johnson)
</center>


The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Where earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor), Fast R-CNN instead used a single network to compute all three.

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/fastrcnn2.png?raw=1" alt="" style="width: 400px;"/> 
(Source: https://www.slideshare.net/simplyinsimple/detection-52781995)
</center>

### Faster R-CNN

Even with all these advancements, there was still one remaining bottleneck in the Fast R-CNN process — the region proposer. As we saw, the very first step to detecting the locations of objects is generating a bunch of potential bounding boxes or regions of interest to test. In Fast R-CNN, these proposals were created using Selective Search, a fairly slow process that was found to be the bottleneck of the overall process.

In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to make the region proposal step almost cost free through an architecture they (creatively) named Faster R-CNN.

The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification). So why not reuse those same CNN results for region proposals instead of running a separate selective search algorithm?

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/fasterrcnn.png?raw=1" alt="" style="width: 400px;"/> 
(Source:  https://arxiv.org/abs/1506.01497)
</center>

Here are the inputs and outputs of their model:

> Inputs: Images (Notice how region proposals are not needed).

> Outputs: Classifications and bounding box coordinates of objects in the images.

### Mask R-CNN

So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.

Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.

Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation? 

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:

> Inputs: CNN Feature Map.
> Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/pixelrcnn.png?raw=1" alt="" style="width: 700px;"/> 
(Source:  https://arxiv.org/abs/1703.06870)
</center>

## 1D-Conv for text classification

**IMDB Movie reviews sentiment classification**: Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

The seminal research paper on this subject was published by Yoon Kim on 2014. In this paper Yoon Kim has laid the foundations for how to model and process text by convolutional neural networks for the purpose of sentiment analysis. He has shown that by simple one-dimentional convolutional networks, one can develops very simple neural networks that reach 90% accuracy very quickly.

Here is the text of an example review from our dataset:

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/review1.png?raw=1" alt="" style="width: 600px;"/> 
<center>

In [2]:
!pip install numpy==1.16.2

<module 'numpy.version' from '/usr/local/lib/python3.6/dist-packages/numpy/version.py'>


In [4]:
import numpy as np
print(np.__version__)

1.16.2


In [5]:
'''
This example demonstrates the use of Convolution1D for text classification.
'''

from __future__ import print_function
import numpy as np
import tensorflow as tf
np.random.seed(1337)  # for reproducibility

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Conv1D, MaxPooling1D
from tensorflow.keras.datasets import imdb


# set parameters:
max_features = 5000
maxlen = 100
batch_size = 32
embedding_dims = 100
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 10

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), ' train sequences \n')
print(len(X_test), ' test sequences \n')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features, embedding_dims, input_length=maxlen))
model.add(Dropout(0.25))

# we add a Convolution1D, which will learn nb_filter
# word group filters of size filter_length:
model.add(Conv1D(padding="valid", 
                 kernel_size=3, 
                 filters=250, 
                 strides=1, 
                 activation="relu"))
# we use standard max pooling (halving the output of the previous layer):
model.add(MaxPooling1D(pool_size=2))

model.add(Conv1D(padding="valid", 
                 kernel_size=3, 
                 filters=250, 
                 strides=1, 
                 activation="relu"))
model.add(MaxPooling1D(pool_size=2))


# We flatten the output of the conv layer,
# so that we can add a vanilla dense layer:
model.add(Flatten())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.25))
model.add(Activation('relu'))

# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=nb_epoch,
          validation_data=(X_test, y_test))

Loading data...
25000  train sequences 

25000  test sequences 

Pad sequences (samples x time)
X_train shape: (25000, 100)
X_test shape: (25000, 100)
Build model...
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f1cd7eacc18>