# Deep learning for text and sequences

This chapter explores deep-learning models that can process text (understood as sequences of word or sequences of characters), time series, and sequence data in general. The two fundamental deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets, the one-dimensional version of the 2D
convnets that we covered in the previous chapters. We’ll discuss both of these
approaches in this chapter.

Applications of these algorithms include the following:
- Document classification and time series classification, such as identifying the topic of an article or the author of a book
- Timeseries comparisons, such as estimating how closely related two documents or two stock tickers are
-  Sequence-to-sequence learning, such as decoding an English sentence into French
- Sentiment analysis, such as classifying the sentiment of tweets or movie review as positive or negative
- Timeseries forecasting, such as predicting the future weather at a certain location, given recent weather data

This chapter’s examples focus on two narrow tasks: sentiment analysis on the IMDB
dataset, a task we approached earlier in the book, and temperature forecasting. But
the techniques demonstrated for these two tasks are relevant to all the applications
just listed, and many more.

## Working with text data
Text is one of the most widespread forms of sequence data. It can be understood as
either a sequence of characters or a sequence of words, but it’s most common to work
at the level of words. The deep-learning sequence-processing models introduced in
the following sections can use text to produce a basic form of natural-language understanding,
sufficient for applications including document classification, sentiment
analysis, author identification, and even question-answering (QA) (in a constrained
context). Of course, keep in mind throughout this chapter that none of these deeplearning
models truly understand text in a human sense; rather, these models can
map the statistical structure of written language, which is sufficient to solve many simple
textual tasks. Deep learning for natural-language processing is pattern recognition
applied to words, sentences, and paragraphs, in much the same way that computer
vision is pattern recognition applied to pixels.

Like all other neural networks, deep-learning models don’t take as input raw text:
they only work with numeric tensors. Vectorizing text is the process of transforming text
into numeric tensors. This can be done in multiple ways:
- Segment text into words, and transform each word into a vector.
- Segment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.

N-grams are overlapping groups of multiple consecutive words or characters.
Collectively, the different units into which you can break down text (words, characters,
or n-grams) are called tokens, and breaking text into such tokens is called tokenization.
All text-vectorization processes consist of applying some tokenization scheme and
then associating numeric vectors with the generated tokens. These vectors, packed
into sequence tensors, are fed into deep neural networks. There are multiple ways to
associate a vector with a token. In this section, I’ll present two major ones: one-hot
encoding of tokens, and token embedding (typically used exclusively for words, and called
word embedding). The remainder of this section explains these techniques and shows
how to use them to go from raw text to a Numpy tensor that you can send to a Keras
network.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx1.png" alt="Drawing" style="width:600px;"/></th>
  </tr>
</table>

### Understanding n-grams and bag-of-words
Word n-grams are groups of N (or fewer) consecutive words that you can extract from
a sentence. The same concept may also be applied to characters instead of words.

Here’s a simple example. Consider the sentence “The cat sat on the mat.” It may be
decomposed into the following set of 2-grams:

    {"The", "The cat", "cat", "cat sat", "sat",
    "sat on", "on", "on the", "the", "the mat", "mat"}
It may also be decomposed into the following set of 3-grams:

    {"The", "The cat", "cat", "cat sat", "The cat sat",
    "sat", "sat on", "on", "cat sat on", "on the", "the",
    "sat on the", "the mat", "mat", "on the mat"}

Such a set is called a bag-of-2-grams or bag-of-3-grams, respectively. The term bag
here refers to the fact that you’re dealing with a set of tokens rather than a list or
sequence: the tokens have no specific order. This family of tokenization methods is
called bag-of-words.

Because bag-of-words isn’t an order-preserving tokenization method (the tokens generated
are understood as a set, not a sequence, and the general structure of the sentences
is lost), it tends to be used in shallow language-processing models rather than
in deep-learning models. Extracting n-grams is a form of feature engineering, and
deep learning does away with this kind of rigid, brittle approach, replacing it with hierarchical
feature learning. One-dimensional convnets and recurrent neural networks,
introduced later in this chapter, are capable of learning representations for groups of
words and characters without being explicitly told about the existence of such groups,
by looking at continuous word or character sequences. For this reason, we won’t
cover n-grams any further in this book. But do keep in mind that they’re a powerful,
unavoidable feature-engineering tool when using lightweight, shallow text-processing
models such as logistic regression and random forests.

## One-hot encoding of words and characters
One-hot encoding is the most common, most basic way to turn a token into a vector.
You saw it in action in the initial IMDB and Reuters examples in chapter 3 (done with
words, in that case). It consists of associating a unique integer index with every word
and then turning this integer index i into a binary vector of size N (the size of the
vocabulary); the vector is all zeros except for the i th entry, which is 1.
Of course, one-hot encoding can be done at the character level, as well. To unambiguously
drive home what one-hot encoding is and how to implement it, listings 6.1
and 6.2 show two toy examples: one for words, the other for characters.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx2.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx3.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Note that Keras has built-in utilities for doing one-hot encoding of text at the word level
or character level, starting from raw text data. You should use these utilities, because
they take care of a number of important features such as stripping special characters
from strings and only taking into account the N most common words in your dataset (a
common restriction, to avoid dealing with very large input vector spaces).

<table style="width:100%">
  <tr>
    <th><img src="photos/tx4.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx5.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

A variant of one-hot encoding is the so-called one-hot hashing trick, which you can use
when the number of unique tokens in your vocabulary is too large to handle explicitly.
Instead of explicitly assigning an index to each word and keeping a reference of these
indices in a dictionary, you can hash words into vectors of fixed size. This is typically
done with a very lightweight hashing function. The main advantage of this method is
that it does away with maintaining an explicit word index, which saves memory and
allows online encoding of the data (you can generate token vectors right away, before
you’ve seen all of the available data). The one drawback of this approach is that it’s
susceptible to hash collisions: two different words may end up with the same hash, and
subsequently any machine-learning model looking at these hashes won’t be able to tell
the difference between these words. The likelihood of hash collisions decreases when
the dimensionality of the hashing space is much larger than the total number of
unique tokens being hashed.

## Using word embeddings
Another popular and powerful way to associate a vector with a word is the use of dense
word vectors, also called word embeddings. Whereas the vectors obtained through one-hot
encoding are binary, sparse (mostly made of zeros), and very high-dimensional (same
dimensionality as the number of words in the vocabulary), word embeddings are lowdimensional
floating-point vectors (that is, dense vectors, as opposed to sparse vectors);
see figure 6.2. Unlike the word vectors obtained via one-hot encoding, word
embeddings are learned from data. It’s common to see word embeddings that are
256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large
vocabularies. On the other hand, one-hot encoding words generally leads to vectors
that are 20,000-dimensional or greater (capturing a vocabulary of 20,000 tokens,

<table style="width:100%">
  <tr>
    <th><img src="photos/tx6.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

There are two ways to obtain word embeddings:
- Learn word embeddings jointly with the main task you care about (such as document classification or sentiment prediction). In this setup, you start with random word vectors and then learn word vectors in the same way you learn the weights of a neural network.
- Load into your model word embeddings that were precomputed using a different machine-learning task than the one you’re trying to solve. These are called pretrained word embeddings.

Let’s look at both.

### LEARNING WORD EMBEDDINGS WITH THE EMBEDDING LAYER
The simplest way to associate a dense vector with a word is to choose the vector at
random. The problem with this approach is that the resulting embedding space has
no structure: for instance, the words accurate and exact may end up with completely
different embeddings, even though they’re interchangeable in most sentences. It’s
difficult for a deep neural network to make sense of such a noisy, unstructured
embedding space.

To get a bit more abstract, the geometric relationships between word vectors
should reflect the semantic relationships between these words. Word embeddings are
meant to map human language into a geometric space. For instance, in a reasonable
embedding space, you would expect synonyms to be embedded into similar word vectors;
and in general, you would expect the geometric distance (such as L2 distance)
between any two word vectors to relate to the semantic distance between the associated
words (words meaning different things are embedded at points far away from
each other, whereas related words are closer). In addition to distance, you may want
specific directions in the embedding space to be meaningful. To make this clearer, let’s
look at a concrete example.

In figure 6.3, four words are embedded on a 2D plane:
cat, dog, wolf, and tiger. With the vector representations we
chose here, some semantic relationships between these
words can be encoded as geometric transformations. For
instance, the same vector allows us to go from cat to tiger
and from dog to wolf : this vector could be interpreted as the
“from pet to wild animal” vector. Similarly, another vector
lets us go from dog to cat and from wolf to tiger, which could
be interpreted as a “from canine to feline” vector.
In real-world word-embedding spaces, common examples
of meaningful geometric transformations are “gender”
vectors and “plural” vectors. For instance, by adding a “female” vector to the vector
“king,” we obtain the vector “queen.” By adding a “plural” vector, we obtain “kings.”
Word-embedding spaces typically feature thousands of such interpretable and potentially
useful vectors.

Is there some ideal word-embedding space that would perfectly map human language
and could be used for any natural-language-processing task? Possibly, but we
have yet to compute anything of the sort. Also, there is no such a thing as human language—
there are many different languages, and they aren’t isomorphic, because a language
is the reflection of a specific culture and a specific context. But more
pragmatically, what makes a good word-embedding space depends heavily on your task:
the perfect word-embedding space for an English-language movie-review sentiment analysis
model may look different from the perfect embedding space for an English language
legal-document-classification model, because the importance of certain
semantic relationships varies from task to task.

It’s thus reasonable to learn a new embedding space with every new task. Fortunately,
backpropagation makes this easy, and Keras makes it even easier. It’s about
learning the weights of a layer: the Embedding layer.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx7.png" alt="Drawing" style="width:400px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx8.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx9.png" alt="Drawing" style="width:800px;"/></th>
  </tr>
</table>

The Embedding layer is best understood as a dictionary that maps integer indices
(which stand for specific words) to dense vectors. It takes integers as input, it looks up
these integers in an internal dictionary, and it returns the associated vectors. It’s effectively
a dictionary lookup (see figure 6.4).

The Embedding layer takes as input a 2D tensor of integers, of shape (samples,
sequence_length), where each entry is a sequence of integers. It can embed
sequences of variable lengths: for instance, you could feed into the Embedding layer in
the previous example batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a batch must
have the same length, though (because you need to pack them into a single tensor),
so sequences that are shorter than others should be padded with zeros, and sequences
that are longer should be truncated.
This layer returns a 3D floating-point tensor of shape (samples, sequence_
length, embedding_dimensionality). Such a 3D tensor can then be processed by
an RNN layer or a 1D convolution layer (both will be introduced in the following
sections).

When you instantiate an Embedding layer, its weights (its internal dictionary of
token vectors) are initially random, just as with any other layer. During training, these
word vectors are gradually adjusted via backpropagation, structuring the space into
something the downstream model can exploit. Once fully trained, the embedding
space will show a lot of structure—a kind of structure specialized for the specific problem
for which you’re training your model.
Let’s apply this idea to the IMDB movie-review sentiment-prediction task that
you’re already familiar with. First, you’ll quickly prepare the data. You’ll restrict the
movie reviews to the top 10,000 most common words (as you did the first time you
worked with this dataset) and cut off the reviews after only 20 words. The network will
learn 8-dimensional embeddings for each of the 10,000 words, turn the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flatten the
tensor to 2D, and train a single Dense layer on top for classification.



<table style="width:100%">
  <tr>
    <th><img src="photos/tx10.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx11.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

You get to a validation accuracy of ~76%, which is pretty good considering that you’re
only looking at the first 20 words in every review. But note that merely flattening the
embedded sequences and training a single Dense layer on top leads to a model that
treats each word in the input sequence separately, without considering inter-word
relationships and sentence structure (for example, this model would likely treat both
“this movie is a bomb” and “this movie is the bomb” as being negative reviews). It’s
much better to add recurrent layers or 1D convolutional layers on top of the embedded
sequences to learn features that take into account each sequence as a whole.
That’s what we’ll focus on in the next few sections.

### USING PRETRAINED WORD EMBEDDINGS
Sometimes, you have so little training data available that you can’t use your data
alone to learn an appropriate task-specific embedding of your vocabulary. What do
you do then?

Instead of learning word embeddings jointly with the problem you want to solve,
you can load embedding vectors from a precomputed embedding space that you
know is highly structured and exhibits useful properties—that captures generic
aspects of language structure. The rationale behind using pretrained word embeddings
in natural-language processing is much the same as for using pretrained convnets
in image classification: you don’t have enough data available to learn truly
powerful features on your own, but you expect the features that you need to be fairly
generic—that is, common visual features or semantic features. In this case, it makes
sense to reuse features learned on a different problem.

Such word embeddings are generally computed using word-occurrence statistics
(observations about what words co-occur in sentences or documents), using a variety of
techniques, some involving neural networks, others not. The idea of a dense, lowdimensional
embedding space for words, computed in an unsupervised way, was initially
explored by Bengio et al. in the early 2000s,1 but it only started to take off in
research and industry applications after the release of one of the most famous and successful
word-embedding schemes: the Word2vec algorithm (https://code.google.com/archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec
dimensions capture specific semantic properties, such as gender.
There are various precomputed databases of word embeddings that you can download
and use in a Keras Embedding layer. Word2vec is one of them. Another popular
one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This
embedding technique is based on factorizing a matrix of word co-occurrence statistics.
Its developers have made available precomputed embeddings for millions of
English tokens, obtained from Wikipedia data and Common Crawl data.
Let’s look at how you can get started using GloVe embeddings in a Keras model.
The same method is valid for Word2vec embeddings or any other word-embedding
database. You’ll also use this example to refresh the text-tokenization techniques
introduced a few paragraphs ago: you’ll start from raw text and work your way up.

## Putting it all together: from raw text to word embeddings
You’ll use a model similar to the one we just went over: embedding sentences in
sequences of vectors, flattening them, and training a Dense layer on top. But you’ll do
so using pretrained word embeddings; and instead of using the pretokenized IMDB
data packaged in Keras, you’ll start from scratch by downloading the original text data.

### DOWNLOADING THE IMDB DATA AS RAW TEXT
First, head to http://mng.bz/0tIo and download the raw IMDB dataset. Uncompress it.
Now, let’s collect the individual training reviews into a list of strings, one string per
review. You’ll also collect the review labels (positive/negative) into a labels list.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx12.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

### TOKENIZING THE DATA
Let’s vectorize the text and prepare a training and validation split, using the concepts
introduced earlier in this section. Because pretrained word embeddings are meant to
be particularly useful on problems where little training data is available (otherwise,
task-specific embeddings are likely to outperform them), we’ll add the following twist:
restricting the training data to the first 200 samples. So you’ll learn to classify movie
reviews after looking at just 200 examples.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx13.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx14.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

### DOWNLOADING THE GLOVE WORD EMBEDDINGS
Go to https://nlp.stanford.edu/projects/glove, and download the precomputed
embeddings from 2014 English Wikipedia. It’s an 822 MB zip file called glove.6B.zip,
containing 100-dimensional embedding vectors for 400,000 words (or nonword
tokens). Unzip it.

### PREPROCESSING THE EMBEDDINGS
Let’s parse the unzipped file (a .txt file) to build an index that maps words (as strings)
to their vector representation (as number vectors).

<table style="width:100%">
  <tr>
    <th><img src="photos/tx15.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Next, you’ll build an embedding matrix that you can load into an Embedding layer. It
must be a matrix of shape (max_words, embedding_dim), where each entry i contains
the embedding_dim-dimensional vector for the word of index i in the reference word
index (built during tokenization). Note that index 0 isn’t supposed to stand for any
word or token—it’s a placeholder.



<table style="width:100%">
  <tr>
    <th><img src="photos/tx17.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

### DEFINING A MODEL
You’ll use the same model architecture as before.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx18.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

### LOADING THE GLOVE EMBEDDINGS IN THE MODEL
The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is
the word vector meant to be associated with index i. Simple enough. Load the GloVe
matrix you prepared into the Embedding layer, the first layer in the model.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx19.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Additionally, you’ll freeze the Embedding layer (set its trainable attribute to False),
following the same rationale you’re already familiar with in the context of pretrained
convnet features: when parts of a model are pretrained (like your Embedding layer)
and parts are randomly initialized (like your classifier), the pretrained parts shouldn’t
be updated during training, to avoid forgetting what they already know. The large gradient
updates triggered by the randomly initialized layers would be disruptive to the
already-learned features.

### TRAINING AND EVALUATING THE MODEL
Compile and train the model.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx20.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx21.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

The model quickly starts overfitting, which is unsurprising given the small number of
training samples. Validation accuracy has high variance for the same reason, but it
seems to reach the high 50s.

Note that your mileage may vary: because you have so few training samples, performance
is heavily dependent on exactly which 200 samples you choose—and you’re
choosing them at random. If this works poorly for you, try choosing a different random
set of 200 samples, for the sake of the exercise (in real life, you don’t get to
choose your training data).

You can also train the same model without loading the pretrained word embeddings
and without freezing the embedding layer. In that case, you’ll learn a taskspecific
embedding of the input tokens, which is generally more powerful than
pretrained word embeddings when lots of data is available. But in this case, you have
only 200 training samples. Let’s try it (see figures 6.7 and 6.8).

<table style="width:100%">
  <tr>
    <th><img src="photos/tx22.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
    <tr>
    <th><img src="photos/tx23.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Validation accuracy stalls in the low 50s. So in this case, pretrained word embeddings
outperform jointly learned embeddings. If you increase the number of training samples,
this will quickly stop being the case—try it as an exercise.
Finally, let’s evaluate the model on the test data. First, you need to tokenize the test
data.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx24.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx25.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Next, load and evaluate the first model.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx26.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

## Wrapping up
Now you’re able to do the following:
- Turn raw text into something a neural network can process
- Use the Embedding layer in a Keras model to learn task-specific token embeddings
- Use pretrained word embeddings to get an extra boost on small natural language processing problems

# Sequence processing with convnets
In chapter 5, you learned about convolutional neural networks (convnets) and how
they perform particularly well on computer vision problems, due to their ability to
operate convolutionally, extracting features from local input patches and allowing for
representation modularity and data efficiency. The same properties that make convnets
excel at computer vision also make them highly relevant to sequence processing.
Time can be treated as a spatial dimension, like the height or width of a 2D image.
Such 1D convnets can be competitive with RNNs on certain sequence-processing
problems, usually at a considerably cheaper computational cost. Recently, 1D convnets,
typically used with dilated kernels, have been used with great success for audio
generation and machine translation. In addition to these specific successes, it has long
been known that small 1D convnets can offer a fast alternative to RNNs for simple tasks
such as text classification and timeseries forecasting.

## Understanding 1D convolution for sequence data
The convolution layers introduced previously were 2D convolutions, extracting 2D
patches from image tensors and applying an identical transformation to every patch.
In the same way, you can use 1D convolutions, extracting local 1D patches (subsequences)
from sequences (see figure 6.26).

<table style="width:100%">
  <tr>
    <th><img src="photos/tx27.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Such 1D convolution layers can recognize local patterns in a sequence. Because the
same input transformation is performed on every patch, a pattern learned at a certain
position in a sentence can later be recognized at a different position, making 1D convnets
translation invariant (for temporal translations). For instance, a 1D convnet processing
sequences of characters using convolution windows of size 5 should be able to
learn words or word fragments of length 5 or less, and it should be able to recognize these words in any context in an input sequence. A character-level 1D convnet is thus
able to learn about word morphology.

### 1D pooling for sequence data
You’re already familiar with 2D pooling operations, such as 2D average pooling and
max pooling, used in convnets to spatially downsample image tensors. The 2D pooling
operation has a 1D equivalent: extracting 1D patches (subsequences) from an input
and outputting the maximum value (max pooling) or average value (average pooling).
Just as with 2D convnets, this is used for reducing the length of 1D inputs (subsampling).

### Implementing a 1D convnet
In Keras, you use a 1D convnet via the Conv1D layer, which has an interface similar to
Conv2D. It takes as input 3D tensors with shape (samples, time, features) and
returns similarly shaped 3D tensors. The convolution window is a 1D window on the
temporal axis: axis 1 in the input tensor.
Let’s build a simple two-layer 1D convnet and apply it to the IMDB sentiment classification
task you’re already familiar with. As a reminder, this is the code for
obtaining and preprocessing the data.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx28.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

1D convnets are structured in the same way as their 2D counterparts, which you used
in chapter 5: they consist of a stack of Conv1D and MaxPooling1D layers, ending in
either a global pooling layer or a Flatten layer, that turn the 3D outputs into 2D outputs,
allowing you to add one or more Dense layers to the model for classification or
regression.

One difference, though, is the fact that you can afford to use larger convolution
windows with 1D convnets. With a 2D convolution layer, a 3 × 3 convolution window
contains 3 × 3 = 9 feature vectors; but with a 1D convolution layer, a convolution window
of size 3 contains only 3 feature vectors. You can thus easily afford 1D convolution
windows of size 7 or 9.

This is the example 1D convnet for the IMDB dataset.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx29.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Figures 6.27 and 6.28 show the training and validation results. Validation accuracy is
somewhat less than that of the LSTM, but runtime is faster on both CPU and GPU (the
exact increase in speed will vary greatly depending on your exact configuration). At this
point, you could retrain this model for the right number of epochs (eight) and run it
on the test set. This is a convincing demonstration that a 1D convnet can offer a fast,
cheap alternative to a recurrent network on a word-level sentiment-classification task.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx30.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx31.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

## Combining CNNs and RNNs to process long sequences
Because 1D convnets process input patches independently, they aren’t sensitive to the
order of the timesteps (beyond a local scale, the size of the convolution windows),
unlike RNNs. Of course, to recognize longer-term patterns, you can stack many convolution
layers and pooling layers, resulting in upper layers that will see long chunks of
the original inputs—but that’s still a fairly weak way to induce order sensitivity. One
way to evidence this weakness is to try 1D convnets on the temperature-forecasting
problem, where order-sensitivity is key to producing good predictions. The following
example reuses the following variables defined previously: float_data, train_gen,
val_gen, and val_steps.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx32.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
    <tr>
    <th><img src="photos/tx33.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

The validation MAE stays in the 0.40s: you can’t even beat the common-sense baseline
using the small convnet. Again, this is because the convnet looks for patterns anywhere
in the input timeseries and has no knowledge of the temporal position of a pattern
it sees (toward the beginning, toward the end, and so on). Because more recent
data points should be interpreted differently from older data points in the case of this
specific forecasting problem, the convnet fails at producing meaningful results. This
limitation of convnets isn’t an issue with the IMDB data, because patterns of keywords
associated with a positive or negative sentiment are informative independently of
where they’re found in the input sentences.

One strategy to combine the speed and lightness of convnets with the order-sensitivity
of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30).
This is especially beneficial when you’re dealing
with sequences that are so long they can’t
realistically be processed with RNNs, such as
sequences with thousands of steps. The convnet
will turn the long input sequence into
much shorter (downsampled) sequences of
higher-level features. This sequence of
extracted features then becomes the input to
the RNN part of the network.

This technique isn’t seen often in
research papers and practical applications,
possibly because it isn’t well known. It’s effective
and ought to be more common. Let’s try
it on the temperature-forecasting dataset.
Because this strategy allows you to manipulate
much longer sequences, you can either look at data from longer ago (by increasing the lookback parameter of the data generator)
or look at high-resolution timeseries (by decreasing the step parameter of the
generator). Here, somewhat arbitrarily, you’ll use a step that’s half as large, resulting
in a timeseries twice as long, where the temperature data is sampled at a rate of
1 point per 30 minutes. The example reuses the generator function defined earlier.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx34.png" alt="Drawing" style="width:400px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx35.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

This is the model, starting with two Conv1D layers and following up with a GRU layer.
Figure 6.31 shows the results.

<table style="width:100%">
  <tr>
    <th><img src="photos/tx36.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
  <tr>
    <th><img src="photos/tx37.png" alt="Drawing" style="width:1000px;"/></th>
  </tr>
</table>

Judging from the validation loss, this setup isn’t as good as the regularized GRU alone,
but it’s significantly faster. It looks at twice as much data, which in this case doesn’t
appear to be hugely helpful but may be important for other datasets.
## Wrapping up
Here’s what you should take away from this section:
- In the same way that 2D convnets perform well for processing visual patterns in 2D space, 1D convnets perform well for processing temporal patterns. They offer a faster alternative to RNNs on some problems, in particular natural language processing tasks.
- Typically, 1D convnets are structured much like their 2D equivalents from the world of computer vision: they consist of stacks of Conv1D layers and Max- Pooling1D layers, ending in a global pooling operation or flattening operation.
- Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before an RNN, shortening the sequence and extracting useful representations for the RNN to process.

