# How to Handle Missing Timesteps in Sequence Prediction Problems with Python
It is common to have missing observations from sequence data.

Data may be corrupt or unavailable, but it is also possible that your data has variable length sequences by definition. Those sequences with fewer timesteps may be considered to have missing values.

In this tutorial, you will discover how you can handle data with missing values for sequence prediction problems in Python with the Keras deep learning library.

After completing this tutorial, you will know:

* How to remove rows that contain a missing timestep.
* How to mark missing timesteps and force the network to learn their meaning.
* How to mask missing timesteps and exclude them from calculations in the model.

Let’s get started.

## Overview
This section is divided into 3 parts; they are:

1. Echo Sequence Prediction Problem
2. Handling Missing Sequence Data
3. Learning With Missing Sequence Values

### Environment
This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras (v2.0.4+) installed with either the TensorFlow (v1.1.0+) or Theano (v0.9+) backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

## Echo Sequence Prediction Problem
The echo problem is a contrived sequence prediction problem where the objective is to remember and predict an observation at a fixed prior timestep, called a lag observation.

For example, the simplest case is to predict the observation from the previous timestep that is, echo it back. For example:

In [None]:
Time 1: Input 45
Time 2: Input 23, Output 45
Time 3: Input 73, Output 23
...

The question is, what do we do about timestep 1?

We can implement the echo sequence prediction problem in Python.

This involves two steps: the generation of random sequences and the transformation of random sequences into a supervised learning problem.

### Generate Random Sequence
We can generate sequences of random values between 0 and 1 using the [random()](https://docs.python.org/3/library/random.html) function in the random module.

We can put this in a function called generate_sequence() that will generate a sequence of random floating point values for the desired number of timesteps.

This function is listed below.

In [None]:
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]

### Frame as Supervised Learning
Sequences must be framed as a supervised learning problem when using neural networks.

That means the sequence needs to be divided into input and output pairs.

The problem can be framed as making a prediction based on a function of the current and previous timesteps.

Or more formally:

In [None]:
y(t) = f(X(t), X(t-1))

Where y(t) is the desired output for the current timestep, f() is the function we are seeking to approximate with our neural network, and X(t) and X(t-1) are the observations for the current and previous timesteps.

The output could be equal to the previous observation, for example, y(t) = X(t-1), but it could as easily be y(t) = X(t). The model that we train on this problem does not know the true formulation and must learn this relationship.

This mimics real sequence prediction problems where we specify the model as a function of some fixed set of sequenced timesteps, but we don’t know the actual functional relationship from past observations to the desired output value.

We can implement this framing of an echo problem as a supervised learning problem in python.

The [Pandas shift()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html) function can be used to create a shifted version of the sequence that can be used to represent the observations at the prior timestep. This can be concatenated with the raw sequence to provide the X(t-1) and X(t) input values.

In [None]:
df = DataFrame(sequence)
df = concat([df.shift(1), df], axis=1)

We can then take the values from the Pandas DataFrame as the input sequence (X) and use the first column as the output sequence (y).

In [None]:
# specify input and output data
X, y = values, values[:, 0]

Putting this all together, we can define a function that takes the number of timesteps as an argument and returns X,y data for sequence learning called generate_data().

In [None]:
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 0]
	return X, y

### Sequence Problem Demonstration
We can tie the generate_sequence() and generate_data() code together into a worked example.

The complete example is listed below.

In [1]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame
 
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]
 
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 0]
	return X, y
 
# generate sequence
n_timesteps = 10
X, y = generate_data(n_timesteps)
# print sequence
for i in range(n_timesteps):
	print(X[i], '=>', y[i])

[      nan 0.6137393] => nan
[0.6137393  0.68540607] => 0.6137392972071636
[0.68540607 0.37637228] => 0.6854060732497181
[0.37637228 0.0901033 ] => 0.37637228124488553
[0.0901033  0.40802453] => 0.0901033000189041
[0.40802453 0.60029364] => 0.4080245288521507
[0.60029364 0.34499509] => 0.600293644843831
[0.34499509 0.44402892] => 0.34499509463998534
[0.44402892 0.87593653] => 0.44402891960269364
[0.87593653 0.57281337] => 0.8759365282435257


Running this example generates a sequence, converts it to a supervised representation, and prints each X,y pair.

We can see that we have NaN values on the first row.

This is because we do not have a prior observation for the first value in the sequence. We have to fill that space with something.

But we cannot fit a model with NaN inputs.

## Handling Missing Sequence Data
There are two main ways to handle missing sequence data.

They are to remove rows with missing data and to fill the missing timesteps with another value.

The best approach for handling missing sequence data will depend on your problem and your chosen network configuration. I would recommend exploring each method and see what works best.

### Remove Missing Sequence Data
In the case where we are echoing the observation in the previous timestep, the first row of data does not contain any useful information.

That is, in the example above, given the input:

In [None]:
[        nan  0.18961404]

and the output:

In [None]:
nan

There is nothing meaningful that can be learned or predicted.

The best case here is to delete this row.

We can do this during the formulation of the sequence as a supervised learning problem by removing all rows that contain a NaN value. Specifically, the [dropna() function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) can be called prior to splitting the data into X and y components.

The complete example is listed below:

In [2]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame
 
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]
 
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	# remove rows with missing values
	df.dropna(inplace=True)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 0]
	return X, y
 
# generate sequence
n_timesteps = 10
X, y = generate_data(n_timesteps)
# print sequence
for i in range(len(X)):
	print(X[i], '=>', y[i])

[0.35052025 0.68190401] => 0.350520247828497
[0.68190401 0.15423214] => 0.6819040087491088
[0.15423214 0.70453253] => 0.1542321378187086
[0.70453253 0.32751965] => 0.7045325349998864
[0.32751965 0.59273306] => 0.32751965121504834
[0.59273306 0.48429259] => 0.5927330613071813
[0.48429259 0.87226429] => 0.48429258539812303
[0.87226429 0.55126978] => 0.8722642899949663
[0.55126978 0.18220127] => 0.5512697765712204


Running the example results in 9 X,y pairs instead of 10, with the first row removed.

### Replace Missing Sequence Data
In the case when the echo problem is configured to echo the observation at the current timestep, then the first row will contain meaningful information.

We can replace all NaN values with a specific value that does not appear naturally in the input, such as -1. To do this, we can use the [fillna() Pandas function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html).

The complete example is listed below:

In [3]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame
 
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]
 
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	# replace missing values with -1
	df.fillna(-1, inplace=True)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 1]
	return X, y
 
# generate sequence
n_timesteps = 10
X, y = generate_data(n_timesteps)
# print sequence
for i in range(len(X)):
	print(X[i], '=>', y[i])

[-1.          0.36248605] => 0.3624860490052806
[0.36248605 0.38425938] => 0.38425938317697184
[0.38425938 0.6159787 ] => 0.6159787016562683
[0.6159787  0.95462465] => 0.9546246513863305
[0.95462465 0.47350256] => 0.4735025625008039
[0.47350256 0.49168172] => 0.49168172348867245
[0.49168172 0.19306156] => 0.1930615572852208
[0.19306156 0.03430595] => 0.03430595388105395
[0.03430595 0.73127246] => 0.731272460984132
[0.73127246 0.77012933] => 0.7701293327697278


Running the example, we can see that the NaN value in the first column of the first row was replaced with a -1 value.

## Learning with Missing Sequence Values
There are two main options when learning a sequence prediction problem with marked missing values.

The problem can be modeled as-is and we can encourage the model to learn that a specific value means “missing.” Alternately, the special missing values can be masked and explicitly excluded from the prediction calculations.

We will take a look at both cases for the contrived “echo the current observation” problem with two inputs.

### Learning Missing Values
We can develop an LSTM for the prediction problem.

The input is defined by 2 timesteps with 1 feature. A small LSTM with 5 memory units in the first hidden layer is defined and a single output layer with a linear activation function.

The network will be fit using the mean squared error loss function and the efficient ADAM optimization algorithm with default configuration.

In [None]:
# define model
model = Sequential()
model.add(LSTM(5, input_shape=(2, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

To ensure that the model learns a generalized solution to the problem, that is to always returns the input as output (y(t) == X(t)), we will generate a new random sequence every epoch. The network will be fit for 500 epochs and updates will be performed after each sample in each sequence (batch_size=1).

In [None]:
# fit model
for i in range(500):
	X, y = generate_data(n_timesteps)
	model.fit(X, y, epochs=1, batch_size=1, verbose=2)

Once fit, another random sequence will be generated and the predictions from the model will be compared to the expected values. This will provide a concrete idea of the skill of the model.

In [None]:
# evaluate model on new data
X, y = generate_data(n_timesteps)
yhat = model.predict(X)
for i in range(len(X)):
	print('Expected', y[i,0], 'Predicted', yhat[i,0])

Tying all of this together, the complete code listing is provided below.

In [4]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
 
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]
 
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	# replace missing values with -1
	df.fillna(-1, inplace=True)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 1]
	# reshape
	X = X.reshape(len(X), 2, 1)
	y = y.reshape(len(y), 1)
	return X, y
 
n_timesteps = 10
# define model
model = Sequential()
model.add(LSTM(5, input_shape=(2, 1)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit model
for i in range(500):
	X, y = generate_data(n_timesteps)
	model.fit(X, y, epochs=1, batch_size=1, verbose=2)
# evaluate model on new data
X, y = generate_data(n_timesteps)
yhat = model.predict(X)
for i in range(len(X)):
	print('Expected', y[i,0], 'Predicted', yhat[i,0])

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Epoch 1/1
 - 2s - loss: 0.1293
Epoch 1/1
 - 0s - loss: 0.2622
Epoch 1/1
 - 0s - loss: 0.3545
Epoch 1/1
 - 0s - loss: 0.1600
Epoch 1/1
 - 0s - loss: 0.2687
Epoch 1/1
 - 0s - loss: 0.1757
Epoch 1/1
 - 0s - loss: 0.2420
Epoch 1/1
 - 0s - loss: 0.2335
Epoch 1/1
 - 0s - loss: 0.1757
Epoch 1/1
 - 0s - loss: 0.1948
Epoch 1/1
 - 0s - loss: 0.1191
Epoch 1/1
 - 0s - loss: 0.0661
Epoch 1/1
 - 0s - loss: 0.1376
Epoch 1/1
 - 0s - loss: 0.0810
Epoch 1/1
 - 0s - loss: 0.1399
Epoch 1/1
 - 0s - loss: 0.0984
Epoch 1/1
 - 0s - loss: 0.1024
Epoch 1/1
 - 0s - loss: 0.1081
Epoch 1/1
 - 0s - loss: 0.1098
Epoch 1/1
 - 0s - loss: 0.1363
Epoch 1/1
 - 0s - loss: 0.0937
Epoch 1/1
 - 0s - loss: 0.0715
Epoch 1/1
 - 0s - loss: 0.0771
Epoch 1/1
 - 0s - loss: 0.0514
Epoch 1/1
 - 0s - loss: 0.0716
Epoch 1/1
 - 0s - loss: 0.0492
Epoch 1/1
 - 0s - loss: 0.0805
Epoch 1/1
 - 0s - loss: 0.0391
Epoch 1/1
 - 0s - loss: 0.0692
Epoch 1/1
 - 0s - loss: 0.0583
Epoch 1/1
 - 0s - loss: 0.0436
Epoch 1/1
 - 0s - loss: 0.0500
Epoch 1/

 - 0s - loss: 0.0034
Epoch 1/1
 - 0s - loss: 0.0016
Epoch 1/1
 - 0s - loss: 0.0023
Epoch 1/1
 - 0s - loss: 0.0020
Epoch 1/1
 - 0s - loss: 5.9142e-04
Epoch 1/1
 - 0s - loss: 0.0019
Epoch 1/1
 - 0s - loss: 5.2425e-04
Epoch 1/1
 - 0s - loss: 0.0015
Epoch 1/1
 - 0s - loss: 4.6755e-04
Epoch 1/1
 - 0s - loss: 5.1827e-04
Epoch 1/1
 - 0s - loss: 5.2550e-04
Epoch 1/1
 - 0s - loss: 0.0011
Epoch 1/1
 - 0s - loss: 4.5286e-04
Epoch 1/1
 - 0s - loss: 0.0016
Epoch 1/1
 - 0s - loss: 6.2762e-04
Epoch 1/1
 - 0s - loss: 3.1602e-04
Epoch 1/1
 - 0s - loss: 7.5916e-04
Epoch 1/1
 - 0s - loss: 4.4320e-04
Epoch 1/1
 - 0s - loss: 6.3890e-04
Epoch 1/1
 - 0s - loss: 3.8376e-04
Epoch 1/1
 - 0s - loss: 3.9046e-04
Epoch 1/1
 - 0s - loss: 0.0013
Epoch 1/1
 - 0s - loss: 8.9811e-04
Epoch 1/1
 - 0s - loss: 2.4954e-04
Epoch 1/1
 - 0s - loss: 4.1408e-04
Epoch 1/1
 - 0s - loss: 4.1901e-04
Epoch 1/1
 - 0s - loss: 3.1960e-04
Epoch 1/1
 - 0s - loss: 3.4712e-04
Epoch 1/1
 - 0s - loss: 4.6818e-04
Epoch 1/1
 - 0s - loss: 3.1106e

Epoch 1/1
 - 0s - loss: 6.0519e-05
Expected 0.8685285196762417 Predicted 0.8194692
Expected 0.1924922354886558 Predicted 0.20446332
Expected 0.845564165498323 Predicted 0.8630018
Expected 0.3894580702942654 Predicted 0.38488454
Expected 0.37308305596938485 Predicted 0.3582573
Expected 0.23948442298084527 Predicted 0.22121733
Expected 0.8159167980950672 Predicted 0.8342422
Expected 0.6328072753645948 Predicted 0.64054024
Expected 0.049355718309846086 Predicted 0.06680457
Expected 0.9487940714467635 Predicted 0.9472947


Running the example prints the loss each epoch and compares the expected vs. the predicted output at the end of a run for one sequence.

Reviewing the final predictions, we can see that the network learned the problem and predicted “good enough” outputs, even in the presence of missing values.

You could experiment further with this example and mark 50% of the t-1 observations for a given sequence as -1 and see how that affects the skill of the model over time.

### Masking Missing Values
The marked missing input values can be masked from all calculations in the network.

We can do this by using a [Masking layer](https://keras.io/layers/core/#masking) as the first layer to the network.

When defining the layer, we can specify which value in the input to mask. If all features for a timestep contain the masked value, then the whole timestep will be excluded from calculations.

This provides a middle ground between excluding the row completely and forcing the network to learn the impact of marked missing values.

Because the Masking layer is the first in the network, it must specify the expected shape of the input, as follows:

In [None]:
model.add(Masking(mask_value=-1, input_shape=(2, 1)))

We can tie all of this together and re-run the example. The complete code listing is provided below.

In [5]:
from random import random
from numpy import array
from pandas import concat
from pandas import DataFrame
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Masking
 
# generate a sequence of random values
def generate_sequence(n_timesteps):
	return [random() for _ in range(n_timesteps)]
 
# generate data for the lstm
def generate_data(n_timesteps):
	# generate sequence
	sequence = generate_sequence(n_timesteps)
	sequence = array(sequence)
	# create lag
	df = DataFrame(sequence)
	df = concat([df.shift(1), df], axis=1)
	# replace missing values with -1
	df.fillna(-1, inplace=True)
	values = df.values
	# specify input and output data
	X, y = values, values[:, 1]
	# reshape
	X = X.reshape(len(X), 2, 1)
	y = y.reshape(len(y), 1)
	return X, y
 
n_timesteps = 10
# define model
model = Sequential()
model.add(Masking(mask_value=-1, input_shape=(2, 1)))
model.add(LSTM(5))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit model
for i in range(500):
	X, y = generate_data(n_timesteps)
	model.fit(X, y, epochs=1, batch_size=1, verbose=2)
# evaluate model on new data
X, y = generate_data(n_timesteps)
yhat = model.predict(X)
for i in range(len(X)):
	print('Expected', y[i,0], 'Predicted', yhat[i,0])

Epoch 1/1
 - 2s - loss: 0.4598
Epoch 1/1
 - 0s - loss: 0.6151
Epoch 1/1
 - 0s - loss: 0.3594
Epoch 1/1
 - 0s - loss: 0.4661
Epoch 1/1
 - 0s - loss: 0.3130
Epoch 1/1
 - 0s - loss: 0.1972
Epoch 1/1
 - 0s - loss: 0.4282
Epoch 1/1
 - 0s - loss: 0.4372
Epoch 1/1
 - 0s - loss: 0.2935
Epoch 1/1
 - 0s - loss: 0.3822
Epoch 1/1
 - 0s - loss: 0.1684
Epoch 1/1
 - 0s - loss: 0.2768
Epoch 1/1
 - 0s - loss: 0.3674
Epoch 1/1
 - 0s - loss: 0.2610
Epoch 1/1
 - 0s - loss: 0.1088
Epoch 1/1
 - 0s - loss: 0.1837
Epoch 1/1
 - 0s - loss: 0.0548
Epoch 1/1
 - 0s - loss: 0.2131
Epoch 1/1
 - 0s - loss: 0.3371
Epoch 1/1
 - 0s - loss: 0.0605
Epoch 1/1
 - 0s - loss: 0.1199
Epoch 1/1
 - 0s - loss: 0.1565
Epoch 1/1
 - 0s - loss: 0.1590
Epoch 1/1
 - 0s - loss: 0.1138
Epoch 1/1
 - 0s - loss: 0.1550
Epoch 1/1
 - 0s - loss: 0.0573
Epoch 1/1
 - 0s - loss: 0.0958
Epoch 1/1
 - 0s - loss: 0.1229
Epoch 1/1
 - 0s - loss: 0.1011
Epoch 1/1
 - 0s - loss: 0.1095
Epoch 1/1
 - 0s - loss: 0.0436
Epoch 1/1
 - 0s - loss: 0.0769
Epoch 1/

 - 0s - loss: 0.0162
Epoch 1/1
 - 0s - loss: 0.0138
Epoch 1/1
 - 0s - loss: 0.0153
Epoch 1/1
 - 0s - loss: 0.0183
Epoch 1/1
 - 0s - loss: 0.0097
Epoch 1/1
 - 0s - loss: 0.0193
Epoch 1/1
 - 0s - loss: 0.0104
Epoch 1/1
 - 0s - loss: 0.0136
Epoch 1/1
 - 0s - loss: 0.0254
Epoch 1/1
 - 0s - loss: 0.0158
Epoch 1/1
 - 0s - loss: 0.0162
Epoch 1/1
 - 0s - loss: 0.0140
Epoch 1/1
 - 0s - loss: 0.0100
Epoch 1/1
 - 0s - loss: 0.0208
Epoch 1/1
 - 0s - loss: 0.0133
Epoch 1/1
 - 0s - loss: 0.0151
Epoch 1/1
 - 0s - loss: 0.0051
Epoch 1/1
 - 0s - loss: 0.0076
Epoch 1/1
 - 0s - loss: 0.0218
Epoch 1/1
 - 0s - loss: 0.0080
Epoch 1/1
 - 0s - loss: 0.0123
Epoch 1/1
 - 0s - loss: 0.0090
Epoch 1/1
 - 0s - loss: 0.0175
Epoch 1/1
 - 0s - loss: 0.0061
Epoch 1/1
 - 0s - loss: 0.0172
Epoch 1/1
 - 0s - loss: 0.0116
Epoch 1/1
 - 0s - loss: 0.0070
Epoch 1/1
 - 0s - loss: 0.0115
Epoch 1/1
 - 0s - loss: 0.0076
Epoch 1/1
 - 0s - loss: 0.0116
Epoch 1/1
 - 0s - loss: 0.0094
Epoch 1/1
 - 0s - loss: 0.0136
Epoch 1/1
 - 0s - 

Again, the loss is printed each epoch and the predictions are compared to expected values for a final sequence.

Again, the predictions appear good enough to a few decimal places.

### Which Method to Choose?
These one-off experiments are not sufficient to evaluate what would work best on the simple echo sequence prediction problem.

They do provide templates that you can use on your own problems.

I would encourage you to explore the 3 different ways of handling missing values in your sequence prediction problems. They were:

* Removing rows with missing values.
* Mark and learn missing values.
* Mask and learn without missing values.

Try each approach on your sequence prediction problem and double down on what appears to work best.

## Summary
It is common to have missing values in sequence prediction problems if your sequences have variable lengths.

In this tutorial, you discovered how to handle missing data in sequence prediction problems in Python with Keras.

Specifically, you learned:

* How to remove rows that contain a missing value.
* How to mark missing values and force the model to learn their meaning.
* How to mask missing values to exclude them from calculations in the model.