# Trajectory Prediction with LSTM
### CS230 - Deep Learning -  Final Submission. 
#### Mitchell Dawson, Benjamin Goeing, Tyler Hughes.  

## Project Introduction
 In this project, we training an RNN to predict the trajectory of objects (pedestrians, bikers, cars, etc.), as they move through a scene and interact with one another. Our model could be used to predict movements of crowds of people and vehicles given an overhead image of a scene. This may have potential applications in helping to make public spaces less susceptible to crowding or accidents, improving control of autonomous vehicles, or video surveillance.  We provide several sources of additional input to the RNN to improve its performance.  For example, information related to the tracked subject's location and trajectory, the scene layout, and the presence of other people.


## Dataset

We are leveraging the Stanford Drone dataset, which contains a large number of overhead images of crowded spaces on Stanford campus. (http://cvgl.stanford.edu/projects/uav_data/). Here are some examples of drone footage from this dataset:

<img src="images/bookstore.jpg?raw=true" width="250"/>  <img src="images/deathCircle.jpg?raw=true" width="250"/>   

Each scene in the dataset also contains csv files with rows in the form of (f,o,x,y).  Here 'f' is the frame number, 'o' is the unique object identifier (which person is being tracked) and x,y are the x and y coordinates (not normalized).

## Module Imports
Our LSTM model is built with Pytorch.

We will be importing helper modules for processing the dataset and loading an LSTM trajectory tracker class


In [1]:
# Module import (add here as necessary)
import sys
import os
import numpy as np
import torch
import torch.utils.data

from simple_processing import load_simple_array
from lstm import TrajectoryPredictor
from linear_error import compute_linear_error

## Define Constants
We have to specify the number of frames to observe before predicting.
Also, for training the LSTM, we must specify the batch size and number of epochs

In [2]:
# constants
Nf = 10         # number of frames to observe before making prediction
batch_size = 4  # TrajectoryPredictor training batch size
num_epochs = 10 # number of training epochs

## Loading and Preprocessing Training Data
We will load in and process the drone dataset.

For now, we will just load (x,y) pairs of positions at a series of frames for each person in the scene.  We will use both the scene information and presence of other people in the frame later.  We normalize the (x,y) positions to be between -1 and 1

The data is loaded into a pytorch dataset for feeding into the LSTM

In [3]:
def load_data(): 
    train_trajectories = []
    for filename in os.listdir('train/stanford/annotations/'):
        if not filename.endswith('.txt'):
            continue
        train_trajectories += load_simple_array('train/stanford/annotations/' + filename)
    
    dev_trajectories = []
    for filename in os.listdir('dev/stanford/annotations/'):
        if not filename.endswith('.txt'):
            continue
        dev_trajectories += load_simple_array('dev/stanford/annotations/' + filename)

    test_trajectories = []
    for filename in os.listdir('test/stanford/annotations/'):
        if not filename.endswith('.txt'):
            continue
        test_trajectories += load_simple_array('test/stanford/annotations/' + filename)

    return train_trajectories, dev_trajectories, test_trajectories


train_trajectories, dev_trajectories, test_trajectories = load_data()

np_train_trajectories = np.stack(train_trajectories)

np_train_data = np_train_trajectories[:,:Nf,:]
np_train_target = np_train_trajectories[:,Nf:,:]

train_data_tensor = torch.Tensor(np_train_data)
train_target_tensor = torch.Tensor(np_train_target)

train_dataset = torch.utils.data.TensorDataset(train_data_tensor, train_target_tensor)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size)

We note that for subsequent sections, we have preloaded and processed the dataset to include the extra information we wish to supply for the LSTM, we do not include that code in this notebook but it is all in the file 'simple_processing.py'

Here we print out the trajectory of the first person in the training set. The first $N_f$ ($x$,$y$) pairs are for observation. The second $N_f$ ($x$,$y$) pairs are those we want to predict.

In [4]:
print(train_dataset[0])

(
 0.0120  0.8285
 0.0120  0.8285
 0.0134  0.8382
 0.0219  0.8382
 0.0318  0.8405
 0.0403  0.8424
 0.0502  0.8442
 0.0598  0.8465
 0.0697  0.8507
 0.0792  0.8526
[torch.FloatTensor of size 10x2]
, 
 0.0905  0.8526
 0.1011  0.8526
 0.1124  0.8549
 0.1234  0.8563
 0.1347  0.8581
 0.1457  0.8600
 0.1563  0.8600
 0.1676  0.8600
 0.1772  0.8563
 0.1885  0.8535
[torch.FloatTensor of size 10x2]
)


## Training the LSTM
Now that we have our data loaded into a pytorch dataset and data loader, we are ready to train the RNN.  We have opted for an LSTM model, initially with a hidden state size of 2, although we will expand on this model later.

<img src="images/LSTM_xy.png?raw=true" width="500">   

We have implemented a TrajectoryPredictor() class that trains an LSTM on the pytorch data loader.
TrajectoryPredictor takes as initial parameters: input dimension (2 here because we have x,y inputs), output dimension (2 here because x,y output predictions), and batch size.

We commented out the code to initialize a trajectory predictor and train it.  Instead, we will load a pre-trained model that we trained on FloydHub.

In [None]:
# p = TrajectoryPredictor(2, 2, batch_size)
# p.train(train_loader, num_epochs)

# LOAD ONLY XY MODEL HERE!!!
# p_xy = ...

epoch 1. mean loss: 1.82455535136
epoch 2. mean loss: 0.546421991492
epoch 3. mean loss: 0.386153386321
epoch 4. mean loss: 0.323843483411
epoch 5. mean loss: 0.289107988643
epoch 6. mean loss: 0.266656400561
epoch 7. mean loss: 0.250179112599
epoch 8. mean loss: 0.236991802132


## Model Prediction
With our model trained, we predict on some validation trajectories to get a sense how well our LSTM performs.  

<img src="images/test_predict.png?raw=true" width="150">   

We use a loss function of

$$\mathcal{L} = \frac{1}{MN}\sum_{i=1}^M \sum_{j=1}^N \sqrt{\big(x^{(i)}_j - \bar{x}^{(i)}_j\big)^2 + \big(y^{(i)}_j - \bar{y}^{(i)}_j\big)^2}  $$

Where $i$ is the trajectory number and $j$ is the frame number.  $x$ and $y$ are the predicted coordinates and $\bar{x}$ and $\bar{y}$ are the true coordinates.

This loss function gives us the mean, average displacement error for each frame in each trajectory.

In [6]:
np_dev_trajectories = np.stack(dev_trajectories)

np_dev_data = np_dev_trajectories[:,:Nf,:]
np_dev_target = np_dev_trajectories[:,Nf:,:]

dev_data_tensor = torch.Tensor(np_dev_data)
dev_target_tensor = torch.Tensor(np_dev_target)

dev_dataset = torch.utils.data.TensorDataset(dev_data_tensor, dev_target_tensor)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size = batch_size)

In [7]:
p_xy.test(dev_loader)

mean loss: 0.534271395503


We compare with the predictions of a linear model on the same trajectory, where the (x,y) coordinates of a future time frame are estimated by extrapolating from the person's velocity at the final observation time frame.

In [4]:
compute_linear_error(dev_trajectories, Nf)

mean loss: 0.128018440883


Clearly we have to include more information in order to improve this model.

## Velocity
We suspect that with just the (x,y) data, once the model starts producing errors in the trajectory prediction, these errors begin to compound because the subject walks off from the correct path.  In order to bias our model against this, we decided to add in the subject's velocity at the final time step (not updated after prediction).  This will have the effect of producing predictions that are more linear in the direction of the original training trajectory.

We compute the final velocity with a simple finite difference formula, where $N$ is the final training frame and $\Delta t$ is the frame rate

$$v_x = \frac{x_N - x_{N-1}}{\Delta t}$$


$$v_y = \frac{y_N - y_{N-1}}{\Delta t}$$

Let us load our pretrained model and predict on the dev set

In [None]:
# LOAD VELOCITY DATA IN:
# p_xyv = load_v
# PREDICT ON DEV SET

While this model performed better than with just $x$ and $y$, we still have some ways to go

## Dense Layer
In order to add dimensionality to our LSTM and also allow for the model to learn an embedding for the input vectors, we added a dense layer between our $x$,$y$,$\vec{v}$ data and our LSTM input:

<img src="images/LSTM_xyv_dense.png?raw=true" width="500">   

This dense layer takes the 4 dimensional input and transforms it to an $LSTM_{size}$ dimensional input.  We chose $LSTM_{size}$ to be 20 after trying several values.

In [None]:
# LOAD VELOCITY+DENSE DATA IN:
# p_xyv_dense = load_v
# PREDICT ON DEV SET

Adding the dense layer slightly improved performance of the model, but we still only have information from a single person's trajectory. Therefore, our model is not yet able to incorporate important scene or crowd information.  We add this in the next sections.

## Occupancy Grid
The input training data has a set of ($x$,$y$) pairs at different frames, but also contains information about all of the subjects in the scene at a given frame.  

Therefore, we did some preprocessing of the initial dataset to give, for each trajectory, not only ($x$,$y$) data, but also an occupancy grid representing all of the other people's locations in the scene at a given time.

The form of this occupancy grid was an $N_o\times N_o$ numpy array with a $0$ where there is no person and $1$ where there is a person in the scene.  We fed this numpy array directly into the LSTM along with the $x$, $y$ and velocity information.

<img src="images/LSTM_others.png?raw=true" width="500">   


In [None]:
# LOAD OTHERS DATA IN:
# p_others = load_v
# PREDICT ON DEV SET

Inputting the data in this form seemed to have only a marginal effect on the performance of the model.  It's possible that the numpy array was too large and the LSTMs had trouble inferring any relevant information from this sparse input data.  We will revisit this issue after discussing incorporating scene information.

## Image Segmentation / Scene Information

Our dataset contains several images of the scene from overhead.  We figure that the characteristics of the scene will have a dramatic effect on how people walk around and interact with the scene.  Therefore, we wanted to feed this information to our LSTM in a compact representation.  

To make things easier for our model to learn, we needed to first abstract away a lot of the details in the images.  We used MIT's Lableme (http://labelme.csail.mit.edu/Release3.0/) tool to manually segment each image and classify each pixel by class:
- a) road
- b) sidewalk
- c) grass
- d) inaccessible (describing objects such as building walls, trees etc.) 

A matlab script was written to then make sure there were no overlaps or unlabelled pixels.

Here is an example of the scene labeling and segmentation process:

<img src="images/imageSeg.png?raw=true" width="250">

## Convolutional Neural Network Approach

In the previous section, we had issues feeding in the occupancy grid directly to our LSTM.  This was likely bceause of its large size compared to the relevant information contained in the array.  

Therefore, we decided that the best approach would be to feed any further image/scene data into the LSTM after being processed through a CNN.  

We used a pretrained network, AlexNet https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, as our CNN, since we decided that it would be too computationally intensive to train the CNN kernels.

In addition to the segmented background image and the occupancy grid, we also fed the CNN in a numpy array representing where the tracked person is in the scene.  This was to ensure that the spatial relationship between the tracked person and this other information was encoded in the same way by the CNN.

<img src="images/LSTM_full.png?raw=true" width="500">

### more details here

We load in this pre-trained model from floydhub and test it on the dev set:

In [None]:
# LOAD FULL DATA IN:
# p_full = load_v
# PREDICT ON DEV SET

As we can see, this model performs
### How does it perform?

## Discussion

"The notebook should at least contain a thorough explanation of your model, your dataset, your training/validating/testing process, your challenges, an explanation about the hyperparameters, optimization, regularization you choose, the performance of your algorithm, error analysis, some thoughts on future works,"
 
#### Challenges
What are some challenges

#### Error Analysis
Compare our errors with the stanford vision lab's results.

#### Future Works
We think it might be useful for future work to explore different CNN architectures for encoding the scene and occupancy data.  We used a pre-trained model because our previous sections were already very computationally intensive, but perhaps training the CNN in conjunction with the LSTM would give better results.  

Also, since our segmented image array elements are just indeces corresponding to the scene 'class' of each pixel, there could be several approaches to better use this data.  For example, we could construct separate binary arrays for each of the classes (for example, a separate 'inaccessable' array to feed to the CNN).  Unfortunately, since our training and setup took longer than expected, we did not have enough time to work through these ideas, but we believe that some tweaking could lead to much better results.

## Conclusion

In conclusion, we have shown that an RNN model can perform reasonably well on trajectory prediction tasks.  We have learned a great deal about how important it is to choose wisely when designing your inputs for the LSTM task.  If the input is too simple then the model will not have enough information to predict trajectories well.  On the other hand, throwing too much data at your LSTM will dramatically increase training time and computational cost, and is not even guarenteed to improve results.  We spent a good amount of time on this project trying to find the middle ground between these two extremes and had some limited success.  Although we did not end up beating the state of the art on these tasks, it was a fruitful learning experience.