<a href="https://colab.research.google.com/github/lakshmiprasanna1999/Traffic_prediction/blob/master/Traffic_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
import pickle

In [None]:
!git clone https://github.com/wshuyi/demo_traffic_jam_prediction.git

In [None]:
from pathlib import Path
data_dir = Path('demo_traffic_jam_prediction')

In [None]:
with open(data_dir / 'data.pickle', 'rb') as f:
    [event_dict, df] = pickle.load(f)

In [None]:
event_dict

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
max_len_event_id = df.events.apply(len).idxmax()
max_len_event_id

In [None]:
max_len_event = df.iloc[max_len_event_id]
max_len_event.events

In [None]:
maxlen = len(max_len_event.events)
maxlen

In [None]:
reversed_dict = {}
for k, v in event_dict.items():
  reversed_dict[v] = k

In [None]:
reversed_dict

In [None]:
def map_event_list_to_idxs(event_list):
  list_idxs = []
  for event in (event_list):
    idx = reversed_dict[event]
    list_idxs.append(idx)
  return list_idxs

In [None]:
map_event_list_to_idxs(max_len_event.events)


In [None]:
import numpy as np
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

In [None]:
len(event_dict)

In [None]:
df.events.apply(map_event_list_to_idxs)

In [None]:
sequences = df.events.apply(map_event_list_to_idxs).tolist()
sequences[:5]

 the first row is much longer than the following ones.
However, to apply a sequence model on the data, we need to make sure all the input sequences share the same length. Hence, we use the length of the longest sequence as the max length, and fill other shorter sequences with 0s from the beginning.

In [None]:
data = pad_sequences(sequences, maxlen=maxlen)
data

In [None]:
labels = np.array(df.label)

In [None]:
np.random.seed(12)

We shuffle the sequences along with their corresponding labels.as we are done with the running of the code so we can make changes

In [None]:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

The training set will contain 80% of the data, while the other 20% goes into the validation set.

In [None]:
training_samples = int(len(indices) * .8)
validation_samples = len(indices) - training_samples

The following codes divide the data into training and validation sets, along with the labels.

In [None]:
X_train = data[:training_samples]
y_train = labels[:training_samples]
X_valid = data[training_samples: training_samples + validation_samples]
y_valid = labels[training_samples: training_samples + validation_samples]

the content of training data:

In [None]:
X_train

 as we filled the sequences with 0 as padding value, now we have 33, instead of 32 event types.

So the number of event types will be set to 33. here we add +1

In [None]:
num_events = len(event_dict) + 1

If we simply put the numbers into classification model, it will regard each number as a continuous value. However, they are not. So we will let the numbers go through an Embedding layer, and convert each number (representing a certain type of event) into a vector. Each vector, will contain 20 scalars.

In [None]:
embedding_dim = 20

In [None]:
#The initial embedding matrix will be generated randomly.
embedding_matrix = np.random.rand(num_events, embedding_dim)

Finally, we can build a model now.
We use the Sequential model in Keras, and put different layers one by one, as we play with legos.
The first layer is Embedding Layer, then a LSTM Layer follows, the last layer is a dense one, whose activation function is sigmoid, to make binary classification.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM

units = 32

model = Sequential()
model.add(Embedding(num_events, embedding_dim))
model.add(LSTM(units))
model.add(Dense(1, activation='sigmoid'))

The next step is to handle the parameters in the Embedding layer. For now, we just load in the initial embedding matrix generated randomly, and won’t let the training process change the weights in Embedding Layer.

In [None]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

In [None]:
#Then, we train the model, and save the model into a h5 file.
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_data=(X_valid, y_valid))
model.save("mymodel_embedding_untrainable.png")

In [None]:
#After the model is trained, let us visualize the curves of accuracy and loss with matplotlib.
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

As you can see, it is not bad. If we use a dummy model to predict everything as label 0 (or all as 1), the accuracy will stay at 0.50. So our model, apparently, has captured some pattern, and out-performed the dummy one.
However, it is very unstable.

LOSS :-------------------
As you may find out, it is not good. When the loss of training went down, the loss on validation set bumped, and there is no significant trend of convergence.
It is more important to find out the reason.



Note that we used a randomly initialized Embedding Matrix which stayed static during the training phase. It may lead us into trouble.

So next step, we can do an experiment to allow the Embedding layer be trained and adjusted.

HERE WE JUST SET IT AS TRUE

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM

units = 32

model = Sequential()
model.add(Embedding(num_events, embedding_dim))
model.add(LSTM(units))
model.add(Dense(1, activation='sigmoid'))

In [None]:
#The only different in the code, is that parameter trainable was set to True.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = True

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_data=(X_valid, y_valid))
model.save("mymodel_embedding_trainable.png")

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

As you can see, it got better. The fluctuation of validation accuracy curve went down, while the validation accuracy got higher than 0.75.
This model is, to some extent, more valuable.

In [None]:
#We will add two parameters related with Dropouts. To do this, we use dropout=0.2, recurrent_dropout=0.2 when defining the LSTM layer.
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, LSTM

units = 32

model = Sequential()
model.add(Embedding(num_events, embedding_dim))
model.add(LSTM(units, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

In [None]:
#We will keep the parameter trainable of Embedding Layer to True.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = True

In [None]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_data=(X_valid, y_valid))
model.save("mymodel_embedding_trainable_with_dropout.h5")

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

when you look into the curve of loss, you’ll see significant improvement.

The curve of validation loss is smoother, and much closer to the trend of training loss.
Over-fitting has been taken care of, and the model is now more stable and generalizable to unseen data.
The Traffic Administration can then use the model to predict the happening of severe traffic volume with the Waze open data of incidents report. The expectation of model accuracy is about 75%.