In [3]:
# Using ! to invoke the following shell command inside our notebook
!git clone https://bitbucket.org/jadslim/german-traffic-signs

Cloning into 'german-traffic-signs'...
Unpacking objects: 100% (6/6), done.


In [5]:
# The above cell cloned the data from the bitbucket server and created a folder
# called "german-traffic-sign". Now we list the data inside this folder
!ls german-traffic-signs/

signnames.csv  test.p  train.p	valid.p


As it can be seen there are 4 files. First one is a spreadsheet of sign name, and the other 3 files are pickle files, that contain our respective training, test and validation datasets.

In python to save something on disk, it can be pickled. That is it can be serialized before writing it to file. By serializing it, it converts all the object to a character stream.

Pickled file contain serialized data that can be unpickled when desired

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.utils.np_utils import to_categorical
from keras.layers import Dropout, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
import pickle

In [7]:
np.random.seed(0)

In [8]:
# with keyword is used whenever we wish to execute two operations as a pair and
# invoke a block of code in between
# Here using "with", we will open a file, manipulate it and "with" will then
# automatically close the file.
with open('german-traffic-signs/train.p', 'rb') as f:
    train_data = pickle.load(f)
with open('german-traffic-signs/valid.p', 'rb') as f:
    val_data = pickle.load(f)
with open('german-traffic-signs/test.p', 'rb') as f:
    test_data = pickle.load(f)

if we print the type of our pickled datasets using
```
print(type(train_data))
```
we can see they are of type
```
<class 'dict'>
```
From the key value pairs of these dictionaries, two values are of our interest. One of them is *features*, and the other one is *labels*.

Feature key corresponds to values of training images in pixel representation, wherese the labels corresponds to an array of labels, which pretty much label each training image as belonging to some class

In [9]:
X_train, y_train = train_data['features'], train_data['labels']
X_val, y_val = val_data['features'], val_data['labels']
X_test, y_test = test_data['features'], test_data['labels']

In [11]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(34799, 32, 32, 3)
(4410, 32, 32, 3)
(12630, 32, 32, 3)


As it can be seen, we have roughly 35000 training images of size 32 by 32 pixels with depth of 3 (as our traffic signs despite the MNIST datasets are in RGB format, they have a depth of 3 for each of the 3 color channels: Red, Green and Blue).

Similarly, there are about 4500 validation and 12500 testing images.

We checked it here to make sure that they are consistent and are based on our expectations.

As we have imported these data from a repository, it is a good practice to verify that our dataset was imported correctly whenever our program is running.

In [15]:
# Assert that the number of images equal to the number of labels.
assert(X_train.shape[0] == y_train.shape[0]), "The number of images is not equal to the number of labels"
assert(X_val.shape[0] == y_val.shape[0]), "The number of images is not equal to the number of labels"
assert(X_test.shape[0] == y_test.shape[0]), "The number of images is not equal to the number of labels"

assert(X_train.shape[1:] == (32, 32, 3)), "The dimension of the images are not 32 x 32 x 3"
assert(X_val.shape[1:] == (32, 32, 3)), "The dimension of the images are not 32 x 32 x 3"
assert(X_test.shape[1:] == (32, 32, 3)), "The dimension of the images are not 32 x 32 x 3"