<a href="https://colab.research.google.com/github/lblogan14/deep_learning_for_computer_vision/blob/master/ch4_object_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [37]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Deep_Learning_for_Computer_Vision/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision


#Detect Objects in an Image
**Object localization** is to find the position of the object in addition to  labeling the object.

**Object detection** is to find multiple objects in the image with rectangular corrdinates.

In detection, there are a variable number
of objects. This small difference makes a big difference when designing the
architectures for the deep learning model concerning localization or detection.

#Explore the Datasets

##ImageNet Dataset
ImageNet has data for evaluating classification, localization, and detection tasks.

##PASCAL VOC Challenge
There are 20 classes
in the dataset. The dataset has 11,530 images for training and validations with
27,450 annotations for regions of interest. The following are the twenty classes
present in the dataset:
* Person: Person
* Animal: Bird, cat, cow, dog, horse, sheep
* Vehicle: Airplane, bicycle, boat, bus, car, motorbike, train
* Indoor: Bottle, chair, dining table, potted plant, sofa, tv/monitor

##COCO Object Detection Challenge
The **Common Objects in Context (COCO)** dataset has 200,000 images with
more than 500,000 object annotations in 80 categories.

#Evaluate Datasets Using Metrics
The human may have annotated a box that is called **ground-truth**.
The ground-truth need not be the absolute truth. 

Moreover, the boxes can be a
few pixels different from human to human. Hence it becomes harder for the
algorithm to detect the exact bounding box drawn by humans. **Intersection over
Union (IoU)** is used to evaluate the localization task. **Mean Precision Average
(mAP)** is used to evaluate the detection task.

##Intersection over Union
The IoU is the ratio of the overlapping area of **ground truth** and predicted area
to the total area. The IoU is calculated as a ratio of the area of overlap to the area of the union.


In [0]:
import tensorflow as tf

The following example is to compute the IoU given the ground truth and prediction bounding boxes,

In [0]:
def calculate_iou(gt_bb, pred_bb):
  '''
  :param gt_bb: ground truth bounding box
  :param pred_bb: predicted bounding box
  '''
  
  gt_bb = tf.stack([
      gt_bb[:, :, :, :, 0] - gt_bb[:, :, :, :, 2] / 2.0,
      gt_bb[:, :, :, :, 1] - gt_bb[:, :, :, :, 3] / 2.0,
      gt_bb[:, :, :, :, 0] + gt_bb[:, :, :, :, 2] / 2.0,
      gt_bb[:, :, :, :, 1] + gt_bb[:, :, :, :, 3] / 2.0
  ])
  gt_bb = tf.transpose(gt_bb, [1,2,3,4,0])
  
  pred_bb = tf.stack([
      pred_bb[:, :, :, :, 0] - pred_bb[:, :, :, :, 2] / 2.0,
      pred_bb[:, :, :, :, 1] - pred_bb[:, :, :, :, 3] / 2.0,
      pred_bb[:, :, :, :, 0] + pred_bb[:, :, :, :, 2] / 2.0,
      pred_bb[:, :, :, :, 1] + pred_bb[:, :, :, :, 3] / 2.0
  ])
  pred_bb = tf.transpose(pred_bb, [1,2,3,4,0])
  
  area = tf.maximum(0.0,
                    tf.minimum(gt_bb[:, :, :, :, 2:], pred_bb[:, :, :, :, 2:]) -
                    tf.maximum(gt_bb[:, :, :, :, :2], pred_bb[:, :, :, :, :2])
                   )
  intersection_area = area[:, :, :, :, 0] * area[:, :, :, :, 1]
  
  gt_bb_area = (gt_bb[:, :, :, :, 2] - gt_bb[:, :, :, :, 0]) * \
               (gt_bb[:, :, :, :, 3] - gt_bb[:, :, :, :, 1])
  pred_bb_area = (pred_bb[:, :, :, :, 2] - pred_bb[:, :, :, :, 0]) * \
                 (pred_bb[:, :, :, :, 3] - pred_bb[:, :, :, :, 1])
  union_area = tf.maximum(gt_bb_area + pred_bb_area - intersection_area, 1e-10)
  iou = tf.clip_by_value(intersection_area / union_area, 0.0, 1.0)
  return iou

The ground truth and predicted bounding boxes are stacked together. Then the
area is calculated while handling the case of negative area.

The negative area could occur when bounding box coordinates are incorrect. The right side
coordinates of the box many occur left to the left coordinates. Since the structure
of the bounding box is not preserved, the negative area is bound to occur. The
union and intersection areas are computed followed by a final IoU calculation
which is the ratio of the overlapping area of **ground truth** and predicted area to
the total area. 

The IoU calculation can be coupled with algorithms to train
localization problems.

##Mean Average Precision (mAP)
The mAP is used for evaluating detection algorithms. The mAP metric is the
product of precision and recall of the detected bounding boxes. The mAP value
ranges from 0 to 100. The higher the number, the better it is. The mAP can be
computed by calculating **average precision (AP)** separately for each class, then
the average over the class. A detection is considered a true positive only if the
mAP is above 0.5. All detections from the test images can be combined by
drawing a draw precision/recall curve for each class. The final area under the
curve can be used for the comparison of algorithms. The mAP is a good measure
of the sensitivity of the network while not raising many false alarms.

#Localizing Algorithms
Localization algorithms are an extension of the image classification and image retrieval.

In image classification, an
image is passed through several layers of a CNN (convolutional neural network).
The final layer of CNN outputs the probabilistic value, belonging to each of the
labels. This can be extended to localize the objects.

#Localize Objects Using Sliding Windows
An intuitive way of localization is to predict several cropped portions of an
image with an object. The cropping of the images can be done by moving a
window across the image and predicting for every window. The method of
moving a smaller window than the image and cropping the image according to
window size is called a **sliding window**. A prediction can be made for every
cropped window of the image which is called sliding window object detection.

The prediction can be done by the deep learning model trained for image
classification problems with *closely-cropped images*. *Close cropping* means that
only one object will be found in the whole image. The movement of the window
has to be uniform across the image. Each portion of the image is passed through
the model to find the classification.

1. It can only find objects that are the same size as the window. The sliding
window will miss an object if the object size is bigger than the window
size. To overcome this, we will use the concept of **scale space**.
2. Another problem is that moving the window over pixels may lead to
missing a few objects. Moving the window over every pixel will result in a
lot of extra computation hence it will slow down the system. To avoid this,
we will incorporate a trick in the convolutional layers.

##Scale-Space Concept
is to use iamges that are of various sizes.

An
image is reduced to smaller size, hence bigger objects can be detected with the
same-sized window. An image can be resized to some sizes with decreasing
sizes. The resizing of images by removing alternative pixels or interpolation may
leave some artefacts. Hence the image is smoothened and resized iteratively. The
images that are obtained by smoothening and resizing are scale space.

The window is slide on every single scale for the localization of objects.
Running multiple scales is equivalent to running the image with a bigger
window. The computational complexity of running on multiple scales is high.
Localization can be sped up by moving faster with a trade-off for accuracy. The
complexity makes the solution not usable in production. The idea of the sliding
window could be made efficient with a fully convolutional implementation of
sliding windows.

In [0]:
from tensorflow.examples.tutorials.mnist import input_data
mnist_data = input_data.read_data_sets('MNIST_data', one_hot=True)

input_size = 784
no_classes = 10
batch_size = 100
total_batches = 300

In [0]:
tf.reset_default_graph()

In [0]:
x_input = tf.placeholder(tf.float32, shape=[None, input_size])
y_input = tf.placeholder(tf.float32, shape=[None, no_classes])


def add_variable_summary(tf_variable, summary_name):
  with tf.name_scope(summary_name + '_summary'):
    mean = tf.reduce_mean(tf_variable)
    tf.summary.scalar('Mean', mean)
    with tf.name_scope('standard_deviation'):
        standard_deviation = tf.sqrt(tf.reduce_mean(
            tf.square(tf_variable - mean)))
    tf.summary.scalar('StandardDeviation', standard_deviation)
    tf.summary.scalar('Maximum', tf.reduce_max(tf_variable))
    tf.summary.scalar('Minimum', tf.reduce_min(tf_variable))
    tf.summary.histogram('Histogram', tf_variable)


x_input_reshape = tf.reshape(x_input, [-1, 28, 28, 1],
                             name='input_reshape')


def convolution_layer(input_layer, filters, kernel_size=[3, 3],
                      activation=tf.nn.relu):
    layer = tf.layers.conv2d(
        inputs=input_layer,
        filters=filters,
        kernel_size=kernel_size,
        activation=activation
    )
    add_variable_summary(layer, 'convolution')
    return layer


def pooling_layer(input_layer, pool_size=[2, 2], strides=2):
    layer = tf.layers.max_pooling2d(
        inputs=input_layer,
        pool_size=pool_size,
        strides=strides
    )
    add_variable_summary(layer, 'pooling')
    return layer

In [0]:
# OverFeat model
convolution_layer_1 = convolution_layer(x_input_reshape, 64)
pooling_layer_1 = pooling_layer(convolution_layer_1)
convolution_layer_2 = convolution_layer(pooling_layer_1, 128)
pooling_layer_2 = pooling_layer(convolution_layer_2)
dense_layer_bottleneck = convolution_layer(pooling_layer_2, 1024, [5, 5])
logits = convolution_layer(dense_layer_bottleneck, no_classes, [1, 1])
logits = tf.reshape(logits, [-1, 10])

The dense layers are expressed as convolution layers.

In [0]:
with tf.name_scope('loss'):
    softmax_cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
        labels=y_input, logits=logits)
    print(softmax_cross_entropy)
    loss_operation = tf.reduce_mean(softmax_cross_entropy, name='loss')
    print(loss_operation)
    tf.summary.scalar('loss', loss_operation)

with tf.name_scope('optimiser'):
    optimiser = tf.train.AdamOptimizer().minimize(loss_operation)


with tf.name_scope('accuracy'):
    with tf.name_scope('correct_prediction'):
        predictions = tf.argmax(logits, 1)
        correct_predictions = tf.equal(predictions, tf.argmax(y_input, 1))
    with tf.name_scope('accuracy'):
        accuracy_operation = tf.reduce_mean(
            tf.cast(correct_predictions, tf.float32))
tf.summary.scalar('accuracy', accuracy_operation)

In [0]:
session = tf.Session()
session.run(tf.global_variables_initializer())

merged_summary_operation = tf.summary.merge_all()
train_summary_writer = tf.summary.FileWriter('/tmp/train', session.graph)
test_summary_writer = tf.summary.FileWriter('/tmp/test')

In [0]:
test_images, test_labels = mnist_data.test.images, mnist_data.test.labels

In [0]:
for batch_no in range(total_batches):
    mnist_batch = mnist_data.train.next_batch(batch_size)
    train_images, train_labels = mnist_batch[0], mnist_batch[1]
    _, merged_summary = session.run([optimiser, merged_summary_operation],
                                    feed_dict={
        x_input: train_images,
        y_input: train_labels,
    })
    train_summary_writer.add_summary(merged_summary, batch_no)
    if batch_no % 10 == 0:
        merged_summary, _ = session.run([merged_summary_operation,
                                         accuracy_operation], feed_dict={
            x_input: test_images,
            y_input: test_labels,
        })
        test_summary_writer.add_summary(merged_summary, batch_no)

##Convolution Implementation of Sliding WIndow
Instead of sliding, the final target is made into some targets required as depth and a number of boxes as the window

#Think about Localization as a Regression Problem
One fundamental way to think about localization is modeling the problem as a
regression problem. The bounding box is four numbers and hence can be
predicted in a direct manner with a setting for regression. We will also need to
predict the label, which is a classification problem.

There are different parameterizations available to define the bounding boxes.
There are four numbers usually for the bounding box. One of the representations
is the center of the coordinates with the height and width of the bounding box. A
pre-trained model can be used by removing the fully connected layer and
replacing it with a regression encoder. The regression has to be regularized with
the L2 loss which performs poorly with an outlier.

Swapping regression with a smoothened version of regularization is better. Fine-
tuning the model gives a good accuracy, whereas training the whole network
gives only a marginal performance improvement. It's a trade-off between
training time and accuracy.

#Combine Regression with the Sliding WIndow
The classification score is computed for every window in the sliding window
approach or the fully convolutional approach to know what object is present in
that window. Instead of predicting the classification score for every window to
detect an object, each window itself can be predicted with a classification score.
Combining all the ideas such as sliding window, scale-space, full convolution,
and regression give superior results than any individual approach.

#Detect Objects

##Regions of the convolutional neural network (R-CNN)
It proposes a few boxes and checks whether any
of the boxes correspond to the ground truth. **Selective search** was used for these
region proposals. Selective search proposes the regions by grouping thebcolor/texture of windows of various sizes. The selective search looks for blob-like structures. It starts with a pixel and produces a blob at a higher scale. It produces around 2,000 region proposals. This region proposal is less when
compared to all the sliding windows possible.

The proposals are resized and passed through a standard CNN architecture such
as Alexnet/VGG/Inception/ResNet. The last layer of the CNN is trained with an
SVM identifying the object with a no-object class. The boxes are further
improved by tightening the boxes around the images. A linear regression model
to predict a closer bounding box is trained with object region proposals.

The architecture of R-CNN,
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/rcnn.JPG?raw=true)

The encoder can be a pre-trained model of a standard deep learning model. The
features are computed for all the regions from the training data. The features are
stored and then the SVM is trained. Next, the bounding boxes are trained with
the normalized coordinates.

Disadvantages:
* Several proposals are formed by selective search and hence many
inferences have to be computed, usually around 2,000
* There are three classifiers that have to be trained, which increases the
number of parameters
* There is no end-to-end training

##Fast R-CNN
The Fast R-CNN method runs CNN inference only once and
hence reduces computations. The output of the CNN is used to propose the
networks and select the bounding box. It introduced a technique called **Region
of Interest pooling**. The Region of Interest pooling takes the CNN features and
pools them together according to the regions.

The features obtained after the inference using CNN is pooled and regions are selected,
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/fast_rcnn.JPG?raw=true)

By applying the region of interest pooling, an end-to-end training is performed, avoiding multiple classifiers.
Note that the SVM is replaced by the softmax layer and the box regressor is
replaced by bounding box regressors. 

The disadvantage that still remains is the
selective search, which takes some time.

##Faster R-CNN
The
difference between Faster R-CNN and the Fast R-CNN method is that the Faster
R-CNN uses CNN features of architecture such as VGG and Inception for
proposals instead of selective search. The CNN features are further passed
through the region proposal network. A sliding window is passed through
features with potential bounding boxes and scores as the output, as well as a few
aspect ratios that are intuitive, the model outputs bounding box and score:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/faster_rcnn.JPG?raw=true)

Faster R-CNN is faster than Fast R-CNN as it saves computation by computing
the feature only once.

##Single Shot Multi-Box Detector
SSD (single shot multi-box) is the fastest of all the methods. This
method simultaneously predicts the object and finds the bounding box. During
training, there might be a lot of negatives and hence hard-negative mining the
class imbalance. The output from CNN has various sizes of features. These are
passed to a 3x3 convolutional filter to predict bounding box. 
This step predicts the object and bounding box:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ssd.JPG?raw=true)

#Object Detection API
Google released pre-trained models with various algorithms trained on the COCO
dataset for public use. The API is built on top of TensorFlow and intended for
constructing, training, and deploying object detection models. The APIs support
both object detection and localization tasks.

##Installation and Setup
Install the Protocol Buffers (protobuf) compiler with the following commands.
Create a directory for protobuf and download the library directly:

In [38]:
%cd ./data/ch4
%ls

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4
[0m[01;34mmodels[0m/  [01;34mprotoc_3.3[0m/


In [10]:
%mkdir protoc_3.3
%cd protoc_3.3
!wget https://github.com/google/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4/protoc_3.3
--2018-12-02 23:43:38--  https://github.com/google/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/protocolbuffers/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip [following]
--2018-12-02 23:43:38--  https://github.com/protocolbuffers/protobuf/releases/download/v3.3.0/protoc-3.3.0-linux-x86_64.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/23357588/45727070-2c66-11e7-99e8-4246c50ca001?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20

In [11]:
%ls

protoc-3.3.0-linux-x86_64.zip


Change the permission of the folder and extract the contents,

In [12]:
!chmod 775 protoc-3.3.0-linux-x86_64.zip
!unzip protoc-3.3.0-linux-x86_64.zip

Archive:  protoc-3.3.0-linux-x86_64.zip
   creating: include/
   creating: include/google/
   creating: include/google/protobuf/
  inflating: include/google/protobuf/any.proto  
  inflating: include/google/protobuf/api.proto  
   creating: include/google/protobuf/compiler/
  inflating: include/google/protobuf/compiler/plugin.proto  
  inflating: include/google/protobuf/descriptor.proto  
  inflating: include/google/protobuf/duration.proto  
  inflating: include/google/protobuf/empty.proto  
  inflating: include/google/protobuf/field_mask.proto  
  inflating: include/google/protobuf/source_context.proto  
  inflating: include/google/protobuf/struct.proto  
  inflating: include/google/protobuf/timestamp.proto  
  inflating: include/google/protobuf/type.proto  
  inflating: include/google/protobuf/wrappers.proto  
   creating: bin/
  inflating: bin/protoc              
  inflating: readme.txt              


Protocol Buffers (protobuf) is Google's language-neutral, platform-neutral,
extensible mechanism for serializing structured data. It serves the use of XML
but is much simpler and faster. The models are usually exported to this format in
TensorFlow. One can define the data structure once but can be read or written in
a variety of languages.

In [41]:
%pwd

'/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4/protoc_3.3'

In [42]:
%cd ./../

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4


Move back to the working folder and clone the repo

In [16]:
!git clone https://github.com/tensorflow/models.git

Cloning into 'models'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 23133 (delta 21), reused 24 (delta 10), pack-reused 23094[K
Receiving objects: 100% (23133/23133), 562.90 MiB | 12.08 MiB/s, done.
Resolving deltas: 100% (13510/13510), done.
Checking out files: 100% (2883/2883), done.


Move the model to the research folder,

In [43]:
%cd models/research/

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4/models/research


The TensorFlow object detection API uses protobufs for exporting model
weights and the training parameters.

In [0]:
!./../../protoc_3.3/bin/protoc object_detection/protos/*.proto --python_out=.

The TensorFlow, models, research, and slim
directories should be appended to `PYTHONPATH`:

In [0]:
!export PYTHONPATH=.:./slim/

Adding to the python path with the preceding command works only one time. For the next, this command has to be run again.

The installation can be tested by
running the following code:

In [49]:
%pwd

'/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4/models/research'

In [51]:
%ls

[0m[01;34ma3c_blogpost[0m/                      [01;34mlstm_object_detection[0m/
[01;34madversarial_crypto[0m/                [01;34mmarco[0m/
[01;34madversarial_logit_pairing[0m/         [01;34mmaskgan[0m/
[01;34madversarial_text[0m/                  [01;34mminigo[0m/
[01;34madv_imagenet_models[0m/               [01;34mmorph_net[0m/
[01;34mastronet[0m/                          [01;34mnamignizer[0m/
[01;34mattention_ocr[0m/                     [01;34mneural_gpu[0m/
[01;34maudioset[0m/                          [01;34mneural_programmer[0m/
[01;34mautoaugment[0m/                       [01;34mnext_frame_prediction[0m/
[01;34mautoencoder[0m/                       [01;34mnst_blogpost[0m/
[01;34mbrain_coder[0m/                       [01;34mobject_detection[0m/
[01;34mcognitive_mapping_and_planning[0m/    [01;34mpcl_rl[0m/
[01;34mcognitive_planning[0m/                [01;34mptn[0m/
[01;34mcompression[0m/                       [01;34mqa_kg

In [0]:
!python ./object_detection/builders/model_builder_test.py

##Pre-Trained Models
Model Name | Speed | COCO mAP
---|---|---
`ssd_mobilenet_v1_coco` | fast | 21
`ssd_inception_v2_coco` | fast | 24
`rfnn_resnet101_coco` | medium | 30
`faster_rcnn_resnet101_coco` | medium | 32
`faster_rcnn_inception_resnet_v2_atrous_coco` | slow | 37

Download the SSD model trained on Mobilenet and extract it to the working directory

In [55]:
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Deep_Learning_for_Computer_Vision/data/ch4

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4


In [56]:
!wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017.tar.gz
!tar -xzvf ssd_mobilenet_v1_coco_11_06_2017.tar.gz

--2018-12-03 00:27:24--  http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 74.125.202.128, 2607:f8b0:4001:c06::80
Connecting to download.tensorflow.org (download.tensorflow.org)|74.125.202.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 128048406 (122M) [application/x-tar]
Saving to: ‘ssd_mobilenet_v1_coco_11_06_2017.tar.gz’


2018-12-03 00:27:27 (43.2 MB/s) - ‘ssd_mobilenet_v1_coco_11_06_2017.tar.gz’ saved [128048406/128048406]

ssd_mobilenet_v1_coco_11_06_2017/
ssd_mobilenet_v1_coco_11_06_2017/model.ckpt.index
ssd_mobilenet_v1_coco_11_06_2017/model.ckpt.meta
ssd_mobilenet_v1_coco_11_06_2017/frozen_inference_graph.pb
ssd_mobilenet_v1_coco_11_06_2017/model.ckpt.data-00000-of-00001
ssd_mobilenet_v1_coco_11_06_2017/graph.pbtxt


* `graph.pbtxt` is the proto-definition of the graph
* `frozen_inference_graph.pb` is the weights of the graph frozen and can be used for inference
* Checkpoint files:
 * `model.ckpt.data-00000-of-00001`
 * `model.ckpt.meta`
 * `model.ckpt.index`

##Re-Train Object Detection Models
The same API lets us retrain a model for our custom dataset. Training of custom
data involves
1. preparation of a dataset, 
2. selecting the algorithm, and
3. performing fine-tuning. 

The whole pipeline can be passed as a parameter to the
training script. The training data has to be converted to TensorFlow records.
TensorFlow records is a file format provided by Google to make the reading of
data faster than regular files.

###Data Preparation for the Pet Dataset
Download the image and annotations:

In [57]:
%pwd

'/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4'

In [58]:
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
!wget http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz

--2018-12-03 00:42:42--  http://www.robots.ox.ac.uk/~vgg/data/pets/data/images.tar.gz
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 791918971 (755M) [application/x-gzip]
Saving to: ‘images.tar.gz’


2018-12-03 00:43:16 (22.6 MB/s) - ‘images.tar.gz’ saved [791918971/791918971]

--2018-12-03 00:43:17--  http://www.robots.ox.ac.uk/~vgg/data/pets/data/annotations.tar.gz
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19173078 (18M) [application/x-gzip]
Saving to: ‘annotations.tar.gz’


2018-12-03 00:43:19 (9.36 MB/s) - ‘annotations.tar.gz’ saved [19173078/19173078]



Extract the images and annotations,

In [59]:
!tar -xvf images.tar.gz
!tar -xvf annotations.tar.gz

images/
images/boxer_16.jpg
images/chihuahua_165.jpg
images/pug_183.jpg
images/english_setter_1.jpg
images/chihuahua_170.jpg
images/english_cocker_spaniel_17.jpg
images/samoyed_39.jpg
images/Egyptian_Mau_62.jpg
images/samoyed_36.jpg
images/german_shorthaired_3.jpg
images/Ragdoll_183.jpg
images/British_Shorthair_64.jpg
images/american_pit_bull_terrier_57.jpg
images/beagle_120.jpg
images/american_bulldog_174.jpg
images/chihuahua_101.jpg
images/shiba_inu_136.jpg
images/Abyssinian_136.jpg
images/Siamese_201.jpg
images/Abyssinian_85.jpg
images/saint_bernard_145.jpg
images/Siamese_63.jpg
images/leonberger_164.jpg
images/Maine_Coon_126.jpg
images/samoyed_51.jpg
images/Birman_15.jpg
images/english_cocker_spaniel_181.jpg
images/english_cocker_spaniel_128.jpg
images/leonberger_133.jpg
images/english_cocker_spaniel_6.jpg
images/miniature_pinscher_119.jpg
images/american_pit_bull_terrier_27.jpg
images/Abyssinian_37.jpg
images/Bombay_91.jpg
images/Egyptian_Mau_6.jpg
images/Maine_Coon_173.jpg
images

Create the `pet_tf` record file to create the dataset in the `tf` records, as they are the required input for the object detection trainer.

The `label_map` for the `Pet` dataset can be found at `object_detection/data/pet_label_map.pbtxt`. Now, move to the `research` folder,

In [60]:
%pwd

'/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4'

In [61]:
%cd ./models/research

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision/data/ch4/models/research


In [68]:
!pip3 install contextlib2
!pip3 install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/03/a4/9eea8035fc7c7670e5eab97f34ff2ef0ddd78a491bf96df5accedb0e63f5/lxml-4.2.5-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K    100% |████████████████████████████████| 5.8MB 4.2MB/s 
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.2.5


In [0]:
!python ./object_detection/dataset_tools/create_pet_tf_record.py \
!--label_map_path=./object_detection/data/pet_label_map.pbtxt \
!--data_dir=./../../. \
!--output_dir=./../../.

#Self-Driving Car
Dataset for training a pedestrian object detection:

http://pascal.inrialpes.fr/data/human/

Pedestrian Detection:

https://github.com/diegocavalca/machine-learning/blob/master/supervisioned/object.detection_tensorflow/simple.detection.ipynb

Dataset for training a sign detector:

http://www.vision.ee.ethz.ch/~timofter/traffic_signs/ and http://btsd.ethz.ch/shareddata/
