# Training model in Google Cloud

The container where this notebook is running has installed Google Cloud SDK.
To train model in cluster in Google Cloud you must log in. To do this, execute the following command from the console

```
docker exec -it <container name> gcloud init
```

## Config variables

Name of the bucket in Cloud Storage where we save the files

In [26]:
GCS_BUCKET='es_kiff'

Config file name

In [27]:
#CONFIG_FILE = 'rfcn_resnet101GCP.config'
#CONFIG_FILE = 'ssd_mobilenetGCP.config'
CONFIG_FILE = 'faster_rcnn_inception_resnetGCP.config'

Pretraining model name

In [28]:
#PRETRAINING_MODEL = 'faster_rcnn_resnet101_coco_11_06_2017'
#PRETRAINING_MODEL = 'ssd_mobilenet_v1_coco_2018_01_28'
PRETRAINING_MODEL = 'faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28'

Path to train, test and labels files

In [29]:
trainRecordPath = '/u01/notebooks/TFM/DatasetCreator/out/train.record'
testRecordPath = '/u01/notebooks/TFM/DatasetCreator/out/test.record'
labelsPath = '/u01/notebooks/TFM/DatasetCreator/out/label_map.pbtxt'

## Data and configuration preparation

Upload files to Cloud Storage

In [30]:
!gsutil cp $trainRecordPath gs://$GCS_BUCKET/data/
!gsutil cp $testRecordPath gs://$GCS_BUCKET/data/
!gsutil cp $labelsPath gs://$GCS_BUCKET/data/label_map.pbtxt

Copying file:///u01/notebooks/TFM/DatasetCreator/out/train.record [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

- [1 files][246.7 MiB/246.7 MiB]    4.3 MiB/s                                   
Operation completed over 1 objects/246.7 MiB.                                    
Copying file:///u01/notebooks/TFM/DatasetC

Get pretraining model and upload to Cloud Storage

In [17]:
#!wget http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_11_06_2017.tar.gz
#!tar -xvf faster_rcnn_resnet101_coco_11_06_2017.tar.gz
#!gsutil cp faster_rcnn_resnet101_coco_11_06_2017/model.ckpt.* gs://$GCS_BUCKET/data/
      
!wget http://download.tensorflow.org/models/object_detection/{PRETRAINING_MODEL}.tar.gz
!tar -xvf {PRETRAINING_MODEL}.tar.gz
!gsutil cp {PRETRAINING_MODEL}/model.ckpt.* gs://$GCS_BUCKET/data/

--2019-05-25 21:18:52--  http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.168.176, 2a00:1450:4003:80a::2010
Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.168.176|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 672221478 (641M) [application/x-tar]
Saving to: ‘faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28.tar.gz’


2019-05-25 21:19:27 (18.6 MB/s) - ‘faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28.tar.gz’ saved [672221478/672221478]

faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/
faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/model.ckpt.index
faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/checkpoint
faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/pipeline.config
faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28/model.ckpt.data-00000-of-00001
faster_r

Replace paths in config file and upload to Cloud Storage

In [31]:
!sed -i "s|PATH_TO_BE_CONFIGURED|"gs://$GCS_BUCKET"/data|g"  /u01/notebooks/TFM/Configs/$CONFIG_FILE
!gsutil cp /u01/notebooks/TFM/Configs/$CONFIG_FILE gs://$GCS_BUCKET/data/$CONFIG_FILE

Copying file:///u01/notebooks/TFM/Configs/faster_rcnn_inception_resnetGCP.config [Content-Type=application/octet-stream]...
- [1 files][  3.1 KiB/  3.1 KiB]                                                
Operation completed over 1 objects/3.1 KiB.                                      


Change to tesorflow research folder

In [32]:
cd /u01/notebooks/models/research/

/u01/notebooks/models/research


Packaging to run in Cloud ML

In [33]:
!bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools
!python setup.py sdist
!(cd slim && python setup.py sdist)

Cloning into 'cocoapi'...
remote: Enumerating objects: 953, done.[K
remote: Total 953 (delta 0), reused 0 (delta 0), pack-reused 953[K
Receiving objects: 100% (953/953), 11.70 MiB | 11.13 MiB/s, done.
Resolving deltas: 100% (565/565), done.
running sdist
running egg_info
writing object_detection.egg-info/PKG-INFO
writing dependency_links to object_detection.egg-info/dependency_links.txt
writing requirements to object_detection.egg-info/requires.txt
writing top-level names to object_detection.egg-info/top_level.txt
reading manifest file 'object_detection.egg-info/SOURCES.txt'
writing manifest file 'object_detection.egg-info/SOURCES.txt'
running check


creating object_detection-0.1
creating object_detection-0.1/object_detection
creating object_detection-0.1/object_detection.egg-info
creating object_detection-0.1/object_detection/anchor_generators
creating object_detection-0.1/object_detection/box_coders
creating object_detection-0.1/object_detection/builders
creating object_detection-

Run training and validation in Cloud ML

In [34]:
!gcloud ai-platform jobs submit training `whoami`_object_detection_diagrams_`date +%m_%d_%Y_%H_%M_%S` \
    --runtime-version 1.12 \
    --job-dir=gs://$GCS_BUCKET/model_dir \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
    --module-name object_detection.model_main \
    --region us-central1 \
    --config /u01/notebooks/TFM/Configs/cloud.yml \
    -- \
    --model_dir=gs://$GCS_BUCKET/model_dir \
    --pipeline_config_path=gs://$GCS_BUCKET/data/$CONFIG_FILE

Job [root_object_detection_diagrams_05_25_2019_22_17_02] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe root_object_detection_diagrams_05_25_2019_22_17_02

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs root_object_detection_diagrams_05_25_2019_22_17_02
jobId: root_object_detection_diagrams_05_25_2019_22_17_02
state: QUEUED


Add credentials to connect tensorboard to Cloud Storage

In [1]:
!echo '' > /u01/notebooks/TFM/Configs/key.json

Launch tensorboard to view progress of training and eval jobs on Google Cloud 

gcloud auth application-default login

In [1]:
!export GOOGLE_APPLICATION_CREDENTIALS="/u01/notebooks/TFM/Configs/key.json" && tensorboard --logdir=gs://es_kiff/model_dir

TensorBoard 1.13.1 at http://134aafc9ad33:6006 (Press CTRL+C to quit)
2019-05-23 03:00:42.844908: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.623136 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-23 03:00:43.638326: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.545661 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-23 03:00:44.272542: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.38535 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 

Now we can open tensorboard page in [http://localhost:6006](http://localhost:6006)

In [5]:
!export GOOGLE_APPLICATION_CREDENTIALS="/u01/notebooks/TFM/Configs/key.json" && python object_detection/export_inference_graph.py \
    --input_type=image_tensor  \
    --pipeline_config_path=gs://$GCS_BUCKET/data/$CONFIG_FILE  \
    --trained_checkpoint_prefix=gs://$GCS_BUCKET/model_dir/model.ckpt-100000  \
    --output_directory=gs://$GCS_BUCKET/outputmodel

2019-05-12 23:49:05.872138: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.982213 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-12 23:49:06.950225: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.109334 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-05-12 23:49:07.124271: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.8935 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve h