<a href="https://colab.research.google.com/github/hailusong/colab-god-idclass/blob/master/god_idclass_gcs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GCS Setup: Anything Other Than Detection Model and Pipeline Config
for detection model setup, please use **god_idclass_gcs_model.ipynb**

Environment variables setup.<br>
**Tensorflow runtime version list** can be found at [here](https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list)

In [0]:
DEFAULT_HOME='/content'
TF_RT_VERSION='1.13'
PYTHON_VERSION='3.5'

YOUR_GCS_BUCKET='id-norm'
YOUR_PROJECT='orbital-purpose-130316'

Select the right model from [this official list](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md):

| model | dataset | datetime | notes |
| - |  - | - | - |
| ssd_inception_v2 | coco | 2018_01_28 | |
| ~~ssd_inception_v3~~ | ~~pets~~ | ~~11_06_2017~~ | |
| ssd_mobilenet_v2 | coco | 2018_03_29 | |
| faster_rcnn_resnet101 | coco | 11_06_2017 | |

In [0]:
MODEL_NAME = 'ssd_mobilenet_v2'
PRETRAINED_DATASET = 'coco'
PRETRAINED_TS = '2018_03_29'
PRETRAINED_MODEL_NAME = f'{MODEL_NAME}_{PRETRAINED_DATASET}_{PRETRAINED_TS}'
PIPELINE_CONFIG_NAME = f'pipeline_{MODEL_NAME}'

## Session and Environment Verification (Destination - Local)

Establish security session with Google Cloud

In [0]:
from google.colab import auth
auth.authenticate_user()


################# RE-RUN ABOVE CELLS IF NEED TO RESTART RUNTIME #################

Verify Versions: TF, Python, IPython and prompt_toolkit (these two need to have compatible version), and protoc

In [0]:
import tensorflow as tf
print(tf.__version__)
assert(tf.__version__.startswith(TF_RT_VERSION + '.')), f'tf.__version__ {tf.__version__} not matching with specified TF runtime version env variable {TF_RT_VERSION}'

1.13.1


In [0]:
!python -V
!ipython --version
!pip show prompt_toolkit
!protoc --version

Python 3.6.7
5.5.0
Name: prompt-toolkit
Version: 1.0.15
Summary: Library for building powerful interactive command lines in Python
Home-page: https://github.com/jonathanslenders/python-prompt-toolkit
Author: Jonathan Slenders
Author-email: UNKNOWN
License: UNKNOWN
Location: /usr/local/lib/python3.6/dist-packages
Requires: six, wcwidth
Required-by: jupyter-console, ipython
libprotoc 3.0.0


## Install Google Object Detection API in Colab
Reference is https://colab.research.google.com/drive/1kHEQK2uk35xXZ_bzMUgLkoysJIWwznYr


### Downgrade prompt-toolkit to 1.0.15 (Destination - Local)
Run this **ONLY** if the Installation not Working

In [0]:
# !pip install 'prompt-toolkit==1.0.15'

### Google Object Detection API Installation (Destination - Local)

In [0]:
!apt-get install -y -qq protobuf-compiler python-pil python-lxml
![ ! -e {DEFAULT_HOME}/models ] && git clone --depth=1 --quiet https://github.com/tensorflow/models.git {DEFAULT_HOME}/models
!ls {DEFAULT_HOME}/models

AUTHORS     CONTRIBUTING.md    LICENSE	 README.md  samples    WORKSPACE
CODEOWNERS  ISSUE_TEMPLATE.md  official  research   tutorials


In [0]:
import os
os.chdir(f'{DEFAULT_HOME}/models/research')
!pwd

/content/models/research


*From Wikipedia ...*: 

**protocol buffers** are a language-neutral, platform-neutral extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. 

You define how you want your data to be structured once, then you can **use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages**.

Remember **.proto defines structured data** and **protoc generates the source code** the serailize/de-serialize.

In [0]:
!protoc object_detection/protos/*.proto --python_out=.
# !ls object_detection/protos/*.proto
# !cat object_detection/protos/anchor_generator.proto
!ls {DEFAULT_HOME}/models/research/object_detection/builders/anchor*

/content/models/research/object_detection/builders/anchor_generator_builder.py
/content/models/research/object_detection/builders/anchor_generator_builder_test.py


#### Add Google Object Detection API into System Path

In [0]:
import sys
sys.path.append(f'{DEFAULT_HOME}/models/research')
sys.path.append(f'{DEFAULT_HOME}/models/research/slim')

Note that ! calls out to a shell (in a **NEW** process), while % affects the **SAME** process associated with the notebook.

Since we append pathes to sys.path, we **HAVE TO** use % command to run the Python

Also it is **IMPORTANT** to have **%matplotlib inline** otherwise %run model_builder_test.py will **cause function attribute error** when accessing matplotlib.pyplot attributes from **iPython's run_line_magic** 

In [0]:
# !find . -name 'inception*' -print
%matplotlib inline

In [0]:
# If see the error 'function' object has no attribute 'called', just run the %matplotlib cell and this cell AGAIN 
%run object_detection/builders/model_builder_test.py

import os
os.chdir(f'{DEFAULT_HOME}')

............s...
----------------------------------------------------------------------
Ran 16 tests in 0.154s

OK (skipped=1)


## Prepare Our Own Data: Download, Convert and Upload (Destination - GCS)

Use Google Cloud SDK gsutil to download the data file **generated.tar.gz**<br>
Note that the file **generated.tar.gz** MUST BE uploaded to GCS bucket by:<br>
* Run the BB project idaug to generate images, bbox csv and key-points csv in folder **generated**
* Tar/gzip the whole **generated** folder to **generated.tar.gz**

In [0]:
# Download the file.
!gsutil cp gs://{YOUR_GCS_BUCKET}/generated.tar.gz /tmp/generated.tar.gz
!ls /tmp/*gz

Copying gs://id-norm/generated.tar.gz...
\ [1 files][131.2 MiB/131.2 MiB]                                                
Operation completed over 1 objects/131.2 MiB.                                    
/tmp/generated.tar.gz


Prepare the data file (unzip, untar)

In [0]:
import os
os.chdir(f'{DEFAULT_HOME}')

![[ ! -f /tmp/generated.tar && -f /tmp/generated.tar.gz ]] && gunzip /tmp/generated.tar.gz
![[ ! -e ./generated && -f /tmp/generated.tar ]] && tar xf /tmp/generated.tar
!pwd
!ls {DEFAULT_HOME}/generated

/content
bbox-train-non-id1.csv	bbox-valid-on-dl.csv	pnts-valid-non-id2.csv
bbox-train-non-id2.csv	bbox-valid-on-hc.csv	pnts-valid-non-id3.csv
bbox-train-non-id3.csv	pnts-train-non-id1.csv	pnts-valid-on-dl.csv
bbox-train-on-dl.csv	pnts-train-non-id2.csv	pnts-valid-on-hc.csv
bbox-train-on-hc.csv	pnts-train-non-id3.csv	Train
bbox-valid-non-id1.csv	pnts-train-on-dl.csv	Valid
bbox-valid-non-id2.csv	pnts-train-on-hc.csv
bbox-valid-non-id3.csv	pnts-valid-non-id1.csv


In [0]:
# Copy unzip generated back
!gsutil cp -R {DEFAULT_HOME}/generated gs://{YOUR_GCS_BUCKET}

Concat all train csv together, keep only one header and name the first column (no name in the input as it is considered as index column in BB project idaug).<br>
Apply the same processing to validation data as well.

In [0]:
!head -1 {DEFAULT_HOME}/generated/bbox-train-on-dl.csv | sed 's/^,/filename,/' > {DEFAULT_HOME}/train-merged.csv
!head -1 {DEFAULT_HOME}/generated/bbox-valid-on-dl.csv | sed 's/^,/filename,/' > {DEFAULT_HOME}/valid-merged.csv
!tail -q --lines=+2 {DEFAULT_HOME}/generated/bbox-train-*.csv | sed 's/\\/\//g' >> {DEFAULT_HOME}/train-merged.csv
!tail -q --lines=+2 {DEFAULT_HOME}/generated/bbox-valid-*.csv | sed 's/\\/\//g' >> {DEFAULT_HOME}/valid-merged.csv
!ls {DEFAULT_HOME}/generated
!head {DEFAULT_HOME}/train-merged.csv {DEFAULT_HOME}/valid-merged.csv

bbox-train-non-id1.csv	bbox-valid-on-dl.csv	pnts-valid-non-id2.csv
bbox-train-non-id2.csv	bbox-valid-on-hc.csv	pnts-valid-non-id3.csv
bbox-train-non-id3.csv	pnts-train-non-id1.csv	pnts-valid-on-dl.csv
bbox-train-on-dl.csv	pnts-train-non-id2.csv	pnts-valid-on-hc.csv
bbox-train-on-hc.csv	pnts-train-non-id3.csv	Train
bbox-valid-non-id1.csv	pnts-train-on-dl.csv	Valid
bbox-valid-non-id2.csv	pnts-train-on-hc.csv
bbox-valid-non-id3.csv	pnts-valid-non-id1.csv
==> /content/train-merged.csv <==
filename,bbox1_x1,bbox1_y1,bbox1_x2,bbox1_y2,label
generated/Train/non-id1/0.png,10,5,143,93,UNKNOWN
generated/Train/non-id1/1.png,15,0,126,74,UNKNOWN
generated/Train/non-id1/2.png,40,23,119,76,UNKNOWN
generated/Train/non-id1/3.png,20,51,246,202,UNKNOWN
generated/Train/non-id1/4.png,15,33,129,109,UNKNOWN
generated/Train/non-id1/5.png,38,43,114,94,UNKNOWN
generated/Train/non-id1/6.png,51,10,223,125,UNKNOWN
generated/Train/non-id1/7.png,38,48,198,155,UNKNOWN
generated/Train/non-id1/8.png,38,33,255,178,UNKNO

Upload unzip data file to GCS bucket in parallel mode (-m)

In [0]:
!gsutil cp {DEFAULT_HOME}/train-merged.csv {DEFAULT_HOME}/valid-merged.csv gs://{YOUR_GCS_BUCKET}

Copying file:///content/train-merged.csv [Content-Type=text/csv]...
Copying file:///content/valid-merged.csv [Content-Type=text/csv]...
/ [2 files][ 60.0 KiB/ 60.0 KiB]                                                
Operation completed over 2 objects/60.0 KiB.                                     


### Convert Our Label CSV Data to TF Record
Source code is based on https://github.com/datitran/raccoon_dataset/blob/master/generate_tfrecord.py

In [0]:
%pdb

Automatic pdb calling has been turned ON


In [0]:
import os
os.chdir(f'{DEFAULT_HOME}')

!head {DEFAULT_HOME}/train-merged.csv
!mkdir -p {DEFAULT_HOME}/coversion
!git -C {DEFAULT_HOME}/colab-god-idclass pull

# Train records first
%run {DEFAULT_HOME}/colab-god-idclass/src/generate_tfrecord.py --csv_input={DEFAULT_HOME}/train-merged.csv --output_path={DEFAULT_HOME}/coversion/train.record

In [0]:
# Validation records second
!head {DEFAULT_HOME}/valid-merged.csv
%run {DEFAULT_HOME}/colab-god-idclass/src/generate_tfrecord.py --csv_input={DEFAULT_HOME}/valid-merged.csv --output_path={DEFAULT_HOME}/coversion/test.record

In [0]:
!gsutil cp {DEFAULT_HOME}/coversion/train.record {DEFAULT_HOME}/coversion/test.record gs://{YOUR_GCS_BUCKET}/data_{MODEL_NAME}

Copying file:///content/coversion/train.record [Content-Type=application/octet-stream]...
Copying file:///content/coversion/test.record [Content-Type=application/octet-stream]...
|
Operation completed over 2 objects/131.8 MiB.                                    
