# Using Google Cloud DataFlow with TFRUtil

This notebook demonstrates how to use TFRUtil with Google Cloud DataFlow to scale up to processing any size of dataset.
    
## Notebook Setup

1. Please install TFUtil with the command `python setup.py` from the repository root.

2. Create a new GCS bucket the command with `gsutil mb gs://your/bucket/name` and set the BUCKET= constant to that name.

3. Copy the test images from the TFRutil repo to the new gcs bucket with the command `gsutil cp -r  ./tfrutil/test_data/images gs://<BUCKET_NAME/images`


In [40]:
import pandas as pd
import tfrecorder
import os

In [39]:
!pip download tfrecorder --no-deps

Collecting tfrecorder
  Downloading tfrecorder-0.1.1-py3-none-any.whl (30 kB)
  Saved ./tfrecorder-0.1.1-py3-none-any.whl
Successfully downloaded tfrecorder


In [47]:
BUCKET="mikebernico-sandbox" # ADD YOUR BUCKET HERE
PROJECT="mikebernico-sandbox" # ADD YOUR PROJECT NAME HERE
REGION="us-central1" # ADD A COMPUTE REGION HERE
OUTPUT_PATH = "/results/"
TFRECORDER_WHEEL = os.path.join(os.getcwd(),"tfrecorder-0.1.1-py3-none-any.whl")

In [49]:
df = pd.read_csv("data.csv")

In [50]:
df

Unnamed: 0,split,image_uri,label
0,TRAIN,../tfrecorder/test_data/images/cat/cat-640x853...,cat
1,VALIDATION,../tfrecorder/test_data/images/cat/cat-800x600...,cat
2,TEST,../tfrecorder/test_data/images/cat/cat-800x600...,cat
3,TRAIN,../tfrecorder/test_data/images/goat/goat-640x6...,goat
4,VALIDATION,../tfrecorder/test_data/images/goat/goat-320x3...,goat
5,TEST,../tfrecorder/test_data/images/goat/goat-640x4...,goat


## Update image_uri 

The image_uri column is currently pointing to the local file locations for each test image. We will change this path to the new GCS location below.

In [51]:
df['image_uri'] = df.image_uri.str.replace("../tfrecorder/", "gs://mikebernico-sandbox/")

In [18]:
!gsutil cp -r ../tfrecorder/test_data/* gs://mikebernico-sandbox/test_data

Copying file://../tfrecorder/test_data/data.csv [Content-Type=text/csv]...
Copying file://../tfrecorder/test_data/images/goat/goat-320x320-2.jpg [Content-Type=image/jpeg]...
Copying file://../tfrecorder/test_data/images/goat/goat-640x640-1.jpg [Content-Type=image/jpeg]...
Copying file://../tfrecorder/test_data/images/goat/goat-640x427-3.jpg [Content-Type=image/jpeg]...
/ [4 files][165.2 KiB/165.2 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://../tfrecorder/test_data/images/cat/cat-800x600-3.jpg [Content-Type=image/jpeg]...
Copying file://../tfrecorder/test_data/images/cat/cat-800x600-2.jpg [Content-Type=image/jpeg]...
Copying file://../tfrecorder/test_data/images/cat/cat-640x853-1.jpg [Content-Type=image/jpeg]...
Copy

In [52]:
df

Unnamed: 0,split,image_uri,label
0,TRAIN,gs://mikebernico-sandbox/test_data/images/cat/...,cat
1,VALIDATION,gs://mikebernico-sandbox/test_data/images/cat/...,cat
2,TEST,gs://mikebernico-sandbox/test_data/images/cat/...,cat
3,TRAIN,gs://mikebernico-sandbox/test_data/images/goat...,goat
4,VALIDATION,gs://mikebernico-sandbox/test_data/images/goat...,goat
5,TEST,gs://mikebernico-sandbox/test_data/images/goat...,goat


In [55]:
df.tensorflow.to_tfr(output_dir=BUCKET + OUTPUT_PATH,
                     runner="DataflowRunner",
                     project=PROJECT,
                     region=REGION,
                     tfrecorder_wheel=TFRECORDER_WHEEL)

TypeError: '>=' not supported between instances of 'NoneType' and 'int'

# That's it!

As you can see, TFRUtil has taken the supplied CSV and transformed it into TFRecords, ready for consumption, along with the transform function