# Using Google Cloud DataFlow with TFRecorder

This notebook demonstrates how to use TFRecorder with Google Cloud DataFlow to scale up to processing any size of dataset.
    
## Notebook Setup

1. Please install TFRecorder with the command `python setup.py` from the repository root.

2. Create a new GCS bucket the command with `gsutil mb gs://your/bucket/name/` and set the BUCKET= constant to that name.

3. Copy the test images from the TFRutil repo to the new gcs bucket with the command `gsutil cp -r  ./tfrutil/test_data/images gs://<BUCKET_NAME/images`


In [21]:
import pandas as pd
import tfrecorder
import os

In [22]:
!pip download tfrecorder --no-deps
!cp tfrecorder* /tmp

Collecting tfrecorder
  File was already downloaded /home/jupyter/tensorflow-recorder/samples/tfrecorder-2.0-py3-none-any.whl
Successfully downloaded tfrecorder


In [23]:
BUCKET="gs://tfrecorder-output/" # ADD YOUR BUCKET HERE, E.G. "GS://MYBUCKET/"
PROJECT="jared-playground" # ADD YOUR PROJECT NAME HERE
REGION="us-central1" # ADD A COMPUTE REGION HERE
OUTPUT_PATH = "results/"
TFRECORDER_WHEEL = "/home/jupyter/tensorflow-recorder/samples/tfrecorder-2.0-py3-none-any.whl" #UPDATE VERSION AS NEEDED

In [33]:
df = pd.read_csv("/home/jupyter/tensorflow-recorder/tfrecorder/test_data/data.csv")

In [34]:
df['image_uri'][0]

'tfrecorder/test_data/images/TEST/cat/cat-800x600-3.jpg'

## Update image_uri 

The image_uri column is currently pointing to the local file locations for each test image. We will change this path to the new GCS location below.

In [35]:
df['image_uri'] = df.image_uri.str.replace("tfrecorder/", BUCKET)

In [36]:
df['image_uri'][0]

'gs://tfrecorder-output/test_data/images/TEST/cat/cat-800x600-3.jpg'

In [38]:
df.tensorflow.to_tfr(output_dir=BUCKET + OUTPUT_PATH,
                     runner="DataflowRunner",
                     project=PROJECT,
                     region=REGION,
                     tfrecorder_wheel=TFRECORDER_WHEEL)

AttributeError: 'TFRecorderAccessor' object has no attribute 'to_tfrecord'

# That's it!

As you can see, TFRecorder has taken the supplied CSV and transformed it into TFRecords, ready for consumption, along with the transform function