# Using Google Cloud DataFlow with TFRecorder

This notebook demonstrates how to use TFRecorder with Google Cloud DataFlow to scale up to processing any size of dataset.
    
## Notebook Setup

1. Please install TFRecorder with the command `python setup.py` from the repository root.

2. Create a new GCS bucket the command with `gsutil mb gs://your/bucket/name/` and set the BUCKET= constant to that name.

3. Copy the test images from the TFRutil repo to the new gcs bucket with the command `gsutil cp -r  ./tfrutil/test_data/images gs://<BUCKET_NAME/images`


In [None]:
import pandas as pd
import tfrecorder
import os

In [None]:
!pip download tfrecorder --no-deps
!cp tfrecorder* /tmp

In [None]:
BUCKET="" # ADD YOUR BUCKET HERE, E.G. "GS://MYBUCKET/"
PROJECT="" # ADD YOUR PROJECT NAME HERE
REGION="" # ADD A COMPUTE REGION HERE
OUTPUT_PATH = "results/"
TFRECORDER_WHEEL = "/tmp/tfrecorder-0.1.1-py3-none-any.whl" #UPDATE VERSION AS NEEDED

In [None]:
df = pd.read_csv("data.csv")

## Update image_uri 

The image_uri column is currently pointing to the local file locations for each test image. We will change this path to the new GCS location below.

In [None]:
df['image_uri'] = df.image_uri.str.replace("../tfrecorder/", BUCKET)

In [None]:
df

In [None]:
df.tensorflow.to_tfr(output_dir=BUCKET + OUTPUT_PATH,
                     runner="DataflowRunner",
                     project=PROJECT,
                     region=REGION,
                     tfrecorder_wheel=TFRECORDER_WHEEL)

# That's it!

As you can see, TFRecorder has taken the supplied CSV and transformed it into TFRecords, ready for consumption, along with the transform function