# Using Google Cloud DataFlow with TFRUtil

This notebook demonstrates how to use TFRUtil with Google Cloud DataFlow to scale up to processing any size of dataset.
    
## Notebook Setup

1. Please install TFUtil with the command `python setup.py` from the repository root.

2. Create a new GCS bucket the command with `gsutil mb gs://your/bucket/name` and set the BUCKET= constant to that name.

3. Copy the test images from the TFRutil repo to the new gcs bucket with the command `gsutil cp -r  ./tfrutil/test_data/images gs://<BUCKET_NAME/images`


In [None]:
import pandas as pd
import tfrutil

In [None]:
BUCKET="" # ADD YOUR BUCKET HERE
PROJECT="" # ADD YOUR PROJECT NAME HERE
REGION="" # ADD A COMPUTE REGION HERE
TFRUTIL_PATH = "" # ADD THE LOCAL PATH TO YOUR CLONE OF THE TFRUTIL REPO HERE
OUTPUT_PATH = "/results/

In [None]:
df = pd.read_csv("data.csv")

In [None]:
df

## Update image_uri 

The image_uri column is currently pointing to the local file locations for each test image. We will change this path to the new GCS location below.

In [None]:
df['image_uri'] = BUCKET + df.image_uri.str.slice(start=20)

In [None]:
df

In [None]:
df.tensorflow.to_tfr(output_dir=BUCKET + OUTPUT_PATH
                     runner="DataFlowRunner",
                     project=PROJECT,
                     region=REGION,
                     tfrutil_path=TFRUTIL_PATH)

# That's it!

As you can see, TFRUtil has taken the supplied CSV and transformed it into TFRecords, ready for consumption, along with the transform function