## To Google Cloud Platform

As we collect more data, the data validation becomes a more time-consuming step in our machine learning workflow. One way of reducing the time to perform the validation is by taking advantage of available cloud solutions. By using a cloud provider, we aren’t limited to the computation power of our laptop or on-premise computing
resources.

As an example, we’ll introduce how to run TFDV on Google Cloud’s product Dataflow. TFDV runs on Apache Beam, which makes a switch to GCP Dataflow very easy.Dataflow lets us accelerate our data validation tasks by parallelizing and distributing them across the allocated nodes for our data-processing task. While Dataflow charges
for the number of CPUs and the gigabytes of memory allocated, it can speed up our pipeline step.

**DataFlow**

Dataflow is a managed service for executing a wide variety of data processing patterns. One can deploy batch and streaming data processing pipelines using Dataflow, including directions for using service features.

We all know that Google probably has more experience processing big data than any other organization on the planet and now they’re making their data processing software available to their customers. Not only that, but they’ve also open-sourced the software as Apache Beam.

Cloud Dataflow is a serverless data processing service that runs jobs written using the Apache Beam libraries. When you run a job on Cloud Dataflow, it spins up a cluster of virtual machines, distributes the tasks in your job to the VMs, and dynamically scales the cluster based on how the job is performing. It may even change the order of operations in your processing pipeline to optimize your job. You don't need to take care of creating and managing 
VM's for huge processing jobs.


**Apache Beam**

The Apache Beam SDK is an open source programming model that enables you to develop both batch and streaming pipelines. You create your pipelines with an Apache Beam program and then run them on the Dataflow service. The [Apache Beam documentation](#https://beam.apache.org/) provides in-depth conceptual information and reference material for the Apache Beam programming model, SDKs, and other runners.

In [None]:
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions

PROJECT_ID = '<YOUR_GCP_PROJECT_ID>'
JOB_NAME = '<YOUR_JOB_NAME>'
GCS_STAGING_LOCATION = 'gs://<YOUR_GCP_BUCKET>/staging'
GCS_TMP_LOCATION = 'gs://<YOUR_GCP_BUCKET>/tmp'
GCS_DATA_LOCATION = 'gs"//<YOUR_GCP_BUCKET/<FILE_nAME>.tfrecord'
GCS_STATS_OUTPUT_PATH = 'gs://<YOUR_GCP_BUCKET>/output'

PATH_TO_WHL_FILE = '<PATH_TO_YOUT_WHEEL_FILE>'

In [None]:
# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
options.view_as(StandardOptions).runner = 'DataflowRunner'

Once we have configured the Google Cloud options, we need to configure the setup for the Dataflow workers. All tasks are executed on workers that need to be provisioned with the necessary packages to run their tasks. In our case, we need to install TFDV by specifying it as an additional package.

To do this, download the latest TFDV package (the binary .whl file) 3 to your local system. Choose a version which can be executed on a Linux system (e.g., tensorflow_data_validation-0.22.0-cp37-cp37m-manylinux2010_x86_64.whl ).

Data ValidationTo configure the worker setup options, specify the path to the downloaded package
in the setup_options.extra_packages list as shown

Find TFDV wheel file [here](https://pypi.org/project/tensorflow-data-validation/#files)

In [None]:
setup_options = options.view_as(SetupOptions)
setup_options.extra_packages = [PATH_TO_WHL_FILE]

In [None]:
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION,
                                       output_path=GCS_STATS_OUTPUT_PATH,
                                       pipeline_options=options)

After you have started the data validation with Dataflow, you can switch back to the Google Cloud console. Your newly kicked off job should be listed