# Analyzing data with Tensorflow Data Validation

This notebook demonstrates how TensorFlow Data Validation (TFDV) can be used to analyze and validate your data, including generating descriptive statistics, inferring and fine tuning schema, checking for and fixing anomalies, and detecting drift and skew. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent. TFDV is the tool to achieve it.

You are going to use a variant of Cover Type dataset. For more information about the dataset refer to [the dataset summary page.](../../datasets/covertype/README.md)

## Lab setup
### Import packages and check the versions

In [1]:
import tempfile
import tensorflow as tf
import tensorflow_data_validation as tfdv
import time

from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, DebugOptions, WorkerOptions
from tensorflow_metadata.proto.v0 import schema_pb2

print('TensorFlow version: {}'.format(tf.__version__))
print('TensorFlow Data Validation version: {}'.format(tfdv.__version__))

TensorFlow version: 1.15.0
TensorFlow Data Validation version: 0.15.0


### Set the locations of the dataset files

In [2]:
TRAINING_DATASET='gs://workshop-datasets/covertype/training/covertype_training.csv'
TRAINING_DATASET_WITH_MISSING_VALUES='gs://workshop-datasets/covertype/training_missing/covertype_training_missing.csv'
EVALUATION_DATASET='gs://workshop-datasets/covertype/evaluation/covertype_evaluation.csv'
EVALUATION_DATASET_WITH_ANOMALIES='gs://workshop-datasets/covertype/evaluation_anomalies/covertype_evaluation.csv'
SERVING_DATASET='gs://workshop-datasets/covertype/serving/covertype_serving.csv'

### Configure GCP settings

In [3]:
PROJECT_ID = 'jk-mlops-demo'
REGION = 'us-central1'
STAGING_BUCKET = 'gs://{}-lab11'.format(PROJECT_ID)

### Create a GCP staging bucket

In [4]:
!gsutil mb -p $PROJECT_ID $STAGING_BUCKET 

Creating gs://jk-mlops-demo-lab11/...
ServiceException: 409 Bucket jk-mlops-demo-lab11 already exists.


In [8]:
PATH_TO_WHL_FILE = '/home/tmp/tensorflow_data_validation-0.15.0-cp36-cp36m-manylinux2010_x86_64.whl'

## Computing and visualizing descriptive statistics

First, you use `tfdv.generate_statistics_from_csv` to compute statistics for the training data split. 

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions.

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. For applications that wish to integrate deeper with TFDV (e.g., attach statistics generation at the end of a data-generation pipeline), the API also exposes a Beam PTransform for statistics generation.

Notice that although your dataset is in Google Cloud Storage you will run you computation locally on the notebook's host, using the Beam DirectRunner. Later in the lab, you will use Cloud Dataflow to calculate statistics on a remote distributed cluster.

In [4]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAINING_DATASET_WITH_MISSING_VALUES
)



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


You can now use `tfdv.visualize_statistics` to create a visualization of your data. `tfdv.visualize_statistics` uses [Facets](https://pair-code.github.io/facets/) that provides succinct, interactive visualizations to aid in understanding and analyzing machine learning datasets.

In [5]:
tfdv.visualize_statistics(train_stats)

The interactive widget you see is **Facets Overview**. 
- Numeric features and categorical features are visualized separately, including charts showing the distributions for each feature.
- Features with missing or zero values display a percentage in red as a visual indicator that there may be issues with examples in those features. The percentage is the percentage of examples that have missing or zero values for that feature.
- Try clicking "expand" above the charts to change the display
- Try hovering over bars in the charts to display bucket ranges and counts
- Try switching between the log and linear scales
- Try selecting "quantiles" from the "Chart to show" menu, and hover over the markers to show the quantile percentages

## Infering Schema
Now let's use `tfdv.infer_schema` to create a schema for the data. A schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

In [6]:
schema = tfdv.infer_schema(train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Wilderness_Area',STRING,required,,'Wilderness_Area'
'Aspect',INT,required,,-
'Cover_Type',INT,required,,-
'Elevation',INT,required,,-
'Hillshade_3pm',INT,required,,-
'Hillshade_9am',INT,required,,-
'Hillshade_Noon',INT,required,,-
'Horizontal_Distance_To_Fire_Points',INT,required,,-
'Horizontal_Distance_To_Hydrology',FLOAT,optional,single,-
'Horizontal_Distance_To_Roadways',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Wilderness_Area',"'Cache', 'Commanche', 'Neota', 'Rawah'"


Notice that `tfdv.infer_schema` did not infer all features properly. Although, both `Soil_Type` and `Cover_Type` are `INT` type, they should be interpreted as categorical rather than numeric. You can use `tfdv` functions to manually fine tune the schema.

In [7]:
soil_type_domain = [
"2702", "2703", "2704", "2705", "2706", "2717", "3501", "3502", "4201", "4703", "4704", "4744", "4758", "5101", 
"5151", "6101", "6102", "6731", "7101", "7102", "7103", "7201", "7202", "7700", "7701", "7702", "7709", "7710", 
"7745", "7746", "7755", "7756", "7757", "7790", "8703", "8707", "8708", "8771", "8772", "8776",
]

tfdv.get_feature(schema, 'Soil_Type').type = schema_pb2.FeatureType.BYTES
tfdv.set_domain(schema, 'Soil_Type', schema_pb2.StringDomain(name='Soil_Type', value=soil_type_domain))

tfdv.set_domain(schema, 'Cover_Type', schema_pb2.IntDomain(name='Cover_Type', min=1, max=7, is_categorical=True))

In [8]:
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Wilderness_Area',STRING,required,,'Wilderness_Area'
'Aspect',INT,required,,-
'Cover_Type',INT,required,,"[1,7]"
'Elevation',INT,required,,-
'Hillshade_3pm',INT,required,,-
'Hillshade_9am',INT,required,,-
'Hillshade_Noon',INT,required,,-
'Horizontal_Distance_To_Fire_Points',INT,required,,-
'Horizontal_Distance_To_Hydrology',FLOAT,optional,single,-
'Horizontal_Distance_To_Roadways',INT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Wilderness_Area',"'Cache', 'Commanche', 'Neota', 'Rawah'"
'Soil_Type',"'2702', '2703', '2704', '2705', '2706', '2717', '3501', '3502', '4201', '4703', '4704', '4744', '4758', '5101', '5151', '6101', '6102', '6731', '7101', '7102', '7103', '7201', '7202', '7700', '7701', '7702', '7709', '7710', '7745', '7746', '7755', '7756', '7757', '7790', '8703', '8707', '8708', '8771', '8772', '8776'"


## Creating statistics using Cloud Dataflow

In the previous step, you created descriptive statistics using local compute. This may work for smaller datasets. But for large datasets you may not have enough local compute power. The `tfdv.generate_statistics_*` functions can utilize `DataflowRunner` to run Beam processing on a distributed Dataflow cluster.

To run TFDV on Google Cloud Dataflow, the TFDV wheel file must be provided to the Dataflow workers and the Dataflow configuration must be specified through the `pipeline_options` parameter. 

You will also configure `tfdv.generate_statistics_from_csv` to use the pre-defined schema.

### Configure Dataflow settings

In [None]:
MACHINE_TYPE = 'n1-standard-4'
REQUIREMENTS_FILE_PATH = 'requirements.txt'
PATH_TO_WHL_FILE = '/home/tmp/tensorflow_data_validation-0.15.0-cp36-cp36m-manylinux2010_x86_64.whl'

In [None]:
#gcr.io/cloud-dataflow/v1beta3/python36:2.16.0

#FROM gcr.io/cloud-dataflow/v1beta3/python-fnapi:2.8.0

In [29]:
dockerfile_path="Dockerfile"

In [30]:
%%writefile {dockerfile_path}
FROM gcr.io/cloud-dataflow/v1beta3/python36:2.16.0
RUN pip install tensorflow-data-validation

Writing Dockerfile


In [31]:
IMAGE_NAME="lab_11_dataflow"
IMAGE_URI="gcr.io/{}/{}:latest".format(PROJECT_ID, IMAGE_NAME)
!gcloud builds submit --timeout 15m --tag $IMAGE_URI

Creating temporary tarball archive of 3 file(s) totalling 105.2 KiB before compression.
Uploading tarball of [.] to [gs://jk-mlops-demo_cloudbuild/source/1575502316.43-74ab6cd4e7e7434badaebfcac8ab95bf.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-demo/builds/47404416-d0ff-4454-8d73-7ee8f9bccb1a].
Logs are available at [https://console.cloud.google.com/gcr/builds/47404416-d0ff-4454-8d73-7ee8f9bccb1a?project=538245408688].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "47404416-d0ff-4454-8d73-7ee8f9bccb1a"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-demo_cloudbuild/source/1575502316.43-74ab6cd4e7e7434badaebfcac8ab95bf.tgz#1575502316912464
Copying gs://jk-mlops-demo_cloudbuild/source/1575502316.43-74ab6cd4e7e7434badaebfcac8ab95bf.tgz#1575502316912464...
/ [1 files][ 21.5 KiB/ 21.5 KiB]                                                
Operation completed over 1 objects/21.5 KiB.                                 

In [9]:
%%writefile setup.py

from setuptools import setup

setup(
    name='custom-predictor-2',
    description='Custom prediction routine.',
    version='0.1',
    install_requires=[
 #     'httplib2==0.12.0',
 #     'tensorflow==1.15',
 #     'tensorflow-serving-api==1.15',
      'tensorflow_data_validation==0.15.0'
    ]
)

Overwriting setup.py


In [10]:
options = PipelineOptions()

options.view_as(GoogleCloudOptions).project = PROJECT_ID
options.view_as(GoogleCloudOptions).region = REGION
options.view_as(GoogleCloudOptions).job_name = "tfdv-{}".format(time.strftime("%Y%m%d-%H%M%S"))
options.view_as(GoogleCloudOptions).staging_location = STAGING_BUCKET + '/staging/'
options.view_as(GoogleCloudOptions).temp_location = STAGING_BUCKET + '/tmp/'
options.view_as(StandardOptions).runner = 'DataflowRunner'
#options.view_as(SetupOptions).save_main_session = True
#options.view_as(SetupOptions).extra_packages = [PATH_TO_WHL_FILE]
options.view_as(DebugOptions).experiments = ['shuffle_mode=auto', 'use_fastavro']
#options.view_as(WorkerOptions).worker_harness_container_image = IMAGE_URI
options.view_as(SetupOptions).setup_file = '/home/mlops-labs/Lab-11-TFDV/01-Covertype-Dataset/setup.py'
#options.view_as(SetupOptions).requirements_file = 'requirements.txt'

### Regenerate statistics
Regenerate the statistics using the schema

In [16]:
stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAINING_DATASET_WITH_MISSING_VALUES,
    stats_options=stats_options,
    pipeline_options=options,
    output_path=STAGING_BUCKET + '/output/'
)



In [13]:
tfdv.visualize_statistics(train_stats)

In [None]:
stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=EVALUATION_DATASET,
    stats_options=stats_options
)

tfdv.visualize_statistics(train_stats)

In [None]:
stats_options = tfdv.StatsOptions(schema=schema, infer_type_from_schema=True)

train_stats = tfdv.generate_statistics_from_csv(
    data_location=SERVING_DATASET,
    stats_options=stats_options
)

tfdv.visualize_statistics(train_stats)

In [None]:
#schema = schema_pb2.Schema()

#schema.feature.add(name='Elevation', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Aspect', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Slope', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Hydrology', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Vertical_Distance_To_Hydrology', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Roadways', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_9am', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_Noon', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Hillshade_3pm', type=schema_pb2.FeatureType.FLOAT)
#schema.feature.add(name='Horizontal_Distance_To_Fire_Points', type=schema_pb2.FeatureType.FLOAT)

#schema.feature.add(name='Wilderness_Area', type=schema_pb2.FeatureType.BYTES)
#schema.feature.add(name='Soil_Type', type=schema_pb2.FeatureType.BYTES)

#schema.feature.add(name='Cover_Type', type=schema_pb2.FeatureType.INT)
#tfdv.set_domain(schema, 'Cover_Type', schema_pb2.IntDomain(min=1, max=7, is_categorical=True))

In [14]:
import types

In [15]:
dir(types)

['AsyncGeneratorType',
 'BuiltinFunctionType',
 'BuiltinMethodType',
 'ClassType',
 'CodeType',
 'CoroutineType',
 'DynamicClassAttribute',
 'FrameType',
 'FunctionType',
 'GeneratorType',
 'GetSetDescriptorType',
 'LambdaType',
 'MappingProxyType',
 'MemberDescriptorType',
 'MethodType',
 'ModuleType',
 'SimpleNamespace',
 'TracebackType',
 '_GeneratorWrapper',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_ag',
 '_calculate_meta',
 '_collections_abc',
 '_functools',
 'coroutine',
 'new_class',
 'prepare_class']