# Data Validation with TensorFlow Data Validation (TFDV)
Data validation is the process of examining data to ensure that they can be properly input before training (training data) or inference (serving data).
One way to gain confidence in the validity of data is to prove conformance to a given formal specification.

[TensorFlow Data Validation](https://github.com/tensorflow/data-validation) (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX). Below is a simple example of data validation with TFDV.

NB: This notebook's runtime type is Python 2.

## Set-up

In [1]:
!apt-get install python-dev python-snappy

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-dev is already the newest version (2.7.15~rc1-1).
The following NEW packages will be installed:
  python-snappy
0 upgraded, 1 newly installed, 0 to remove and 28 not upgraded.
Need to get 10.8 kB of archives.
After this operation, 39.9 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-snappy amd64 0.5-1.1build2 [10.8 kB]
Fetched 10.8 kB in 0s (257 kB/s)
Selecting previously unselected package python-snappy.
(Reading database ... 132681 files and directories currently installed.)
Preparing to unpack .../python-snappy_0.5-1.1build2_amd64.deb ...
Unpacking python-snappy (0.5-1.1build2) ...
Setting up python-snappy (0.5-1.1build2) ...


In [2]:
!pip install -q tensorflow_data_validation

[K     |████████████████████████████████| 2.3MB 2.4MB/s 
[K     |████████████████████████████████| 225kB 60.4MB/s 
[K     |████████████████████████████████| 1.9MB 41.9MB/s 
[K     |████████████████████████████████| 2.9MB 39.6MB/s 
[K     |████████████████████████████████| 440kB 55.9MB/s 
[K     |████████████████████████████████| 51kB 5.1MB/s 
[K     |████████████████████████████████| 51kB 5.3MB/s 
[K     |████████████████████████████████| 1.0MB 45.1MB/s 
[K     |████████████████████████████████| 81kB 9.6MB/s 
[K     |████████████████████████████████| 122kB 54.4MB/s 
[K     |████████████████████████████████| 143kB 57.3MB/s 
[K     |████████████████████████████████| 235kB 61.9MB/s 
[K     |████████████████████████████████| 92kB 10.6MB/s 
[K     |████████████████████████████████| 512kB 56.4MB/s 
[K     |████████████████████████████████| 71kB 8.4MB/s 
[?25h  Building wheel for tensorflow-transform (setup.py) ... [?25l[?25hdone
  Building wheel for psutil (setup.py) ... [

In [3]:
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv

from __future__ import print_function

  'You are using Apache Beam with Python 2. '


## Basic Dataset Description

In [5]:
dataset = pd.read_csv("pollution_small.csv")
dataset.shape

(2188, 5)

In [6]:
trainingdset = dataset[:1600]
trainingdset.describe()

Unnamed: 0,pm10,no2,so2,soot
count,1600.0,1600.0,1600.0,1600.0
mean,49.656494,30.980519,16.229981,21.551956
std,35.211906,12.400788,10.621896,12.127354
min,6.38,9.74,4.01,6.0
25%,28.345,22.5675,9.7775,14.4
50%,38.835,28.715,13.275,18.63
75%,58.05,36.37,19.2825,24.0725
max,277.25,138.01,123.13,107.65


In [7]:
testdset = dataset[1600:]
testdset.describe()

Unnamed: 0,pm10,no2,so2,soot
count,588.0,588.0,588.0,588.0
mean,44.648248,37.296922,13.60517,18.44131
std,28.992087,10.94005,5.098944,6.596459
min,11.9,15.07,4.99,8.0
25%,28.3375,29.2175,10.1225,14.41
50%,35.555,35.815,12.345,17.09
75%,50.8125,43.8725,15.855,20.9625
max,273.77,106.03,38.03,87.21


## Basic Data Analysis and Validation
[Get started with TFDV](https://www.tensorflow.org/tfx/data_validation/get_started)

### Generate Statistics and Visualize

In [8]:
traindset_stats = tfdv.generate_statistics_from_dataframe(dataframe=dataset)
tfdv.visualize_statistics(traindset_stats)

### Infering the schema

In [9]:
schema = tfdv.infer_schema(statistics=traindset_stats)
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'soot',FLOAT,required,,-
'no2',FLOAT,required,,-
'pm10',FLOAT,required,,-
'Date',BYTES,required,,-
'so2',FLOAT,required,,-


## Test Dataset Validation
Does the test dataset conform to the Schema?

### Check for Anomalies
The original test dataset we are using does not differ from the data specification.

In [10]:
testdset_stats = tfdv.generate_statistics_from_dataframe(dataframe=testdset)
anomalies = tfdv.validate_statistics(statistics=testdset_stats, schema=schema)
tfdv.display_anomalies(anomalies)

However, a "corrupted" version of the original test dataset will not comply with the specification.

In [14]:
testdset_copy = testdset.copy()
testdset_copy["pm10"].replace(6.38,"six point thirty eight",inplace=True)
testdset_copy.drop("soot", axis=1, inplace=True)
testdset_copy_stats = tfdv.generate_statistics_from_dataframe(dataframe=testdset_copy)
anomalies_testdset_copy = tfdv.validate_statistics(statistics=testdset_copy_stats, schema=schema)
tfdv.display_anomalies(anomalies_testdset_copy)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'soot',Column dropped,Column is completely missing


##  Dealing with Schemas in Production
Implementing machine learning (and in particular deep learning) applications at production level requires to be cognizant of the end-to-end pipeline.

Under the above consideration, it is advisable to define a single schema for all of the data that could ever flow all along the pipeline, but at the same time it is advisable to define subsets of the original schema (referred to as environments) that define subsets of the data that could flow at particular stages of the pipeline.

In [0]:
schema.default_environment.append("TRAINING")
schema.default_environment.append("SERVING")

tfdv.get_feature(schema, "soot").not_in_environment.append("SERVING")

### Checking for anomalies between the SERVING environment and new test set
Checking for anomalies on data expected in the inference phase of the ML application should be performed relative to the SERVING environment. In our running example, the ommision of column `soot` should not yield an alert.

In [16]:
anomalies_env_serving = tfdv.validate_statistics(testdset_copy_stats, schema, environment="SERVING")
tfdv.display_anomalies(anomalies_env_serving)