# TensorFlow Data Validation
This notebook goes over the different functionalities of standalone TensorFlow Data Validation (TFDV). This was covered in the 2021 Q2 chapter conference. [Here is the link to the video](https://drive.google.com/file/d/12YUv_k-ORXjcJ_pqGEXHCIJvf8lsw7Sy) if you want to find out more about data drift & skew, and divergence metrics.

# Perform the necessary installations

In [None]:
!pip install tensorflow_data_validation

Restart runtime before continuing.

In [None]:
import os
import urllib
import requests
import zipfile
import pandas as pd
import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils.schema_util import schema_pb2

Download the data.

In [None]:
zip, headers = urllib.request.urlretrieve('https://storage.googleapis.com/artifacts.tfx-oss-public.appspot.com/datasets/chicago_data.zip')
zipfile.ZipFile(zip).extractall()
zipfile.ZipFile(zip).close()

train_data = pd.read_csv(os.path.join('.', 'data', 'train', 'data.csv'))
test_data = pd.read_csv(os.path.join('data', 'eval', 'data.csv'))
serving_data = pd.read_csv(os.path.join('data', 'serving', 'data.csv'))

# Generating and visualizing statistics
You can generate statistics from a dataframe, tfrecord, or a CSV file.

We can generate and visualize statistics of a dataset as follows.

In [None]:
train_stats = tfdv.generate_statistics_from_dataframe(train_data)
tfdv.visualize_statistics(train_stats)

Some useful features on this chart are the log checkmark and the possbility to view quantiles (when clicking the dropdown on the top right).

Some of the possible issues that are flagged:
* `pickup_cencus_tract` is always null
* `dropoff_census_tract` & `company` are missing quire a lot
* Lots of zero values for `trip_miles`

# Schema inference
Using the generated statistics, a data schema can be generated and shown as is done below.

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

The `presence` column indicates if a feature can be missing. The `valency` column indicates the number of values that are required for that feature per training sample (for categorical features, single implies there must be exactly one category per sample).

This automatically inferred schema can be a good starting point for a custom schema. You can access the features of a schema using `tfdv.get_feature(schema, 'feature_name')`. You can access the domain of a feature using `tfdv.get_domain(schema, 'feature_name')`, or set it using `tfdv.set_domain(schema, 'feature_name', domain)`.

Suppose we want to set the `trip_miles` feature to be between 0 and 500, based on the statistics of the training data we saw before.

In [None]:
tfdv.set_domain(schema, 'trip_miles', schema_pb2.FloatDomain(min=0, max=500))

tfdv.display_schema(schema=schema)

# Comparing two datasets
Now that we have our schema, let's check out the test set. We can visualize them next to each other by specifying a right hand side and left hand side in the `visualize_statistics` method.

In [None]:
test_stats = tfdv.generate_statistics_from_dataframe(test_data)
tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats, lhs_name='test dataset', rhs_name='train dataset')

Some issues can be spotted by looking at the differences in minimum and max values. Also, note the percentages checkbox that appears when comparing two datasets. This allows for easier comparison of the distributions of the datasets.

We can check for schema anomalies using the `validate_statistics` method. This checks if the input statistics are conform to the earlier defined schema.

In [None]:
anomalies = tfdv.validate_statistics(statistics=test_stats, schema=schema)
tfdv.display_anomalies(anomalies)

There are unexpected values for the categorical field `company`, but all of which only occur less than 1% of the time. `payment_type` also has an unexpected field. `trip_miles` does not conform to the domain we set earlier.

You can access the feature of a schema using `tfdv.get_feature(schema, 'feature_name)`.
You can access the domain of a feature using `tfdv.get_domain(schema, 'feature_name')`

In [None]:
# Relax requirement on unexpected string values for company categories
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9

# Add a new category to the payment type domain
payment_type_domain = tfdv.get_domain(schema, 'payment_type')
payment_type_domain.value.append('Prcard')

# Increase domain range of trip_miles
trip_miles_domain = tfdv.get_domain(schema, 'trip_miles')
trip_miles_domain.max = 2000.0

updated_anomalies = tfdv.validate_statistics(test_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Other schema changes you can do include:
* Changing the type of the feature: `feature.type = 1` (`0` for unknown, `1` for string, `2` for int, `3` for float, `4` for struct)
* Setting the minimum required presence of a feature: `feature.presence.min_fraction=0.9`
* Changing the valency of a feature: `feature.value_count.min = min` or `feature.value_count.max = max`

# Environment-based schema

The schema can differ per environment. For example: 

In [None]:
serving_stats = tfdv.generate_statistics_from_dataframe(serving_data)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Here we see that the tips value is missing from the serving data. This is due to the fact that this is our label value, which is absent from serving data.

This is were environments come in. You can create extra environments for your schema using `schema.default_environment.append('env_name')` and add or remove a feature from this using `feature.in_environment.append('env_name')` or `feature.not_in_environment.append('env_name')`

In [None]:
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

tips_feature = tfdv.get_feature(schema, 'tips')
tips_feature.not_in_environment.append('SERVING')

serving_anomalies = tfdv.validate_statistics(serving_stats, schema, environment='SERVING')
tfdv.display_anomalies(serving_anomalies)

# Distribution Drift & Skew

Using the skew & drift comparator, you can configure your schema to allow detection of drift and skew. Both comparators are essentially the same. The skew comparator mean to be used to detect changes between training and serving statistics, and the drift comparator is used to detect changes between two different datasets in time.

In [None]:
pickup_community_area_feature = tfdv.get_feature(schema, 'pickup_community_area')
pickup_community_area_feature.drift_comparator.jensen_shannon_divergence.threshold = 0.2
pickup_community_area_feature.skew_comparator.jensen_shannon_divergence.threshold = 0.2

company_feature=tfdv.get_feature(schema, 'company')
company_feature.drift_comparator.infinity_norm.threshold = 0.001
company_feature.skew_comparator.infinity_norm.threshold = 0.2

payment_type_feature = tfdv.get_feature(schema, 'payment_type')
payment_type_feature.drift_comparator.infinity_norm.threshold = 0.001
payment_type_feature.skew_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema,
                                          serving_statistics=serving_stats,
                                          previous_statistics=test_stats)
tfdv.display_anomalies(skew_anomalies)

# Saving your schema

Once we are happy with our custom schema, we can export it so it can be loaded.

In [None]:
tfdv.write_schema_text(schema, 'schema.txt')
loaded_schema = tfdv.load_schema_text('schema.txt')
tfdv.display_schema(loaded_schema)

We can also choose to write out the anomalies and statistics to a file if we want to.

In [None]:
tfdv.write_anomalies_text(skew_anomalies, 'anomalies.txt')
tfdv.write_stats_text(train_stats, 'train_stats.txt')

# Slicing your data

Another functinality of tfdv is the ability to slice your data. This can be useful when you want to analyze the distribution of certain categorical values.

In the example below, we slice the data to look at the distribution of occurrences where no payment was charged.


In [None]:
from tensorflow_data_validation.utils import slicing_util
from tensorflow_data_validation.utils.stats_util import statistics_pb2

slice_cash_function = slicing_util.get_feature_value_slicer(features={'payment_type': [b'No Charge']})
slice_options = tfdv.StatsOptions(slice_functions=[slice_cash_function])
# This functionality does not seem to work for dataframes
slice_stats = tfdv.generate_statistics_from_csv('data/train/data.csv', stats_options=slice_options)

def get_sliced_stats(stats, slice_key):
    for sliced_stats in stats.datasets:
        if sliced_stats.name == slice_key:
            result = statistics_pb2.DatasetFeatureStatisticsList()
            result.datasets.add().CopyFrom(sliced_stats)
            return result
        print('Invalid Slice key')

def display_slice_keys(stats):
    print(list(map(lambda x: x.name, slice_stats.datasets)))

display_slice_keys(slice_stats)

In [None]:
tfdv.visualize_statistics(get_sliced_stats(slice_stats, 'All Examples'), get_sliced_stats(slice_stats, 'payment_type_No Charge'))