# Data Validation

In this notebook we focus on the problem of validating the input data fed to ML pipelines. The importance of this problem is hard to overstate, especially for production pipelines. Irrespective of the ML algorithms used, data errors can adversely affect the quality of the generated model. Therefore, it is imperative to catch data errors early. The importance of error-free data also applies to the task of model understanding, since any attempt to debug and understand the output of the model must be grounded on the assumption that the data is adequately clean. All these observations point to the fact that we need to elevate data to a first-class citizen in ML pipelines, on par with algorithms and infrastructure, with corresponding tooling to continuously monitor and validate data throughout the various stages of the pipeline.

There are many reasons to analyze and transform your data:

* To find problems in your data. Common problems include:
    * Missing data, such as features with empty values.
    * Labels treated as features, so that your model gets to peek at the right answer during training.
    * Features with values outside the range you expect.
    * Missing or unexpected features.
    * Feature with not enough proportion of the examples.
    * Unexpected feature type.
* To engineer more effective feature sets. For example, you can identify:
    * Especially informative features.
    * Redundant features.
    * Features that vary so widely in scale that they may slow learning.
    * Features with little or no unique predictive information.


# TFDV

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TensorFlow Extended (TFX).

**Tensorflow Data Validation (TFDV)** can analyze training and serving data and includes:

- Computing descriptive statistics
    - Scalable calculation of summary statistics of training and test data.
    - Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of features (Facets)

- Inferring a schema
    - Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies
    - A schema viewer to help you inspect the schema.

- Detecting data anomalies
    - Perform validity checks by comparing data statistics against a schema that codifies expectations of the user.
    - Detect training-serving skew by comparing examples in training and serving data.
    - Detect data drift by looking at a series of data.
    - An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.


# Installing Libraries & Dependencies

In [None]:
!cat requirements.txt

In [None]:
#!pip3 install -r requirements.txt

# Importing Librarires

In [None]:
from datetime import datetime

import pkg_resources
import json
import sys
import os
import re

import tensorflow as tf
import tensorflow_data_validation as tfdv
from tensorflow_data_validation import StatsOptions

from tensorflow_metadata.proto.v0 import schema_pb2
from tensorflow.python.lib.io import file_io
from apache_beam.options.pipeline_options import (
    PipelineOptions,
    GoogleCloudOptions,
    StandardOptions,
    SetupOptions,
    WorkerOptions
)

from data_validation_utils import *

print('INFO: TF version -- {}'.format(tf.__version__))
print('INFO: TFDV version -- {}'.format(pkg_resources.get_distribution("tensorflow_data_validation").version))
print('INFO: Beam version -- {}'.format(pkg_resources.get_distribution("apache_beam").version))
print('INFO: Pyarrow version -- {}'.format(pkg_resources.get_distribution("pyarrow").version))
print('INFO: TFX-BSL version -- {}'.format(pkg_resources.get_distribution("tfx-bsl").version))

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

# Input Arguments

Example of input arguments for the data validation component

In [None]:
PROJECT = "irn-70656-dev-1307100302"
REGION = "europe-west1"
RAW_DATA_PATH = "gs://bike-sharing-data/"
BUCKET = "bike-sharing-pipeline-metadata"
PIPELINE_VERSION = "v0_1"
DATA_VERSION = datetime.now().strftime("%y%m%d_%H%M%S")
RUNNER = "DirectRunner" # DirectRunner or DataflowRunner

In [None]:
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['PIPELINE_VERSION'] = PIPELINE_VERSION
os.environ['DATA_VERSION'] = DATA_VERSION

# Setting Paths 

Setting up some globals for the gcs files

In [None]:
# Set up some globals for gcs file
HANDLER = 'gs://' # ../ for local data, gs:// for cloud data
    
BASE_DIR = os.path.join(HANDLER, BUCKET, PIPELINE_VERSION)
RUN_DIR = os.path.join(BASE_DIR, 'run', DATA_VERSION)

STAGING_DIR = os.path.join(RUN_DIR, 'staging')
OUTPUT_DIR = os.path.join(RUN_DIR, 'data_validation')

FROZEN_STATS_PATH = os.path.join(BASE_DIR,'freeze', 'frozen_stats.txt')
FROZEN_SCHEMA_PATH = os.path.join(BASE_DIR, 'freeze', 'frozen_schema.txt')
DATA_STATS_PATH = os.path.join(OUTPUT_DIR, 'stats', 'data_stats.txt')
DATA_SCHEMA_PATH = os.path.join(OUTPUT_DIR, 'schema', 'data_schema.txt')
DATA_ANOMALIES_PATH = os.path.join(OUTPUT_DIR, 'anomalies', 'data_anomalies.txt')
STATIC_HTML_PATH = os.path.join(OUTPUT_DIR, 'index.html')

# Running on Google Cloud

Set up project and compute region.

In [None]:
!pip show ipython

In [None]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Create GS bucket if not already created

In [None]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

Internally, TFDV uses Apache Beam's data-parallel processing framework to scale the computation of statistics over large datasets. 

To run TFDV on Google Cloud, the TFDV wheel file must be downloaded and provided to the Dataflow workers. We can download the wheel file to the current directory as follows:

In [None]:
!pip download tensorflow_data_validation \
  --no-deps \
  --platform manylinux1_x86_64 \
  --only-binary=:all:   

In [None]:
PATH_TO_WHL_FILE = [filename for filename in os.listdir('.') if filename.startswith('tensorflow_data_validation')]

The following snippet shows an example usage of TFDV on Google Cloud:

In [None]:
job_name = 'datavalidation-' + re.sub("_", "-", PIPELINE_VERSION) + \
    '-' + re.sub("_", "-", DATA_VERSION)

# Create and set your PipelineOptions.
options = PipelineOptions()

# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT
google_cloud_options.job_name = job_name
google_cloud_options.region = REGION
google_cloud_options.staging_location = STAGING_DIR
google_cloud_options.temp_location = STAGING_DIR
options.view_as(WorkerOptions).subnetwork = 'regions/{}/subnetworks/default'.format(REGION)
setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = PATH_TO_WHL_FILE
options.view_as(StandardOptions).runner = RUNNER

# 1- Computing descriptive data statistics

## 1.1- Computing descriptive statistics for current raw data

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Tools such as Facets Overview can provide a succinct visualization of these statistics for easy browsing.

In [None]:
stats_options = StatsOptions()
stats_options.feature_whitelist = ["datetime","season","weather","daytype","temp",
                                   "atemp","humidity","windspeed","casual","registered",
                                   "count"]

In [None]:
# Generating data statistics for initial dataset
print('INFO: Generate & exporting data statistics to {}/'.format(DATA_STATS_PATH))
data_stats =  tfdv.generate_statistics_from_csv(
    data_location=os.path.join(RAW_DATA_PATH, 'train.csv'),
    output_path=DATA_STATS_PATH,
    pipeline_options=options,
    stats_options=stats_options)                                    

## 1.2- Using Visualizations to Check Your Data

TensorFlow Data Validation provides tools for visualizing the distribution of feature values. By examining these distributions in a Jupyter notebook using Facets you can catch common problems with data. **Visualizing statistics** with TFDV Facets allow us to:

### 1.2.1- Identifying Suspicious Distributions

#### Unbalanced Data

An unbalanced feature is a feature for which one value predominates. Unbalanced features can occur naturally, but if a feature always has the same value you may have a data bug. To detect unbalanced features in a Facets Overview, choose "Non-uniformity" from the "Sort by" dropdown.

In our case, no problem of unbalanced data.

#### Uniformly Distributed Data

A uniformly distributed feature is one for which all possible values appear with close to the same frequency. As with unbalanced data, this distribution can occur naturally, but can also be produced by data bugs.


To detect uniformly distributed features in a Facets Overview, choose "Non- uniformity" from the "Sort by" dropdown and check the "Reverse order" checkbox:

### 1.2.2- Missing Data

To check whether a feature is missing values entirely:

- Choose "Amount missing/zero" from the "Sort by" drop-down.
- Check the "Reverse order" checkbox.
- Look at the "missing" column to see the percentage of instances with missing values for a feature.

A data bug can also cause incomplete feature values. For example you may expect a feature's value list to always have three elements and discover that sometimes it only has one. To check for incomplete values or other cases where feature value lists don't have the expected number of elements:

Choose "Value list length" from the "Chart to show" drop-down menu on the right.

Look at the chart to the right of each feature row. The chart shows the range of value list lengths for the feature. For example, the highlighted row in the screenshot below shows a feature that has some zero-length value lists:


### 1.2.3- Large Differences in Scale Between Features

If your features vary widely in scale, then the model may have difficulties learning. For example, if some features vary from 0 to 1 and others vary from 0 to 1,000,000,000, you have a big difference in scale. Compare the "max" and "min" columns across features to find widely varying scales.

Consider normalizing feature values to reduce these wide variations.

### 1.2.4- Labels with Invalid Labels

TensorFlow's Estimators have restrictions on the type of data they accept as labels. For example, binary classifiers typically only work with {0, 1} labels.



In [None]:
tfdv.visualize_statistics(data_stats)

# 2- Inferring a schema over the data

The schema describes the expected properties of the data. Some of these properties are:

- which features are expected to be present
- their type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features.

In short, the schema describes the expectations for "correct" data and can thus be used to detect errors in the data. 

Note that the schema is expected to be fairly static, e.g., several datasets can conform to the same schema, whereas statistics (described above) can vary per dataset.

## 2.1- Inferring schema from data set

TFDV includes infer_schema() to generate a schema automatically

In [None]:
# Generate data set schema
data_schema = tfdv.infer_schema(data_stats)

In [None]:
tfdv.display_schema(data_schema)

## 2.2- Customizing schema

In general, TFDV uses conservative heuristics to infer stable data properties from the statistics in order to avoid overfitting the schema to the specific dataset. It is strongly advised to review the inferred schema and refine it as needed, to capture any domain knowledge about the data that TFDV's heuristics might have missed.

The schema itself is stored as a Schema protocol buffer and can thus be updated/edited using the standard protocol-buffer API. TFDV also provides a few utility methods to make these updates easier.

In [None]:
#for feature in stats_options.feature_whitelist:
#    tfdv.get_feature(data_schema, feature).value_count.min=1
#    tfdv.get_feature(data_schema, feature).value_count.max=1


## 2.3- Schema Environments

By default, validations assume that all datasets in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving.

Environments can be used to express such requirements. In particular, features in schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`.

In our case, the feature named 'partRootRawLabels' is required for training, but is expected to be missing from serving. This can be expressed by:
- Define two distinct environments in the schema: ["SERVING", "TRAINING"] and associate 'partRootRawLabels' only with environment "TRAINING".
- Associate the training data with environment "TRAINING" and the serving data with environment "SERVING".

In [None]:
# casual, registered, count should be required during training, optional while serving

# All features are by default in both TRAINING and SERVING environments.
# Specify that 'partRootLabels' feature is not in SERVING environment.
data_schema.default_environment.append('TRAINING')
data_schema.default_environment.append('SERVING')
tfdv.get_feature(data_schema, 'casual').not_in_environment.append('SERVING')
tfdv.get_feature(data_schema, 'registered').not_in_environment.append('SERVING')
tfdv.get_feature(data_schema, 'count').not_in_environment.append('SERVING')

## 2.4- Saving data schema

In [None]:
def create_dir(path):
    '''
    A function that creates the directory of a provided path.
    (Might not be needed to save results to GS)
    '''
    path_dir = re.search('(.*)/', path).group(1)
    try:
        os.mkdir(path_dir)
    except OSError:
        print ("ERROR: Creation of the directory %s failed" % path_dir)
    else:
        print ("INFO: Successfully created the directory %s " % path_dir)

In [None]:
if HANDLER != "gs://":
    create_dir(DATA_SCHEMA_PATH) # for local use only
    
tfdv.write_schema_text(data_schema, DATA_SCHEMA_PATH)
print('INFO: The data set schema was written to {}'.format(DATA_SCHEMA_PATH))

## 2.5- Loading frozen schema/stats

In [None]:
# Check if frozen data schema exists otherwise create it from current data set
try:
    frozen_schema = tfdv.load_schema_text(input_path=FROZEN_SCHEMA_PATH)
    print('INFO: Pipeline frozen data schema was loaded from {}'.format(FROZEN_SCHEMA_PATH))
except:
    # First pipeline run, create new schema
    print('INFO: Frozen schema not found! First pipeline run! Saving current schema as frozen schema')
    frozen_schema = data_schema
    if HANDLER != "gs://":
        create_dir(FROZEN_SCHEMA_PATH)
    tfdv.write_schema_text(frozen_schema, FROZEN_SCHEMA_PATH)
    print('INFO: A new pipeline data schema was written to {}'.format(FROZEN_SCHEMA_PATH))
    
# Check if frozen data statistics exist otherwise create them from current data set
try:
    frozen_stats = tfdv.load_statistics(FROZEN_STATS_PATH)
    print('INFO: Pipeline frozen data statistics were loaded from {}'.format(FROZEN_STATS_PATH))
except:
    # First pipeline run, create new schema
    print('INFO: No data statistics found at {}'.format(FROZEN_STATS_PATH))
    print('INFO: Frozen data statistics not found! First pipeline run! Saving current data statistics as frozen data statistics')
    frozen_stats=data_stats
    # Save new pipeline data stats
    tf.io.gfile.copy(
        DATA_STATS_PATH,
        FROZEN_STATS_PATH)
    print('INFO: New pipeline data statistics were written to {}/'.format(FROZEN_STATS_PATH))

In [None]:
tfdv.display_schema(frozen_schema)

# 3- Checking the data for errors

## Matching the statistics of the dataset against a schema

Given a schema, it is possible to check whether a dataset conforms to the expectations set in the schema. 

## Checking data skew and drift

In addition to checking whether a dataset conforms to the expectations set in the schema, TFDV also provides functionalities to detect:

- skew between training and serving data
- drift between different days of data

TFDV performs this check by comparing the statistics of different datasets based on the drift/skew comparators specified in the schema.

Same with checking whether a dataset conform to the expectations set in the schema, the result is also an instance of the **Anomalies protocol buffer**.

In the next section we will:

- Match data set statistics against frozen_schema
- Check data drift between data statistics and previous frozen statistics

Drift detection is supported for categorical features and between consecutive spans of data (i.e., between span N and span N+1), such as between different days of data. We express drift in terms of L-infinity distance, and you can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.

In [None]:
# Add a drift comparator to schema for catagorical features and set the threshold to 0.01
tfdv.get_feature(frozen_schema, 'season').drift_comparator.infinity_norm.threshold = 0.01
tfdv.get_feature(frozen_schema, 'weather').drift_comparator.infinity_norm.threshold = 0.01
tfdv.get_feature(frozen_schema, 'daytype').drift_comparator.infinity_norm.threshold = 0.01


In [None]:
# Detect schema anomalies and drift on new data set
print('INFO: Check for schema anomalies and drift on new data set.')
data_anomalies = tfdv.validate_statistics(
    statistics=data_stats,
    schema = frozen_schema,
    environment='TRAINING',
    previous_statistics=frozen_stats)

if HANDLER != "gs://":
    create_dir(DATA_ANOMALIES_PATH) # for local use only
tfdv.write_anomalies_text(data_anomalies, DATA_ANOMALIES_PATH)   
print('INFO: Writing data anomalies to {}'.format(DATA_ANOMALIES_PATH))

In [None]:
tfdv.display_anomalies(data_anomalies)

# 4- Saving results for Kubeflow Artifacts

The display functions of tfdv like `tfdv.display_schema` or `tfdv.visualize_statistics` allows us to visualize results regarding the schema, the statistics of our datasets, as well as the anomalies in a notebook. It would be interesting if we can visualize these results in the Pipelines UI.

The Kubeflow Pipelines UI offers built-in support for several types of visualizations, which we can use for this purpose. An output artifact is an output emitted by a pipeline component, which the Kubeflow Pipelines UI understands and can render as rich visualizations. 

It’s useful for pipeline components to include artifacts so that you can provide for performance evaluation, quick decision making for the run, or comparison across different runs. Artifacts also make it possible to understand how the pipeline’s various components work. An artifact can range from a plain textual view of the data to rich interactive visualizations.

To make use of this programmable UI, our pipeline component must write a JSON file to the component’s local filesystem. We can do this at any point during the pipeline execution.

Available output viewers:
* Confusion matrix 
* Markdown 
* ROC curve
* Table
* TensorBoard
* Web app 

The web-app viewer provides flexibility for **rendering our custom tfdv output**. We can specify an HTML file that our component creates, and the Kubeflow Pipelines UI renders that HTML in the output page. 

We need to figure a way to render the output of the TFDV functions into an HTML code.

The functions in the file `data_validation_utils` have been copied and modified to suit our desired output from the [github repo](https://github.com/tensorflow/data-validation/blob/v0.21.2/tensorflow_data_validation/utils/display_util.py) of tensorflow's `data-validation` open source project.

* The function get_schema_html does the same as `tfdv.display_schema` but the output is rendered as HTML tables instead of dataframes.
* `get_statistics_html` and `get_anomalies_html` were already used by tfdv as intermediary functions but aren't directly exposed by the library. We can hence keep a copy of this particular version of the functions.

In [None]:
print('INFO: Rendering HTML artifacts.')
features_html, domains_html = get_schema_html(data_schema)
data_stats_drift_html = get_statistics_html(data_stats, frozen_stats, lhs_name="NEW_DATA", rhs_name="PREV_PREV")
data_anomalies_html = get_anomalies_html(data_anomalies)

We can add some style to our html page.

In [None]:
style="""
<style>
h1 {
    color:#0B6FA4;
}
h2 {
  color:#0B6FA4;
}
table.paleBlueRows {
    font-family: Arial, Helvetica, sans-serif;
    border: 1px solid #FFFFFF;
    text-align: left;
    border-collapse: collapse;
}
table.paleBlueRows td, table.paleBlueRows th {
    border: 1px solid #FFFFFF;
    padding: 3px 2px;
}
table.paleBlueRows tbody td {
    font-size: 13px;
}
table.paleBlueRows tr:nth-child(even) {
    background: #D0E4F5;
}
table.paleBlueRows thead {
    background: #0B6FA4;
    background: -moz-linear-gradient(top, #4893bb 0%, #237dad 66%, #0B6FA4 100%);
    background: -webkit-linear-gradient(top, #4893bb 0%, #237dad 66%, #0B6FA4 100%);
    background: linear-gradient(to bottom, #4893bb 0%, #237dad 66%, #0B6FA4 100%);
    border-bottom: 5px solid #FFFFFF;
}
table.paleBlueRows thead th {
    font-size: 15px;
    font-weight: bold;
    color: #FFFFFF;
    text-align: left;
    border-left: 2px solid #FFFFFF;
}
table.paleBlueRows thead th:first-child {
    border-left: none;
}

table.paleBlueRows tfoot td {
    font-size: 14px;
}
</style>
"""

Add the different html outputs to one html page:

In [None]:
html = style +  '<h1>Schema</h1><h2>Features</h2>'  + features_html  + '<br><h2>Domains</h2>' + domains_html + \
'<br><h1>Dataset Statistics</h1>' +  data_stats_drift_html + \
'<br><h1>Dataset Anomalies</h1>' +  data_anomalies_html 

Write a HTML file to the component’s local filesystem and upload HTML file to GCS

In [None]:
# Save and upload HTML file to GCS
OUTPUT_FILE_PATH = './index.html'
    
with open(OUTPUT_FILE_PATH, "wb") as f:
    f.write(html.encode('utf-8'))
    
tf.io.gfile.copy(
    OUTPUT_FILE_PATH,
    STATIC_HTML_PATH,
    overwrite=True
)

Our pipeline component must write a JSON file to the component’s local filesystem. We can do this at any point during the pipeline execution.

In [None]:
metadata = {
'outputs' : [{
  'type': 'web-app',
  'storage': 'gcs',
  'source': STATIC_HTML_PATH,
}]
}

# Write output files for next steps in pipeline
with file_io.FileIO('./mlpipeline-ui-metadata.json', 'w') as f:
    json.dump(metadata, f)

Write data_version to txt output file to be used for next steps inputs in pipeline.

In [None]:
with file_io.FileIO('./data_version.txt', 'w') as f:
    f.write(DATA_VERSION)

In [None]:
print(DATA_VERSION)

Let's view our final html file:

In [None]:
HTML(html)

# 5- Freeze the new schema

In deploy component, only if model is deployed