# What is TFDV (TensorFlow Data Validation)

TFDV is data validation tools in the same category of AWS Deequ, Apache Griffin.

* [TensorFlow Data Validation (TFDV)](https://www.tensorflow.org/tfx/data_validation/get_started)
* [TensorFlow Data Validation: Checking and analyzing your data](https://www.tensorflow.org/tfx/guide/tfdv)

> TensorFlow Data Validation identifies anomalies in training and serving data, and can automatically create a schema by examining the data. The component can be configured to detect different classes of anomalies in the data. It can  
> * Perform validity checks by comparing data statistics against a schema that codifies expectations of the user.
> * Detect training-serving skew by comparing examples in training and serving data.
> * Detect data drift by looking at a series of data.

* [Get started with Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started)

> Computing descriptive data statistics  
TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and the shapes of their value distributions. Tools such as Facets Overview can provide a succinct visualization of these statistics for easy browsing.

* [TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic)

> This example colab notebook illustrates how TensorFlow Data Validation (TFDV) can be used to investigate and visualize your dataset. That includes looking at descriptive statistics, inferring a schema, checking for and fixing anomalies, and checking for drift and skew in our dataset. It's important to understand your dataset's characteristics, including how it might change over time in your production pipeline. It's also important to look for anomalies in your data, and to compare your training, evaluation, and serving datasets to make sure that they're consistent.

* [Data validation using TFX Pipeline and TensorFlow Data Validation](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_tfdv)

> In this notebook-based tutorial, we will create and run TFX pipelines to validate input data and create an ML model. This notebook is based on the TFX pipeline we built in Simple TFX Pipeline Tutorial. If you have not read that tutorial yet, you should read it before proceeding with this notebook.

* [Hands on Tensorflow Data Validation](https://towardsdatascience.com/hands-on-tensorflow-data-validation-61e552f123d7)


# Data

In [1]:
%%bash
export KAGGLE_USERNAME=$(cat ~/.kaggle/kaggle.json | jq -r '.username')
export KAGGLE_KEY=$(cat ~/.kaggle/kaggle.json | jq -r '.key')

rm -f titanic.zip gender_submission.csv test.csv train.csv
kaggle competitions download -c titanic -q
unzip titanic.zip

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [2]:
import pandas as pd


df_train = pd.read_csv("./train.csv")
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


---
# TFDV

* [tfdv.visualize_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics)

```
tfdv.visualize_statistics(
    lhs_statistics: statistics_pb2.DatasetFeatureStatisticsList,
    rhs_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
    lhs_name: Text = 'lhs_statistics',
    rhs_name: Text = 'rhs_statistics',
    allowlist_features: Optional[List[types.FeaturePath]] = None,
    denylist_features: Optional[List[types.FeaturePath]] = None
) -> None
```

In [3]:
import tensorflow as tf
import tensorflow_data_validation as tfdv
import matplotlib.pyplot as plt

## Training data

In [4]:
statistics_train = tfdv.generate_statistics_from_dataframe(
    dataframe=df_train.loc[:, ~df_train.columns.isin(['PassengerId', 'Name', "Survived"])],
    stats_options=tfdv.StatsOptions(
        label_feature='tip_bin',
        weight_feature=None,
        sample_rate=1,
        num_top_values=50
    )
)

###  Statistics (Training)

In [5]:
tfdv.visualize_statistics(statistics_train)

###  Schema (training)

In [6]:
schema_train = tfdv.infer_schema(statistics=statistics_train)
tfdv.display_schema(schema=schema_train)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Pclass',INT,required,,-
'Sex',STRING,required,,'Sex'
'Age',FLOAT,optional,single,-
'SibSp',INT,required,,-
'Parch',INT,required,,-
'Ticket',BYTES,required,,-
'Fare',FLOAT,required,,-
'Cabin',BYTES,optional,single,-
'Embarked',STRING,optional,single,'Embarked'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'female', 'male'"
'Embarked',"'C', 'Q', 'S'"


## Test data

In [7]:
df_test = pd.read_csv("./test.csv")
statistics_test = tfdv.generate_statistics_from_dataframe(
    dataframe=df_test.loc[:, ~df_test.columns.isin(['PassengerId', 'Name'])],
    stats_options=tfdv.StatsOptions(
        label_feature='tip_bin',
        weight_feature=None,
        sample_rate=1,
        num_top_values=50
    )
)

###  Statistics (test)

In [8]:
tfdv.visualize_statistics(statistics_test)

###  Schema (test)

In [9]:
schema_test = tfdv.infer_schema(statistics=statistics_test)
tfdv.display_schema(schema=schema_test)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'Pclass',INT,required,,-
'Sex',STRING,required,,'Sex'
'Age',FLOAT,optional,single,-
'SibSp',INT,required,,-
'Parch',INT,required,,-
'Ticket',BYTES,required,,-
'Fare',FLOAT,optional,single,-
'Cabin',STRING,optional,single,'Cabin'
'Embarked',STRING,required,,'Embarked'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'Sex',"'female', 'male'"
'Cabin',"'A11', 'A18', 'A21', 'A29', 'A34', 'A9', 'B10', 'B11', 'B24', 'B26', 'B36', 'B41', 'B45', 'B51 B53 B55', 'B52 B54 B56', 'B57 B59 B63 B66', 'B58 B60', 'B61', 'B69', 'B71', 'B78', 'C101', 'C105', 'C106', 'C116', 'C130', 'C132', 'C22 C26', 'C23 C25 C27', 'C28', 'C31', 'C32', 'C39', 'C46', 'C51', 'C53', 'C54', 'C55 C57', 'C6', 'C62 C64', 'C7', 'C78', 'C80', 'C85', 'C86', 'C89', 'C97', 'D', 'D10 D12', 'D15', 'D19', 'D21', 'D22', 'D28', 'D30', 'D34', 'D37', 'D38', 'D40', 'D43', 'E31', 'E34', 'E39 E41', 'E45', 'E46', 'E50', 'E52', 'E60', 'F', 'F E46', 'F E57', 'F G63', 'F2', 'F33', 'F4', 'G6'"
'Embarked',"'C', 'Q', 'S'"


## Comparision

It is important that **the numerical and categorical features belongs roughly to the same range** between the test/validation data and the training data. Otherwise, you might have distribution skew that will negatively affect the accuracy of your model.

In [10]:
tfdv.visualize_statistics(
    lhs_statistics=statistics_train,
    lhs_name="ttrain",
    rhs_statistics=statistics_test,
    rhs_name="test"
)

## Validation

* [tfdv.validate_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics)

> This method validates the statistics against the schema. If drift- or skew-detection is conducted, then the raw skew/drift measurements for each feature that is compared will be recorded in the drift_skew_info field in the returned Anomalies proto.

```
tfdv.validate_statistics(
    statistics: statistics_pb2.DatasetFeatureStatisticsList,
    schema: schema_pb2.Schema,
    environment: Optional[Text] = None,
    previous_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
    serving_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None
) -> anomalies_pb2.Anomalies
```

* [tfdv.display_anomalies](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_anomalies)

> Displays the input anomalies (for use in a Jupyter notebook).

```
tfdv.display_anomalies(
    anomalies: anomalies_pb2.Anomalies
) -> None
```

In [11]:
anomalies = tfdv.validate_statistics(
    statistics=statistics_train,
    schema=schema_test
)
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'Cabin',Unexpected string values,"Examples contain values missing from the schema: A10 (<1%), A14 (<1%), A16 (<1%), A19 (<1%), A20 (<1%), A23 (<1%), A24 (<1%), A26 (<1%), A31 (<1%), A32 (<1%), A36 (<1%), A5 (<1%), A6 (<1%), A7 (<1%), B101 (<1%), B102 (<1%), B18 (<1%), B19 (<1%), B20 (<1%), B22 (<1%), B28 (<1%), B3 (<1%), B30 (<1%), B35 (<1%), B37 (<1%), B38 (<1%), B39 (<1%), B4 (<1%), B42 (<1%), B49 (<1%), B5 (<1%), B50 (<1%), B73 (<1%), B77 (<1%), B79 (<1%), B80 (<1%), B82 B84 (<1%), B86 (<1%), B94 (<1%), B96 B98 (~1%), C103 (<1%), C104 (<1%), C110 (<1%), C111 (<1%), C118 (<1%), C123 (<1%), C124 (<1%), C125 (<1%), C126 (<1%), C128 (<1%), C148 (<1%), C2 (<1%), C30 (<1%), C45 (<1%), C47 (<1%), C49 (<1%), C50 (<1%), C52 (<1%), C65 (<1%), C68 (<1%), C70 (<1%), C82 (<1%), C83 (<1%), C87 (<1%), C90 (<1%), C91 (<1%), C92 (<1%), C93 (<1%), C95 (<1%), C99 (<1%), D11 (<1%), D17 (<1%), D20 (<1%), D26 (<1%), D33 (<1%), D35 (<1%), D36 (<1%), D45 (<1%), D46 (<1%), D47 (<1%), D48 (<1%), D49 (<1%), D50 (<1%), D56 (<1%), D6 (<1%), D7 (<1%), D9 (<1%), E10 (<1%), E101 (~1%), E12 (<1%), E121 (<1%), E17 (<1%), E24 (<1%), E25 (<1%), E33 (<1%), E36 (<1%), E38 (<1%), E40 (<1%), E44 (<1%), E49 (<1%), E58 (<1%), E63 (<1%), E67 (<1%), E68 (<1%), E77 (<1%), E8 (<1%), F E69 (<1%), F G73 (<1%), F38 (<1%), T (<1%)."
'Embarked',Multiple errors,"The feature has a shape, but it's not always present (if the feature is nested, then it should always be present at each nested level) or its value lengths vary. The feature was present in fewer examples than expected: minimum fraction = 1.000000, actual = 0.997755"


In [12]:
!rm -f titanic.zip gender_submission.csv test.csv train.csv