# Use Tensorflow data validation

Let's use the 

In [1]:
import tensorflow as tf
import tensorflow_data_validation as tfdv
import os
import tempfile, urllib, zipfile
import pandas as pd

print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

TF version: 2.8.0
TFDV version: 1.6.0


In [19]:
# Read train data
train_data_path="../data/adult_train.csv"
train_data=pd.read_csv(train_data_path, sep=",", na_values=["?"])
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,37,Private,185744,HS-grad,9,Married-civ-spouse,Other-service,Wife,White,Female,0,0,20,United-States,<=50K
1,35,Private,189240,Some-college,10,Divorced,Other-service,Unmarried,Black,Female,0,0,40,United-States,<=50K
2,36,Local-gov,137314,Some-college,10,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,60,United-States,>50K
3,57,Private,205708,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,Poland,<=50K
4,44,Private,172032,Assoc-voc,11,Married-civ-spouse,Adm-clerical,Husband,White,Male,7298,0,51,United-States,>50K


In [20]:
# Read eval data

eval_data_path="../data/adult_eval.csv"

eval_data=pd.read_csv(eval_data_path, sep=",", na_values=["?"])
eval_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,33,Private,220860,HS-grad,9,Divorced,Transport-moving,Not-in-family,White,Male,0,0,45,United-States,<=50K
1,66,Self-emp-inc,249043,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,5556,0,26,United-States,>50K
2,35,Private,77820,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,60,United-States,<=50K
3,67,Private,172756,1st-4th,2,Widowed,Machine-op-inspct,Not-in-family,White,Female,2062,0,34,Ecuador,<=50K
4,28,Private,212563,Some-college,10,Divorced,Machine-op-inspct,Unmarried,Black,Female,0,0,25,United-States,<=50K


In [32]:
# read serving data

serving_data_path="../data/adult_serving.csv"
serving_data=pd.read_csv(serving_data_path, sep=",", na_values=["?"])
serving_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,3,public,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,5,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,3,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,5,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,2,public,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


## Step1 Data profiling

First we'll use [tfdv.generate_statistics_from_dataframe](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) to understand our training data. 


In [22]:
train_stats = tfdv.generate_statistics_from_dataframe(dataframe=train_data)

Now let's use [tfdv.visualize_statistics](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics).

In [23]:
tfdv.visualize_statistics(train_stats)

## Step2: Infer a schema

In TFDV, the definition of **schema** is different from database or dataframe. It describes not only the column name, column datatypes, it also describes the characteristics of your data such as presence/abscence of data, expected range of values, etc.

In general, TFDV uses **conservative heuristics** to infer stable data properties from the statistics in order to avoid overfitting the schema to the specific dataset. **It is strongly advised to review the inferred schema and refine it as needed**, to capture any domain knowledge about the data that TFDV’s heuristics might have missed.

If we want to compare with solutions such as Great Expectations, TDDA. The stats(TFDV) are profiler(TDDA), the schema(TFDV) is validation rules(TDDA).

In [24]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'age',INT,required,,-
'workclass',STRING,optional,single,'workclass'
'fnlwgt',INT,required,,-
'education',STRING,required,,'education'
'education-num',INT,required,,-
'marital-status',STRING,required,,'marital-status'
'occupation',STRING,optional,single,'occupation'
'relationship',STRING,required,,'relationship'
'race',STRING,required,,'race'
'sex',STRING,required,,'sex'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'workclass',"'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'"
'education',"'10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th', 'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters', 'Preschool', 'Prof-school', 'Some-college'"
'marital-status',"'Divorced', 'Married-AF-spouse', 'Married-civ-spouse', 'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'"
'occupation',"'Adm-clerical', 'Armed-Forces', 'Craft-repair', 'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct', 'Other-service', 'Priv-house-serv', 'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support', 'Transport-moving'"
'relationship',"'Husband', 'Not-in-family', 'Other-relative', 'Own-child', 'Unmarried', 'Wife'"
'race',"'Amer-Indian-Eskimo', 'Asian-Pac-Islander', 'Black', 'Other', 'White'"
'sex',"'Female', 'Male'"
'native-country',"'Cambodia', 'Canada', 'China', 'Columbia', 'Cuba', 'Dominican-Republic', 'Ecuador', 'El-Salvador', 'England', 'France', 'Germany', 'Greece', 'Guatemala', 'Haiti', 'Honduras', 'Hong', 'Hungary', 'India', 'Iran', 'Ireland', 'Italy', 'Jamaica', 'Japan', 'Laos', 'Mexico', 'Nicaragua', 'Outlying-US(Guam-USVI-etc)', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto-Rico', 'Scotland', 'South', 'Taiwan', 'Thailand', 'Trinadad&Tobago', 'United-States', 'Vietnam', 'Yugoslavia'"
'income',"'<=50K', '>50K'"


## Check evaluation data for errors

So far we've only been looking at the training data.  It's important that our evaluation data is consistent with our training data, including that it uses the same schema.  It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training.

In [25]:
# Compute stats for evaluation data
eval_stats = tfdv.generate_statistics_from_dataframe(dataframe=eval_data)

In [26]:
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
                          lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

## Step 4. Check for evaluation anomalies

Does our evaluation dataset match the schema from our training dataset?  This is especially important for categorical features, where we want to identify the range of acceptable values.

Key Point: What would happen if we tried to evaluate using data with categorical feature values that were not in our training dataset?  What about numeric features that are outside the ranges in our training dataset?

In [27]:
# Check eval data for errors by validating the eval data stats using the previously inferred schema.
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'native-country',Unexpected string values,Examples contain values missing from the schema: Holand-Netherlands (<1%).


## Fix evaluation anomalies in the schema

In the eval data, the column native-country contains Holand-Netherlands, that does not exist in training data. We can fix the anomaly by adding this value in the schema.

In [35]:
# Add new value to the domain of feature workclass.
nativecountry_type_domain = tfdv.get_domain(schema, 'native-country')
nativecountry_type_domain.value.append('Holand-Netherlands')


# Validate eval stats after updating the schema 
updated_anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(updated_anomalies)

We also split off a 'serving' dataset for this example, so we should check that too.  By default all datasets in a pipeline should use the same schema, but there are often exceptions. For example, in supervised learning we need to include labels in our dataset, but when we serve the model for inference the labels will not be included. In some cases introducing slight schema variations is necessary.

**Environments** can be used to express such requirements. In particular, features in schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`.

For example, in this dataset the `income` feature is included as the label for training, but it's missing in the serving data. Without environment specified, it will show up as an anomaly.

In [36]:
serving_stats = tfdv.generate_statistics_from_dataframe(dataframe=serving_data)
serving_anomalies = tfdv.validate_statistics(serving_stats, schema)

tfdv.display_anomalies(serving_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'income',Column dropped,Column is completely missing
'workclass',Unexpected string values,Examples contain values missing from the schema: public (<1%).


In [37]:
# Add new value to the domain of feature workclass.
workclass_type_domain = tfdv.get_domain(schema, 'workclass')
workclass_type_domain.value.append('public')

In [38]:

serving_anomalies = tfdv.validate_statistics(serving_stats, schema)
tfdv.display_anomalies(serving_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'income',Column dropped,Column is completely missing


Now we just have the `income` feature (which is our label) showing up as an anomaly ('Column dropped').  Of course we don't expect to have labels in our serving data, so let's tell TFDV to ignore that.

In [39]:
# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

# Specify that 'tips' feature is not in SERVING environment.
tfdv.get_feature(schema, 'income').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(serving_stats, schema, environment='SERVING')

tfdv.display_anomalies(serving_anomalies_with_env)

## Freeze the schema

Now that the schema has been reviewed and curated, we will store it in a file to reflect its "frozen" state. TFDV uses the **protobuf library**, which is becoming a unified method to manipulate your static data (datastructure, transformation scheme, frozen models…).

In [40]:
from tensorflow.python.lib.io import file_io
from google.protobuf import text_format

output_dir="../data"
schema_file = os.path.join(output_dir, 'schema.pbtxt')
tfdv.write_schema_text(schema, schema_file)

!cat {schema_file}

feature {
  name: "age"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "workclass"
  value_count {
    min: 1
    max: 1
  }
  type: BYTES
  domain: "workclass"
  presence {
    min_count: 1
  }
}
feature {
  name: "fnlwgt"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "education"
  type: BYTES
  domain: "education"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "education-num"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "marital-status"
  type: BYTES
  domain: "marital-status"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "occupation"
  value_count {
    min: 1
    max: 1
  }
  type: BY