## Data Validation

Tensorflow Data Validation (TFDV) can analyze training and serving data to:

- [compute descriptive statistics](#Generating-Statistics)

- [infer a schema](#Generating-schema)

- [detect data anomalies](#Anomalies-detection)

- [data skew and drift](#Data-Skew-and-Drift)

In [20]:
import warnings
warnings.filterwarnings('ignore', 'absl')

In [21]:
import os
import tensorflow_data_validation as tfdv

from ml_metadata.metadata_store import metadata_store
from ml_metadata.proto import metadata_store_pb2

import pandas as pd
from collections import defaultdict

In [22]:
root_dir = os.path.split(os.getcwd())[0]
data_dir = os.path.join(root_dir, 'data', 'dataset1')

data_file = os.listdir(data_dir)[0]
data_dir = os.path.join(data_dir,data_file)

### Generating Statistics

TFDV can compute descriptive statistics that provide a quick overview of the data in terms of the features that are present and their value distributions. It also provides an interactive visualization of those statistic by using [Facets](https://pair-code.github.io/facets/) tool.

```generate_stistics_from_csv```  method is used to calculate statistic from the csv data file  from local or cloud storage

In [23]:
stats = tfdv.generate_statistics_from_csv(data_location = data_dir,
                                         delimiter=',')



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [24]:
tfdv.visualize_statistics(stats)

In the above visualization you can found many statistic which helps us to understand the distribution and charecteristics of the features. 

For numerical features, TFDV computes for every feature:
- The overall count of data records
- The number of missing data records
- The mean and standard deviation of the feature across the data records
- The minimum and maximum value of the feature across the data records
- The percentage of zero values of the feature across the data records
In addition, it generates a histogram of the values for each feature.

For categorical features, TFDV provides:
- The overall count of data records
- The percentage of missing data records
- The number of unique records
- The average string length of all records of a feature
- For each category, TFDV determines the sample count for each label and its rank

### Generating schema

The schema describes the expected properties of the data which is used to detect errors during training or serving time. Some of these properties are:

- which features are expected to be present
- their type
- the number of values for a feature in each example
- the presence of each feature across all examples
- the expected domains of features.

e.g., several datasets can conform to the same schema, whereas statistics (described above) can vary per dataset.

TFDV uses conservative heuristics to infer stable data properties from the statistics **in order to avoid overfitting the schema to the specific dataset**. It is strongly advised to review the inferred schema and refine it as needed, to capture any domain knowledge about the data that TFDV's heuristics might have missed.

>note: These lines are taken from official [Tensorflow website](https://www.tensorflow.org/tfx/data_validation/get_started)

In [76]:
schema = tfdv.infer_schema(stats)

In [77]:
tfdv.display_schema(schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'product',STRING,required,,'product'
'sub_product',STRING,optional,single,'sub_product'
'issue',STRING,required,,'issue'
'sub_issue',STRING,optional,single,'sub_issue'
'consumer_complaint_narrative',BYTES,required,,-
'company',BYTES,required,,-
'state',STRING,optional,single,'state'
'zip_code',BYTES,optional,single,-
'company_response',STRING,required,,'company_response'
'timely_response',STRING,required,,'timely_response'


In this visualization, Presence means whether the feature must be present in 100% of
data examples ( required ) or not ( optional ). Valency means the number of values
required per training example. In the case of categorical features, single would mean
each training example must have exactly one category for the feature.

The schema that has been generated here may not be exactly what we need, it
assumes that the current dataset is exact representation of future data as well. If a
feature is present in all training examples in this dataset, it will be marked as
required , but in reality it may be optional.

so how can I update the schema stats based on domain knowledge?

It is shown in session [Updating Schema](#Updating-schema)

## Exploring the Data

In [27]:
base_dir = os.path.join(root_dir, 'temp_')
file = [i for i in os.listdir(base_dir) if 'sqlite' in i]
config = os.path.join(base_dir, file[0])

connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = config

store = metadata_store.MetadataStore(connection_config)

In [28]:
def display_properties(input):
    data = defaultdict(list)
    for artifact in input:
        properties = artifact.properties
        custom_properties = artifact.custom_properties
        for key, value in properties.items():
            data['artifact id'].append(artifact.id)
            data['type_id'].append(artifact.type_id)
            data['name'].append(key)
            data['is_customproperty'].append(0)
            data['value'].append(value.string_value)

            
        for key, value in custom_properties.items():
            data['artifact id'].append(artifact.id)
            data['type_id'].append(artifact.type_id)
            data['name'].append(key)
            data['is_customproperty'].append(1)
            data['value'].append(value.string_value)
    return pd.DataFrame(data)


def display_types(types):
    table = {'id': [], 'name': []}
    for a_type in types:
        table['id'].append(a_type.id)
        table['name'].append(a_type.name.split('.')[-1])
    return pd.DataFrame(data=table)

def display_artifacts(store, artifacts):
    table = defaultdict(list)
    for a in artifacts:
        table['artifact id'].append(a.id)
        artifact_type = store.get_artifact_types_by_id([a.type_id])[0]
        table['type'].append(artifact_type.name)
        table['uri'].append(a.uri)
        table['create_time_since_epoch'].append(a.create_time_since_epoch)
        table['last_update_time_since_epoch'].append(a.last_update_time_since_epoch)
    return pd.DataFrame(data=table)
    
def display_context(store, artifacts):
    table = defaultdict(list)
    for a in artifacts:
        table['artifact id'].append(a.id)
        artifact_type = store.get_context_types_by_id([a.type_id])[0]
        table['type'].append(artifact_type.name)
        table['name'].append(a.name)
        table['create_time_since_epoch'].append(a.create_time_since_epoch)
        table['last_update_time_since_epoch'].append(a.last_update_time_since_epoch)
    return pd.DataFrame(data=table)

def display_executions(store, artifacts):
    table = defaultdict(list)
    for a in artifacts:
        table['artifact id'].append(a.id)
        artifact_type = store.get_execution_types_by_id([a.type_id])[0]
        table['type'].append(artifact_type.name.split('.')[-1])
        e_state = a.last_known_state
        if e_state == 2:
            table['last_known_state'].append('Running')
        elif e_state == 3:
            table['last_known_state'].append('Success')
        else:
            table['last_known_state'].append(e_state)
        table['create_time_since_epoch'].append(a.create_time_since_epoch)
        table['last_update_time_since_epoch'].append(a.last_update_time_since_epoch)
    return pd.DataFrame(data=table)


In [29]:
artifacts = display_artifacts(store, store.get_artifacts())
uri = artifacts.uri[0]

In [30]:
artifacts_prop = display_properties(store.get_artifacts())

In [31]:
split_names = artifacts_prop.loc[(artifacts_prop.name == 'split_names') & 
                   (artifacts_prop['artifact id'] == artifacts['artifact id'][0])].value

In [32]:
temp_store = {}
for split in eval(split_names[0]):
    file = os.path.join(uri, split)
    file = os.path.join(file, os.listdir(file)[0])
    temp_store[split] = tfdv.generate_statistics_from_tfrecord(
                        data_location = file)

In [33]:
train_stats = temp_store['train']
val_stats = temp_store['eval']

### comparing schema

Let’s say we have two datasets: training and validation datasets. Before training our
machine learning model, we would like to determine how representative the valida‐
tion set is in regards to the training set. Does the validation data follow our training
data schema? TFDV is there to help you out

In [34]:
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')

We can use TFDV to check for selection bias using the statistics visualizations . For example, if our dataset contains Gender as a categorical feature, we can check that this is not biased toward the male category. In our dataset, we have State as a categorical feature. Ideally, the distribution of example counts across the different US states would reflect the relative population in each state.(e.g., Texas, in third place, has a larger popu‐
lation than Florida in second place). If we find this type of bias in our data and we
believe this bias may harm our model’s performance, we can go back and collect
more data or over/undersample our data to get the correct distribution.

| value | validation_data | training_data | 
| -- | -- | -- |
| CA | 3408 | 6573 |
| FL | 1946 | 4010 |
| TX | 1906 | 3794 |

click `show raw data` in the left corner of the chart in state feature

### Anomalies detection

TFDV itself detect some short of anomalies present in the data by using statistis and schema.

In [35]:
anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)

In [36]:
tfdv.display_anomalies(anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'sub_issue',Missing values,Some examples have fewer values than expected.
'state',Missing values,Some examples have fewer values than expected.
'sub_product',Missing values,Some examples have fewer values than expected.
'zip_code',Missing values,Some examples have fewer values than expected.


#### Updating schema

The preceding anomaly protocol shows us how to detect variations from the schema
that is autogenerated from our dataset. But another use case for TFDV is manually
setting the schema according to our domain knowledge of the data.

For example TFDV infers sub_issue feature will available in  80% of our examples.if we decide that we need to require this feature to be present in greater than 90% of our training examples, we can update the schema to reflect this

In [37]:
sub_issue_feature = tfdv.get_feature(schema, 'sub_issue')
sub_issue_feature.presence.min_fraction = 0.9

In [38]:
state_domain = tfdv.get_domain(schema, 'state')
state_domain.value.remove('AK')

Above action of removing AK from state domain is to show wheather TFDV detect AK is missed out in state domain list

In [39]:
updated_anomalies = tfdv.validate_statistics(val_stats, schema)
tfdv.display_anomalies(updated_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'sub_issue',Missing values,Some examples have fewer values than expected.
'state',Multiple errors,Some examples have fewer values than expected. Examples contain values missing from the schema: AK (<1%).
'zip_code',Missing values,Some examples have fewer values than expected.
'sub_product',Missing values,Some examples have fewer values than expected.


We can see that state is maked with multiple errors and it also noted that AK is missed out in schema but which was presented in the dataset.

so we can discus about this with our domain experts and add it manually.

```
state_domain = tfdv.get_domain(schema, 'state')
state_domain.value.append('AK')
```

Writing and reading schema also be done for future data validation

In [40]:
schema_location = os.path.join(os.pardir, 'temp_', 'schema.pbtxt')

tfdv.write_schema_text(schema, schema_location)

In [41]:
%%bash 

tree ../temp_ -I *ExampleGen

../temp_
├── ExampleValidator
│   └── anomalies
│       └── 9
│           ├── eval
│           │   └── anomalies.pbtxt
│           └── train
│               └── anomalies.pbtxt
├── metadata.sqlite
├── SchemaGen
│   └── schema
│       └── 8
│           └── schema.pbtxt
├── schema.pbtxt
└── StatisticsGen
    └── statistics
        └── 7
            ├── eval
            │   └── stats_tfrecord
            └── train
                └── stats_tfrecord

13 directories, 7 files


To read schema file:

```
tfdv.read_schema_text(schema, schema_location)
```

### Data Skew and Drift

**Data Skew:**

TFDV provides a built-in “skew comparator” that detects large differences between
the statistics of two datasets. This isn’t the statistical definition of skew (a dataset that
is asymmetrically distributed around its mean). It is defined in TFDV as the You can also adjust the schema so that different features are required in the training and serving environ‐ments. Data ValidationL-infinity norm of the difference between the serving_statistics of two datasets. If the difference between the two datasets exceeds the threshold of the L-infinity norm for a given feature, TFDV highlights it as an anomaly using the anomaly detection.


**TYPES OF DRIFT'S**

- **Concept drift** or change in P(Y|X) is a shift in the actual relationship between the model inputs and the output. 
- **Label drift** or change in P(Y Ground Truth) is a shift in the model’s output or label distribution
- **Data drift** or change in P(X) is a shift in the model’s input data distribution. Data drift is one of the reasons model accuracy degrades over time. It is nothing but underlying statistical properties of the predictors change. If the variable changes it will affects the model performance.

The best way to address this issue is to continuously monitoring the models. Based on past experiences, an estimate can be made as to when drift starts to creep in the model. Based on this, the model can be proactively re-developed itselft to avoid risks associated with drift.

Causes of data drift include:

- Upstream process changes, such as a sensor being replaced that changes the units of measurement from inches to centimeters.
- Data quality issues, such as a broken sensor always reading 0.
- Natural drift in the data, such as mean temperature changing with the seasons.
- Change in relation between features, or covariate shift.



>L-infinity norm
The L-infinity norm is an expression used to define the difference
between two vectors (in our case, the serving statistics). The L-
infinity norm is defined as the maximum absolute value of the vec‐
tor’s entries.
For example, if the two vector are the statistic of two different distribution, minimum value of L-infinity norm represents similarity between two distribution.

we can also use many other methods like KL-divergence, Jensen-Shannon Divergence so on..

In TFDV we can use L-infinity norm or Jensen-shannon-divergence base on our preference

> ***NOTE To detect skew for numeric features, specify a jensen_shannon_divergence threshold instead of an infinity_norm threshold in the skew_comparator***

To show how it works, I set threshold value to 0.0001 (which is not an reasonable threshold).
you can set this value base on your bussiness problem

In [42]:
tfdv.get_feature(schema,'company').skew_comparator.infinity_norm.threshold = 0.0001

skew_anomalies = tfdv.validate_statistics(statistics=train_stats,
                                        schema=schema,
                                        serving_statistics=val_stats)

In [43]:
tfdv.display_anomalies(skew_anomalies)

  pd.set_option('max_colwidth', -1)


Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'state',Multiple errors,Some examples have fewer values than expected. Examples contain values missing from the schema: AK (<1%).
'company',High Linfty distance between training and serving,"The Linfty distance between training and serving is 0.00303146 (up to six significant digits), above the threshold 0.0001. The feature value with maximum difference is: Bank of America"
'sub_product',Missing values,Some examples have fewer values than expected.
'zip_code',Missing values,Some examples have fewer values than expected.
'sub_issue',Missing values,Some examples have fewer values than expected.


Now you can see that company feature is marked with an anomalie ***'High Linfty distance between training and serving'***

Similar to this skew example, you should define your drift_comparator for the fea‐
tures you would like to watch and compare. You can then call validate_statistics
with the two dataset statistics as arguments, one for your baseline (e.g., yesterday’s
dataset) and one for a comparison (e.g., today’s dataset):

```
tfdv.get_feature(schema,'company').drift_comparator.infinity_norm.threshold = 0.01

drift_anomalies = tfdv.validate_statistics(statistics=train_stats_today,
                                            schema=schema,
                                            previous_statistics=train_stats_yesterday)
```

### Sclicing Dataset

TFDV can be used to slice datasets on features of our choice to infer whether they are biased.
The scenario in which a subtle way for bias to enter data is when data is missing. If data is not missing at random, it may be missing more frequently for one group of people within the dataset than for others. This can mean that when the final model is trained, its performance is worse for these groups.

In this example, we’ll look at data from different US states. We can slice the data so
that we only get statistics from California using the following code

In [44]:
from tensorflow_data_validation.utils import slicing_util

slice_fn1 = slicing_util.get_feature_value_slicer(
            features={'state': [b'CA']})

slice_options = tfdv.StatsOptions(slice_functions=[slice_fn1])
slice_stats = tfdv.generate_statistics_from_csv(
                            data_location=data_dir,
                            stats_options=slice_options)

In [45]:
from tensorflow_metadata.proto.v0 import statistics_pb2

def display_slice_keys(stats):
    print(list(map(lambda x: x.name, slice_stats.datasets)))

def get_sliced_stats(stats, slice_key):
    for sliced_stats in stats.datasets:
        if sliced_stats.name == slice_key:
            result = statistics_pb2.DatasetFeatureStatisticsList()
            result.datasets.add().CopyFrom(sliced_stats)
            return result
        print('Invalid Slice key')
        
def compare_slices(stats, slice_key1, slice_key2):
    lhs_stats = get_sliced_stats(stats, slice_key1)
    rhs_stats = get_sliced_stats(stats, slice_key2)
    tfdv.visualize_statistics(lhs_stats, rhs_stats)

In [46]:
compare_slices(slice_stats, 'state_CA', 'All Examples')

Invalid Slice key


### Schema Environments

By default, validations assume that all datasets in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving.

Environments can be used to express such requirements. In particular, features in schema can be associated with a set of environments using default_environment, in_environment and not_in_environment.

For example, if the company feature is being used as the label in training, but missing in the serving data. Without environment specified, it will show up as an anomaly.

In [63]:
%%bash

mkdir -p ../data/serving_dataset

In [70]:
import pandas as pd

serving_data = pd.read_csv(data_dir, nrows=20)

In [71]:
serving_datapath = os.path.join(os.pardir, 'data', 'serving_dataset', 'serving_data.csv')

serving_data.drop('company', axis = 1, inplace = True)
serving_data.to_csv(serving_datapath, index = False)

In [72]:
serving_data_stat = tfdv.generate_statistics_from_csv(data_location = serving_datapath)

In [85]:
serving_anomalies = tfdv.validate_statistics(
            serving_data_stat, schema)

tfdv.display_anomalies(serving_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'company',Column dropped,Column is completely missing


Here we got an anomaly as column dropped ('company'). we have to indicate that this column won't be available in serving environment. for that we have to maintain evinornment for schema and mark company column as not required in serving evironment.

In [79]:
schema.default_environment.append('TRAINING')
schema.default_environment.append('SERVING')

In [82]:
tfdv.get_feature(schema, 'company').not_in_environment.append('SERVING')

serving_anomalies_with_env = tfdv.validate_statistics(
    serving_data_stat, schema, environment='SERVING')

In [83]:
tfdv.display_anomalies(serving_anomalies_with_env)

  pd.set_option('max_colwidth', -1)


## Speedup validation process

As we collect more data, the data validation becomes a more time-consuming step in our machine learning workflow. One way of reducing the time to perform the validation is by taking advantage of available cloud solutions. By using a cloud provider, we aren’t limited to the computation power of our laptop or on-premise computing resources.

This is not shown in this notebook, to know how to take the advantage of Google clouds DataFlow 
[click here](https://www.tensorflow.org/tfx/data_validation/get_started#running_on_google_cloud)

## Integrating TFDV into Your Machine Learning Pipeline

So far, all methods we have discussed can be used in a standalone setup. This can be
helpful to investigate datasets outside of the pipeline setup.
TFX provides a pipeline component called StatisticsGen , SchemaGen which accepts the output of the previous ExampleGen components as input and then performs the generation of statistics and Schema

### Loading artifact from metadata store

In the previous notebook (Data Ingestion) we had run ExampleGen with several configurations. The StatisticsGen and SchemaGen requires previous run artifacts (ExampleGen) as an input.

In this section I shown how to load the previous run artifacts from the metadatastore

In [None]:
from tfx.types import artifact_utils
from tfx.types import standard_artifacts
from tfx.types import channel_utils

from tfx.orchestration.experimental.interactive import visualizations

# The context, execution and artifacts are connects together which helps us to back track the
# Executiion events happed

# This function is used to filter the latest execution of the given pipeline name
def get_latest_executions(store, pipeline_name, component_id = None):
    if component_id is None:
        run_contexts = [
            c for c in store.get_contexts_by_type('run')
            if c.properties['pipeline_name'].string_value == pipeline_name
        ]
    else:
        run_contexts = [
            c for c in store.get_contexts_by_type('component_run')
            if c.properties['pipeline_name'].string_value == pipeline_name and
               c.properties['component_id'].string_value == component_id
        ]
    if not run_contexts:
        return []

    latest_context = max(run_contexts,
                       key=lambda c: c.last_update_time_since_epoch)
    return store.get_executions_by_context(latest_context.id)


# This method is used to get the output artifact created by the latest execution which has
# been filtered using get_latest_executions method
def get_latest_artifacts(store, pipeline_name, component_id = None):
    executions = get_latest_executions(store, pipeline_name, component_id)

    execution_ids = [e.id for e in executions]
    events = store.get_events_by_execution_ids(execution_ids)
    artifact_ids = [
      event.artifact_id for event in events
      if event.type == metadata_store_pb2.Event.OUTPUT
    ]
    return store.get_artifacts_by_id(artifact_ids)

# This function outputs the list of artifacts that were created in the last run based on given
# artifact type
def find_latest_artifacts_by_type(store, artifacts, artifact_type):
    try:
        artifact_type = store.get_artifact_type(artifact_type)
    except errors.NotFoundError:
        return []
    filtered_artifacts = [aritfact for aritfact in artifacts
                        if aritfact.type_id == artifact_type.id]
    return [artifact_utils.deserialize_artifact(artifact_type, artifact)
      for artifact in filtered_artifacts]

def visualize_artifacts(artifacts):
    for artifact in artifacts:
        visualization = visualizations.get_registry().get_visualization(
            artifact.type_name)
    if visualization:
        visualization.display(artifact)

In [None]:
artifacts = get_latest_artifacts(store, 'pipline_interactive')
example = find_latest_artifacts_by_type(store, artifacts,
                                        standard_artifacts.Examples.TYPE_NAME)

In [None]:
visualize_artifacts(example)

In [None]:
# list of artifacts has to be converted into channels before passing it to the compents
example_gen = channel_utils.as_channel(example)

### TFDV as Components

In [None]:
from tfx.components import StatisticsGen, SchemaGen, ExampleValidator
from tfx.orchestration.experimental.interactive.interactive_context \
        import InteractiveContext

pipeline_name = f'pipline_interactive'
base_root = os.path.split(os.getcwd())[0]
pipeline_root = os.path.join(base_root, f'temp_')
beam_args = [
    '--runner=DirectRunner'
]

if not os.path.exists(pipeline_root):
    raise Exception('Run Data Ingestion Notebook before running this')

context = InteractiveContext(pipeline_name = pipeline_name,
                            pipeline_root = pipeline_root,
                            beam_pipeline_args = beam_args)

statistics_gen = StatisticsGen(
    examples=example_gen)

context.run(statistics_gen)

In [None]:
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=True)

context.run(schema_gen)

If the ExampleValidator component detects a misalignment in the dataset statistics
or schema between the new and the previous dataset, it will set the status to failed in
the metadata store, and the pipeline ultimately stops. Otherwise, the pipeline moves
on to the next step, the data preprocessing.

In [None]:
example_validator = ExampleValidator(
    statistics=statistics_gen.outputs['statistics'],
    schema=schema_gen.outputs['schema'])

context.run(example_validator)

In [None]:
display_artifacts(store, store.get_artifacts())

In [None]:
visualize_artifacts(statistics_gen.outputs['statistics'].get())

In [None]:
visualize_artifacts(schema_gen.outputs['schema'].get())

In [None]:
visualize_artifacts(example_validator.outputs['anomalies'].get())

**Note:**
>The ExampleValidator can automatically detect the anomalies
against the schema by using the skew and drift comparators we
described previously. However, this may not cover all the potential
anomalies in your data. If you need to detect some other specific
anomalies, you will need to write your own custom componen