# Data Analysis and Schema Generation with TFDV

In this lab, we use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) (TFDV) to perform the following:

1. **Generate statistics** from the training data.
2. **Visualise and analyse** the generated statistics.
2. **Infer** a **schema** from the generated statistics.
3. **Update** the schema with domain knowledge.
4. **Validate** the evaluation data against the schema.
5. **Save** the schema for later use.

<br/>
<img valign="middle" src="imgs/tfdv.png" width="800">


In [None]:
#!pip install -q -U tensorflow_data_validation

In [None]:
import os
import tensorflow_data_validation as tfdv
print('TFDV version: {}'.format(tfdv.__version__))

In [None]:
WORKSPACE = 'workspace' # you can set to a GCS location
DATA_DIR = os.path.join(WORKSPACE, 'raw_data')
TRAIN_DATA_FILE = os.path.join(DATA_DIR,'train.csv')
EVAL_DATA_FILE = os.path.join(DATA_DIR,'eval.csv')

## 1. Generate statistics

In [None]:
HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

TARGET_FEATURE_NAME = 'income_bracket'
TARGET_LABELS = [' <=50K', ' >50K']
WEIGHT_COLUMN_NAME = 'fnlwgt'

You can run this on Dataflow by setting the `pipeline_options` parameter.

In [None]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=TRAIN_DATA_FILE, 
    column_names=HEADER,
    stats_options=tfdv.StatsOptions(
        weight_feature=WEIGHT_COLUMN_NAME,
        sample_rate=1.0
    )
)

## 2. Visualise generated statistics

In [None]:
tfdv.visualize_statistics(train_stats)

## 3. Infer schema from statistics

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

## 4. Update the schema with yout domain knowledge

In [None]:
# Relax the minimum fraction of values that must come from the domain for feature occupation.
occupation = tfdv.get_feature(schema, 'occupation')
occupation.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature native_country.
native_country_domain = tfdv.get_domain(schema, 'native_country')
native_country_domain.value.append('Egypt')

# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('EVALUATION')
schema.default_environment.append('SERVING')

# Specify that the class feature is not in SERVING environment.
tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

In [None]:
tfdv.display_schema(schema=schema)

In [None]:
tfdv.get_feature(schema, TARGET_FEATURE_NAME)

## 4. Validate evaluation data
We validate evaluation data against the generated schema, and find anomalies, if any...

In [None]:
eval_stats = tfdv.generate_statistics_from_csv(
    EVAL_DATA_FILE, 
    column_names=HEADER, 
    stats_options=tfdv.StatsOptions(
        weight_feature=WEIGHT_COLUMN_NAME)
)

eval_anomalies = tfdv.validate_statistics(eval_stats, schema, environment='EVALUATION')
tfdv.display_anomalies(eval_anomalies)

## 5. Save the schema
We freeze the schema to use it for the subsequent ML steps.

In [None]:
RAW_SCHEMA_LOCATION = os.path.join(WORKSPACE, 'raw_schema.pbtxt')

In [None]:
tfdv.write_schema_text(schema, RAW_SCHEMA_LOCATION)
print("Schema stored.")

In [None]:
tfdv.load_schema_text(RAW_SCHEMA_LOCATION)