# Data Analysis and Schema Generation with TFDV

In this lab, we use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) (TFDV) to perform the following:

1. **Generate statistics** from the training data.
2. **Visualise and analyse** the generated statistics.
2. **Infer** a **schema** from the generated statistics.
3. **Update** the schema with domain knowledge.
4. **Validate** the evaluation data against the schema.
5. **Save** the schema for later use.


<br/>
<img valign="middle" src="imgs/tfdv.png" width="800">

## Dataset

The dataset used in these labs is the **UCI Adult Dataset**: https://archive.ics.uci.edu/ml/datasets/adult.

It is a classification dataset, where the task is to predict whether income exceeds 50K USD per yearr based on census data. It is also known as "Census Income" dataset.

In [None]:
import os
from tensorflow.io import gfile

WORKSPACE = 'workspace' # you can set to a GCS location
DATA_DIR = os.path.join(WORKSPACE, 'data')
RAW_SCHEMA_DIR = os.path.join(WORKSPACE, 'raw_schema')

### 1. Download data

In [None]:
if gfile.exists(WORKSPACE):
    print("Removing previous workspace...")
    gfile.rmtree(WORKSPACE)

print("Creating new workspace...")
gfile.mkdir(WORKSPACE)
print("Creating data directory...")
gfile.mkdir(DATA_DIR)

TRAIN_DATA_FILE = os.path.join(DATA_DIR,'train.csv')
EVAL_DATA_FILE = os.path.join(DATA_DIR,'eval.csv')

print("Downloading raw data...")
gfile.copy(src='gs://cloud-samples-data/ml-engine/census/data/adult.data.csv', dst=os.path.join(DATA_DIR,'file1.csv'))
gfile.copy(src='gs://cloud-samples-data/ml-engine/census/data/adult.test.csv', dst=os.path.join(DATA_DIR,'file2.csv'))
print("Data downloaded.")

### 2. Adding headers to the CSV files as the CsvExampleGen components expect headers...

In [None]:
import pandas as pd

HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

pd.read_csv(DATA_DIR +"/file1.csv", names=HEADER).to_csv(DATA_DIR +"/train-01.csv", index=False)
pd.read_csv(DATA_DIR +"/file2.csv", names=HEADER).to_csv(DATA_DIR +"/train-02.csv", index=False)
gfile.remove(DATA_DIR +"/file1.csv")
gfile.remove(DATA_DIR +"/file2.csv")

In [None]:
!wc -l $DATA_DIR/train-01.csv
!head $DATA_DIR/train-01.csv

### 3. Upload to Cloud Storage

In [None]:
GCS_BUCKET = 'ksalama-ocado-gcs'
!gsutil -m cp $DATA_DIR/*.csv gs://$GCS_BUCKET/data/census/

In [None]:
!ls $DATA_DIR/

## Tensorflow Data Validation for Schema Generation

In [None]:
import tensorflow_data_validation as tfdv

TARGET_FEATURE_NAME = 'income_bracket'
WEIGHT_FEATURE_NAME = 'fnlwgt'

## 1. Compute Statistics

In [None]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=DATA_DIR+'/*.csv', 
    column_names=None, # CSV data file include header
    stats_options=tfdv.StatsOptions(
        weight_feature=WEIGHT_FEATURE_NAME,
        sample_rate=1.0
    )
)

In [None]:
tfdv.visualize_statistics(train_stats)

## 2. Infer Schema

In [None]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

## 3. Alter the Schema

In [None]:
# Relax the minimum fraction of values that must come from the domain for feature occupation.
occupation = tfdv.get_feature(schema, 'occupation')
occupation.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature native_country.
native_country_domain = tfdv.get_domain(schema, 'native_country')
native_country_domain.value.append('Egypt')

# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('EVALUATION')
schema.default_environment.append('SERVING')

# Specify that the class feature is not in SERVING environment.
tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

## 4. Save the Schema

In [None]:
import shutil

if os.path.exists(RAW_SCHEMA_DIR):
    shutil.rmtree(RAW_SCHEMA_DIR)
    
os.mkdir(RAW_SCHEMA_DIR)

raw_schema_location = os.path.join(RAW_SCHEMA_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, raw_schema_location)

### Test loading saved schema

In [None]:
tfdv.load_schema_text(raw_schema_location)