# Data Analysis and Schema Generation with TFDV

In this lab, we use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) (TFDV) to perform the following:

1. **Generate statistics** from the training data.
2. **Visualise and analyse** the generated statistics.
2. **Infer** a **schema** from the generated statistics.
3. **Update** the schema with domain knowledge.
4. **Validate** the evaluation data against the schema.
5. **Save** the schema for later use.


<br/>
<img valign="middle" src="imgs/tfdv.png" width="800">

## Dataset

The dataset used in these labs is the **UCI Adult Dataset**: https://archive.ics.uci.edu/ml/datasets/adult.

It is a classification dataset, where the task is to predict whether income exceeds 50K USD per yearr based on census data. It is also known as "Census Income" dataset.

In [1]:
import os
from tensorflow.io import gfile

WORKSPACE = 'workspace' # you can set to a GCS location
DATA_DIR = os.path.join(WORKSPACE, 'data')
RAW_SCHEMA_DIR = os.path.join(WORKSPACE, 'raw_schema')

### 1. Download data

In [2]:
if gfile.exists(WORKSPACE):
    print("Removing previous workspace...")
    gfile.rmtree(WORKSPACE)

print("Creating new workspace...")
gfile.mkdir(WORKSPACE)
print("Creating data directory...")
gfile.mkdir(DATA_DIR)

TRAIN_DATA_FILE = os.path.join(DATA_DIR,'train.csv')
EVAL_DATA_FILE = os.path.join(DATA_DIR,'eval.csv')

print("Downloading raw data...")
gfile.copy(src='gs://cloud-samples-data/ml-engine/census/data/adult.data.csv', dst=os.path.join(DATA_DIR,'file1.csv'))
gfile.copy(src='gs://cloud-samples-data/ml-engine/census/data/adult.test.csv', dst=os.path.join(DATA_DIR,'file2.csv'))
print("Data downloaded.")

Removing previous workspace...
Creating new workspace...
Creating data directory...
Downloading raw data...
Data downloaded.


### 2. Adding headers to the CSV files as the CsvExampleGen components expect headers...

In [3]:
import pandas as pd

HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

pd.read_csv(DATA_DIR +"/file1.csv", names=HEADER).to_csv(DATA_DIR +"/train-01.csv", index=False)
pd.read_csv(DATA_DIR +"/file2.csv", names=HEADER).to_csv(DATA_DIR +"/train-02.csv", index=False)
gfile.remove(DATA_DIR +"/file1.csv")
gfile.remove(DATA_DIR +"/file2.csv")

In [4]:
!wc -l $DATA_DIR/train-01.csv
!head $DATA_DIR/train-01.csv

32562 workspace/data/train-01.csv
age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
39, State-gov,77516, Bachelors,13, Never-married, Adm-clerical, Not-in-family, White, Male,2174,0,40, United-States, <=50K
50, Self-emp-not-inc,83311, Bachelors,13, Married-civ-spouse, Exec-managerial, Husband, White, Male,0,0,13, United-States, <=50K
38, Private,215646, HS-grad,9, Divorced, Handlers-cleaners, Not-in-family, White, Male,0,0,40, United-States, <=50K
53, Private,234721, 11th,7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male,0,0,40, United-States, <=50K
28, Private,338409, Bachelors,13, Married-civ-spouse, Prof-specialty, Wife, Black, Female,0,0,40, Cuba, <=50K
37, Private,284582, Masters,14, Married-civ-spouse, Exec-managerial, Wife, White, Female,0,0,40, United-States, <=50K
49, Private,160187, 9th,5, Married-spouse-absent, Other-service, Not-in-family, Black,

## Tensorflow Data Validation for Schema Generation

In [5]:
import tensorflow_data_validation as tfdv

TARGET_FEATURE_NAME = 'income_bracket'
WEIGHT_FEATURE_NAME = 'fnlwgt'

## 1. Compute Statistics

In [6]:
train_stats = tfdv.generate_statistics_from_csv(
    data_location=DATA_DIR+'/*.csv', 
    column_names=None, # CSV data file include header
    stats_options=tfdv.StatsOptions(
        weight_feature=WEIGHT_FEATURE_NAME,
        sample_rate=1.0
    )
)



Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


In [7]:
tfdv.visualize_statistics(train_stats)

## 2. Infer Schema

In [8]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'race',STRING,required,,'race'
'capital_gain',INT,required,,-
'hours_per_week',INT,required,,-
'capital_loss',INT,required,,-
'gender',STRING,required,,'gender'
'occupation',STRING,required,,'occupation'
'education_num',INT,required,,-
'native_country',STRING,required,,'native_country'
'income_bracket',STRING,required,,'income_bracket'
'workclass',STRING,required,,'workclass'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'race',"' Amer-Indian-Eskimo', ' Asian-Pac-Islander', ' Black', ' Other', ' White'"
'gender',"' Female', ' Male'"
'occupation',"' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair', ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners', ' Machine-op-inspct', ' Other-service', ' Priv-house-serv', ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support', ' Transport-moving'"
'native_country',"' ?', ' Cambodia', ' Canada', ' China', ' Columbia', ' Cuba', ' Dominican-Republic', ' Ecuador', ' El-Salvador', ' England', ' France', ' Germany', ' Greece', ' Guatemala', ' Haiti', ' Holand-Netherlands', ' Honduras', ' Hong', ' Hungary', ' India', ' Iran', ' Ireland', ' Italy', ' Jamaica', ' Japan', ' Laos', ' Mexico', ' Nicaragua', ' Outlying-US(Guam-USVI-etc)', ' Peru', ' Philippines', ' Poland', ' Portugal', ' Puerto-Rico', ' Scotland', ' South', ' Taiwan', ' Thailand', ' Trinadad&Tobago', ' United-States', ' Vietnam', ' Yugoslavia'"
'income_bracket',"' <=50K', ' >50K'"
'workclass',"' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'"
'marital_status',"' Divorced', ' Married-AF-spouse', ' Married-civ-spouse', ' Married-spouse-absent', ' Never-married', ' Separated', ' Widowed'"
'education',"' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate', ' HS-grad', ' Masters', ' Preschool', ' Prof-school', ' Some-college'"
'relationship',"' Husband', ' Not-in-family', ' Other-relative', ' Own-child', ' Unmarried', ' Wife'"


## 3. Alter the Schema

In [9]:
# Relax the minimum fraction of values that must come from the domain for feature occupation.
occupation = tfdv.get_feature(schema, 'occupation')
occupation.distribution_constraints.min_domain_mass = 0.9

# Add new value to the domain of feature native_country.
native_country_domain = tfdv.get_domain(schema, 'native_country')
native_country_domain.value.append('Egypt')

# All features are by default in both TRAINING and SERVING environments.
schema.default_environment.append('TRAINING')
schema.default_environment.append('EVALUATION')
schema.default_environment.append('SERVING')

# Specify that the class feature is not in SERVING environment.
tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')

## 4. Save the Schema

In [10]:
import shutil

if os.path.exists(RAW_SCHEMA_DIR):
    shutil.rmtree(RAW_SCHEMA_DIR)
    
os.mkdir(RAW_SCHEMA_DIR)

raw_schema_location = os.path.join(RAW_SCHEMA_DIR, 'schema.pbtxt')
tfdv.write_schema_text(schema, raw_schema_location)

### Test loading saved schema

In [11]:
tfdv.load_schema_text(raw_schema_location)

feature {
  name: "race"
  type: BYTES
  domain: "race"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "capital_gain"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "hours_per_week"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "capital_loss"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "gender"
  type: BYTES
  domain: "gender"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "occupation"
  type: BYTES
  domain: "occupation"
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  distribution_constraints {
    min_domain_mass: 0.9
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {