<a href="https://colab.research.google.com/github/srivatsan88/Tensorflow_Extended_Notebook/blob/master/TFX_Visualize_Distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is accompanying notebook for my linkedin article - https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/


In [0]:
from __future__ import print_function
import sys,tempfile, urllib, os

Currently TFX components runs on python 2.x. Ensure you have right python environment

In [2]:
sys.version_info.major

2

Install tensorflow data validation library

In [0]:
!pip install -q tensorflow_data_validation
import tensorflow_data_validation as tfdv

print('TFDV version: {}'.format(tfdv.version.__version__))

Create file in local system to store the downloaded dataset

In [0]:
BASE_DIR = '/tmp'
OUTPUT_FILE = os.path.join(BASE_DIR, 'churn_data.csv')

Download the Watson Telecom dataset and store it in local disk

In [0]:
churn_data=urllib.urlretrieve('https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv', OUTPUT_FILE)


Do not worry about the dataset details for now. Let us quickly run Tensorflow Datavalidation (TFDV) component to generate statistics on the file. Later in the notebook we will show alternate way to load the data and also try to understand the data as this is the dataset we will be playing around in the reamining part of the series

In [0]:
train_stats = tfdv.generate_statistics_from_csv(data_location=OUTPUT_FILE)

Visualize the generated stats. If you see the visualization. TensorFlow Data Validation provides tools for visualizing the distribution of feature values. By examining these distributionsyou can catch common problems with data. One quick observation is SeniorCitizen column below has around 84% zeros. Play around with different chart below and also in case if you want to search into any particular feature

In [7]:
tfdv.visualize_statistics(train_stats)

Let us now create schema for our data using infer_schema method. Schema defines constraints for the data that are relevant for ML. Example constraints include the data type of each feature, whether it's numerical or categorical, or the frequency of its presence in the data. For categorical features the schema also defines the domain - the list of acceptable values. Since writing a schema can be a tedious task, especially for datasets with lots of features, TFDV provides a method to generate an initial version of the schema based on the descriptive statistics.

Getting the schema right is important because the rest of our production pipeline will be relying on the schema that TFDV generates to be correct. The schema also provides documentation for the data

In [8]:
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'OnlineSecurity',STRING,required,,'OnlineSecurity'
'SeniorCitizen',INT,required,,-
'Partner',STRING,required,,'Partner'
'DeviceProtection',STRING,required,,'DeviceProtection'
'MonthlyCharges',FLOAT,required,,-
'OnlineBackup',STRING,required,,'OnlineBackup'
'MultipleLines',STRING,required,,'MultipleLines'
'gender',STRING,required,,'gender'
'StreamingTV',STRING,required,,'StreamingTV'
'Contract',STRING,required,,'Contract'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'OnlineSecurity',"'No', 'No internet service', 'Yes'"
'Partner',"'No', 'Yes'"
'DeviceProtection',"'No', 'No internet service', 'Yes'"
'OnlineBackup',"'No', 'No internet service', 'Yes'"
'MultipleLines',"'No', 'No phone service', 'Yes'"
'gender',"'Female', 'Male'"
'StreamingTV',"'No', 'No internet service', 'Yes'"
'Contract',"'Month-to-month', 'One year', 'Two year'"
'StreamingMovies',"'No', 'No internet service', 'Yes'"
'InternetService',"'DSL', 'Fiber optic', 'No'"


Let us now load the file downloaded earlier in pandas dataframe and split the dataset to compare distirbution and schema against each other

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow_data_validation.statistics import stats_options as options

In [0]:
churn_df = pd.read_csv(OUTPUT_FILE)

Quickly scroll to understand the column and dataset. We will be using this dataset for this future series

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents

In [11]:
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Split the dataset to train and eval. We will be using this datsaset to check distribution difference between train and eval dataset

In [0]:
churn_train, churn_eval = train_test_split(churn_df, test_size=0.2)

In [13]:
churn_eval.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
6628,9979-RGMZT,Female,0,No,No,7,Yes,No,Fiber optic,No,...,No,No,Yes,Yes,One year,Yes,Mailed check,94.05,633.45,No
6435,0298-XACET,Male,0,Yes,Yes,52,No,No phone service,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,50.2,2554.0,No
6286,2717-HVIZY,Female,0,No,Yes,8,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,20.05,163.6,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
135,7799-LGRDP,Female,0,No,No,43,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,Yes,Credit card (automatic),25.7,1188.2,No


To display difference between train and eval dataset I am manually changing MultipleLines categorical column to introduce a new category in eval dataset to show the training/evaluation dataset skew

In [0]:
churn_eval.iloc[0, churn_eval.columns.get_loc('MultipleLines')] ='All phone Service'

Visualize train dataframe. This is similar to what we did above

In [0]:
train_df_stats=tfdv.generate_statistics_from_dataframe(churn_train,stats_options=options.StatsOptions(),n_jobs=3)

In [16]:
tfdv.visualize_statistics(train_df_stats)

Visualize eval dataset  here

In [0]:
eval_df_stats=tfdv.generate_statistics_from_dataframe(churn_eval,stats_options=options.StatsOptions(),n_jobs=3)

So far we've visualized dataset individually. It's important that our evaluation data is consistent with our training data, including that it uses the same schema. It's also important that the evaluation data includes examples of roughly the same ranges of values for our numerical features as our training data, so that our coverage of the loss surface during evaluation is roughly the same as during training. The same is true for categorical features. Otherwise, we may have training issues that are not identified during evaluation, because we didn't evaluate part of our loss surface.

Notice in below graph each feature now includes statistics for both the training and evaluation datasets.
Notice that the charts now have both the training and evaluation datasets overlaid, making it easy to compare them.
Notice that the charts now include a percentages view, which can be combined with log or the default linear scales.


In [18]:
tfdv.visualize_statistics(lhs_statistics=eval_df_stats, rhs_statistics=train_df_stats,lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')

validate_statistics method can be used to compare schematic or data drift.  Below we are comparing eval and train dataframe. If you examine the output the report highlights additional category in MultipleLines feature of eval dataframe that was not seen during training

In [19]:
eval_anomalies = tfdv.validate_statistics(statistics=eval_df_stats, schema=schema)
tfdv.display_anomalies(eval_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'MultipleLines',Unexpected string values,Examples contain values missing from the schema: All phone Service (<1%).


We can update schema manually in case schema evolves over time or additional metadata needs to be fed in to schema

In [0]:
# Add new value to the domain of feature MultipleLines.
payment_type_domain = tfdv.get_domain(schema, 'MultipleLines')
payment_type_domain.value.append('All phone Service')

In [21]:
eval_anomalies = tfdv.validate_statistics(statistics=eval_df_stats, schema=schema)
tfdv.display_anomalies(eval_anomalies)

Validate_statistics in addition to comparing schema can also be used to setup skew compartor and drift compator. Display_anomolies function will indicator features that have skewed or have drifted above the set threshold

In [22]:
multiple_lines_skew = tfdv.get_feature(schema, 'MultipleLines')
multiple_lines_skew.skew_comparator.infinity_norm.threshold = 0.001

totalcharges_comp =tfdv.get_feature(schema, 'TotalCharges')
totalcharges_comp.drift_comparator.infinity_norm.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(eval_df_stats, schema,
                                          previous_statistics=train_df_stats)
tfdv.display_anomalies(skew_anomalies)

Unnamed: 0_level_0,Anomaly short description,Anomaly long description
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1
'TotalCharges',High Linfty distance between current and previous,"The Linfty distance between current and previous is 0.00817236 (up to six significant digits), above the threshold 0.001. The feature value with maximum difference is: 20.2"
