# Analyzing the data splits

**Achievement:** Illustrate how to validate the data splits.

## Introduction

Splitting your data correctly has great implications for the design of your ML workflow. If the data distribution on your splits is not similar, you risk commiting errors when evaluating your machine learning model. When splitting your dataset, you have to at least split it into 2 sets:

 - Training: The training dataset is the sample of data used to train the model. It is the largest sample of data used when creating a machine learning model.
 - Testing: The testing, or validation, dataset is a second sample of data used to provide a validation of the model to see if the model can correctly predict, or classify, using data not seen before.

 Actually, it is recommended to split your data in 3: training, validation and testing. But for this workshop, we will just keep it at the two splits described.

We will use [Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started) to visualize the distributions of the test and train sets.


# Reproducibility and code formatting

In [1]:
# To watermark the environment
%load_ext watermark

# For automatic code formatting in jupyter lab.
%load_ext lab_black

# For automatic code formatting in jupyter notebook
%load_ext nb_black

# For better logging
%load_ext rich

# Analysis

We proceed to the analysis part.

In [2]:
# Imports
# -------

# System
import sys

# Logging
import logging

# Rich logging in jupyter
from rich.logging import RichHandler

FORMAT = "%(message)s"
logging.basicConfig(
    level="WARN", format=FORMAT, datefmt="[%X]", handlers=[RichHandler()]
)

log = logging.getLogger("rich")

# Packages
import pandas as pd
import tensorflow_data_validation as tfdv
from sklearn.model_selection import train_test_split
import workshop.transform as wt

2022-09-18 21:51:47.294524: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Load the data:
train_path = "../data/train/diabetes_binary_train.csv.zip"
test_path = "../data/test/diabetes_binary_test.csv.zip"
data_train = pd.read_csv(train_path, compression="zip")
data_test = pd.read_csv(test_path, compression="zip")

In [4]:
# To show the Data Validation, we need to convert the columns to the right format.
# Categorical columns will be transformed to str
# We do this by first transforming the floats to ints and then to str

cols_to_int = [
    "Diabetes_binary",
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "GenHlth",
    "MentHlth",
    "PhysHlth",
    "DiffWalk",
    "Sex",
    "Age",
    "Education",
    "Income",
]

cols_to_str = [
    "Diabetes_binary",
    "HighBP",
    "HighChol",
    "CholCheck",
    "Smoker",
    "Stroke",
    "HeartDiseaseorAttack",
    "PhysActivity",
    "Fruits",
    "Veggies",
    "HvyAlcoholConsump",
    "AnyHealthcare",
    "NoDocbcCost",
    "GenHlth",
    "MentHlth",
    "PhysHlth",
    "DiffWalk",
    "Sex",
    "Age",
    "Education",
    "Income",
]

In [5]:
# We transform the splits. Here we illustrate a technique
# to create a pipeline to chain transformations.
# Notice that the original data is not transformed.

data_train_transformed = (
    data_train.pipe(wt.start_pipeline)
    .pipe(wt.cols_to_int, cols=cols_to_int)
    .pipe(wt.cols_to_str, cols=cols_to_str)
)

data_test_transformed = (
    data_test.pipe(wt.start_pipeline)
    .pipe(wt.cols_to_int, cols=cols_to_int)
    .pipe(wt.cols_to_str, cols=cols_to_str)
)

Executed step start_pipeline shape=(56553, 22) took 0:00:00.006054s
Executed step cols_to_int shape=(56553, 22) took 0:00:00.398762s
Executed step cols_to_str shape=(56553, 22) took 0:00:00.221889s
Executed step start_pipeline shape=(14139, 22) took 0:00:00.001456s
Executed step cols_to_int shape=(14139, 22) took 0:00:00.097318s
Executed step cols_to_str shape=(14139, 22) took 0:00:00.059389s


In [6]:
# We generate the statistics for each split
train_stats = tfdv.generate_statistics_from_dataframe(data_train_transformed)
test_stats = tfdv.generate_statistics_from_dataframe(data_test_transformed)

# Then we visualize the splits
tfdv.visualize_statistics(
    lhs_statistics=train_stats,
    lhs_name="train_split",
    rhs_statistics=test_stats,
    rhs_name="test_split",
)

Have a look at [https://www.tensorflow.org/tfx/guide/tfdv]() for a detailed explanation of the possibilities with this tool.

In the example above, try the following:
1. Select a feature, and in the options to the right, click on `expand` and `percentage`. This will help you compare the distribution percentages between the splits.
2. Compare every feature using this procedure. Can you find features where the percentages differ significantly?

# Watermark

This should be the last section of your notebook, since it watermarks all your environment.

When commiting this notebook, remember to restart the kernel, rerun the notebook and run this cell last, to watermark the environment.

In [7]:
%watermark -gb -iv -m -v

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.5.0

Compiler    : Clang 12.0.1 
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 12
Architecture: 64bit

Git hash: 3e6b63f1cb3184ba20c78cccd2a54c81a3180ad2

Git branch: splits_analysis

sys                       : 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:06:49) 
[Clang 12.0.1 ]
workshop                  : 0.0.1
pandas                    : 1.4.3
logging                   : 0.5.1.2
tensorflow_data_validation: 1.10.0

