# Statistic analysis of the raw data
In this notebook, we:

* Generate and visualize statistics for a dataset (training, validation and test)
* Infer a dataset schema based on the training dataset
* Check for possible anomalies in validation and test datasets based on the schema from the training dataset

In [25]:
from datasets import load_from_disk
from PIL import Image
import tensorflow_data_validation as tfdv
import pandas
import os

### Load full dataset and remove unimportant columns from training dataset

In [26]:
# Load dataset
dataset = load_from_disk("../data/raw/DocLayNet-small")
dataset.set_format("pandas")

In [27]:
# Remove irrelevant columns
dataset = dataset.remove_columns(['bboxes_line','page_hash','original_filename','page_no','coco_width','coco_height','collection','original_width','original_height','num_pages'])
ds_train = dataset['train'][:]
ds_train['categories'] = ds_train['categories'].apply(lambda x: [str(i) for i in x])
columns = list(dataset['train'].features.keys())

### Generate and visualize statistics of training dataset

In [28]:
#Generate statistics
stats_options = tfdv.StatsOptions(feature_allowlist=columns)
train_stats = tfdv.generate_statistics_from_dataframe(ds_train.drop(columns='image'), stats_options)

# Visualize Statistics
tfdv.visualize_statistics(train_stats)

### Infer the Data schema of the training dataset statistics

In [29]:
# Infer the data schema by using the training statistics
schema = tfdv.infer_schema(train_stats)

# Display the data schema
tfdv.display_schema(schema)

# Check number of features
print(f"Number of features in schema: {len(schema.feature)}")

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'id',BYTES,required,,-
'texts',BYTES,required,,-
'bboxes_block',INT,required,,-
'categories',BYTES,required,,-
'doc_category',STRING,required,,'doc_category'


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'doc_category',"'financial_reports', 'government_tenders', 'laws_and_regulations', 'manuals', 'patents', 'scientific_articles'"


Number of features in schema: 5


## Compare Train and Validation statistics

We observe that both datasets have the same features and the feature "doc_category" has the same categorical values.
We see that the 'bboxes_block', the 'categories' feature which represents bbox categories and the 'doc_category' feature have very similar distributions in train and validation, which is adequate.

In [30]:
ds_val = dataset['validation'][:]
ds_val['categories'] = ds_val['categories'].apply(lambda x: [str(i) for i in x])

#Generate validation statistics
val_stats = tfdv.generate_statistics_from_dataframe(ds_val.drop(columns='image'), stats_options=stats_options)

# Compare evaluation data with training data 
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats,
                          lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')

### Anomaly detection in validation set

In [31]:
anomalies = tfdv.validate_statistics(val_stats, schema)
tfdv.display_anomalies(anomalies)

## Compare Train and Test statistics

We observe that both datasets have the same features and the feature "doc_category" has the same categorical values.
We see that the 'bboxes_block', the 'categories' feature which represents bbox categories and the 'doc_category' feature have very similar distributions in train and validation, which is adequate.

In [32]:
options = tfdv.StatsOptions(schema=schema, 
                            infer_type_from_schema=True, 
                            feature_allowlist=columns)

In [33]:
ds_test = dataset['test'][:]
ds_test['categories'] = ds_test['categories'].apply(lambda x: [str(i) for i in x])

test_stats = tfdv.generate_statistics_from_dataframe(ds_test.drop(columns='image'), stats_options=options)

tfdv.visualize_statistics(lhs_statistics=test_stats, rhs_statistics=train_stats,
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

# HINT: Calculate and display anomalies using the generated serving statistics
anomalies = tfdv.validate_statistics(test_stats, schema)
tfdv.display_anomalies(anomalies)

## Freeze Schema

In [34]:
# Use TensorFlow text output format pbtxt to store the schema
OUTPUT_DIR = "../models"
schema_file = os.path.join(OUTPUT_DIR, 'raw_schema.pbtxt')
tfdv.write_schema_text(schema, schema_file) 