# Why Use TFRecord Instead of HDF5 for Neural Network Training?

When preparing a dataset for neural network training, the **TFRecord** format is often preferred over **HDF5 (h5)** for the following reasons:

---

## 1. Optimization for TensorFlow
- **TFRecord** is specifically designed for TensorFlow.
- It is natively supported by TensorFlow and optimized for input pipelines (`tf.data.Dataset`), enabling more efficient data loading and preprocessing.

---

## 2. Sequential Read Performance
- **TFRecord** is a sequential binary format optimized for linear reads from disk.
- This improves performance when loading large amounts of data during training.

---

## 3. Data Streaming
- With **TFRecord**, data can be read and consumed as a continuous stream (**streaming**).
- This is useful for training on very large datasets that cannot fit entirely in memory.
- **HDF5**, on the other hand, often requires direct file access and is not optimized for continuous streaming.

---

## 4. Support for Parallelism
- **TFRecord** integrates with TensorFlow pipelines to support:
  - **Prefetching**: Loading future batches in parallel with training.
  - **Shuffling**: Efficiently shuffling data to avoid correlations between consecutive batches.
  - **Caching**: Storing preprocessed data in memory.
- **HDF5**, while supporting parallel reads with some libraries, is not as optimized for parallelism with TensorFlow.

---

## Conclusion

- **TFRecord** is the ideal format for training with TensorFlow due to:
  - Efficiency
  - Scalability
  - Native integration
cesso casuale ai dati
  - Compatibilità con altri strumenti
  - Esplorazione approfondita del dataset


In [1]:
%%bash
DATA_DIR=/tmp/lhcf-cnn

if [ ! -d $DATA_DIR ]; then
  mkdir -p $DATA_DIR
fi

if [ ! -f $DATA_DIR/combined_data.h5 ]; then
  wget https://minio.131.154.99.37.myip.cloud.infn.it/hackathon-data/lhcf-cnn/combined_data.h5 -O $DATA_DIR/combined_data.h5 &> .log
fi

ls -lrth $DATA_DIR/combined_data.h5

-rw-r--r--. 1 root root 5.6G Nov 14 11:37 /tmp/lhcf-cnn/combined_data.h5


In [2]:
import os
import h5py
import shutil
import numpy as np
import tensorflow as tf

from sklearn.model_selection import train_test_split
from multiprocessing import Pool, Manager
from tqdm import tqdm

2024-11-16 15:58:20.179497: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-16 15:58:20.179721: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-16 15:58:20.179765: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-16 15:58:20.189277: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
PATH = "/tmp/lhcf-cnn"
src_fname = f"{PATH}/combined_data.h5"

In [4]:
# Open the HDF5 file
with h5py.File(src_fname, 'r') as file:
    # List the datasets present in the file
    print(list(file.keys()))

    # Display the contents of each dataset
    for name in file.keys():
        data = file[name]
        print(f"\n{name}:")
        print("Shape:", data.shape)
        print("Data type:", data.dtype)

['ID', 'dE', 'posdE_01xy', 'posdE_23x', 'posdE_23y']

ID:
Shape: (10000,)
Data type: bool

dE:
Shape: (10000, 16, 1)
Data type: float16

posdE_01xy:
Shape: (10000, 384, 384, 2)
Data type: float16

posdE_23x:
Shape: (10000, 384, 2)
Data type: float16

posdE_23y:
Shape: (10000, 384, 2)
Data type: float16


### **1. Creating a TFRecord Example**

The `create_tfrecord_example` function generates a TFRecord example from raw data. Each example consists of a series of **features** mapped to the following fields:

- **`posdE_01xy`**, **`posdE_23x`**, **`posdE_23y`**:
  - Arrays of type float representing the main data.
  - These arrays are flattened into a list using `.reshape(-1)`.

- **`dE`**:
  - A one-dimensional float array derived from the original matrix.
  - The extra dimension is removed using `[:, 0]`.

- **`label`**:
  - An integer label (of type `int`) representing the class associated with the data.

These data elements are combined into a `tf.train.Example` object, which represents a single serializable example for the TFRecord format.
eatures(feature=features))


In [5]:
# Function to create a TFRecord example
def create_tfrecord_example(posdE_01xy, posdE_23x, posdE_23y, dE, label):
    features = {
        "posdE_01xy": tf.train.Feature(float_list=tf.train.FloatList(value=posdE_01xy.reshape(-1))),
        "posdE_23x": tf.train.Feature(float_list=tf.train.FloatList(value=posdE_23x.reshape(-1))),
        "posdE_23y": tf.train.Feature(float_list=tf.train.FloatList(value=posdE_23y.reshape(-1))),
        "dE": tf.train.Feature(float_list=tf.train.FloatList(value=dE.reshape(-1))),
        "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))
    }
    return tf.train.Example(features=tf.train.Features(feature=features))

# Function to Write a Batch of Examples to a TFRecord File

## Introduction
This function reads a set of data from an HDF5 (`.h5`) file and writes it to a TFRecord file, a format optimized for processing data in TensorFlow.

---
anzamento
            progress_queue.put(1)


In [6]:
# Function to write a batch of examples to a TFRecord file with progress bar updates
def write_tfrecord_batch(h5_file_path, indices, tfrecord_file_path, progress_queue):
    """
    Writes a batch of examples from an HDF5 file to a TFRecord file.

    Args:
        h5_file_path (str): Path to the source HDF5 file.
        indices (list): List of indices of the data to write.
        tfrecord_file_path (str): Path to the destination TFRecord file.
        progress_queue (Queue): Queue to update the progress bar.

    """
    with h5py.File(h5_file_path, "r") as f, tf.io.TFRecordWriter(tfrecord_file_path) as writer:
        for i in indices:
            posdE_01xy = f["posdE_01xy"][i].astype("float32")
            posdE_23x = f["posdE_23x"][i].astype("float32")
            posdE_23y = f["posdE_23y"][i].astype("float32")
            dE = f["dE"][i, :, 0].astype("float32")  # Removes the extra dimension
            label = int(f["ID"][i])
            
            example = create_tfrecord_example(posdE_01xy, posdE_23x, posdE_23y, dE, label)
            writer.write(example.SerializeToString())
            progress_queue.put(1)  # Updates the progress bar

# Main Function to Split the Dataset and Write in Parallel with a Progress Bar

## Introduction
The `split_and_parallel_write` function is designed to:
1. **Split an HDF5 dataset** into two sets: one for training and one for validation.
2. Write the split data into TFRecord files in **parallel** using multiple processes.
3. Update a **progress bar** to monitor the status of the process

---
zamento
    progress_process.close()
    progress_process.join()


In [7]:
# Function to display progress
def progress_listener(total_samples, progress_queue):
    with tqdm(total=total_samples, desc="Converting to TFRecord") as pbar:
        for _ in range(total_samples):
            progress_queue.get()
            pbar.update(1)

# Main function to split the dataset and write in parallel with a progress bar
def split_and_parallel_write(h5_file_path, train_tfrecord_path, val_tfrecord_path, train_ratio=0.8, num_processes=4):
    """
    Splits the dataset and writes the data into TFRecord files in parallel.

    Args:
        h5_file_path (str): Path to the source HDF5 file.
        train_tfrecord_path (str): Base path for training TFRecord files.
        val_tfrecord_path (str): Base path for validation TFRecord files.
        train_ratio (float): Proportion of the data to use for training (default 0.8).
        num_processes (int): Number of processes for parallelization (default 4).
    """
    
    with h5py.File(h5_file_path, "r") as f:
        n_samples = f["ID"].shape[0]
        indices = np.arange(n_samples)
        train_indices, val_indices = train_test_split(indices, test_size=1 - train_ratio, random_state=42, shuffle=True)

    # Split the indices into batches for parallel processing
    train_batches = np.array_split(train_indices, num_processes)
    val_batches = np.array_split(val_indices, num_processes)

    # Manager for the progress bar using multiprocessing
    manager = Manager()
    progress_queue = manager.Queue()

    # Start the progress bar process
    total_samples = len(train_indices) + len(val_indices)
    progress_process = Pool(1, progress_listener, (total_samples, progress_queue))

    # Parallel writing of TFRecord files
    with Pool(num_processes) as pool:
        pool.starmap(write_tfrecord_batch, [(h5_file_path, batch, f"{train_tfrecord_path}_part{i}", progress_queue) for i, batch in enumerate(train_batches)])
        pool.starmap(write_tfrecord_batch, [(h5_file_path, batch, f"{val_tfrecord_path}_part{i}", progress_queue) for i, batch in enumerate(val_batches)])

    # Close the progress bar process
    progress_process.close()
    progress_process.join()

In [8]:
exp_dirname = f"{PATH}/Train_and_Validation"

if not os.path.exists(exp_dirname):
    os.makedirs(exp_dirname)
else:
    shutil.rmtree(exp_dirname)
    os.makedirs(exp_dirname)

# Run conversion using multiple processes with progress bar
split_and_parallel_write(src_fname, f"{exp_dirname}/train.tfrecord", f"{exp_dirname}/validation.tfrecord", num_processes=8)

Converting to TFRecord: 100%|██████████| 10000/10000 [01:47<00:00, 92.62it/s]


In [9]:
# Function to concatenate TFRecord files with a progress bar
def concatenate_tfrecords(input_files, output_file):
    total_records = sum(1 for input_file in input_files for _ in tf.data.TFRecordDataset(input_file))
    
    with tf.io.TFRecordWriter(output_file) as writer:
        with tqdm(total=total_records, desc=f"Concatenating {output_file}") as pbar:
            for input_file in input_files:
                for record in tf.data.TFRecordDataset(input_file):
                    writer.write(record.numpy())
                    pbar.update(1)
    print(f"Concatenated TFRecord file saved: {output_file}")

# Get the list of train and validation TFRecord files
train_files = sorted([os.path.join(exp_dirname, f) for f in os.listdir(exp_dirname) if f.startswith("train.tfrecord_part")])
validation_files = sorted([os.path.join(exp_dirname, f) for f in os.listdir(exp_dirname) if f.startswith("validation.tfrecord_part")])

# Concatenate files into a single TFRecord for train and validation with a progress bar
concatenate_tfrecords(train_files, f"{PATH}/train.tfrecord")
concatenate_tfrecords(validation_files, f"{PATH}/validation.tfrecord")

2024-11-16 16:00:12.658829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3234 MB memory:  -> device: 0, name: NVIDIA A100-PCIE-40GB MIG 1g.5gb, pci bus id: 0000:e1:00.0, compute capability: 8.0
Concatenating /tmp/lhcf-cnn/train.tfrecord: 100%|██████████| 8000/8000 [00:14<00:00, 566.22it/s]


Concatenated TFRecord file saved: /tmp/lhcf-cnn/train.tfrecord


Concatenating /tmp/lhcf-cnn/validation.tfrecord: 100%|██████████| 2000/2000 [00:03<00:00, 517.27it/s]

Concatenated TFRecord file saved: /tmp/lhcf-cnn/validation.tfrecord



