# Classification of Structured Data with Keras preprocessing layers

**Author:** [Mike Fournigault](https://www.linkedin.com/in/mike-fournigault-57312071/)<br>


## 1. Environment setup

Setup of Weight and biases for monitoring the model training and evaluation.


In [1]:
!pip install wandb -Uq

In [2]:
import wandb

wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmike-fournigault1[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Clonning the repository and installing the requirements

In [3]:
from google.colab import userdata
github_token = userdata.get("github_token")

In [4]:
from google.colab import drive

# mounting my google drive
drive.mount("/content/gdrive", force_remount=True)

# Clone the repo "astro_iqa" from my github
! git clone https://mfournigault:$github_token@github.com/mfournigault/astro_iqa.git

Mounted at /content/gdrive
Cloning into 'astro_iqa'...
remote: Enumerating objects: 437, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 437 (delta 89), reused 80 (delta 35), pack-reused 293 (from 2)[K
Receiving objects: 100% (437/437), 244.27 MiB | 24.34 MiB/s, done.
Resolving deltas: 100% (171/171), done.


In [5]:
import os
os.chdir("/content/astro_iqa")
#! conda env update -n base -f environment_tf2.15_gpu_wsl.yml

In [6]:
#!pip install numpy --upgrade
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
print(tf.test.is_gpu_available(cuda_only=True))
print(tf.test.is_built_with_cuda())


Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


Num GPUs Available:  1
True
True


In [7]:
import os
import sys

# Only the TensorFlow backend supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"

import numpy as np
import pandas as pd
import tensorflow as tf
import keras
from keras import layers

sys.path.append(os.path.abspath("/content/astro_iqa/src/"))

Adding the instrumentation of the TF code with debugger V2

## 2. Loading and preparing the datasets

Reading and merging catalog and mapping files

In [None]:
columns = ["OBJECT_ID", "FITS_ID", "CCD_ID", "ISO0", "BACKGROUND", "ELLIPTICITY", "ELONGATION", "CLASS_STAR", "FLAGS", "EXPTIME"]
data_path = "/content/gdrive/MyDrive/Astronomie/astro_iqa/data/"
proc_path = os.path.join(data_path, "processed")
fm_path = os.path.join(data_path, "for_modeling")


In [9]:
os.chdir("/content/astro_iqa/src/data_acquisition_understanding")

In [None]:
!python /content/astro_iqa/src/data_acquisition_understanding/dnn_datasets_preparation.py --data_path $data_path --train_fraction 0.7 --validation_fraction 0.5

2025-04-14 07:17:46.376425: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744615066.396124    4105 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744615066.402051    4105 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading and concatening catalogs ...
CADC catalog size:  (1873000, 12)
Cleaning and splitting catalog ...
Class weights:
{'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
-----------------
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

In [None]:
from data_acquisition_understanding.dnn_datasets_preparation import datasets_loader, custom_reader_func

In [None]:
# Load the tensorflow Datasets
os.chdir("/content/astro_iqa/")
print("current directory: ", os.getcwd())
print("Content of the directory: ", os.listdir(fm_path))

print("Reading the datasets ...")
batch_size = 4096
shuffling_size = 100000
training_dataset, validation_dataset, testing_dataset = datasets_loader(
        data_path=fm_path,
        shuffling_size=shuffling_size,
        batch_size=batch_size,
        label_name="gt_label1",
        custom_reader_func=custom_reader_func
    )


current directory:  /content/astro_iqa
Content of the directory:  ['modelling.md', '.gitattributes', 'map_images_labels_ngc7000.json', 'validation_dataset', 'map_images_labels_ngc0869.json', 'test_dataset', 'objects_catalog_cadc_bronze.parquet.gz', 'objects_catalog_ngc0869_bronze.parquet.gz', 'objects_catalog_ngc0896_bronze.parquet.gz', 'map_images_labels_ngc0896.json', 'map_images_labels.json', 'map_images_labels_cadc2.json', 'objects_catalog_ngc7000_bronze.parquet.gz', 'training_dataset', 'map_images_labels_cadc.json']
Reading the datasets ...
Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]


In [14]:
print("Number of batches in training: ", training_dataset.cardinality().numpy())
print("Number of batches in validation:", validation_dataset.cardinality().numpy())
print("Number of batches in testing:", testing_dataset.cardinality().numpy())

Number of batches in training:  323
Number of batches in validation: 70
Number of batches in testing: 70


## 3. Creating model inputs and preprocessing layers

### 3.1 Defining the preprocessing layers

### 3.2 Encoding input features with preprocessing layers

For categorical features, we encode them using `layers.StringLookup` or `layers.IntegerLookup`.
The layer vocabularies are learnt from the dataset (e.g. the training dataset), a `layers.CategoryEncoding` finally encodes the inputs using the vocabulary.
For the numerical features, we apply a `RobustNormalization` layer to take into account outliers (possibly huges) during the normalization.

***Input features are encoded in the same order as they are defined in the dataset.***

In [None]:
from modeling.preprocessing import encode_inputs
from modeling.preprocessing import FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES

In [None]:
import pickle

SAVE_PATH = "/content/gdrive/MyDrive/"+"encoded_features.pkl"

# Supposons que "training_dataset" est déjà défini et chargé
# if os.path.exists(SAVE_PATH):
#     with open(SAVE_PATH, "rb") as f:
#         all_inputs, encoded_features = pickle.load(f)
#     print("Loading feature encoding from saved file.")
# else:
all_inputs, encoded_features = encode_inputs(training_dataset, FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES)
with open(SAVE_PATH, "wb") as f:
    pickle.dump((all_inputs, encoded_features), f)
print("Feature encoding computed and saved.")


## 4. Creating, training the model and monitoring with W&B

##  4.1 Defining the model and training procedure

In [None]:
# class weights are:
# {'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
class_weights = {
  1: 0.5989620465839595,  # GOOD
  2: 0.19793362733891753,  # RBT
  3: 0.14426418975966473,  # BT
  4: 0.058840136317458235 #,  # B_SEEING
  # 5: 0.004915   # BGP
}

In [None]:
from wandb.integration.keras import WandbMetricsLogger
from wandb.integration.keras import WandbModelCheckpoint
from wandb.integration.keras import WandbCallback

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau
from modeling.nn_modeling import create_dnn_model

In [None]:
def training_evaluation(config):
    """
    Train and evaluate the model with the given hyperparameters.
    """
    global all_inputs, encoded_features

    print("Reading and preparing the datasets ...")
    training_dataset, validation_dataset, testing_dataset = datasets_loader(
            data_path=fm_path,
            shuffling_size=config["shuffling_size"],
            batch_size=config["batch_size"],
            label_name="gt_label1",
            custom_reader_func=custom_reader_func
        )

    model = create_dnn_model(all_inputs=all_inputs,
                         encoded_inputs=encoded_features,
                         num_hidden_layers=config["num_hidden_layers"],
                         units_per_layer=config["num_units"],
                         dropout_rate=config["dropout"],
                         l2=config["l2"])

    # Create a LearningRateScheduler callback
     # lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=config["initial_lr"],
    #                                                             decay_steps=config["decay_steps"],
    #                                                             end_learning_rate=config["end_lr"],
    #                                                             power=1)
    reduce_lr = ReduceLROnPlateau(monitor='val_sparse_categorical_accuracy', factor=0.2, patience=3, min_lr=1e-6)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=config["initial_lr"]),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=[keras.metrics.SparseCategoricalAccuracy()])

    # Train the model
    checkpoint_cb = WandbModelCheckpoint(filepath="checkpoint.weights.h5", save_weights_only=True)
    best_model_cb = WandbModelCheckpoint(filepath="iqa_best_dnn.keras", 
                                         save_best_only=True
                                         save_weights_only=False, 
                                         monitor="val_sparse_categorical_accuracy", 
                                         mode="max")
    print("Start training the model...")
    history = model.fit(training_dataset,
                        epochs=config["num_epochs"],
                        validation_data=validation_dataset,
                        callbacks=[reduce_lr,
                                   WandbMetricsLogger(log_freq="batch"),
                                   checkpoint_cb,
                                   best_model_cb
                                   ],
                        class_weight=class_weights)

    print("Model training finished.")

    print("Evaluating model performance...")
    loss, accuracy = model.evaluate(testing_dataset)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history

## 4.2 Defining the monitoring configuration and experiment

In [30]:
import os

os.environ['WANDB_AGENT_MAX_INITIAL_FAILURES'] = '1'

In [None]:
notes = f"With shuffling_size={shuffling_size} and batch_size={batch_size}.\n"

In [None]:
config = dict(
    # Hyper params
    num_hidden_layers = 2,
    num_units = 64,
    dropout = 0.3,
    l2 = 0.008,
    num_classes = 5,
    shuffling_size = shuffling_size,
    batch_size = batch_size,
    initial_lr = 1e-3,
    end_lr = 1e-4,
    decay_steps = 1000,
    num_epochs = 30,
)

# Enable resuming the run
run = wandb.init(project="astro_iqa", 
                 entity="mike-fournigault1", 
                 config=config, save_code=True, 
                 resume="allow",
                 job_type="train",
                 tags=["dnn"],
                 notes=notes)

## 4.3 Running the training/evluation experiment

In [None]:
training_history = training_evaluation(config)

Reading and preparing the datasets ...
Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]
Start training the model...
Epoch 1/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 216ms/step - accuracy: 0.4429 - loss: 1.7478 - val_accuracy: 0.5534 - val_loss: 1.2452
Epoch 2/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m89s[0m 243ms/step - accuracy: 0.7491 - loss: 0.3538 - val_accuracy: 0.5327 - val_loss: 1.2226
Epoch 3/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m79s[0m 244ms/step - accuracy: 0.7506 - loss: 0.2711 - val_accuracy: 0.4888 - val_loss: 1.3421
Epoch 4/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 242ms/step - accuracy: 0.7514 - loss: 0.2355 - val_accuracy: 0.5207 - val_loss: 2.0978
Epoch 5/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 240ms/step - accuracy: 0.7515 - loss: 0.2327 - val_accuracy: 0.5288 - val_loss: 2.3569
Epoch 6/30

In [33]:
# Terminate the W&B run
run.finish()

0,1
batch/accuracy,▁▄█▃▁▃█▂▁▄▃▁▂▁▄▄▂▅▄▂▃▂▂▁▂▄▄▂▂██▂█▂▄▄▁▃▄▄
batch/batch_step,▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇█████
batch/learning_rate,█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
batch/loss,██▆▅▂▄▄▃▄▄▂▄▁▁▂▆▅▄▃▄▃▃▅▂▁▁▄▃▃▅▁▃▄▅▃▅▃▄▄▄
epoch/accuracy,▁█████████████████████████████
epoch/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
epoch/learning_rate,█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch/loss,█▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch/val_accuracy,▃▂▁▂▂▂▃▂▃▅▃▄▄▆▆▄▆▅▇▇▇▆▇▆▇███▇█
epoch/val_loss,▂▂▃▇█▆▅▃▅▃▄▃▄▂▁▃▂▂▂▂▂▁▂▂▂▁▁▂▃▃

0,1
batch/accuracy,0.6014
batch/batch_step,9890.0
batch/learning_rate,0.0001
batch/loss,0.26265
epoch/accuracy,0.59974
epoch/epoch,29.0
epoch/learning_rate,0.0001
epoch/loss,0.26268
epoch/val_accuracy,0.75732
epoch/val_loss,1.36343
