# Classification of Structured Data with Deep Cross Networks and Keras preprocessing layers

**Author:** [Mike Fournigault](https://www.linkedin.com/in/mike-fournigault-57312071/)<br>

Based on the Deep Cross Network V2 by Google: [DCN V2: Improved Deep & Cross Network and Practical Lessons
for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535)<br>


## 1. Environment setup

Setup of Weight and biases for monitoring the model training and evaluation.


In [1]:
!pip install wandb -Uq

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m113.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import wandb

wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmike-fournigault1[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Installing tensorflow recommenders to get the DCN V2 layer

In [3]:
!pip install -q tensorflow-recommenders

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Clonning the repository and installing the requirements

In [5]:
from google.colab import userdata
github_token = userdata.get("github_token")

In [6]:
from google.colab import drive

# mounting my google drive
drive.mount("/content/gdrive", force_remount=True)

# Clone the repo "astro_iqa" from my github
! git clone https://mfournigault:$github_token@github.com/mfournigault/astro_iqa.git

Mounted at /content/gdrive
Cloning into 'astro_iqa'...
remote: Enumerating objects: 518, done.[K
remote: Counting objects: 100% (225/225), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 518 (delta 151), reused 125 (delta 61), pack-reused 293 (from 2)[K
Receiving objects: 100% (518/518), 245.90 MiB | 31.89 MiB/s, done.
Resolving deltas: 100% (233/233), done.
Updating files: 100% (81/81), done.


In [7]:
import os
os.chdir("/content/astro_iqa")
#! conda env update -n base -f environment_tf2.15_gpu_wsl.yml

In [8]:
import os
import sys

# Only the TensorFlow backend supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"

import numpy as np
import pandas as pd
import tensorflow as tf
import keras


sys.path.append(os.path.abspath("/content/astro_iqa/src/"))

## 2. Loading and preparing the datasets

Reading and merging catalog and mapping files

In [9]:
columns = ["OBJECT_ID", "FITS_ID", "CCD_ID", "ISO0", "BACKGROUND", "ELLIPTICITY", "ELONGATION", "CLASS_STAR", "FLAGS", "EXPTIME"]
data_path = "/content/gdrive/MyDrive/Astronomie/astro_iqa/data/"
proc_path = os.path.join(data_path, "processed")
fm_path = os.path.join(data_path, "for_modeling")


In [10]:
os.chdir("/content/astro_iqa/src/data_acquisition_understanding")

In [12]:
!python /content/astro_iqa/src/data_acquisition_understanding/dnn_datasets_preparation.py --data_path $data_path --train_fraction 0.7 --validation_fraction 0.5

2025-04-23 09:47:17.216882: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745401637.237458    3099 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745401637.243617    3099 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading and concatening catalogs ...
CADC catalog size:  (1873000, 12)
Cleaning and splitting catalog ...
Class weights:
{'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
-----------------
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

In [13]:
from data_acquisition_understanding.dnn_datasets_preparation import datasets_loader, custom_reader_func

In [None]:
# Load the tensorflow Datasets
os.chdir("/content/astro_iqa/")
print("current directory: ", os.getcwd())
print("Content of the directory: ", os.listdir(fm_path))

print("Reading the datasets ...")
batch_size = 4096
shuffling_size = 100000
training_dataset, validation_dataset, testing_dataset, label_voc = datasets_loader(
        data_path=fm_path,
        shuffling_size=shuffling_size,
        batch_size=batch_size,
        label_name="gt_label1",
        custom_reader_func=custom_reader_func
    )


current directory:  /content/astro_iqa
Content of the directory:  ['objects_catalog_cadc_bronze.parquet.gz', 'objects_catalog_ngc0869_bronze.parquet.gz', 'objects_catalog_ngc0896_bronze.parquet.gz', 'objects_catalog_ngc7000_bronze.parquet.gz', 'training_dataset', 'validation_dataset', 'test_dataset']
Reading the datasets ...
Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]


In [15]:
print("Number of batches in training: ", training_dataset.cardinality().numpy())
print("Number of batches in validation:", validation_dataset.cardinality().numpy())
print("Number of batches in testing:", testing_dataset.cardinality().numpy())

Number of batches in training:  323
Number of batches in validation: 70
Number of batches in testing: 70


## 3. Creating model inputs and preprocessing layers

### 3.1 Defining the preprocessing layers

### 3.2 Encoding input features with preprocessing layers

For categorical features, we encode them using `layers.StringLookup` or `layers.IntegerLookup`.
The layer vocabularies are learnt from the dataset (e.g. the training dataset), a `layers.CategoryEncoding` finally encodes the inputs using the vocabulary.
For the numerical features, we apply a `RobustNormalization` layer to take into account outliers (possibly huges) during the normalization.

***Input features are encoded in the same order as they are defined in the dataset.***

In [16]:
from modeling.preprocessing import encode_inputs
from modeling.preprocessing import FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES

In [35]:
import pickle

SAVE_PATH = os.path.join(fm_path, "encoded_features.pkl")

# Supposons que "training_dataset" est déjà défini et chargé
# if os.path.exists(SAVE_PATH):
#     with open(SAVE_PATH, "rb") as f:
#         all_inputs, encoded_features = pickle.load(f)
#     print("Loading feature encoding from saved file.")
# else:
all_inputs, encoded_features = encode_inputs(training_dataset, FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES)
with open(SAVE_PATH, "wb") as f:
    pickle.dump((all_inputs, encoded_features), f)
print("Feature encoding computed and saved.")

Processing numerical feature:  ISO0
Processing categorical feature:  FITS_ID
 ... StringLookup
Processing categorical feature:  FLAGS
 ... IntegerLookup
Processing numerical feature:  ELLIPTICITY
Processing categorical feature:  CCD_ID
 ... IntegerLookup
Processing numerical feature:  CLASS_STAR
Processing numerical feature:  ELONGATION
Processing numerical feature:  EXPTIME
Processing numerical feature:  BACKGROUND
Feature encoding computed and saved.


## 4. Creating, training the model and monitoring with W&B

##  4.1 Defining the model and training procedure

In [36]:
# class weights are:
# {'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
class_weights = {
  1: 0.5989620465839595,  # GOOD
  2: 0.19793362733891753,  # RBT
  3: 0.14426418975966473,  # BT
  4: 0.058840136317458235 #,  # B_SEEING
  # 5: 0.004915   # BGP
}

In [37]:
from tensorflow.keras.callbacks import ReduceLROnPlateau
from modeling.nn_modeling import create_dcn_model

In [None]:
def training_evaluation(config):
    """
    Train and evaluate the model with the given hyperparameters.
    """
    global all_inputs, encoded_features

    print("Reading and preparing the datasets ...")
    training_dataset, validation_dataset, testing_dataset = datasets_loader(
            data_path=fm_path,
            shuffling_size=config["shuffling_size"],
            batch_size=config["batch_size"],
            label_name="gt_label1",
            custom_reader_func=custom_reader_func
        )

    model = create_dcn_model(all_inputs=all_inputs,
                         encoded_inputs=encoded_features,
                         num_hidden_layers=config["num_hidden_layers"],
                         units_per_layer=config["num_units"],
                         num_cross_layers=config["num_cross_layers"],
                         dcn_dnn=config["dcn_dnn"],
                         dropout_rate=config["dropout"],
                         l2=config["l2"])

    # Create a LearningRateScheduler callback
    # lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=config["initial_lr"],
    #                                                             decay_steps=config["decay_steps"],
    #                                                             end_learning_rate=config["end_lr"],
    #                                                             power=1)
    reduce_lr = ReduceLROnPlateau(monitor='val_sparse_categorical_accuracy', factor=0.2, patience=3, min_lr=1e-6)


    model.compile(optimizer=keras.optimizers.Adam(learning_rate=config["initial_lr"]),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=[keras.metrics.SparseCategoricalAccuracy()])

    # tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs/",
    #                                                       histogram_freq=1,
    #                                                       update_freq="batch")
    # Train the model
    checkpoint_cb = WandbModelCheckpoint(filepath="checkpoint.weights.h5", save_weights_only=True)
    cf_dcn = config["dcn_dnn"]
    best_model_cb = WandbModelCheckpoint(filepath=f"iqa_best_dcn_{cf_dcn}.keras",
                                         save_best_only=True,
                                         save_weights_only=False,
                                         monitor="val_sparse_categorical_accuracy",
                                         mode="max")
    print("Start training the model...")
    history = model.fit(training_dataset,
                        epochs=config["num_epochs"],
                        validation_data=validation_dataset,
                        callbacks=[reduce_lr,
                                  #  tensorboard_callback,
                                   WandbMetricsLogger(log_freq="batch"),
                                   checkpoint_cb,
                                   best_model_cb
                                   ],
                        class_weight=class_weights)

    print("Model training finished.")

    print("Evaluating model performance...")
    loss, accuracy = model.evaluate(testing_dataset)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return model, history

## 4.2 Defining the monitoring configuration and experiment

In [39]:
from wandb.integration.keras import WandbMetricsLogger
from wandb.integration.keras import WandbModelCheckpoint
from wandb.integration.keras import WandbCallback

In [None]:
# We preconfigurate the function training_evaluation with all_inputs and encoded_features
# so that the function passed to wandb.agent can be called does not take any arguments.
# from functools import partial

# agent_function = partial(training_evaluation, all_inputs, encoded_features)

In [None]:
wandb.teardown()

In [40]:
import os

os.environ['WANDB_AGENT_MAX_INITIAL_FAILURES'] = '1'

In [41]:
notes = f"With shuffling_size={shuffling_size} and batch_size={batch_size}.\n"

In [None]:
config = dict(
    # Hyper params
    num_hidden_layers = 1,
    num_units = 64,
    num_cross_layers = 1,
    dcn_dnn = "stack", # "stack" or "concatenate"
    dropout = 0.3,
    l2 = 0.008,
    num_classes = 5,
    shuffling_size = shuffling_size,
    batch_size = batch_size,
    initial_lr = 1e-3,
    end_lr = 1e-4,
    decay_steps = 1000,
    num_epochs = 30,
)

# Enable resuming the run
run = wandb.init(project="astro_iqa",
                 entity="mike-fournigault1",
                 config=config, save_code=True,
                 resume="allow",
                 job_type="train",
                 tags=["dcn"] + [config["dcn_dnn"]],
                 notes=notes)

## 4.3 Running the training/evluation experiment

In [None]:
model, training_history = training_evaluation(config)

Reading and preparing the datasets ...




Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]
Start training the model...
Epoch 1/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 257ms/step - loss: 1.1746 - sparse_categorical_accuracy: 0.6351 - val_loss: 2.2679 - val_sparse_categorical_accuracy: 0.4507 - learning_rate: 0.0010
Epoch 2/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m151s[0m 289ms/step - loss: 0.1820 - sparse_categorical_accuracy: 0.8804 - val_loss: 0.4543 - val_sparse_categorical_accuracy: 0.8648 - learning_rate: 0.0010
Epoch 3/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 264ms/step - loss: 0.1523 - sparse_categorical_accuracy: 0.8838 - val_loss: 0.8991 - val_sparse_categorical_accuracy: 0.7209 - learning_rate: 0.0010
Epoch 4/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m95s[0m 284ms/step - loss: 0.1383 - sparse_categorical_accuracy: 0.8709 - val_loss: 0.4020 - val_sparse_categorical_accur

In [46]:
# Terminate the W&B run
run.finish()

Calculating a confusion matrix and classification report for the model predictions on the test set.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Prédire les labels sur le test_dataset
predictions = model.predict(testing_dataset)
predicted_labels = tf.argmax(predictions, axis=1)

# Extraire les véritables labels depuis le test_dataset
true_labels = []
for features, label in testing_dataset:
    true_labels.extend(label.numpy())
true_labels = tf.constant(true_labels)

# Calcul de la matrice de confusion
conf_matrix = tf.math.confusion_matrix(true_labels, predicted_labels)

# Affichage de la matrice de confusion avec seaborn
label_names = label_voc[1::]
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, 
            xticklabels=label_names,
            yticklabels=label_names,
            fmt="d", cmap="Blues")
plt.xlabel("Predictions")
plt.ylabel("True Labels")
plt.title("Confusion Matrix on Test Dataset")
plt.show()