# Classification of Structured Data with Deep Cross Networks and Keras preprocessing layers

**Author:** [Mike Fournigault](https://www.linkedin.com/in/mike-fournigault-57312071/)<br>

Based on the Deep Cross Network V2 by Google: [DCN V2: Improved Deep & Cross Network and Practical Lessons
for Web-scale Learning to Rank Systems](https://arxiv.org/pdf/2008.13535)<br>


## 1. Environment setup

Setup of Weight and biases for monitoring the model training and evaluation.


In [None]:
!pip install wandb -Uq

In [None]:
import wandb

wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmike-fournigault1[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

Installing tensorflow recommenders to get the DCN V2 layer

In [None]:
!pip install -q tensorflow-recommenders

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/96.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Clonning the repository and installing the requirements

In [None]:
from google.colab import userdata
github_token = userdata.get("github_token")

In [None]:
from google.colab import drive

# mounting my google drive
drive.mount("/content/gdrive", force_remount=True)

# Clone the repo "astro_iqa" from my github
! git clone https://mfournigault:$github_token@github.com/mfournigault/astro_iqa.git

Mounted at /content/gdrive
Cloning into 'astro_iqa'...
remote: Enumerating objects: 457, done.[K
remote: Counting objects: 100% (164/164), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 457 (delta 104), reused 90 (delta 40), pack-reused 293 (from 2)[K
Receiving objects: 100% (457/457), 244.29 MiB | 30.19 MiB/s, done.
Resolving deltas: 100% (186/186), done.
Updating files: 100% (76/76), done.


In [None]:
import os
os.chdir("/content/astro_iqa")
#! conda env update -n base -f environment_tf2.15_gpu_wsl.yml

In [None]:
import os
import sys

# Only the TensorFlow backend supports string inputs.
os.environ["KERAS_BACKEND"] = "tensorflow"

import numpy as np
import pandas as pd
import tensorflow as tf
import keras


sys.path.append(os.path.abspath("/content/astro_iqa/src/"))

## 2. Loading and preparing the datasets

Reading and merging catalog and mapping files

In [None]:
columns = ["OBJECT_ID", "FITS_ID", "CCD_ID", "ISO0", "BACKGROUND", "ELLIPTICITY", "ELONGATION", "CLASS_STAR", "FLAGS", "EXPTIME"]
data_path = "/content/astro_iqa/data/"
proc_path = os.path.join(data_path, "processed")
fm_path = os.path.join(data_path, "for_modeling")


In [None]:
os.chdir("/content/astro_iqa/src/data_acquisition_understanding")

In [10]:
!python /content/astro_iqa/src/data_acquisition_understanding/dnn_datasets_preparation.py --train_fraction 0.7 --validation_fraction 0.5

2025-04-17 10:12:26.395456: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744884746.429545    2405 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744884746.439590    2405 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Reading and concatening catalogs ...
CADC catalog size:  (1873000, 12)
Cleaning and splitting catalog ...
Class weights:
{'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
-----------------
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pyd

In [None]:
from data_acquisition_understanding.dnn_datasets_preparation import datasets_loader, custom_reader_func

In [None]:
# Load the tensorflow Datasets
os.chdir("/content/astro_iqa/")
print("current directory: ", os.getcwd())
print("Content of the directory: ", os.listdir(fm_path))

print("Reading the datasets ...")
batch_size = 4096
shuffling_size = 100000
training_dataset, validation_dataset, testing_dataset = datasets_loader(
        data_path=fm_path,
        shuffling_size=shuffling_size,
        batch_size=batch_size,
        label_name="gt_label1",
        custom_reader_func=custom_reader_func
    )


current directory:  /content/astro_iqa
Content of the directory:  ['map_images_labels_ngc0869.json', 'objects_catalog_ngc0896_bronze.parquet.gz', 'training_dataset', 'map_images_labels_ngc7000.json', 'objects_catalog_ngc7000_bronze.parquet.gz', 'test_dataset', 'map_images_labels_cadc.json', 'modelling.md', 'validation_dataset', 'objects_catalog_cadc_bronze.parquet.gz', 'map_images_labels_ngc0896.json', 'map_images_labels_cadc2.json', 'objects_catalog_ngc0869_bronze.parquet.gz', 'map_images_labels.json', '.gitattributes']
Reading the datasets ...
Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]


In [14]:
print("Number of batches in training: ", training_dataset.cardinality().numpy())
print("Number of batches in validation:", validation_dataset.cardinality().numpy())
print("Number of batches in testing:", testing_dataset.cardinality().numpy())

Number of batches in training:  323
Number of batches in validation: 70
Number of batches in testing: 70


## 3. Creating model inputs and preprocessing layers

### 3.1 Defining the preprocessing layers

### 3.2 Encoding input features with preprocessing layers

For categorical features, we encode them using `layers.StringLookup` or `layers.IntegerLookup`.
The layer vocabularies are learnt from the dataset (e.g. the training dataset), a `layers.CategoryEncoding` finally encodes the inputs using the vocabulary.
For the numerical features, we apply a `RobustNormalization` layer to take into account outliers (possibly huges) during the normalization.

***Input features are encoded in the same order as they are defined in the dataset.***

In [None]:
from modeling.preprocessing import encode_inputs
from modeling.preprocessing import FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES

In [23]:
# Encoding input features one time for all
all_inputs, encoded_features = encode_inputs(training_dataset, FEATURE_NAMES, NUMERIC_FEATURE_NAMES, CATEGORICAL_FEATURE_NAMES)

Processing numerical feature:  ISO0
Processing categorical feature:  FITS_ID
 ... StringLookup
Processing categorical feature:  FLAGS
 ... IntegerLookup
Processing numerical feature:  ELLIPTICITY
Processing categorical feature:  CCD_ID
 ... IntegerLookup
Processing numerical feature:  CLASS_STAR
Processing numerical feature:  ELONGATION
Processing numerical feature:  EXPTIME
Processing numerical feature:  BACKGROUND


## 4. Creating, training the model and monitoring with W&B

##  4.1 Defining the model and training procedure

In [24]:
# class weights are:
# {'GOOD': 0.5989620465839595, 'RBT': 0.19793362733891753, 'BT': 0.14426418975966473, 'B_SEEING': 0.058840136317458235}
class_weights = {
  1: 0.5989620465839595,  # GOOD
  2: 0.19793362733891753,  # RBT
  3: 0.14426418975966473,  # BT
  4: 0.058840136317458235 #,  # B_SEEING
  # 5: 0.004915   # BGP
}

In [None]:
from tensorflow.keras.callbacks import ReduceLROnPlateau
from modeling.nn_modeling import create_dcn_model

In [None]:
def training_evaluation(config):
    """
    Train and evaluate the model with the given hyperparameters.
    """
    global all_inputs, encoded_features

    print("Reading and preparing the datasets ...")
    training_dataset, validation_dataset, testing_dataset = datasets_loader(
            data_path=fm_path,
            shuffling_size=config["shuffling_size"],
            batch_size=config["batch_size"],
            label_name="gt_label1",
            custom_reader_func=custom_reader_func
        )

    model = create_dcn_model(all_inputs=all_inputs,
                         encoded_inputs=encoded_features,
                         num_hidden_layers=config["num_hidden_layers"],
                         units_per_layer=config["num_units"],
                         num_cross_layers=config["num_cross_layers"],
                         dcn_dnn=config["dcn_dnn"],
                         dropout_rate=config["dropout"],
                         l2=config["l2"])

    # Create a LearningRateScheduler callback
    # lr_scheduler = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=config["initial_lr"],
    #                                                             decay_steps=config["decay_steps"],
    #                                                             end_learning_rate=config["end_lr"],
    #                                                             power=1)
    reduce_lr = ReduceLROnPlateau(monitor='val_sparse_categorical_accuracy', factor=0.2, patience=3, min_lr=1e-6)


    model.compile(optimizer=keras.optimizers.Adam(learning_rate=config["initial_lr"]),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=[keras.metrics.SparseCategoricalAccuracy()])

    # tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs/",
    #                                                       histogram_freq=1,
    #                                                       update_freq="batch")
    # Train the model
    print("Start training the model...")
    history = model.fit(training_dataset,
                        epochs=config["num_epochs"],
                        validation_data=validation_dataset,
                        callbacks=[reduce_lr,
                                  #  tensorboard_callback,
                                   WandbMetricsLogger(log_freq="batch"),
                                   WandbModelCheckpoint(filepath="checkpoint.weights.h5", save_weights_only=True)
                                   ],
                        class_weight=class_weights)

    print("Model training finished.")

    print("Evaluating model performance...")
    loss, accuracy = model.evaluate(testing_dataset)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history

## 4.2 Defining the monitoring configuration and experiment

In [22]:
from wandb.integration.keras import WandbMetricsLogger
from wandb.integration.keras import WandbModelCheckpoint
from wandb.integration.keras import WandbCallback

In [None]:
# We preconfigurate the function training_evaluation with all_inputs and encoded_features
# so that the function passed to wandb.agent can be called does not take any arguments.
# from functools import partial

# agent_function = partial(training_evaluation, all_inputs, encoded_features)

In [None]:
wandb.teardown()

In [27]:
import os

os.environ['WANDB_AGENT_MAX_INITIAL_FAILURES'] = '1'

In [None]:
notes = f"With shuffling_size={shuffling_size} and batch_size={batch_size}.\n"

In [None]:
config = dict(
    # Hyper params
    num_hidden_layers = 1,
    num_units = 64,
    num_cross_layers = 1,
    dcn_dnn = "stack", # "stack" or "concatenate"
    dropout = 0.3,
    l2 = 0.008,
    num_classes = 5,
    shuffling_size = shuffling_size,
    batch_size = batch_size,
    initial_lr = 1e-3,
    end_lr = 1e-4,
    decay_steps = 1000,
    num_epochs = 30,
)

# Enable resuming the run
run = wandb.init(project="astro_iqa", 
                 entity="mike-fournigault1", 
                 config=config, save_code=True, 
                 resume="allow",
                 job_type="train",
                 tags=["dcn"],
                 notes=notes)

## 4.3 Running the training/evluation experiment

In [29]:
training_history = training_evaluation(config)

Reading and preparing the datasets ...
Label vocabulary:  ['[UNK]', np.str_('GOOD'), np.str_('RBT'), np.str_('BT'), np.str_('B_SEEING')]
Start training the model...
Epoch 1/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m104s[0m 293ms/step - loss: 1.1772 - sparse_categorical_accuracy: 0.6855 - val_loss: 0.4858 - val_sparse_categorical_accuracy: 0.9610 - learning_rate: 0.0010
Epoch 2/30


  callback.on_epoch_end(epoch, logs)


[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 285ms/step - loss: 0.1650 - sparse_categorical_accuracy: 0.9447 - val_loss: 0.0609 - val_sparse_categorical_accuracy: 0.9960 - learning_rate: 0.0010
Epoch 3/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m92s[0m 272ms/step - loss: 0.0918 - sparse_categorical_accuracy: 0.9679 - val_loss: 0.0672 - val_sparse_categorical_accuracy: 0.9949 - learning_rate: 0.0010
Epoch 4/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m147s[0m 288ms/step - loss: 0.0623 - sparse_categorical_accuracy: 0.9850 - val_loss: 0.0972 - val_sparse_categorical_accuracy: 0.9956 - learning_rate: 0.0010
Epoch 5/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m94s[0m 278ms/step - loss: 0.0423 - sparse_categorical_accuracy: 0.9899 - val_loss: 0.1532 - val_sparse_categorical_accuracy: 0.9853 - learning_rate: 0.0010
Epoch 6/30
[1m323/323[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m91s[0m 267ms/step - loss: 0.03

In [30]:
# Terminate the W&B run
run.finish()

0,1
batch/batch_step,▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇███
batch/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
batch/loss,█▇▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
batch/sparse_categorical_accuracy,▁▂▄▄▅████████████████████▇██████████████
epoch/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇███
epoch/learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch/loss,█▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
epoch/sparse_categorical_accuracy,▁▆▆▇███▇▇███████▇▇███████████▇
epoch/val_loss,▂▁▁▁▁▁▁▂▁▁▁▁▁▂▁▆▄▄▄▃▃▂▂▁▁▁▁▁█▁
epoch/val_sparse_categorical_accuracy,▇██████▆▇█▇█▇▆▇▂▂▃▃▃▃▇▅▆▇█▇▇▁█

0,1
batch/batch_step,9689.0
batch/learning_rate,0.001
batch/loss,0.03569
batch/sparse_categorical_accuracy,0.97824
epoch/epoch,29.0
epoch/learning_rate,0.001
epoch/loss,0.03569
epoch/sparse_categorical_accuracy,0.97824
epoch/val_loss,0.03603
epoch/val_sparse_categorical_accuracy,0.99897
