# Using Siamese Networks for "Classification"

Here we adopt a Siamese Neural Network (SNNs) algorithm to perform "classification." SNNs are a powerful tool for performing similarity analysis, and are very powerful when you have severely unbalanced data. With this technique, a single training example is not a single record, instead you feed in a pair-wise input of two different records, and ask how similar they are. In this notebook, we will take all records and generate every combination of these records (`X_train`) and whether or not they belong to the same class (`y_train`) .

In [1]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

2023-10-17 23:32:12.241435: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-17 23:32:12.241486: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-17 23:32:12.241508: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-17 23:32:12.247600: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load Your Data

Replace the following with your actual data. In this example, we are looking at a binary classification problem, however we could (AND SHOULD) instead change this a multi-classification problem across all traits - moreon this later.

In [2]:
# NOTE: Don't load all of your data! You will run out of memory!
features = np.array([
    [0,0,1,1,1,0,0,0,1,1,1,0,None,1.0,1,0,0,1.0,1,0,0,3.0,2.0,1.0,0.0,1,0,0,0,0,0,0,0],
    [0,0,0,2,0,2,0,2,2,2,0,1,0.0,1.0,1,0,0,1.0,0,1,0,1.0,2.0,1.0,0.0,1,0,0,0,0,0,0,0],
    [0,2,0,1,1,2,0,1,2,1,0,1,0.0,1.0,1,0,0,0.0,1,0,0,0.0,1.0,0.0,1.0,1,0,0,0,0,0,0,0],
    [0,0,0,1,1,2,0,2,0,1,0,1,3.0,1.0,0,1,0,1.0,1,0,0,3.0,2.0,0.0,1.0,0,1,0,0,0,0,0,0],
    [0,0,0,0,2,1,0,2,2,1,0,1,1.0,1.0,1,0,0,1.0,1,0,0,3.0,2.0,1.0,0.0,1,0,0,1,0,0,0,0],
    [0,1,1,1,1,0,0,2,1,2,0,1,4.0,0.0,1,0,0,None,1,0,0,None,None,None,None,0,1,0,1,0,0,0,0],
    [1,0,0,0,1,2,1,1,1,0,0,1,1.0,0.0,1,0,0,1.0,1,0,0,0.0,2.0,0.0,0.0,1,0,0,0,0,0,0,0],
    [0,2,0,0,2,1,0,2,1,0,1,0,None,0.0,1,0,0,0.0,0,1,0,0.0,2.0,1.0,0.0,1,0,0,0,0,0,0,0],
    [1,0,1,1,1,1,0,1,1,0,1,0,None,0.0,1,0,0,None,0,1,0,None,None,None,None,1,0,0,0,1,0,0,0],
    [0,1,1,0,2,1,2,1,0,1,0,1,2.0,1.0,1,0,0,None,0,1,0,None,None,None,None,1,0,0,0,0,0,0,0]
])

# HACK: For now, map `None` to -1. However, some thought may be required on appropriate representation.
# As an example, instead of representing answers to an optional question as YES/NO/BLANK (1/0/None) you can instead adopt (1,-1,0)
features[features==None] = -1

labels = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

## Data Preprocessing

Next step is to construct the actual training data, by generating **all combinations** of our samples (`pair_data`) and if they belong to the same class (`pair_labels`).

You can see how this easily extends beyond binary classification/similarity.

In [3]:
pair_data = []
pair_labels = []

# Generate pairs
for i in range(len(features)):
    for j in range(i+1, len(features)):
        pair_data.append([features[i], features[j]])
        pair_labels.append(labels[i] == labels[j])

# Convert to NumPy arrays
pair_data = np.array(pair_data)
pair_data = pair_data.astype(int)
pair_labels = np.array(pair_labels, dtype=int)

# Split your dataset into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(pair_data, pair_labels, test_size=0.2)

## Define Neural Network Architecture

Here is an initial Siamese Neural Network with a very simple architecture. As with any other neural network project, you should spend considerable time tuning the hyperparameters and infrastructure.

For example, you borrow lessons learend from architectures likes ResNet or ImageNet.

In [4]:
input_shape = X_train[0][0].shape

# Define Siamese network architecture
def create_siamese_network():
    input_layer = tf.keras.layers.Input(shape=input_shape)

    x = tf.keras.layers.Dense(64, activation='relu')(input_layer)  # CHANGE ME
    x = tf.keras.layers.Dense(128, activation='relu')(x)  # CHANGE ME
    x = tf.keras.layers.Dense(256, activation='relu')(x)  # CHANGE ME
    output_layer = tf.keras.layers.Dense(128)(x)

    return tf.keras.Model(inputs=input_layer, outputs=output_layer)

# Create two input layers for pairs of data
input_left = tf.keras.layers.Input(shape=input_shape)
input_right = tf.keras.layers.Input(shape=input_shape)

# Build individual Siamese networks
siamese_network = create_siamese_network()
embedding_left = siamese_network(input_left)
embedding_right = siamese_network(input_right)

# Define the contrastive loss function
# NOTE: there are other loss functions you can use
def contrastive_loss(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    margin = 1.0
    return tf.reduce_mean(y_true * tf.square(y_pred) + (1 - y_true) * tf.square(tf.maximum(margin - y_pred, 0)))

# Build the Siamese model
model = tf.keras.models.Model(inputs=[input_left, input_right], outputs=[embedding_left, embedding_right])

# Compile the model
model.compile(optimizer='adam', loss=contrastive_loss)  # CHANGE ME

2023-10-17 23:32:14.595741: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-17 23:32:14.599699: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-17 23:32:14.599733: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-17 23:32:14.601533: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-17 23:32:14.601569: I tensorflow/compile

## Model Training

Crude model training. Again, you should tune hyperparamters.

In [5]:
# Train the Siamese network
batch_size = 32  # CHANGE ME
num_epochs = 10  # CHANGE ME

model.fit([X_train[:, 0], X_train[:, 1]], y_train, batch_size=batch_size, epochs=num_epochs)

Epoch 1/10


2023-10-17 23:32:18.605475: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fbb88268210 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-17 23:32:18.605514: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2023-10-17 23:32:18.611281: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-10-17 23:32:18.748009: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600
2023-10-17 23:32:18.831441: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fbc3ecedb50>

In [6]:
# Evaluate the model
test_loss = model.evaluate([X_test[:, 0], X_test[:, 1]], y_test)
print("Test loss:", test_loss)

Test loss: [0.5983409285545349, 0.28031328320503235, 0.31802764534950256]


## Inference Example

Here we grab a random record from our data to serve as a theoretical `new_data_sample`. Similarly, we grab 2 random case records to serve as `cases_reference_samples`.

We then compare the `new_data_sample` to each of the `cases_reference_samples` and ask if the average score is above a threshold (default `0.5`).

In [7]:
# Select representative reference samples from each class
num_reference_samples = 2  # CHANGE ME: Adjust the number of reference samples as needed

# Randomly select reference samples from the "cases" class
X_cases = features[labels==1]
cases_reference_samples = X_cases[np.random.choice(len(X_cases), num_reference_samples, replace=False)].astype(float)

# Grab a random sample to serve as "new data"
new_data_sample = features[np.random.choice(len(features),  1)].astype(float).reshape(input_shape)

# Define a function to compute the similarity score between a new data sample and reference samples
def compute_similarity_scores(new_data, reference_samples):
    # Initialize an array to store similarity scores
    similarity_scores = []

    for reference_sample in reference_samples:
        # Ensure the data has the shape (1, input_dim) for prediction
        new_data = np.reshape(new_data, (1, input_shape[0]))
        reference_sample = np.reshape(reference_sample, (1, input_shape[0]))
        
        # Compute the similarity or dissimilarity score with the Siamese model
        similarity_score = model.predict([new_data, reference_sample])

        # Append the score to the list
        similarity_scores.append(similarity_score)

    return similarity_scores

# Compute similarity scores with reference samples for both classes
cases_similarity_scores = compute_similarity_scores(new_data_sample, cases_reference_samples)

# Calculate average similarity scores for each class
average_cases_similarity = np.mean(cases_similarity_scores)

# Set a threshold (adjust as needed)
threshold = 0.5  # CHANGE ME

# Make a binary classification decision based on the threshold
if average_cases_similarity > threshold:
    predicted_class = "cases"
else:
    predicted_class = "controls"

print(f"Predicted Class: {predicted_class} ({average_cases_similarity})")

Predicted Class: cases (0.5557858943939209)


## Suggested Improvements

- Use a proper data loader utility to dynamically pull/transform your pair wise data. Every possible combination of 2 in your pool of 400,000 is a VERY big number.
    - You also don't have to use EVERY possible combination, you can generate random pairs if the data is too big
- Extend data beyond binary classification
    - Better to train a single big model properly, then a bunch of improperly/not tuned small models.
- Pick an appropriate numerical encoding for your data to differentiate between `None` and `0`.
- Tune neural network architecture
    - Look at literature and borrow other architectures.
    - Some, like ResNet, are automatically included with Tensorflow Hub and can be pulled with/without pre-trained weights
- Tune hyperparameters
- Chose more than 2 reference samples
    - Ideally you pick a set of "gold standard" references
- Tune inference threshold by inspecting distributions of scores
- Refactor training using Ray for distributed computing
    - Execute on cloud infrastructure w/ Ray Cluster: https://docs.ray.io/en/latest/cluster/getting-started.html
    - Ray Train integration with Tensorflow: https://docs.ray.io/en/latest/train/distributed-tensorflow-keras.html
    - Ray Tune Hyperparameter tuning integration with Optuna: https://docs.ray.io/en/latest/tune/index.html
    - Ray Data for scalable data processing: https://docs.ray.io/en/latest/data/data.html