# Model extraction

This code simulates a model extraction (theft) attack, where an attacker queries a black-box machine learning model and uses its predictions to train an identical replica model without access to the original training data.

We first import our required libraries, import the iris dataset and prepare our data for training.

In [1]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Train a SECRET "Black-Box" Target Model
iris = load_iris()
X, y = iris.data, iris.target

# Convert to binary classification (Setosa vs Non-Setosa)
y = (y == 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

We create and train a simple model, test the trained model on clean test data (X_test). We print accuracy as a baseline to compare against the stolen model.

In [2]:
# Train the target (victim) model
target_model = Sequential([
    Dense(16, activation='relu', input_shape=(X.shape[1],)),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification
])
target_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
target_model.fit(X_train, y_train, epochs=20, batch_size=8, verbose=0)

# Evaluate the black-box model
y_pred_test = (target_model.predict(X_test) > 0.5).astype(int)
print(f"✅ Target Model Test Accuracy: {np.mean(y_pred_test.flatten() == y_test):.2f}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
✅ Target Model Test Accuracy: 1.00


The attacker has no access to the original training data but can query the target model with new inputs (X_attack). The black-box model outputs predictions (confidence scores). The attacker extracts these predictions and treats them as labels for a new dataset (i.e., the attacker labels X_attack using the victim model).

In [3]:
# Simulate an Attacker Querying the Black-Box Model
X_attack = X_test.copy()  # Attacker queries with test samples
stolen_labels = (target_model.predict(X_attack) > 0.5).astype(int)  # Extract outputs

[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


The attacker trains a replica model, using the same architecture as the victim model. However, the training data consists only of queried inputs (X_attack) and stolen labels (stolen_labels). The attack essentially clones the decision boundary of the black-box model.

In [4]:
# Train a Copy (Replica) Model Using the Stolen Data
replica_model = Sequential([
    Dense(16, activation='relu', input_shape=(X.shape[1],)),
    Dense(8, activation='relu'),
    Dense(1, activation='sigmoid')  # Binary classification
])
replica_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
replica_model.fit(X_attack, stolen_labels, epochs=20, batch_size=8, verbose=0)

<keras.src.callbacks.history.History at 0x7ce0081eb210>

We now test the replica model on the original test set (X_test). It compares its accuracy to the victim model—if similar, the model theft was successful.

In [None]:
# Evaluate the replica model
y_pred_replica = (replica_model.predict(X_test) > 0.5).astype(int)
replica_accuracy = np.mean(y_pred_replica.flatten() == y_test)

print(f"💀 Model Exfiltration Success: Stolen Model Accuracy = {replica_accuracy:.2f}")

# Step 4: Compare Model Performance
plt.bar(["Target Model", "Stolen Model"], [np.mean(y_pred_test.flatten() == y_test), replica_accuracy], color=['blue', 'red'])
plt.ylim(0, 1)
plt.ylabel("Accuracy")
plt.title("Model Exfiltration: Comparing Stolen vs. Target Model")
plt.show()


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
