<a href="https://colab.research.google.com/github/madelineapeters/MIDAS-UQ-carpentry/blob/main/iris_dataset/Madeline/CP_iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conformal classification for synthetic iris dataset
## Madeline A.E. Peters

We'll start by importing the necessary modules, setting a seaborn theme for plotting and setting random seeds.

In [43]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, QuantileTransformer, MinMaxScaler
import warnings
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
import tensorflow
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
import tensorflow.keras.layers as layers

# Set seaborn theme
sns.set_theme()

# Ignore warnings
warnings.filterwarnings("ignore")

# Set random seed for reproducibility
tf.random.set_seed(0)
tf.keras.utils.set_random_seed(0)

We'll also need to install puncc and import it some functions from it:

In [44]:
!pip install puncc

Collecting puncc
  Downloading puncc-0.8.0-py3-none-any.whl.metadata (13 kB)
Downloading puncc-0.8.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.8/70.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: puncc
Successfully installed puncc-0.8.0


In [46]:
from deel.puncc.api.prediction import BasePredictor
from deel.puncc.classification import APS

Next, we'll load the the synthetic iris dataset and print the total number of samples:

In [5]:
iris_url = "https://raw.githubusercontent.com/madelineapeters/MIDAS-UQ-carpentry/refs/heads/main/iris_dataset/Madeline/iris_synthetic_data.csv"
data = pd.read_csv(iris_url)
print(str(data.shape[0])+" rows in dataframe.")

3000 rows in dataframe.


Now we'll inspect the first few rows of the dataframe:

In [6]:
data.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,label
0,5.2,3.8,1.5,0.3,Iris-setosa
1,5.3,4.1,1.5,0.1,Iris-setosa
2,4.8,3.1,1.5,0.2,Iris-setosa
3,5.2,3.7,1.5,0.2,Iris-setosa
4,4.9,3.0,1.5,0.3,Iris-setosa


Next we'll print some summary information:

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal length  3000 non-null   float64
 1   sepal width   3000 non-null   float64
 2   petal length  3000 non-null   float64
 3   petal width   3000 non-null   float64
 4   label         3000 non-null   object 
dtypes: float64(4), object(1)
memory usage: 117.3+ KB


Now we'll create a list of unique labels in case we need it later:

In [8]:
labels = data["label"].unique()

Let's one-hot encode our label feature:

In [12]:
cat_encoder = OneHotEncoder()
array_1hot = cat_encoder.fit_transform(data[["label"]]).toarray()
df_1hot = pd.DataFrame(array_1hot,columns=[f"label_{i}" for i in range(len(labels) )])
data_encoded = pd.concat([data[["sepal length","sepal width","petal length","petal width"]],df_1hot],axis=1)
data_encoded.head()

Unnamed: 0,sepal length,sepal width,petal length,petal width,label_0,label_1,label_2
0,5.2,3.8,1.5,0.3,1.0,0.0,0.0
1,5.3,4.1,1.5,0.1,1.0,0.0,0.0
2,4.8,3.1,1.5,0.2,1.0,0.0,0.0
3,5.2,3.7,1.5,0.2,1.0,0.0,0.0
4,4.9,3.0,1.5,0.3,1.0,0.0,0.0


Because conformal prediction is a distribution-free method, meaning it makes no assumptions about the underlying data distribution (including normality), we don't need to check for normality and transform if required. It only requires the data to be exchangeable (essentially meaning the data points are drawn from the same population). We also shouldn't need to scale the data (e.g., so features are between 0 and 1), as conformal prediction should work well even when features are on different scales.

Let's convert our data to an array and then allocate our data between training and testing. After that, we'll split our training data between "fitting" and "calibration" as well as between x and y features.

**Note**: We'll want to convert our y features to integers.

In [51]:
array_encoded = data_encoded.to_numpy()

train_array, test_array = train_test_split(array_encoded,test_size=0.125,shuffle=True,random_state=1234)
fit_array, calib_array = train_test_split(train_array,test_size=0.3,shuffle=True,random_state=1234)

test_x_array = test_array[:,0:(test_array.shape[1]-len(labels))]
test_y_array = test_array[:,(test_array.shape[1]-len(labels)):].astype(int)

fit_x_array = fit_array[:,0:(fit_array.shape[1]-len(labels))]
fit_y_array = fit_array[:,(fit_array.shape[1]-len(labels)):].astype(int)

calib_x_array = calib_array[:,0:(calib_array.shape[1]-len(labels))]
calib_y_array = calib_array[:,(calib_array.shape[1]-len(labels)):].astype(int)

Now we'll set up full connected neural network we'll use for classification. This is before we actually do the "conformal" portion:

In [56]:
model = tf.keras.Sequential([
    layers.Input(shape=(fit_x_array.shape[1],)),  # 4 input features (e.g., petal length, petal width, etc.)
    layers.Dense(16, activation="relu"),  # Hidden layer with 16 neurons
    layers.Dense(32, activation="relu"),  # Hidden layer with 32 neurons
    layers.Dense(fit_y_array.shape[1], activation="softmax")  # Output layer (3 classes for flowers)
])

# Compile the model
model.compile(optimizer="adam",
              loss="categorical_crossentropy",  # Use categorical_crossentropy if one-hot encoded labels
              metrics=["accuracy"])

# Print model summary
model.summary()

# Train the model
history = model.fit(
    fit_x_array,
    fit_y_array,
    epochs=25,
    batch_size=100,
    validation_split=0.2,
    verbose=1,
)


Epoch 1/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 23ms/step - accuracy: 0.3332 - loss: 1.2692 - val_accuracy: 0.6821 - val_loss: 0.9188
Epoch 2/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.6692 - loss: 0.8868 - val_accuracy: 0.6929 - val_loss: 0.7560
Epoch 3/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.7787 - loss: 0.7335 - val_accuracy: 0.9457 - val_loss: 0.6413
Epoch 4/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.9513 - loss: 0.6219 - val_accuracy: 0.9375 - val_loss: 0.5454
Epoch 5/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.9398 - loss: 0.5336 - val_accuracy: 0.9620 - val_loss: 0.4742
Epoch 6/25
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.9644 - loss: 0.4669 - val_accuracy: 0.9864 - val_loss: 0.4169
Epoch 7/25
[1m15/15[0m [32m━━━━━━━

Now that our model is compiled, we'll want to provide reliable uncertainty evaluation through conformal prediction associated with a pre-trained neural network classifier.

First we'll wrap the model using BasePredictor() and then instantiate the APS wrapper around the trained neural network predictor.

In [57]:
predictor = BasePredictor(model,is_trained=True)
nnet_cp = APS(predictor, train=False)

Next we'll calibrate:

In [58]:
nnet_cp.fit(X_calib=calib_x_array, y_calib=calib_y_array)

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step


IndexError: index 0 is out of bounds for axis 0 with size 0

In [66]:
print(calib_y_array[:,2])

[0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1
 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1
 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 1 0
 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1
 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0
 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1
 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 1 1
 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 0
 0 1 0 1 0 0 0 0 0 0 0 0 