# BreakHis Image Classification with 🤗 Vision Transformers and `TensorFlow`

### Quick intro: Vision Transformer (ViT) by Google Brain
The Vision Transformer (ViT) is basically BERT, but applied to images. It attains excellent results compared to state-of-the-art convolutional networks. In order to provide images to the model, each image is split into a sequence of fixed-size patches (typically of resolution 16x16 or 32x32), which are linearly embedded. One also adds a [CLS] token at the beginning of the sequence in order to classify images. Next, one adds absolute position embeddings and provides this sequence to the Transformer encoder.

* [Original paper](https://arxiv.org/abs/2010.11929)
* [Official repo (in JAX)](https://github.com/google-research/vision_transformer)
* [🤗 Vision Transformer](https://huggingface.co/docs/transformers/model_doc/vit)
* [Pre-trained model](https://huggingface.co/google/vit-base-patch16-224-in21k)

## Installation

In [1]:
# !pip install transformers datasets "tensorflow==2.6.0" tensorflow-addons --upgrade

## Setup & Configuration

In this step, we will define global configurations and parameters, which are used across the whole end-to-end fine-tuning process, e.g. `feature extractor` and `model` we will use. 

In this example we are going to fine-tune the [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) a Vision Transformer (ViT) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224.
There are also [large](https://huggingface.co/google/vit-large-patch16-224-in21k) and [huge](https://huggingface.co/google/vit-huge-patch14-224-in21k) flavors of original ViT.

In [2]:
model_id = "google/vit-base-patch16-224-in21k"
zoom = 400


In [3]:
from datasets import load_dataset
from datetime import datetime
import json
from keras.utils import to_categorical
from keras.callbacks import CSVLogger, EarlyStopping
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
from PIL import Image
import shutil

import tensorflow as tf
import tensorflow_addons as tfa
from transformers import create_optimizer, DefaultDataCollator, ViTImageProcessor, TFViTForImageClassification


2023-05-10 21:50:56.260949: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-10 21:50:56.976746: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64::/home/miki/anaconda3/lib/:/home/miki/anaconda3/envs/tf/lib/
2023-05-10 21:50:56.976812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.7/lib64::/home/miki/ana

## Dataset & Pre-processing

- **Data Source:** https://www.kaggle.com/code/nasrulhakim86/breast-cancer-histopathology-images-classification/data
- The Breast Cancer Histopathological Image Classification (BreakHis) is composed of 9,109 microscopic images of breast tumor tissue collected from 82 patients.
- The images are collected using different magnifying factors (40X, 100X, 200X, and 400X). 
- To date, it contains 2,480 benign and 5,429 malignant samples (700X460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format).
- This database has been built in collaboration with the P&D Laboratory – Pathological Anatomy and Cytopathology, Parana, Brazil (http://www.prevencaoediagnose.com.br). 
- Each image filename stores information about the image itself: method of procedure biopsy, tumor class, tumor type, patient identification, and magnification factor. 
- For example, SOBBTA-14-4659-40-001.png is the image 1, at magnification factor 40X, of a benign tumor of type tubular adenoma, original from the slide 14-4659, which was collected by procedure SOB.

The `BreakHis` is not yet available as a dataset in the `datasets` library. To be able to create a `Dataset` instance we need to write a small little helper function, which will load our `Dataset` from the filesystem and create the instance to use later for training.

This notebook assumes that the dataset is available in directory tree next to this file and its directory name is `breakhis_400x`

In [4]:
now = datetime.now().strftime("%Y_%m_%d-%H_%M_%S")
cwd = Path().absolute()
input_path = cwd / f'breakhis_{zoom}x'
output_path = cwd / 'results' / f'{zoom}x_{now}'

output_path

PosixPath('/home/miki/Documents/studia/praca_dyplomowa/vcs/results/400x_2023_05_10-21_50_58')

In [5]:
shutil.rmtree(output_path, ignore_errors=True)
os.makedirs(output_path)

#### Count number of samples per patient

In [6]:
# data = pd.read_csv(train_val_csv)
# group_counts = data.groupby('patient_id').size().reset_index(name='count')

# group_counts

## Fine-tuning the model using `Keras`

Now that our `dataset` is processed, we can download the pretrained model and fine-tune it. But before we can do this we need to convert our Hugging Face `datasets` Dataset into a `tf.data.Dataset`. For this, we will use the `.to_tf_dataset` method and a `data collator` (Data collators are objects that will form a batch by using a list of dataset elements as input).




## Hyperparameter

In [7]:
id2label = {"0": "benign", "1": "malignant"}
label2id = {v: k for k, v in id2label.items()}

num_train_epochs = 10
train_batch_size = 3
eval_batch_size = 3
learning_rate = 3e-5
weight_decay_rate = 0.01
num_warmup_steps = 0
output_dir = model_id.split("/")[1]
fp16 = True

# Train in mixed-precision float16
# Comment this line out if you're using a GPU that will not benefit from this
if fp16:
    tf.keras.mixed_precision.set_global_policy("mixed_float16")


INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce RTX 3070 Laptop GPU, compute capability 8.6


2023-05-10 21:50:58.471626: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-10 21:50:58.475669: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-10 21:50:58.475859: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-10 21:50:58.476477: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


### Download the pretrained transformer model and fine-tune it. 

In [8]:
tf.debugging.disable_traceback_filtering()

image_processor = ViTImageProcessor.from_pretrained(model_id)

train_val_csv = str(input_path / "train_val.csv")

dataset = load_dataset('csv', data_files={'train': train_val_csv})


def load_images(file_locs):
    return [Image.open(file_loc).convert("RGB") for file_loc in file_locs]


images = load_images(dataset['train']['file_loc'])


def process_example(image):
    inputs = image_processor(image, return_tensors='tf')
    return inputs['pixel_values']


def process_dataset(example, idx):
    example['pixel_values'] = process_example(images[idx])

    example['label'] = to_categorical(example['label'], num_classes=2)
    return example


dataset = dataset.map(process_dataset, with_indices=True, num_proc=1)

print(dataset)

Found cached dataset csv (/home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-c7f948f062c03bd6.arrow


DatasetDict({
    train: Dataset({
        features: ['file_loc', 'label', 'patient_id', 'pixel_values'],
        num_rows: 1427
    })
})


In [9]:
def get_loss():
    return tf.keras.losses.BinaryCrossentropy(from_logits=True)


def get_metrics():
    return [
        tf.keras.metrics.BinaryAccuracy(name="accuracy"),
        tf.keras.metrics.AUC(name='auc', from_logits=True),
        tf.keras.metrics.AUC(name='auc_multi', from_logits=True,
                             num_labels=2, multi_label=True),
        tf.keras.metrics.Recall(name='recall'),
        tf.keras.metrics.Precision(name='precision'),
        tfa.metrics.F1Score(name='f1_score', num_classes=2, threshold=0.5),
    ]


def get_callbacks(idx):
    return [
        EarlyStopping(monitor="val_loss", patience=3),
        CSVLogger(output_path / f'train_metrics_{idx}.csv')
    ]


def get_optimizer(learning_rate, weight_decay_rate, num_warmup_steps, num_train_steps):
    optimizer, _ = create_optimizer(
        init_lr=learning_rate,
        num_train_steps=num_train_steps,
        weight_decay_rate=weight_decay_rate,
        num_warmup_steps=num_warmup_steps,
    )

    return optimizer


num_train_steps_list = []
def train_model(idx, train, val):
    num_train_steps = len(train) * num_train_epochs
    num_train_steps_list.append(num_train_steps)
    print(f"num_train_steps = {num_train_steps}")
    optimizer = get_optimizer(
        learning_rate, weight_decay_rate, num_warmup_steps, num_train_steps)

    # load pre-trained ViT model
    model = TFViTForImageClassification.from_pretrained(
        model_id,
        num_labels=2,
        id2label=id2label,
        label2id=label2id,
    )

    # compile model
    model.compile(optimizer=optimizer, loss=get_loss(), metrics=get_metrics())

    history = model.fit(
        train,
        validation_data=val,
        callbacks=get_callbacks(idx),
        epochs=num_train_epochs,
    )

    return {
        'model': model,
        'history': history
    }


In [10]:
data_collator = DefaultDataCollator(return_tensors="tf")

train_val_data = dataset['train']
files = np.array(train_val_data['file_loc'])
labels = np.array(train_val_data['label'])
patient_ids = np.array(train_val_data['patient_id'])


#### Custom `StratifiedGroupKFold` implementation

In [11]:
import numpy as np
from collections import Counter
from sklearn.utils import shuffle

def map_nested_indices(nested_indices, original_indices):
    return original_indices[nested_indices]

class StratifiedGroupKFold:
    def __init__(self, n_splits=5, random_state=None):
        self.n_splits = n_splits
        self.random_state = random_state
        self.used_group_ids = []

    def _fill_bucket(self, bucket, class_counts, group_ids, y):
        for group_id, label in zip(group_ids, y):
            if group_id in self.used_group_ids:
                continue
            if class_counts[label] > 0:
                group_indices = np.where(group_ids == group_id)[0]
                bucket[label].extend(group_indices)
                class_counts[label] -= len(group_indices)
                self.used_group_ids.append(group_id)

    def _create_buckets(self, group_ids, y, class_ratios):
        total_samples = len(group_ids)
        samples_per_split = total_samples // self.n_splits

        buckets = []
        for _ in range(self.n_splits):
            bucket = {label: [] for label in class_ratios.keys()}
            class_counts = {label: int(samples_per_split * ratio)
                            for label, ratio in class_ratios.items()}
            self._fill_bucket(bucket, class_counts, group_ids, y)
            buckets.append(bucket)

        return buckets

    def _rotate_buckets(self, buckets):
        return buckets[-1:] + buckets[:-1]

    def _get_indices(self, bucket, group_ids, y):
        indices = []
        for label, groups in bucket.items():
            for group in groups:
                group_indices = np.where(group_ids == group)[0]
                label_indices = np.where(y == label)[0]
                indices.extend(np.intersect1d(group_indices, label_indices))
        return np.array(indices)

    def split(self, X, y, group_ids):
        index_map = np.arange(len(y))
        group_ids_s, y_s, index_map = shuffle(
            group_ids, y, index_map, random_state=self.random_state)

        class_ratios = {label: count / len(y)
                        for label, count in Counter(y).items()}
        buckets = self._create_buckets(group_ids_s, y_s, class_ratios)

        for _ in range(self.n_splits):
            train_buckets = buckets[1:]
            test_bucket = buckets[0]

            train_indices = np.concatenate(
                [np.array(bucket[label]) for bucket in train_buckets for label in bucket])
            test_indices = np.concatenate(
                [np.array(test_bucket[label]) for label in test_bucket])

            # Map shuffled indices to original ones
            train_indices = index_map[train_indices]
            test_indices = index_map[test_indices]
            
            assert len(np.intersect1d(np.unique(group_ids[train_indices]), np.unique(group_ids[test_indices]))) == 0


            yield train_indices, test_indices
            buckets = self._rotate_buckets(buckets)


#### `StratifiedGroupKFold` usage example

In [12]:
# np.random.seed(42)

# # Wygeneruj dane
# n_samples = 1800
# n_features = 2

# X = np.random.rand(n_samples, n_features)
# y = np.random.randint(0, 2, n_samples)
# group_ids = np.repeat(np.arange(n_samples // 4), 4)

# # Mieszamy dane, etykiety i grupy
# X, y, group_ids = shuffle(X, y, group_ids, random_state=42)

# # Utwórz instancję StratifiedGroupKFold
# skf = StratifiedGroupKFold(n_splits=3, random_state=42)

# # Iteruj przez foldy
# for train_index, test_index in skf.split(X, y, group_ids):
#     print("TRAIN:", len(train_index), "TEST:", len(test_index))
#     # X_train, X_test = X[train_index], X[test_index]
#     y_train, y_test = y[train_index], y[test_index]
#     train_groups, test_groups = group_ids[train_index], group_ids[test_index]
#     train_class_ratio = {label: count / len(y_train) for label, count in Counter(y_train).items()}
#     print(f"Train class ratio: {train_class_ratio}")
#     test_class_ratio = {label: count / len(y_test) for label, count in Counter(y_test).items()}
#     print(f"test class ratio: {test_class_ratio}")
#     assert len(np.intersect1d(np.unique(train_groups), np.unique(test_groups))) == 0
#     print("-----")


In [13]:
def filter_train_val_indices(idx, indices):
    return idx in indices

def remove_extra_dim(example):
    example['pixel_values'] = np.squeeze(example['pixel_values'], axis=0)
    return example


In [14]:
n_splits = 5
sgfk = StratifiedGroupKFold(n_splits=n_splits, random_state=42)

folds = sgfk.split(files, np.argmax(labels, axis=1), patient_ids)


In [15]:
models_with_histories = []


def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))


def run_fold(idx):
    (train_index, val_index) = next(folds)

    # train_index = map_nested_indices(train_index, train_val_index)
    # val_index = map_nested_indices(val_index, train_val_index)

    # Check indices uniqueness
    train_index_unique, train_counts = np.unique(
        train_index, return_counts=True)
    val_index_unique, val_counts = np.unique(val_index, return_counts=True)
    are_all_values_unique = np.all(
        train_counts == 1) and np.all(val_counts == 1)

    print('Are all indices', are_all_values_unique)
    print(
        f'Indices shared between train & val splits (should be empty): {intersection(train_index_unique, val_index_unique)}')

    # Check patient ids uniqueness
    train_data_filtered = train_val_data.filter(lambda _, idx: filter_train_val_indices(
        idx, train_index), with_indices=True).map(remove_extra_dim)
    val_data_filtered = train_val_data.filter(lambda _, idx: filter_train_val_indices(
        idx, val_index), with_indices=True).map(remove_extra_dim)

    train_ids_unique = np.unique(train_data_filtered['patient_id'])
    val_ids_unique = np.unique(val_data_filtered['patient_id'])

    print(f'Train patient IDs: {len(train_ids_unique)}')
    print(f'Val patient IDs: {len(val_ids_unique)}')
    print(
        f'Train + Val patient IDs: {len(train_ids_unique) + len(val_ids_unique)}')
    print(
        f'Patient IDs shared between train & val splits (should be empty): {intersection(train_ids_unique, val_ids_unique)}')

    print(f'Train patient ids: {train_ids_unique}')
    print(f'Val patient ids: {val_ids_unique}')
    assert len(np.intersect1d(train_ids_unique, val_ids_unique)) == 0

    # Create datasets and train model
    train_dataset = train_data_filtered.to_tf_dataset(
        columns=['pixel_values'],
        label_cols=['label'],
        shuffle=True,
        batch_size=train_batch_size,
        collate_fn=data_collator
    )

    val_dataset = val_data_filtered.to_tf_dataset(
        columns=['pixel_values'],
        label_cols=['label'],
        shuffle=True,
        batch_size=eval_batch_size,
        collate_fn=data_collator
    )
    print(train_dataset)
    print(val_dataset)

    # model_with_history = {}
    model_with_history = train_model(idx, train_dataset, val_dataset)
    model_with_history['patient_ids'] = {
        'train': list(train_ids_unique), 'val': list(val_ids_unique)}

    models_with_histories.append(model_with_history)

    print(f'Fold {idx + 1}/{n_splits} finished')


In [16]:
for idx in range(n_splits):
    run_fold(idx)

Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-bd94e691d43e79b1.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-a54dfef17c92cdf9.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-7db0b6ffe2457e9d.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-2637fd494c64f9d0.arrow


Are all indices True
Indices shared between train & val splits (should be empty): []
Train patient IDs: 54
Val patient IDs: 13
Train + Val patient IDs: 67
Patient IDs shared between train & val splits (should be empty): []
Train patient ids: ['SOB_B_A-14-22549G' 'SOB_B_A-14-29960CD' 'SOB_B_F-14-14134'
 'SOB_B_F-14-14134E' 'SOB_B_F-14-21998CD' 'SOB_B_F-14-23060CD'
 'SOB_B_F-14-23222AB' 'SOB_B_F-14-25197' 'SOB_B_PT-14-22704'
 'SOB_B_PT-14-29315EF' 'SOB_B_TA-14-13200' 'SOB_B_TA-14-15275'
 'SOB_B_TA-14-16184' 'SOB_B_TA-14-19854C' 'SOB_B_TA-14-21978AB'
 'SOB_B_TA-14-3411F' 'SOB_M_DC-14-10926' 'SOB_M_DC-14-11031'
 'SOB_M_DC-14-11951' 'SOB_M_DC-14-13412' 'SOB_M_DC-14-14015'
 'SOB_M_DC-14-14926' 'SOB_M_DC-14-14946' 'SOB_M_DC-14-15572'
 'SOB_M_DC-14-15696' 'SOB_M_DC-14-16188' 'SOB_M_DC-14-16336'
 'SOB_M_DC-14-16716' 'SOB_M_DC-14-17614' 'SOB_M_DC-14-17915'
 'SOB_M_DC-14-18650' 'SOB_M_DC-14-2523' 'SOB_M_DC-14-2773'
 'SOB_M_DC-14-2980' 'SOB_M_DC-14-3909' 'SOB_M_DC-14-4364'
 'SOB_M_DC-14-4372' 'SOB

2023-05-10 21:51:12.667269: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-10 21:51:12.668079: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-10 21:51:12.668439: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-05-10 21:51:12.668705: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least on

<PrefetchDataset element_spec=(TensorSpec(shape=(None, 3, 224, 224), dtype=tf.float32, name=None), TensorSpec(shape=(None, 2), dtype=tf.float32, name=None))>
<PrefetchDataset element_spec=(TensorSpec(shape=(None, 3, 224, 224), dtype=tf.float32, name=None), TensorSpec(shape=(None, 2), dtype=tf.float32, name=None))>
num_train_steps = 3780


2023-05-10 21:51:15.224728: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8100
2023-05-10 21:51:15.801505: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-05-10 21:51:15.802465: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-05-10 21:51:15.802484: W tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:85] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-05-10 21:51:15.803432: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-05-10 21:51:15.803491: W tensorflow/compiler/xla/stream_executor/gpu/redzone_allocator.cc:318] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
Some layers from the model 

Epoch 1/10
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10


Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-672ce7abea409518.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-4a1ce6bbfdc5aa01.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-edf5c95ddaed5554.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-c44d1f68b06d3122.arrow


Fold 1/5 finished
Are all indices True
Indices shared between train & val splits (should be empty): []
Train patient IDs: 53
Val patient IDs: 14
Train + Val patient IDs: 67
Patient IDs shared between train & val splits (should be empty): []
Train patient ids: ['SOB_B_A-14-22549AB' 'SOB_B_A-14-22549G' 'SOB_B_A-14-29960CD'
 'SOB_B_F-14-14134' 'SOB_B_F-14-21998CD' 'SOB_B_F-14-23060AB'
 'SOB_B_F-14-23060CD' 'SOB_B_F-14-23222AB' 'SOB_B_F-14-25197'
 'SOB_B_F-14-29960AB' 'SOB_B_F-14-9133' 'SOB_B_PT-14-22704'
 'SOB_B_TA-14-13200' 'SOB_B_TA-14-15275' 'SOB_B_TA-14-16184'
 'SOB_B_TA-14-19854C' 'SOB_B_TA-14-3411F' 'SOB_M_DC-14-12312'
 'SOB_M_DC-14-13412' 'SOB_M_DC-14-14015' 'SOB_M_DC-14-14926'
 'SOB_M_DC-14-14946' 'SOB_M_DC-14-15792' 'SOB_M_DC-14-16188'
 'SOB_M_DC-14-16716' 'SOB_M_DC-14-16875' 'SOB_M_DC-14-17614'
 'SOB_M_DC-14-17915' 'SOB_M_DC-14-18650' 'SOB_M_DC-14-20629'
 'SOB_M_DC-14-20636' 'SOB_M_DC-14-2523' 'SOB_M_DC-14-2773'
 'SOB_M_DC-14-2980' 'SOB_M_DC-14-3909' 'SOB_M_DC-14-4372'
 'SOB_M_D

Some layers from the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing TFViTForImageClassification: ['vit/pooler/dense/kernel:0', 'vit/pooler/dense/bias:0']
- This IS expected if you are initializing TFViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-09429a8004943140.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-467e29293f6c934c.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-900177109df0c039.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-7bd3dbfa8d84ec5b.arrow


Fold 2/5 finished
Are all indices True
Indices shared between train & val splits (should be empty): []
Train patient IDs: 51
Val patient IDs: 16
Train + Val patient IDs: 67
Patient IDs shared between train & val splits (should be empty): []
Train patient ids: ['SOB_B_A-14-22549AB' 'SOB_B_A-14-22549G' 'SOB_B_A-14-29960CD'
 'SOB_B_F-14-14134' 'SOB_B_F-14-14134E' 'SOB_B_F-14-23060AB'
 'SOB_B_F-14-25197' 'SOB_B_F-14-29960AB' 'SOB_B_F-14-9133'
 'SOB_B_PT-14-22704' 'SOB_B_PT-14-29315EF' 'SOB_B_TA-14-13200'
 'SOB_B_TA-14-15275' 'SOB_B_TA-14-21978AB' 'SOB_B_TA-14-3411F'
 'SOB_M_DC-14-10926' 'SOB_M_DC-14-11031' 'SOB_M_DC-14-11951'
 'SOB_M_DC-14-12312' 'SOB_M_DC-14-14946' 'SOB_M_DC-14-15572'
 'SOB_M_DC-14-15696' 'SOB_M_DC-14-15792' 'SOB_M_DC-14-16188'
 'SOB_M_DC-14-16336' 'SOB_M_DC-14-16716' 'SOB_M_DC-14-16875'
 'SOB_M_DC-14-17614' 'SOB_M_DC-14-20629' 'SOB_M_DC-14-20636'
 'SOB_M_DC-14-2523' 'SOB_M_DC-14-2773' 'SOB_M_DC-14-4364'
 'SOB_M_DC-14-4372' 'SOB_M_DC-14-5695' 'SOB_M_DC-14-6241'
 'SOB_M_DC

Some layers from the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing TFViTForImageClassification: ['vit/pooler/dense/kernel:0', 'vit/pooler/dense/bias:0']
- This IS expected if you are initializing TFViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-0065557df797b3f7.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-05acabea464fe84e.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-70a14f2cb1668398.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-9b64585082c8b492.arrow


Fold 3/5 finished
Are all indices True
Indices shared between train & val splits (should be empty): []
Train patient IDs: 55
Val patient IDs: 12
Train + Val patient IDs: 67
Patient IDs shared between train & val splits (should be empty): []
Train patient ids: ['SOB_B_A-14-22549AB' 'SOB_B_A-14-22549G' 'SOB_B_A-14-29960CD'
 'SOB_B_F-14-14134' 'SOB_B_F-14-14134E' 'SOB_B_F-14-21998CD'
 'SOB_B_F-14-23060AB' 'SOB_B_F-14-23060CD' 'SOB_B_F-14-23222AB'
 'SOB_B_F-14-29960AB' 'SOB_B_F-14-9133' 'SOB_B_PT-14-29315EF'
 'SOB_B_TA-14-15275' 'SOB_B_TA-14-16184' 'SOB_B_TA-14-19854C'
 'SOB_B_TA-14-21978AB' 'SOB_M_DC-14-10926' 'SOB_M_DC-14-11031'
 'SOB_M_DC-14-11951' 'SOB_M_DC-14-12312' 'SOB_M_DC-14-13412'
 'SOB_M_DC-14-14015' 'SOB_M_DC-14-14926' 'SOB_M_DC-14-15572'
 'SOB_M_DC-14-15696' 'SOB_M_DC-14-15792' 'SOB_M_DC-14-16188'
 'SOB_M_DC-14-16336' 'SOB_M_DC-14-16875' 'SOB_M_DC-14-17915'
 'SOB_M_DC-14-18650' 'SOB_M_DC-14-20629' 'SOB_M_DC-14-20636'
 'SOB_M_DC-14-2523' 'SOB_M_DC-14-2773' 'SOB_M_DC-14-2980'
 '

Some layers from the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing TFViTForImageClassification: ['vit/pooler/dense/kernel:0', 'vit/pooler/dense/bias:0']
- This IS expected if you are initializing TFViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10


Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-25a1e5a04b697f60.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-e6a67e99fc1bf625.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-5dc1d6945bbb02f4.arrow
Loading cached processed dataset at /home/miki/.cache/huggingface/datasets/csv/default-0010e7a0edeaefdb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-b85136c4211e8e0e.arrow


Fold 4/5 finished
Are all indices True
Indices shared between train & val splits (should be empty): []
Train patient IDs: 55
Val patient IDs: 12
Train + Val patient IDs: 67
Patient IDs shared between train & val splits (should be empty): []
Train patient ids: ['SOB_B_A-14-22549AB' 'SOB_B_F-14-14134E' 'SOB_B_F-14-21998CD'
 'SOB_B_F-14-23060AB' 'SOB_B_F-14-23060CD' 'SOB_B_F-14-23222AB'
 'SOB_B_F-14-25197' 'SOB_B_F-14-29960AB' 'SOB_B_F-14-9133'
 'SOB_B_PT-14-22704' 'SOB_B_PT-14-29315EF' 'SOB_B_TA-14-13200'
 'SOB_B_TA-14-16184' 'SOB_B_TA-14-19854C' 'SOB_B_TA-14-21978AB'
 'SOB_B_TA-14-3411F' 'SOB_M_DC-14-10926' 'SOB_M_DC-14-11031'
 'SOB_M_DC-14-11951' 'SOB_M_DC-14-12312' 'SOB_M_DC-14-13412'
 'SOB_M_DC-14-14015' 'SOB_M_DC-14-14926' 'SOB_M_DC-14-14946'
 'SOB_M_DC-14-15572' 'SOB_M_DC-14-15696' 'SOB_M_DC-14-15792'
 'SOB_M_DC-14-16336' 'SOB_M_DC-14-16716' 'SOB_M_DC-14-16875'
 'SOB_M_DC-14-17614' 'SOB_M_DC-14-17915' 'SOB_M_DC-14-18650'
 'SOB_M_DC-14-20629' 'SOB_M_DC-14-20636' 'SOB_M_DC-14-2980'
 

Some layers from the model checkpoint at google/vit-base-patch16-224-in21k were not used when initializing TFViTForImageClassification: ['vit/pooler/dense/kernel:0', 'vit/pooler/dense/bias:0']
- This IS expected if you are initializing TFViTForImageClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFViTForImageClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Fold 5/5 finished


#### Dump histories

In [17]:
# for idx, model_with_history in enumerate(models_with_histories):
#     history = model_with_history.get('history', None)
#     np.save(output_path / f'train_history_{idx}.npy', history.history)

# To load:
# history = np.load(output_path / f'train_history_{idx}.npy', allow_pickle='TRUE').item()


#### Save the best model

In [21]:
import pandas as pd

csv_files = [output_path / f'train_metrics_{idx}.csv' for idx in range(n_splits)]
dataframes = [pd.read_csv(file) for file in csv_files]


best_model_index = None
best_val_accuracy = 0.0

for i, df in enumerate(dataframes):
    val_accuracy = df.iloc[-1]['val_accuracy']
    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        best_model_index = i

print(f"Best model index: {best_model_index}, val_accuracy: {best_val_accuracy}")


best_model = models_with_histories[best_model_index].get('model', None)
best_model.save_pretrained(output_path / 'best_model', from_tf=True) 

best_model_info = {"idx": best_model_index,
                   "model_id": model_id,
                   "zoom": zoom,
                   "n_splits": n_splits,
                   "num_train_epochs": num_train_epochs,
                   "train_batch_size": train_batch_size,
                   "eval_batch_size": eval_batch_size,
                   "learning_rate": learning_rate,
                   "weight_decay_rate": weight_decay_rate,
                   "num_warmup_steps": num_warmup_steps,
                   "num_train_steps": num_train_steps_list[best_model_index]}

with open(output_path / 'best_model_info.json', 'w') as f:
    json.dump(best_model_info, f, indent=4)

print(json.dumps(best_model_info, indent=4))

Best model index: 0, val_accuracy: 0.9455782175064088
{
    "idx": 0,
    "model_id": "google/vit-base-patch16-224-in21k",
    "zoom": 400,
    "n_splits": 5,
    "num_train_epochs": 10,
    "train_batch_size": 3,
    "eval_batch_size": 3,
    "learning_rate": 3e-05,
    "weight_decay_rate": 0.01,
    "num_warmup_steps": 0,
    "num_train_steps": 3780
}


In [19]:
numeric_columns = ['accuracy', 'auc', 'auc_multi', 'loss', 'precision', 'recall',
                   'val_accuracy', 'val_auc', 'val_auc_multi', 'val_loss', 'val_precision', 'val_recall']

last_rows_numeric = [df[numeric_columns].iloc[-1] for df in dataframes]
mean_metrics = pd.concat(last_rows_numeric, axis=1).mean(axis=1)
std_metrics = pd.concat(last_rows_numeric, axis=1).std(axis=1)

metrics_dict = {
    metric_name: {
        "mean": mean_metrics[metric_name],
        "std": std_metrics[metric_name],
    }
    for metric_name in mean_metrics.index
}

with open(output_path / 'train_metrics_mean_with_std.json', 'w') as f:
    json.dump(metrics_dict, f, indent=4)

print(json.dumps(metrics_dict, indent=4))


{
    "accuracy": {
        "mean": 0.9964607238769532,
        "std": 0.006334462037977405
    },
    "auc": {
        "mean": 0.9995569825172425,
        "std": 0.0009748233614117515
    },
    "auc_multi": {
        "mean": 0.9994521260261535,
        "std": 0.0012057299075043642
    },
    "loss": {
        "mean": 0.03335511106997724,
        "std": 0.02062681463849152
    },
    "precision": {
        "mean": 0.9974926948547364,
        "std": 0.004672305560359958
    },
    "recall": {
        "mean": 0.9954108953475952,
        "std": 0.008055130940363905
    },
    "val_accuracy": {
        "mean": 0.8717103123664856,
        "std": 0.04753376068092139
    },
    "val_auc": {
        "mean": 0.9222186923027038,
        "std": 0.048153249907271344
    },
    "val_auc_multi": {
        "mean": 0.8959536790847779,
        "std": 0.06169958395199498
    },
    "val_loss": {
        "mean": 0.44469166100025176,
        "std": 0.17439586349469957
    },
    "val_precision": {
      