# Context

##  Final Run:

- In this notebook, we perform a final evaluation run.
- We have added a test set to evaluate the aggregation of all the different models.
- The code structure is the same as in the previous notebooks, which were organized by phylum. However, for each phylum, the selected model and hyperparameters may differ slightly.
- we still made some updates because the code was running for more epochs and accuracy varied slightly. Some code might vary slightly from the one presented in the `best_model_phylumname`
- For the **Echinodermata** we made the model on the spote to fit the way we were aggregating the models at the end, it is very simple as their is only one class and its moslty to fit our pipeline
- At the end of the notebook you can find the final model (which aggregates all the one built and run) and also the final score that takes into account every Phylums

(all the notebooks were ran using colab thats why the file path are specific to colab)

# Imports

In [1]:
from google.colab import drive
import zipfile
drive.mount('/content/drive')

zip_path = '/content/drive/MyDrive/rare_species 1.zip'
extract_path = '/content/rare_species 1'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

Mounted at /content/drive


In [2]:
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow import data as tf_data
from tensorflow.keras import layers
from tensorflow.keras.applications import MobileNetV2, ResNet50
from tensorflow.keras.layers import Rescaling, RandAugment
from sklearn.model_selection import train_test_split
from PIL import Image
from sklearn.metrics import classification_report

In [3]:


# With colab
folder_path = '/content/rare_species 1/rare_species 1'
meta = pd.read_csv('/content/rare_species 1/rare_species 1/metadata.csv')


# With vscode
# folder_path = '../data/rare_species 1'
# meta = pd.read_csv('../data/rare_species 1/metadata.csv')

# Test Set Creation and Data Splitting Strategy

In order to properly test if our overall model behave as expected, we needed to create a specific test set that does not follow the classic Keras approach. The process is as follows:

- Use `train_test_split` from **scikit-learn** to create a separate test set.
- Use a custom cell to reorganize the remaining train and validation data into folders for each phylum, allowing us to leverage Keras' `train_val_split`.
- Create a separate split for each phylum to train models specifically designed for them.  
  (You can see the detailed process for each phylum in the corresponding notebooks named `best_model_PhylumName`.)


In [None]:
meta_train_val , meta_test = train_test_split(meta, test_size=0.1, random_state=42, stratify =meta['family'])

In [None]:
# save meta_test as a csv, so we can use it for the testing for the accr. later so we dont get any form of dataleakage
test_csv_path = 'test_metadata.csv'
meta_test.to_csv(test_csv_path, index=False)

from google.colab import files
files.download('test_metadata.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# With colab
current_locations = '/content/rare_species 1/rare_species 1'

# with vscode
# current_locations = '../data/rare_species 1'

for _, row in meta_train_val.iterrows():

    phylum = row['phylum']
    file_path = row['file_path']


    file_location = os.path.join(current_locations, file_path)

    # create a a detination folder keeping the subfolder structure

        # with colab
    target_folder = os.path.join(phylum, os.path.dirname(file_path))

        # with vscode
    # target_folder = os.path.join("../data" , phylum, os.path.dirname(file_path))

    os.makedirs(target_folder, exist_ok=True)  # Make sure the folder exists

    # Final destination path
    destination = os.path.join(target_folder, os.path.basename(file_path))

    # Copy the file if it exists
    if os.path.exists(file_location):
        shutil.copy2(file_location, destination)
    else:
        print(f"Couldn't find the file: {file_location}")

## Final Train, Val, Test, Split

In [None]:
# with colab
path_phylum_athropoda = "/content/arthropoda"
path_phylum_chordata = "/content/chordata"
path_phylum_cnidaria = "/content/cnidaria"
path_phylum_mollusca = "/content/mollusca"
path_phylum_echinodermata = "/content/echinodermata"

# with vscode
# path_phylum_athropoda = "../data/arthropoda"
# path_phylum_chordata = "../data/chordata"
# path_phylum_cnidaria = "../data/cnidaria"
# path_phylum_mollusca = "../data/mollusca"

image_size = (224, 224)
seed = 42
batch_size = 32

train_ds_arthropoda, val_arthropoda= keras.utils.image_dataset_from_directory(
    path_phylum_athropoda,
    validation_split=0.2,
    subset= "both",
    seed= seed,
    image_size= image_size,
    batch_size= batch_size
)

train_ds_chordata, val_chordata= keras.utils.image_dataset_from_directory(
    path_phylum_chordata,
    validation_split=0.2,
    subset= "both",
    seed= seed,
    image_size= image_size,
    batch_size= batch_size
)

train_ds_cnidaria, val_cnidaria= keras.utils.image_dataset_from_directory(
    path_phylum_cnidaria,
    validation_split=0.2,
    subset= "both",
    seed= seed,
    image_size= image_size,
    batch_size= batch_size
)

train_ds_mollusca, val_mollusca= keras.utils.image_dataset_from_directory(
    path_phylum_mollusca,
    validation_split=0.2,
    subset= "both",
    seed= seed,
    image_size= image_size,
    batch_size= batch_size
)

train_ds_echinodermata, val_echinodermata = keras.utils.image_dataset_from_directory(
    path_phylum_echinodermata,
    validation_split=0.2,
    subset= "both",
    seed= seed,
    image_size= image_size,
    batch_size= batch_size
)



Found 856 files belonging to 17 classes.
Using 685 files for training.
Using 171 files for validation.
Found 8956 files belonging to 166 classes.
Using 7165 files for training.
Using 1791 files for validation.
Found 729 files belonging to 13 classes.
Using 584 files for training.
Using 145 files for validation.
Found 189 files belonging to 5 classes.
Using 152 files for training.
Using 37 files for validation.
Found 54 files belonging to 1 classes.
Using 44 files for training.
Using 10 files for validation.


# Data Augmentation

In this section we use the augmentation layers that were build individually for every Phylum type

In [None]:
data_augmentation_arthropoda = keras.Sequential([
    layers.RandAugment(value_range=(0, 255))
])

data_augmentation_chordata= keras.Sequential([
    layers.RandAugment(value_range=(0, 255), num_ops=2),
    layers.RandomFlip("horizontal"),
    layers.RandomRotation(0.2), # 20 percent rotation
    layers.RandomZoom(0.2), # 20 percent rotation
    layers.RandomContrast(0.2, value_range=(0, 255)), # change by 20%
    layers.RandomBrightness(0.2, (0, 255)), # cahnge by 20 %
    layers.GaussianNoise(0.1),

])

data_augmentation_cnidaria = keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),   # Rotate images randomly up to 20%
    layers.RandomZoom(0.2),        # Zoom in/out randomly up to 20%
    layers.RandomContrast(0.2)     # Change contrast randomly up to 20%
])



data_augmentation_mollusca = keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.2),   # Rotate images randomly up to 20%
    layers.RandomZoom(0.2),        # Zoom in/out randomly up to 20%
    layers.RandomContrast(0.2)     # Change contrast randomly up to 20%
])


data_augmentation_arthropoda = keras.Sequential([
    layers.RandAugment(value_range=(0, 255), num_ops=2)
])


# Models

(for more specificity over the different models you can refferes to the different notebook `best_model_PhylumName`. )

## Build the models

**Arthropoda**


In [None]:
def make_model_athropoda(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)

    x = data_augmentation_arthropoda(inputs)
    x = Rescaling(1./255)(x)

    base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")
    base_model.trainable = False  # Freeze for transfer learning

    x = base_model.output
    x = layers.Flatten()(x)
    x = layers.Dropout(0.1)(x)

    outputs = layers.Dense(num_classes, activation="softmax")(x)

    return keras.Model(inputs, outputs)


**chordata**


In [None]:
def make_model_chordata(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)

    x = data_augmentation_chordata(inputs)
    x = Rescaling(1./255)(x)

    base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")
    base_model.trainable = False  # Freeze for transfer learning

    x = base_model.output
    x = layers.GlobalAveragePooling2D()(x) # to avoid over fitting
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.3)(x)

    outputs = layers.Dense(num_classes, activation="softmax", kernel_regularizer=keras.regularizers.l2(0.001))(x) #try to prevent overfitting

    model = keras.Model(inputs, outputs)
    model.base_model = base_model # save thee base model to be able to call it back when fine tunning

    return model

**cnidaria**


In [None]:
def make_model_cnidaria(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)

    x = data_augmentation_cnidaria(inputs)
    x = Rescaling(1./255)(x)

    base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")
    base_model.trainable = False  # Freeze for transfer learning

    x = base_model.output
    x = layers.Flatten()(x)
    x = layers.Dropout(0.1)(x)

    outputs = layers.Dense(num_classes, activation="softmax")(x)

    return keras.Model(inputs, outputs)


**mollusca**

In [None]:
def make_model_mollusca(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)

    x = data_augmentation_mollusca(inputs)
    x = Rescaling(1./255)(x)

    base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")
    base_model.trainable = False # Freeze for transfer learning

    x = base_model.output
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)

    outputs = layers.Dense(num_classes, activation="softmax")(x)

    return keras.Model(inputs, outputs)

**echinodermata**

In [None]:
def make_model_echinodermata(input_shape, num_classes):
    inputs = keras.Input(shape=input_shape)
    x = layers.Flatten()(inputs)
    outputs = layers.Dense(num_classes, activation="sigmoid")(x)

    return keras.Model(inputs, outputs)


## Models Run

we used the accuracy that was printed when the model was running to evaluate the model. Again as stated in some of the `best_model_phylumname` files, although we acknowledge the fact that some models were overfitting we preffered to use the one with the best accuracy

### Chordata

#### First run

In [None]:
model_chordata = make_model_chordata(input_shape=image_size + (3,), num_classes=166)
epochs = 20

callbacks = [
    # saves the best model of the run using max val_accuracy as a metric
    keras.callbacks.ModelCheckpoint(
        "best_model_chordata.keras",
        save_best_only=True,
        monitor="val_acc",
        mode="max",
        verbose=1)
    ]
model_chordata.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), ## change this CategoricalCrossentropy to the the one it is now
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")], ## change this CategoricalCrossentropy to the the one it is now
)

model_chordata.fit(
    train_ds_chordata,
    epochs=epochs,
    callbacks=callbacks,
    validation_data=val_chordata,
)

  base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")


Epoch 1/20
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 113ms/step - acc: 0.0806 - loss: 5.3462
Epoch 1: val_acc improved from -inf to 0.33780, saving model to best_model_chordata.keras
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 151ms/step - acc: 0.0809 - loss: 5.3434 - val_acc: 0.3378 - val_loss: 3.1999
Epoch 2/20
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 114ms/step - acc: 0.2694 - loss: 3.5944
Epoch 2: val_acc improved from 0.33780 to 0.40480, saving model to best_model_chordata.keras
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 143ms/step - acc: 0.2695 - loss: 3.5941 - val_acc: 0.4048 - val_loss: 2.8203
Epoch 3/20
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 115ms/step - acc: 0.3295 - loss: 3.1813
Epoch 3: val_acc improved from 0.40480 to 0.42769, saving model to best_model_chordata.keras
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 145ms/step - acc: 

<keras.src.callbacks.history.History at 0x7e05b4428b90>

#### Fine tunning

In [None]:
fine_tune_epochs = 30

# we recall the model only this time we allow it to change the layers in the base model
# we load the weights of the best reuslt of the first training
fine_tune_model = make_model_chordata(input_shape=image_size + (3,), num_classes=166)
fine_tune_model.load_weights("best_model_chordata.keras")

# only unfreeze the lasts layer of the pretrained model here 20
fine_tune_model.base_model.trainable = True
for layer in fine_tune_model.base_model.layers[:-40]:
    layer.trainable = False


fine_tune_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-4), # lower learning rate tried lower but accuracy wasn't improving at all form first run
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]
)

fine_tune_model.fit(
    train_ds_chordata,
    epochs=fine_tune_epochs,
    validation_data=val_chordata,
    callbacks=callbacks
)

  base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")


Epoch 1/30
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step - acc: 0.5086 - loss: 2.3321
Epoch 1: val_acc did not improve from 0.46510
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 151ms/step - acc: 0.5088 - loss: 2.3315 - val_acc: 0.4506 - val_loss: 2.7706
Epoch 2/30
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step - acc: 0.6024 - loss: 1.8457
Epoch 2: val_acc did not improve from 0.46510
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 143ms/step - acc: 0.6025 - loss: 1.8456 - val_acc: 0.4461 - val_loss: 2.8828
Epoch 3/30
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 117ms/step - acc: 0.6487 - loss: 1.6577
Epoch 3: val_acc did not improve from 0.46510
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 144ms/step - acc: 0.6487 - loss: 1.6576 - val_acc: 0.4645 - val_loss: 2.7636
Epoch 4/30
[1m224/224[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11

<keras.src.callbacks.history.History at 0x7e05953ee410>

### Cnidaria

In [None]:
model_cnidaria = make_model_cnidaria(input_shape=image_size + (3,), num_classes=13)
epochs = 100

callbacks = [
    # saves the best model of the run using max val_accuracy as a metric
    keras.callbacks.ModelCheckpoint(
        "best_model_cnidaria.keras",
        save_best_only=True,
        monitor="val_acc",
        mode="max",
        verbose=1)
    ]

## change from kera example is the loss function as we deal with a lot of classes
model_cnidaria.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), ## change this CategoricalCrossentropy to the the one it is now
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")], ## change this CategoricalCrossentropy to the the one it is now
)

model_cnidaria.fit(
    train_ds_cnidaria,
    epochs=epochs,
    callbacks=callbacks,
    validation_data=val_cnidaria,
)

  base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")


Epoch 1/100
[1m18/19[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 141ms/step - acc: 0.2593 - loss: 10.1744
Epoch 1: val_acc improved from -inf to 0.51034, saving model to best_model_cnidaria.keras
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 288ms/step - acc: 0.2681 - loss: 10.0494 - val_acc: 0.5103 - val_loss: 5.5579
Epoch 2/100
[1m18/19[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 132ms/step - acc: 0.6439 - loss: 3.7588
Epoch 2: val_acc did not improve from 0.51034
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 173ms/step - acc: 0.6420 - loss: 3.7713 - val_acc: 0.4966 - val_loss: 5.3951
Epoch 3/100
[1m18/19[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 128ms/step - acc: 0.6474 - loss: 3.1135
Epoch 3: val_acc improved from 0.51034 to 0.57931, saving model to best_model_cnidaria.keras
[1m19/19[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 193ms/step - acc: 0.6491 - loss: 3.1001 - val_acc: 0.5793 - val_loss: 4.82

<keras.src.callbacks.history.History at 0x7e054c4db790>

### Mollusca

In [None]:
class_names = train_ds_mollusca.class_names
print("Mollusca families (classes):", class_names)

Mollusca families (classes): ['mollusca_cardiidae', 'mollusca_conidae', 'mollusca_haliotidae', 'mollusca_unionidae', 'mollusca_zonitidae']


In [None]:
model_mollusca = make_model_mollusca(input_shape=image_size + (3,), num_classes=len(train_ds_mollusca.class_names))
epochs = 100

# Callback to save the best model based on validation accuracy
callbacks = [
    keras.callbacks.ModelCheckpoint(
        "best_model_mollusca.keras",  # Updated file name
        save_best_only=True,
        monitor="val_acc",            # Metric matches the one in compile
        mode="max",
        verbose=1
    )
]

# Compile the model
model_mollusca.compile(
    optimizer=keras.optimizers.Adam(learning_rate=3e-4),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]
)

# Train the model
history = model_mollusca.fit(
    train_ds_mollusca,
    validation_data=val_mollusca,
    epochs=epochs,
    callbacks=callbacks
)

  base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")


Epoch 1/100
[1m4/5[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 65ms/step - acc: 0.1595 - loss: 2.3208
Epoch 1: val_acc improved from -inf to 0.10811, saving model to best_model_mollusca.keras
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 652ms/step - acc: 0.1612 - loss: 2.3120 - val_acc: 0.1081 - val_loss: 2.1482
Epoch 2/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 144ms/step - acc: 0.1692 - loss: 2.0972
Epoch 2: val_acc improved from 0.10811 to 0.18919, saving model to best_model_mollusca.keras
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 361ms/step - acc: 0.1761 - loss: 2.0857 - val_acc: 0.1892 - val_loss: 1.9907
Epoch 3/100
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 177ms/step - acc: 0.2484 - loss: 1.8519
Epoch 3: val_acc did not improve from 0.18919
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 299ms/step - acc: 0.2465 - loss: 1.8605 - val_acc: 0.1892 - val_loss: 1.8485
Epoch 4/100


### Arthropoda

In [None]:
len(train_ds_arthropoda.class_names)

17

In [None]:
model_arthropoda = make_model_athropoda(input_shape=image_size + (3,), num_classes=len(train_ds_arthropoda.class_names))
epochs = 100


callbacks = [
    # saves the best model of the run using max val_accuracy as a metric
    keras.callbacks.ModelCheckpoint(
        "best_model_arthropoda.keras",
        save_best_only=True,
        monitor="val_acc",
        mode="max",
        verbose=1)
    ]

## change from kera example is the loss function as we deal with a lot of classes
model_arthropoda.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), ## change this CategoricalCrossentropy to the the one it is now
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")], ## change this CategoricalCrossentropy to the the one it is now
)

model_arthropoda.fit(
    train_ds_arthropoda,
    epochs=epochs,
    callbacks=callbacks,
    validation_data=val_arthropoda,
)



  base_model = MobileNetV2(include_top=False, input_tensor=x, weights="imagenet")


Epoch 1/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step - acc: 0.3579 - loss: 12.3118
Epoch 1: val_acc improved from -inf to 0.69591, saving model to best_model_arthropoda.keras
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 181ms/step - acc: 0.3635 - loss: 12.1926 - val_acc: 0.6959 - val_loss: 4.0399
Epoch 2/100
[1m21/22[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 75ms/step - acc: 0.7543 - loss: 3.2345
Epoch 2: val_acc improved from 0.69591 to 0.74854, saving model to best_model_arthropoda.keras
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 119ms/step - acc: 0.7541 - loss: 3.2256 - val_acc: 0.7485 - val_loss: 3.7018
Epoch 3/100
[1m21/22[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 73ms/step - acc: 0.7704 - loss: 3.1010
Epoch 3: val_acc did not improve from 0.74854
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 98ms/step - acc: 0.7698 - loss: 3.1098 - val_acc: 0.7310 - val_loss: 3.225

<keras.src.callbacks.history.History at 0x7e057c380950>

### echinodermata

In [None]:
model_echinodermata = make_model_echinodermata(input_shape=image_size + (3,), num_classes= 1)
epochs = 2


callbacks = [
    # saves the best model of the run using max val_accuracy as a metric
    keras.callbacks.ModelCheckpoint(
        "best_model_echinodermata.keras",
        save_best_only=True,
        monitor="val_acc",
        mode="max",
        verbose=1)
    ]

## change from kera example is the loss function as we deal with a lot of classes
model_echinodermata.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.1),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False), ## change this CategoricalCrossentropy to the the one it is now
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")], ## change this CategoricalCrossentropy to the the one it is now
)

model_echinodermata.fit(
    train_ds_echinodermata,
    epochs=epochs,
    callbacks=callbacks,
    validation_data=val_echinodermata,
)



Epoch 1/2
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 269ms/step - acc: 1.0000 - loss: 0.0000e+00
Epoch 1: val_acc improved from -inf to 1.00000, saving model to best_model_echinodermata.keras
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 617ms/step - acc: 1.0000 - loss: 0.0000e+00 - val_acc: 1.0000 - val_loss: 0.0000e+00
Epoch 2/2
[1m1/2[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 251ms/step - acc: 1.0000 - loss: 0.0000e+00
Epoch 2: val_acc did not improve from 1.00000
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step - acc: 1.0000 - loss: 0.0000e+00 - val_acc: 1.0000 - val_loss: 0.0000e+00


<keras.src.callbacks.history.History at 0x7e0596b17e50>

# Model combination
In the code below, we aggregate the performance of several models located in the `models` folder.

1. Preprocessing Function:  
   We first define a preprocessing function to slightly modify the input image before prediction. This includes operations such as resizing to match the input shape expected by the models.

2. Model Execution:  
   The code then iterates over each row of the test dataset:
   - It checks the **phylum** of the current row and loads the corresponding model (only if it hasn’t already been loaded to avoid redundant operations).
   - It retrieves and sorts the list of **families** belonging to the current phylum to correctly map the predicted index to the actual family name.

3. Prediction and Evaluation:  
   - The model makes a prediction, which returns an index.
   - This index is mapped back to a family name.
   - If the predicted family matches the actual family, the number of correct predictions is incremented by one.
   - Regardless of correctness, the total prediction count is incremented.






In [None]:
test_data = pd.read_csv("/content/test_metadata.csv")
rare_species_folder = '/content/rare_species 1/rare_species 1'

In [15]:
def preprocess_image(image_path):
    img = Image.open(image_path)
    img = img.resize((224, 224))
    img_array = np.array(img)

    # Handle grayscale or RGBA images
    if len(img_array.shape) == 2:
        img_array = np.stack((img_array,) * 3, axis=-1)
    if len(img_array.shape) > 2 and img_array.shape[2] == 4:
        img_array = img_array[:, :, :3]

    return img_array

In [16]:
correct = 0
total = 0
curent_phylum = None

for _, row in test_data.iterrows():
    phylum = row['phylum']
    family = row['family']

    if phylum != curent_phylum:
        if phylum == 'arthropoda':
            model = keras.models.load_model('/content/best_model_arthropoda.keras')
        elif phylum == 'chordata':
            model = keras.models.load_model('/content/best_model_chordata.keras')
        elif phylum == 'mollusca':
            model = keras.models.load_model('/content/best_model_mollusca.keras')
        elif phylum == 'cnidaria':
            model = keras.models.load_model('/content/best_model_cnidaria.keras')
        elif phylum == 'echinodermata':
            model = keras.models.load_model('/content/best_model_echinodermata.keras')

        families = sorted(test_data[test_data['phylum'] == phylum]['family'].unique()) #take the all possible families from the phylum we can do so cause we stratify at the beginning and therefore all families are in test_meta
        curent_phylum = phylum #to remeber the phylum so it does not load each time

    # Process the image and predict
    image_path = os.path.join(rare_species_folder, row['file_path'])
    img_array = preprocess_image(image_path)
    img_batch = np.expand_dims(img_array, axis=0)

    prediction = model.predict(img_batch, verbose=0)
    predicted_idx = np.argmax(prediction[0])
    predicted_family = families[predicted_idx] # map the idenx to the proper family in the test dat file

    if predicted_family == family:
        correct += 1
    total += 1


# Calculate final accuracy
accuracy = correct / total
print(f"Overall accuracy: {accuracy}")
print(f"Correct: {correct}")
print(f"Total: {total}")

Overall accuracy: 0.6346955796497081
Correct: 761
Total: 1199
