<a href="https://colab.research.google.com/github/ilsilfverskiold/smaller-models-docs/blob/main/computer-vision/cook/image-classification/fine-tune/ConvNeXT_torch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Image classification with ConvNEXT using a Hugging Face dataset.**

---



The model is set at facebook/convnext-large-224 but could do just as well with facebook/convnext-tiny-224.

Batch size is 32, epoch is 3.

**Make sure you change the dataset to what you need.** My dataset I've used has both a training and a validation set, so change the code accordingly if you don't have a validation set.

In [None]:
!pip install -q transformers datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
dataset_url = "ilsilfverskiold/traffic-camera-norway-images" # public dataset (possible to import private too)
model = "facebook/convnext-large-224" # decide on your model
learning_rate = 5e-5
epochs = 5

Import the dataset from huggingface below.

In [None]:
from datasets import load_dataset

dataset = load_dataset(dataset_url)

dataset

Downloading readme:   0%|          | 0.00/590 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/288M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/288M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/63.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6103 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/679 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 6103
    })
    validation: Dataset({
        features: ['image', 'label'],
        num_rows: 679
    })
})

Check the features and get the labels. Make sure the images are in PIL format.

In [None]:
dataset["train"].features

In [None]:
dataset["train"][0]

In [None]:
labels = dataset["train"].features["label"].names
print(labels)

['high-traffic', 'low-traffic', 'medium-traffic', 'no-traffic']


In [None]:
id2label = {k:v for k,v in enumerate(labels)}
label2id = {v:k for k,v in enumerate(labels)}

Preprocess the dataset for fine-tuning with ViT/ConvNEXT/Swin Transformer we'll use an image prcoessor to normalize. The image processor ensures that every input image conforms to expectations (input image size and pixel value range).

In [None]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/266 [00:00<?, ?B/s]



The code below is defining a set of image transformations that are applied to the training data. These transformations prepare images for input into a neural network by normalizing them and augmenting the dataset to improve model robustness.

In [None]:
from torchvision.transforms import (
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    ToTensor,
)

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)

transform = Compose(
    [
     RandomResizedCrop(image_processor.size["shortest_edge"]),
     RandomHorizontalFlip(),
     ToTensor(),
     normalize
    ]
)

def train_transforms(examples):
  examples["pixel_values"] = [transform(image.convert("RGB")) for image in examples["image"]]

  return examples

In [None]:
processed_dataset = dataset.with_transform(train_transforms)

The purpose of the collate_fn function below is to control how a list of samples (gathered from the dataset) is merged into a single batch. This function is crucial for ensuring that batches are structured properly before being fed into a model during training or evaluation.

In [None]:
from torch.utils.data import DataLoader

def collate_fn(examples):
  pixel_values = torch.stack([example["pixel_values"] for example in examples])
  labels = torch.tensor([example["label"] for example in examples])

  return {"pixel_values": pixel_values, "labels": labels}

dataloader = DataLoader(processed_dataset["train"], collate_fn=collate_fn, batch_size=4, shuffle=True)
dataloader_validation = DataLoader(processed_dataset["validation"], collate_fn=collate_fn, batch_size=4, shuffle=True)

In [None]:
import torch

batch = next(iter(dataloader))
for k,v in batch.items():
  print(k,v.shape)

pixel_values torch.Size([4, 3, 224, 224])
labels torch.Size([4])


We use the labels we set up earlier from the dataset when importing the pre-trained model below, we also tell it to ignore the pre-defined labels that it previously have been trained on.

In [None]:
from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained(model,
                                                        id2label=id2label,
                                                        label2id=label2id,
                                                        ignore_mismatched_sizes=True) # set to true to ignore the pre-defined labels



config.json:   0%|          | 0.00/69.6k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/791M [00:00<?, ?B/s]

Some weights of ConvNextForImageClassification were not initialized from the model checkpoint at facebook/convnext-large-224 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1000, 1536]) in the checkpoint and torch.Size([4, 1536]) in the model instantiated
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([4]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Train the model, make sure you check the training loss/accuracy, you want one to consistently go down while accuracy should go up. If it fluctuates makes sure you check the performance of the model. I'm using other metrics like precision, recall and f1 too but accuracy is usually the most important of them.

In [None]:
from tqdm.notebook import tqdm
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

for epoch in range(epochs):
    print("Epoch:", epoch)
    model.train()
    training_predictions = []
    training_labels = []

    for batch in tqdm(dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        optimizer.zero_grad()

        outputs = model(pixel_values=batch["pixel_values"], labels=batch["labels"])
        loss, logits = outputs.loss, outputs.logits
        loss.backward()
        optimizer.step()

        preds = logits.argmax(-1)
        training_predictions.extend(preds.cpu().numpy())
        training_labels.extend(batch["labels"].cpu().numpy())

    # Calculate training metrics (you can remove some of these if you only need accuracy or so)
    train_accuracy = accuracy_score(training_labels, training_predictions)
    train_precision, train_recall, train_f1, _ = precision_recall_fscore_support(training_labels, training_predictions, average='weighted')

    print(f"\nTraining Loss: {loss.item()}")
    print(f"Training Accuracy: {train_accuracy}")
    print(f"Training Precision: {train_precision}")
    print(f"Training Recall: {train_recall}")
    print(f"Training F1 Score: {train_f1}")

    # Evaluate on validation set
    model.eval()
    with torch.no_grad():
        validation_predictions = []
        validation_labels = []

        for batch in tqdm(dataloader_validation):
            batch = {k: v.to(device) for k, v in batch.items()}

            outputs = model(pixel_values=batch["pixel_values"], labels=batch["labels"])
            logits = outputs.logits

            preds = logits.argmax(-1)
            validation_predictions.extend(preds.cpu().numpy())
            validation_labels.extend(batch["labels"].cpu().numpy())

        val_accuracy = accuracy_score(validation_labels, validation_predictions)
        val_precision, val_recall, val_f1, _ = precision_recall_fscore_support(validation_labels, validation_predictions, average='weighted')

        print(f"\nValidation Accuracy: {val_accuracy}")
        print(f"Validation Precision: {val_precision}")
        print(f"Validation Recall: {val_recall}")
        print(f"Validation F1 Score: {val_f1}\n")

Epoch: 0


  0%|          | 0/1526 [00:00<?, ?it/s]


Training Loss: 0.6161817908287048
Training Accuracy: 0.7240701294445355
Training Precision: 0.7082519923524156
Training Recall: 0.7240701294445355
Training F1 Score: 0.6983478754413465


  0%|          | 0/170 [00:00<?, ?it/s]


Validation Accuracy: 0.7628865979381443
Validation Precision: 0.7479281129220486
Validation Recall: 0.7628865979381443
Validation F1 Score: 0.7367062752046891

Epoch: 1


  0%|          | 0/1526 [00:00<?, ?it/s]


Training Loss: 0.05508580803871155
Training Accuracy: 0.7799442896935933
Training Precision: 0.7700425151141358
Training Recall: 0.7799442896935933
Training F1 Score: 0.7709631054156653


  0%|          | 0/170 [00:00<?, ?it/s]


Validation Accuracy: 0.7731958762886598
Validation Precision: 0.7751424805183921
Validation Recall: 0.7731958762886598
Validation F1 Score: 0.7692141021279978

Epoch: 2


  0%|          | 0/1526 [00:00<?, ?it/s]


Training Loss: 0.12858520448207855
Training Accuracy: 0.8104211043748976
Training Precision: 0.8036363070424184
Training Recall: 0.8104211043748976
Training F1 Score: 0.8041516926284794


  0%|          | 0/170 [00:00<?, ?it/s]


Validation Accuracy: 0.7893961708394698
Validation Precision: 0.7948342911047581
Validation Recall: 0.7893961708394698
Validation F1 Score: 0.7817927698191313

Epoch: 3


  0%|          | 0/1526 [00:00<?, ?it/s]


Training Loss: 0.09071839600801468
Training Accuracy: 0.8254956578731771
Training Precision: 0.8204496223923569
Training Recall: 0.8254956578731771
Training F1 Score: 0.8206154331301334


  0%|          | 0/170 [00:00<?, ?it/s]


Validation Accuracy: 0.7908689248895434
Validation Precision: 0.8088365811778401
Validation Recall: 0.7908689248895434
Validation F1 Score: 0.7906677239312013

Epoch: 4


  0%|          | 0/1526 [00:00<?, ?it/s]


Training Loss: 0.0933392122387886
Training Accuracy: 0.844174995903654
Training Precision: 0.8402213715977359
Training Recall: 0.844174995903654
Training F1 Score: 0.8398491970210088


  0%|          | 0/170 [00:00<?, ?it/s]


Validation Accuracy: 0.812960235640648
Validation Precision: 0.8056204796606733
Validation Recall: 0.812960235640648
Validation F1 Score: 0.8052701066513138



Save the model below so we can du inference on it. Sometimes you can have good metrics but the model doesn't perform well on new data, so check both.

In [None]:
model.save_pretrained("trained_model")
model.config.save_pretrained("trained_model")

In [None]:
from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained("trained_model")

In [None]:
from transformers import pipeline

pipe = pipeline("image-classification", model=model, image_processor=image_processor)

I usually mount my Google Drive to use new images to test with. This is not necessary if you want to test it in Hugging Face after you've deployed it.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from PIL import Image

image_path = '/content/drive/MyDrive/traffic-levels/kamera-4.jpg'

image = Image.open(image_path)
image

results = pipe(image)
results

[{'label': 'high-traffic', 'score': 0.833804190158844},
 {'label': 'medium-traffic', 'score': 0.15363579988479614},
 {'label': 'low-traffic', 'score': 0.012290147133171558},
 {'label': 'no-traffic', 'score': 0.00026984119904227555}]

I also check the validation set and loop through a few of them to see how they do.

In [None]:
from PIL import Image

for i in range(100):
    image_data = dataset['validation'][i]['image']
    label_index = dataset['validation'][i]['label']

    if not isinstance(image_data, Image.Image):
        image = Image.open(image_data)
    else:
        image = image_data

    results = pipe(image)

    print(f"Results for image {i+1}:")
    print(results)
    print("Actual label:", id2label[label_index])
    print("----------------------------------")

If you're ready to go we can push it to Hugging Face. You'll need a token that has both read/write rights that you find under Settings in your Hugging Face account.

In [None]:
!huggingface-cli login

In [None]:
repo_name = ""

model.push_to_hub(repo_name)
image_processor.push_to_hub(repo_name)