# Use `CSV` for Custom Datasets

# Installing Anomalib

The easiest way to install anomalib is to use pip. You can install it from the command line using the following command:

In [None]:
%pip install anomalib

## Setting up the Dataset Directory

This cell ensures we change the directory to have access to the datasets.

In [1]:
from pathlib import Path

# NOTE: Provide the path to the dataset root directory.
#   If the dataset is not downloaded, it will be downloaded
#   to this directory.
dataset_root = Path.cwd().parent.parent / "datasets" / "hazelnut_toy"

## Use CSV Dataset (for Custom Datasets) via API

Here we show how to utilize custom datasets to train anomalib models using the CSV datamodule. The CSV datamodule allows for more flexible dataset organization, where image paths and labels are specified in a CSV file.

In [2]:
import numpy as np
import pandas as pd
from PIL import Image
from torchvision.transforms.v2 import Resize
from torchvision.transforms.v2.functional import to_pil_image

from anomalib import TaskType
from anomalib.data import CSV, CSVDataset

### Creating a CSV file for the dataset

First, let's create a CSV file that contains the image paths and labels for our hazelnut dataset.

In [None]:
def create_csv_file(dataset_path: Path, csv_filename: str = "hazelnut_dataset.csv") -> Path:
    """Create a CSV file from the hazelnut dataset.

    Args:
        dataset_path (Path): Path to the hazelnut dataset.
        csv_filename (str, optional): Name of the CSV file.
            Defaults to "hazelnut_dataset.csv".

    Returns:
        Path: Path to the created CSV file.
    """
    data = []
    for category in ["good", "crack"]:
        image_dir = dataset_path / category
        for image_path in image_dir.glob("*.jpg"):
            mask_path = dataset_path / "mask" / category / image_path.name if category != "good" else ""
            data.append(
                {
                    "image_path": str(image_path),
                    "label": "normal" if category == "good" else "abnormal",
                    "mask_path": str(mask_path) if mask_path else "",
                },
            )

    data_frame = pd.DataFrame(data)
    csv_path = Path(csv_filename)
    data_frame.to_csv(csv_path, index=False)
    return csv_path


csv_file_path = create_csv_file(dataset_root)
print(f"CSV file created at: {csv_file_path}")

# Display the first few rows of the CSV file
pd.read_csv(csv_file_path).head()

### DataModule

Now that we have created a CSV file for our dataset, let's create an Anomalib datamodule using the CSV class.

In [None]:
csv_datamodule = CSV(
    name="hazelnut_toy",
    csv_path=csv_file_path,
    task=TaskType.SEGMENTATION,
)
csv_datamodule.setup()

In [None]:
# Train images
i, data = next(enumerate(csv_datamodule.test_dataloader()))
print("Test data:", data.image.shape, data.gt_mask.shape)

### Torch Dataset

We can also create a standalone PyTorch dataset instance using the CSVDataset class.

In [None]:
CSVDataset??

Let's add a transform that resizes the input image to 256x256 pixels.

In [19]:
image_size = (256, 256)
transform = Resize(image_size, antialias=True)

#### Classification Task

In [None]:
csv_dataset_classification = CSVDataset(
    name="hazelnut_toy",
    csv_path=csv_file_path,
    transform=transform,
    task="classification",  # or TaskType.CLASSIFICATION,
)
csv_dataset_classification.samples.head()

In [None]:
data = csv_dataset_classification[0]
print(data.image_path, data.image.shape, data.gt_label)

#### Segmentation Task

In [None]:
csv_dataset_segmentation = CSVDataset(
    name="hazelnut_toy",
    csv_path=csv_file_path,
    transform=transform,
    task="segmentation",  # TaskType.SEGMENTATION,
)
csv_dataset_segmentation.samples.head(10)

In [None]:
data = csv_dataset_segmentation[0]  # Choose an abnormal sample
print(data.image_path, data.mask_path, data.image.shape, data.gt_mask.shape)

Let's visualize the image and the mask...

It is also possible to create a CSV dataset with train, validation, and test splits. For more details, you could check out the documentation [here](https://anomalib.readthedocs.io/en/v1.0.1/markdown/guides/reference/data/image/csv.html)

In [None]:
img = to_pil_image(data.image.clone())
msk = to_pil_image(data.gt_mask.float() * 255).convert("RGB")

Image.fromarray(np.hstack((np.array(img), np.array(msk))))