# Anomalib DataModules
This notebook demonstrates the mechanics of anomalib data modules, with a specific focus on benchmarks such as MVTec AD, BTech, and custom datasets via the Folder module. Anomalib data modules are structured as follows: Each data collection implements the Torch Dataset and the PyTorch Lightning DataModule objects.

The Torch Dataset inherits `torch.utils.data.Dataset` and implement the `__len__` and `__getitem__` methods. This implementation might therefore be utilized not just for anomalib, but also for other implementations.

The DataModule implementation inherits the PyTorch Lightning `DataModule` object. The advantage of this class is that it organizes each step of data from download to creating the Torch dataloader. 

Overall, a data implementation has the following structure:

```
anomalib
├── __init__.py
├── data
│   ├── __init__.py
│   ├── btech.py
│   │   ├── BTechDataset
│   │   └── BTech
│   ├── folder.py
│   │   ├── FolderDataset
│   │   └── Folder
│   ├── inference.py
│   │   ├── InferenceDataset
│   │   mvtec.py
│   │   ├── MVTecDataset
└── └── └── MVTec
```

Let's deep dive into each dataset supported in anomalib and check their functionality.

## MVTec AD Dataset

In [2]:
from anomalib.data.mvtec import MVTecDataset
MVTecDataset??

  from .autonotebook import tqdm as notebook_tqdm


[0;31mInit signature:[0m [0mMVTecDataset[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mMVTecDataset[0m[0;34m([0m[0mVisionDataset[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""MVTec AD PyTorch Dataset."""[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mroot[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mPath[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mcategory[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mpre_process[0m[0;34m:[0m [0mPreProcessor[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0msplit[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mtask[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m"segmentation"

To create `MVTecDataset` we need to import `pre_process` that applies transforms to the input image.

In [3]:
from anomalib.pre_processing import PreProcessor
PreProcessor??

[0;31mInit signature:[0m
[0mPreProcessor[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mconfig[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0malbumentations[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mcomposition[0m[0;34m.[0m[0mCompose[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mimage_size[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mTuple[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mto_tensor[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mPreProcessor[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Applies pre-processing and data augmentations to the input and returns the transformed output.[0m
[0;34m[0m
[0;34m    Output could 

In [4]:
pre_process = PreProcessor(image_size=256, to_tensor=True)

### Classification Task

In [14]:
# MVTec Classification Train Set
mvtec_dataset_classification_train = MVTecDataset(root="../../datasets/MVTec", category="bottle", pre_process=pre_process, split="train", task="classification")
mvtec_dataset_classification_train.samples.head()

Unnamed: 0,path,split,label,image_path,mask_path,label_index
0,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/116.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
1,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/136.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
2,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/097.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
3,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/039.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
4,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/037.png,../../datasets/MVTec/bottle/ground_truth/good/...,0


In [15]:
sample = mvtec_dataset_classification_train[0]
sample.keys(), sample["image"].shape

(dict_keys(['image']), torch.Size([3, 256, 256]))

As can be seen above, when we choose `classification` task and `train` split, the dataset only returns `image`. This is mainly because training only requires normal images and no labels. Now let's try `test` split for the `classification` task

In [17]:
# MVTec Classification Test Set
mvtec_dataset_classification_test = MVTecDataset(root="../../datasets/MVTec", category="bottle", pre_process=pre_process, split="test", task="classification")
sample = mvtec_dataset_classification_test[0]
sample.keys(), sample["image"].shape, sample["image_path"], sample["label"]

(dict_keys(['image', 'image_path', 'label']),
 torch.Size([3, 256, 256]),
 '../../datasets/MVTec/bottle/test/good/007.png',
 0)

### Segmentation Task
It is also possible to configure the MVTec dataset for the segmentation task, where the dataset object returns image and ground-truth mask.

In [18]:
# MVTec Segmentation Train Set
mvtec_dataset_segmentation_train = MVTecDataset(root="../../datasets/MVTec", category="bottle", pre_process=pre_process, split="train", task="segmentation")
mvtec_dataset_segmentation_train.samples.head()

Unnamed: 0,path,split,label,image_path,mask_path,label_index
0,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/116.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
1,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/136.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
2,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/097.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
3,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/039.png,../../datasets/MVTec/bottle/ground_truth/good/...,0
4,../../datasets/MVTec/bottle,train,good,../../datasets/MVTec/bottle/train/good/037.png,../../datasets/MVTec/bottle/ground_truth/good/...,0


In [21]:
# MVTec Segmentation Test Set
mvtec_dataset_segmentation_test = MVTecDataset(root="../../datasets/MVTec", category="bottle", pre_process=pre_process, split="test", task="segmentation")
sample = mvtec_dataset_segmentation_test[0]
sample.keys(), sample["image"].shape, sample["mask"].shape

(dict_keys(['image', 'image_path', 'label', 'mask_path', 'mask']),
 torch.Size([3, 256, 256]),
 torch.Size([256, 256]))