# Structure ImageNet (Kaggle Download) into ImageFolder structure.
This script reorganizes ImageNet images and annotations into the Pytorch Dataset ImageFolder file structure.

Note: This script is for images downloaded from [Kaggle](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description). For ImageNet downloaded from the [ImageNet website](https://www.image-net.org/index.php) see [`imagefolder_structure_from_imagenet.ipynb`](https://github.com/mitvis/shared-interest/blob/main/imagenet_download_util/imagefolder_structure_from_imagenet.ipynb).

Shared Interest is built on Pytorch. Pytorch has a class of functions called Datasets that handle loading data from disk. To load images and their bounding boxes, Shared Interest extends the Pytorch [ImageFolder Dataset](https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html). ImageFolder requires images to be in a specific file structure. In order to use the data, this script reorganizes the images and annotations, so the file structure is compatable with Pytorch ImageFolder.

After running this script, your `imagenet` directory will be formatted as follows:
```
imagenet
|
|---val/
|   |---images/
|       |---0000/
|           |---ILSVRC2012_val_<imageid0>.jpeg
|           |---ILSVRC2012_val_<imageid1>.jpeg
|           |...
|       |...
|       |---0999/
|   |---annotations/
|       |---0000/
|           |---ILSVRC2012_val_<imageid0>.xml
|           |---ILSVRC2012_val_<imageid1>.xml
|           |...
|       |...
|       |---0999/

```


In [1]:
import numpy as np
import os
from tqdm import tqdm

## Step 0: Download the ImageNet validation set
ImageNet is a large image classification dataset that contains natural images (i.e., photos) labeled as one of 1000 classes. Every image in the validation set contains object-level bounding boxes highlighting the labeled object. Shared Interest uses the bounding boxes as human annotations, so we will focus on the validation set.

Download ImageNet from [Kaggle](https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data). 

Add the path to your ImageNet data to `imagenet_directory` below.

In [12]:
kaggle_imagenet_directory = '/nobackup/users/aboggust/data/imagenet_kaggle' #TODO: Add path to your base directory
kaggle_image_directory = os.path.join(kaggle_imagenet_directory, 'ILSVRC', 'Data', 'CLS-LOC')
kaggle_annotation_directory = os.path.join(kaggle_imagenet_directory, 'ILSVRC', 'Annotations', 'CLS-LOC')

This file also uses the [validation labels](https://github.com/mitvis/shared-interest/blob/main/imagenet_download_util/validation_labels.txt) available in the Shared Interest repo. Note: the ImageNet labels in the development kit refers to the caffe labels and are not the same as the ImageNet labeling system.

In [3]:
labels_filename = os.path.join(kaggle_imagenet_directory, 'validation_labels.txt')
assert os.path.isfile(labels_filename), "Missing validation labels. Download from Shared Interest repo."

Shared Interest uses the ImageNet Validation set because all 50K images have corresponding object-level bounding box annotations. This script focuses on the validation split.

In [4]:
split = 'val'

## Step 1: Build the file structure

In [5]:
def make_directory(directory):
    """ Makes directory if it does not already exist."""
    if not os.path.isdir(directory):
        os.mkdir(directory)
        
def make_subdirectories(directory, subdirectory_names):
    """Makes a folder directory/name/ for every name in subdirectory names."""
    for name in subdirectory_names:
        make_directory(os.path.join(directory, name))

In [6]:
# Make a folder within the kaggle_imagenet_directory named split
split_directory = os.path.join(kaggle_imagenet_directory, split)
make_directory(split_directory)

In [7]:
# Create an image folder and annotation folder within the split_directory
image_directory = os.path.join(split_directory, 'images')
make_directory(image_directory)

annotation_directory = os.path.join(split_directory, 'annotations')
make_directory(annotation_directory)

image_extension = 'JPEG'
annotation_extension = 'xml'

## Step 2: Make a directory for each of the ImageNet classes

In [8]:
# Load the image labels from the ImageNet development kit information.
# image_labels is a list of strings representing the label for each image.
# image_labels[i] is the label for image dataset[i]
with open(labels_filename, 'r') as f:
    image_labels = ['%04d' %int(line.strip()) for line in f.readlines()]
print('Found %i labels and %i images.' %(len(np.unique(image_labels)), len(image_labels)))
print('Example labels:', np.unique(image_labels)[0:5])
print('First label is: %s' %(image_labels[0]))

Found 1000 labels and 50000 images.
Example labels: ['0000' '0001' '0002' '0003' '0004']
First label is: 0065


In [9]:
# Make a directory for each class within `images` and `annotations`
make_subdirectories(image_directory, np.unique(image_labels))
make_subdirectories(annotation_directory, np.unique(image_labels))

## Step 3: Move the images and annotations into their directories

In [10]:
# Move the images into their directories
for index, label in enumerate(tqdm(image_labels)):
    name = 'ILSVRC2012_%s_%08d.%s' %(split, index+1, image_extension)
    source = os.path.join(kaggle_image_directory, split, name)
    destination = os.path.join(image_directory, label, name)
    os.rename(source, destination)

100%|██████████| 50000/50000 [01:32<00:00, 542.83it/s]


In [13]:
# Move the annotations into their directories
for index, label in enumerate(tqdm(image_labels)):
    name = 'ILSVRC2012_%s_%08d.%s' %(split, index+1, annotation_extension)
    source = os.path.join(kaggle_annotation_directory, split, name)
    destination = os.path.join(annotation_directory, label, name)
    os.rename(source, destination)

100%|██████████| 50000/50000 [00:32<00:00, 1561.96it/s]
