# ImageNet bounding boxes files to one dataframe

![img](http://image-net.org/bbox_fig/kit_fox.JPG)

The [official bounding boxes](http://image-net.org/download-bboxes) for the ImageNet dataset are saved in one file for each image.

This is not very handy and convenient and therefore we create one `pd.DataFrame` to store them all.

In [1]:
import tarfile
import xmltodict
import urllib.request
import pandas as pd

from tqdm import tqdm
from pathlib import Path
from pandas import json_normalize

Create subfolder to save downloaded and created data to

In [15]:
save_path = Path("data/")
save_path.mkdir(parents=True, exist_ok=True)

### Download bounding boxes for ImageNet and unpack them to a folder named `bboxes`

In [54]:
bbox_path = save_path / "bboxes"
bbox_path.mkdir(parents=True, exist_ok=True)

#### **Note**: If you are using a GNU/Linux based operating system (e.g. `Google Colab`) use this faster variant:

In [50]:
! wget -c http://image-net.org/Annotation/Annotation.tar.gz -P data/

--2020-06-20 14:10:42--  http://image-net.org/Annotation/Annotation.tar.gz
Auflösen des Hostnamens image-net.org (image-net.org)… 171.64.68.16
Verbindungsaufbau zu image-net.org (image-net.org)|171.64.68.16|:80 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 44310351 (42M) [application/x-gzip]
Wird in »data/Annotation.tar.gz« gespeichert.


2020-06-20 14:14:47 (181 KB/s) - »data/Annotation.tar.gz« gespeichert [44310351/44310351]



In [55]:
! tar -zxf data/Annotation.tar.gz -C data/bboxes/

Delete archive afterwards

In [57]:
! rm data/Annotation.tar.gz

#### **Note**: If you are using Windows, use the following code. Unpacking takes a while...

In [35]:
bboxes_file = "http://image-net.org/Annotation/Annotation.tar.gz"
ftpstream = urllib.request.urlopen(bboxes_file)
bboxes_file = tarfile.open(fileobj=ftpstream, mode="r|gz")
bboxes_file.extractall(path=bbox_path)

### First, let us look at how they are structured

In [38]:
with tarfile.open(bbox_path / 'n00007846.tar.gz', 'r') as tar:
        for compressed_file in tar.getmembers():
            if compressed_file.name.endswith(".xml"):
                xml_file = tar.extractfile(compressed_file)
                doc = xmltodict.parse(xml_file.read())

We can discard the root element `annotation`

In [39]:
doc["annotation"]

OrderedDict([('folder', 'n00007846'),
             ('filename', 'n00007846_128922'),
             ('source', OrderedDict([('database', 'ImageNet database')])),
             ('size',
              OrderedDict([('width', '333'),
                           ('height', '500'),
                           ('depth', '3')])),
             ('segmented', '0'),
             ('object',
              OrderedDict([('name', 'n00007846'),
                           ('pose', 'Unspecified'),
                           ('truncated', '0'),
                           ('difficult', '0'),
                           ('bndbox',
                            OrderedDict([('xmin', '60'),
                                         ('ymin', '11'),
                                         ('xmax', '330'),
                                         ('ymax', '450')]))]))])

Pandas canot handle unnormalized inputs very well...

In [40]:
pd.DataFrame(doc["annotation"])

Unnamed: 0,folder,filename,source,size,segmented,object
database,n00007846,n00007846_128922,ImageNet database,,0,
width,n00007846,n00007846_128922,,333.0,0,
height,n00007846,n00007846_128922,,500.0,0,
depth,n00007846,n00007846_128922,,3.0,0,
name,n00007846,n00007846_128922,,,0,n00007846
pose,n00007846,n00007846_128922,,,0,Unspecified
truncated,n00007846,n00007846_128922,,,0,0
difficult,n00007846,n00007846_128922,,,0,0
bndbox,n00007846,n00007846_128922,,,0,"{'xmin': '60', 'ymin': '11', 'xmax': '330', 'y..."


If we normalize the input with `json_normalize()` however, the result looks much better

In [41]:
xml_norm = json_normalize(doc['annotation'])
xml_norm

Unnamed: 0,folder,filename,segmented,source.database,size.width,size.height,size.depth,object.name,object.pose,object.truncated,object.difficult,object.bndbox.xmin,object.bndbox.ymin,object.bndbox.xmax,object.bndbox.ymax
0,n00007846,n00007846_128922,0,ImageNet database,333,500,3,n00007846,Unspecified,0,0,60,11,330,450


This is a form that pandas can handle easily

In [42]:
pd.DataFrame(xml_norm)

Unnamed: 0,folder,filename,segmented,source.database,size.width,size.height,size.depth,object.name,object.pose,object.truncated,object.difficult,object.bndbox.xmin,object.bndbox.ymin,object.bndbox.xmax,object.bndbox.ymax
0,n00007846,n00007846_128922,0,ImageNet database,333,500,3,n00007846,Unspecified,0,0,60,11,330,450


____

## Create the dataframe

To process each `*.tar.gz` file, we need a list of all

In [44]:
tars = [filename for filename in bbox_path.rglob('*.tar.gz')]
print(tars[:10])

[PosixPath('data/bboxes/n07683490.tar.gz'), PosixPath('data/bboxes/n03116767.tar.gz'), PosixPath('data/bboxes/n12317296.tar.gz'), PosixPath('data/bboxes/n04593077.tar.gz'), PosixPath('data/bboxes/n07881205.tar.gz'), PosixPath('data/bboxes/n02256656.tar.gz'), PosixPath('data/bboxes/n07690431.tar.gz'), PosixPath('data/bboxes/n01606978.tar.gz'), PosixPath('data/bboxes/n03670208.tar.gz'), PosixPath('data/bboxes/n01661091.tar.gz')]


This function takes a `*.tar.gz` file and opens it temporarily in reading mode `r` and then iterates through the folders in the compressed file and creates a pandas `DataFrame` with all the information.

We just extract the file temporarily and therefore won't use much additional space. This method however, is slower compared to prior full decompression.

In [45]:
def untarXMLtodict(file):
    data = pd.DataFrame()

    with tarfile.open(file, 'r') as tar:
        for compressed_file in tar.getmembers():
            if compressed_file.name.endswith(".xml"):
                xml_file = tar.extractfile(compressed_file)
                xml_dict = xmltodict.parse(xml_file.read())

                xml_dict_norm = json_normalize(xml_dict['annotation'])
                data = pd.concat([data, xml_dict_norm])
    
    return data

We can now simply iterate through all the `*.tar.gz` files we found earlier and run `untarXMLtodict()` on each, concatenate the results and finally get our single `DataFrame` with all the information.

In [46]:
# the new data frame whre everything is going to be stored in
complete_data = pd.DataFrame()

for file in tqdm(tars):
    complete_data = pd.concat([complete_data, untarXMLtodict(file)])

100%|██████████| 3627/3627 [1:13:36<00:00,  1.22s/it]


Looks good!

In [47]:
complete_data[:5]

Unnamed: 0,folder,filename,segmented,source.database,size.width,size.height,size.depth,object.name,object.pose,object.truncated,object.difficult,object.bndbox.xmin,object.bndbox.ymin,object.bndbox.xmax,object.bndbox.ymax,object
0,n07683490,n07683490_6960,0,ImageNet database,500,375,3,n07683490,Unspecified,0,0,78,91,290,307,
0,n07683490,n07683490_5743,0,ImageNet database,480,480,3,n07683490,Unspecified,0,0,273,56,474,399,
0,n07683490,n07683490_7315,0,ImageNet database,500,374,3,n07683490,Unspecified,0,0,141,0,483,301,
0,n07683490,n07683490_6158,0,ImageNet database,500,374,3,n07683490,Unspecified,0,0,93,77,365,311,
0,n07683490,n07683490_7703,0,ImageNet database,500,375,3,n07683490,Unspecified,0,0,134,80,355,299,


#### Save data to disk

Write the `DataFrame` to a `.csv` file

In [48]:
complete_data.to_csv(save_path / "imagenet_boundingboxes.csv", encoding='utf-8', index=False)

Write the `DataFrame` to a `.pkl` file

In [49]:
complete_data.to_pickle(save_path / "imagenet_boundingboxes.pkl")

We could also use the `feather` format...