# ImageNette bounding boxes data

Since ImageNet is quite big with over 150GB of image data, we use a much smaller variant called [imagenette](https://github.com/fastai/imagenette) that was created by Jeremy Howard from [fast.ai](https://docs.fast.ai/).

Imagenette is 
> a subset of 10 easily classified classes from Imagenet: 
>
> (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute).

For this notebook we use the full scaled version of the images available from [here](https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz).

This is the new version from **Dec 6 2019**, that uses a 70/30 train/validate split!

_______

## Create dataset

First let's read in the `pd.DataFrame` we created with our `1_imagenet_boundingboxes.ipynb` notebook

In [3]:
import tarfile
import pandas as pd
import urllib.request

from pathlib import Path

We can directly use the `.pkl` file we created earlier

In [4]:
save_path = Path("data/")

In [2]:
complete_data = pd.read_pickle(save_path / "imagenet_boundingboxes.pkl")

Let's have a look

In [3]:
complete_data[:5]

Unnamed: 0,filename,folder,object,object.bndbox.xmax,object.bndbox.xmin,object.bndbox.ymax,object.bndbox.ymin,object.difficult,object.name,object.pose,object.truncated,segmented,size.depth,size.height,size.width,source.database
0,n07683490_6960,n07683490,,290.0,78,307,91,0,n07683490,Unspecified,0,0,3,375,500,ImageNet database
0,n07683490_5743,n07683490,,474.0,273,399,56,0,n07683490,Unspecified,0,0,3,480,480,ImageNet database
0,n07683490_7315,n07683490,,483.0,141,301,0,0,n07683490,Unspecified,0,0,3,374,500,ImageNet database
0,n07683490_6158,n07683490,,365.0,93,311,77,0,n07683490,Unspecified,0,0,3,374,500,ImageNet database
0,n07683490_7703,n07683490,,355.0,134,299,80,0,n07683490,Unspecified,0,0,3,375,500,ImageNet database
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,n03594734_38401,n03594734,,460.0,41,347,0,0,n03594734,Unspecified,0,0,3,375,500,ImageNet database
0,n03594734_37947,n03594734,,373.0,0,447,67,0,n03594734,Unspecified,0,0,3,500,375,ImageNet database
0,n03594734_29003,n03594734,,316.0,34,226,16,0,n03594734,Unspecified,0,0,3,227,329,ImageNet database
0,n03594734_36264,n03594734,,326.0,51,481,8,0,n03594734,Unspecified,0,0,3,500,375,ImageNet database


The index looks odd. Since we simply merged the dataset, every row has a `1`. Let's fix that

In [4]:
complete_data.reset_index(drop=True, inplace=True)
complete_data[:5]

Unnamed: 0,filename,folder,object,object.bndbox.xmax,object.bndbox.xmin,object.bndbox.ymax,object.bndbox.ymin,object.difficult,object.name,object.pose,object.truncated,segmented,size.depth,size.height,size.width,source.database
0,n07683490_6960,n07683490,,290.0,78,307,91,0,n07683490,Unspecified,0,0,3,375,500,ImageNet database
1,n07683490_5743,n07683490,,474.0,273,399,56,0,n07683490,Unspecified,0,0,3,480,480,ImageNet database
2,n07683490_7315,n07683490,,483.0,141,301,0,0,n07683490,Unspecified,0,0,3,374,500,ImageNet database
3,n07683490_6158,n07683490,,365.0,93,311,77,0,n07683490,Unspecified,0,0,3,374,500,ImageNet database
4,n07683490_7703,n07683490,,355.0,134,299,80,0,n07683490,Unspecified,0,0,3,375,500,ImageNet database
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1073734,n03594734_38401,n03594734,,460.0,41,347,0,0,n03594734,Unspecified,0,0,3,375,500,ImageNet database
1073735,n03594734_37947,n03594734,,373.0,0,447,67,0,n03594734,Unspecified,0,0,3,500,375,ImageNet database
1073736,n03594734_29003,n03594734,,316.0,34,226,16,0,n03594734,Unspecified,0,0,3,227,329,ImageNet database
1073737,n03594734_36264,n03594734,,326.0,51,481,8,0,n03594734,Unspecified,0,0,3,500,375,ImageNet database


A closer look at the data with some basic statistics

**Note**: May take a while

In [None]:
complete_data.describe(include='all')

### Get subset of classes used in imagenette
For the smaller imagenette we don't have to rely on all the images present in ImageNet

**Note**: We do not follow the imagenette train/test split here, but the ImageNet partitioning 

#### **Note**: If you are using a GNU/Linux based operating system (e.g. `Google Colab`) use this faster variant:

In [20]:
! wget -c https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz -P data/

--2020-06-20 14:04:48--  https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz
Auflösen des Hostnamens s3.amazonaws.com (s3.amazonaws.com)… 52.216.97.221
Verbindungsaufbau zu s3.amazonaws.com (s3.amazonaws.com)|52.216.97.221|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 1556914727 (1,4G) [application/x-tar]
Wird in »data/imagenette2.tgz« gespeichert.


2020-06-20 14:17:49 (1,90 MB/s) - »data/imagenette2.tgz« gespeichert [1556914727/1556914727]



In [22]:
! tar -zxf data/imagenette2.tgz -C data/

#### **Note**: If you are using Windows, use the following code. Unpacking takes a while...

In [None]:
imagenette_file = "https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz"
ftpstream = urllib.request.urlopen(imagenette_file)
imagenette_file = tarfile.open(fileobj=ftpstream, mode="r|gz")
imagenette_file.extractall(path=save_path)

Create imagenette `Path` object

In [24]:
imagenette_path = save_path / "imagenette2"

In [25]:
# for now we filter out those that would be in the original ImageNet validation set
imgn_pict = [filename for filename in imagenette_path.rglob('*.JPEG') if '_val_' not in filename.name]

print(imgn_pict[:10])
print(f'\nTotal images: {len(imgn_pict)}')

[PosixPath('data/imagenette2/train/n03394916/n03394916_23108.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_59626.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_5529.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_36154.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_32997.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_61934.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_59745.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_21289.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_29164.JPEG'), PosixPath('data/imagenette2/train/n03394916/n03394916_7547.JPEG')]

Total images: 12894


For now we only need the filename without the `.JPEG` ending to check against our database

In [26]:
imgn_pict_str = [filename.name.split(".")[0] for filename in imgn_pict]
imgn_pict_str[:10]

['n03394916_23108',
 'n03394916_59626',
 'n03394916_5529',
 'n03394916_36154',
 'n03394916_32997',
 'n03394916_61934',
 'n03394916_59745',
 'n03394916_21289',
 'n03394916_29164',
 'n03394916_7547']

In [27]:
print(len(imgn_pict_str))

12894


We only need those that are present in imagenette

In [29]:
complete_data = pd.read_pickle(save_path / "imagenet_boundingboxes.pkl")

In [30]:
imgn_data = complete_data.loc[complete_data['filename'].isin(imgn_pict_str)]
imgn_data[:5]

Unnamed: 0,folder,filename,segmented,source.database,size.width,size.height,size.depth,object.name,object.pose,object.truncated,object.difficult,object.bndbox.xmin,object.bndbox.ymin,object.bndbox.xmax,object.bndbox.ymax,object
0,n03445777,n03445777_5901,0,ImageNet database,500,334,3,n03445777,Unspecified,0,0,249,112,455,303,
0,n03445777,n03445777_8145,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,79,127,294,339,
0,n03445777,n03445777_3928,0,ImageNet database,500,333,3,n03445777,Unspecified,0,0,131,148,238,256,
0,n03445777,n03445777_10304,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,64,0,341,265,
0,n03445777,n03445777_9971,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,95,41,379,330,


Also, let us drop filter `NAaNs` in the bounding box columns

In [31]:
imgn_data_flt = imgn_data[imgn_data[['object.bndbox.xmax', 'object.bndbox.xmin', 'object.bndbox.ymin', 'size.height', 'size.width']].notnull().all(1)]
len(imgn_data_flt)

4312

Fix the index again, since we dropped columns

In [32]:
imgn_data_flt.reset_index(drop=True,inplace=True)
imgn_data_flt[:5]

Unnamed: 0,folder,filename,segmented,source.database,size.width,size.height,size.depth,object.name,object.pose,object.truncated,object.difficult,object.bndbox.xmin,object.bndbox.ymin,object.bndbox.xmax,object.bndbox.ymax,object
0,n03445777,n03445777_5901,0,ImageNet database,500,334,3,n03445777,Unspecified,0,0,249,112,455,303,
1,n03445777,n03445777_8145,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,79,127,294,339,
2,n03445777,n03445777_3928,0,ImageNet database,500,333,3,n03445777,Unspecified,0,0,131,148,238,256,
3,n03445777,n03445777_10304,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,64,0,341,265,
4,n03445777,n03445777_9971,0,ImageNet database,500,375,3,n03445777,Unspecified,0,0,95,41,379,330,


#### Save data to disk

Write the `DataFrame` to a `.csv` file

In [33]:
imgn_data_flt.to_csv(save_path / "imagenette_boundingboxes.csv", encoding='utf-8', index=False)

Write the `DataFrame` to a `.pkl` file

In [34]:
imgn_data_flt.to_pickle(save_path / "imagenette_boundingboxes.pkl")

We could also use the `feather` format...