Downloading and using the images in imagenet today (04/18/2020) is actually surprisingly difficult. The imagenet website requires you to request access, but I never got approval. They also have a list of image URLs but some of them are no longer valid. Another way to get the images is to use `academictorrents.com`. I have not been able to find an alternative source for these images. This notebook documents how I was able to download and format the data.

NOTE: Just use the 2017 torrent. It contains the same data as the 2012 set plus its already organized and it contains additional annotations

# Import

In [67]:
from pathlib import Path
import shutil
import os
import pandas as pd

# Config

In [7]:
dir_data = Path('/data')

# LSVRC 2012

I guess this is the set released in the year 2012. I used the following two links:

* training - [https://academictorrents.com/details/a306397ccf9c2ead27155983c254227c0fd938e2](https://academictorrents.com/details/a306397ccf9c2ead27155983c254227c0fd938e2)
* validation - [https://academictorrents.com/details/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5](https://academictorrents.com/details/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5)

Even though these are for "object detection" these contain just the images and the class labels based on the file names.

In [19]:
dir_imagenet_2012 = dir_data/'IMAGENET'/'2012'
dir_imagenet_2012.mkdir(parents=True, exist_ok=True)

### Download and format the training data

In [20]:
dir_trn = dir_imagenet_2012/'TRN'
dir_trn.mkdir(exist_ok=True)
file_trn_torrent = dir_trn/'trn.torrent'
file_trn_tar = dir_trn/'trn.tar'

Download the torrent file:

In [None]:
!wget https://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent -O {file_trn_torrent}

Use `ctorrent` to download the tar file:

In [None]:
!ctorrent {file_trn_torrent} -s {file_trn_tar}

Untar file:

In [None]:
!tar -xvf {file_trn_tar} -C {dir_trn}

Remove tar and torrent file:

In [25]:
file_trn_torrent.unlink()
file_trn_tar.unlink()

In [38]:
files_class_tar = list(dir_trn.glob('*.tar'))
len(files_class_tar)

1000

There are 1000 classes, so it seems each tar corresponds to each class

In [41]:
def _process_class_tar(file_class_tar):
    dir_class = dir_trn/file_class_tar.stem
    dir_class.mkdir(exist_ok=True)
    shutil.move(file_class_tar.as_posix(), dir_class)
    os.system(f'tar -xvf {dir_class/file_class_tar.name} -C {dir_class}')
    (dir_class/file_class_tar.name).unlink()

In [42]:
for file_class_tar in files_class_tar:
    _process_class_tar(file_class_tar)

NOTE: Do not convert class folders to the class name. For example, there are two "crane" classes, one of the animal and the other of the machine...

Copy class labels to train directory

In [48]:
shutil.copy('train.txt', dir_trn/'trn.txt')

PosixPath('/data/IMAGENET/2012/TRN/trn.txt')

### Download and format the validation data

In [49]:
dir_val = dir_imagenet_2012/'VAL'
dir_val.mkdir(exist_ok=True)
file_val_torrent = dir_val/'val.torrent'
file_val_tar = dir_val/'val.tar'

Download the torrent file:

In [None]:
!wget https://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent -O {file_val_torrent}

Use `ctorrent` to download the tar file:

In [None]:
!ctorrent {file_val_torrent} -s {file_val_tar}

Untar file:

In [None]:
!tar -xvf {file_val_tar} -C {dir_val}

Remove tar and torrent file:

In [56]:
file_val_torrent.unlink()
file_val_tar.unlink()

Copy class labels to validation directory

In [57]:
shutil.copy('val.txt', dir_val/'val.txt')

PosixPath('/data/IMAGENET/2012/VAL/val.txt')

### Format the test data

After some googling, I was able to find a test set hosted here:

* test - [http://169.44.201.108:7002/imagenet/test/ILSVRC2012_img_test.tar](http://169.44.201.108:7002/imagenet/test/ILSVRC2012_img_test.tar)

It also houses the training and validation sets, but I'm not sure how long the above will stay online, so I'll leave the academic torrent instructions there

In [59]:
dir_tst = dir_imagenet_2012/'TST'
file_tst_tar = dir_tst/'ILSVRC2012_img_test.tar'

In [None]:
!tar -xvf {file_tst_tar} -C {dir_tst}

In [61]:
file_tst_tar.unlink()

In [62]:
shutil.copy('test.txt', dir_tst/'tst.txt')

PosixPath('/data/IMAGENET/2012/TST/tst.txt')

### Match codes to synsets and words

In [64]:
shutil.copy('synset_words.txt', dir_imagenet_2012)

'/data/IMAGENET/2012/synset_words.txt'

Read synset to words table

In [65]:
file_synset_words = dir_imagenet_2012/'synset_words.txt'

In [72]:
df_synset_words = pd.read_table(file_synset_words, header=None)

In [81]:
df_synset_words[['synset','words']] = df_synset_words[0].str.split(' ', 1, expand=True)
df_synset_words.drop(columns=[0], inplace=True)
df_synset_words.head()

Unnamed: 0,synset,words
0,n01440764,"tench, Tinca tinca"
1,n01443537,"goldfish, Carassius auratus"
2,n01484850,"great white shark, white shark, man-eater, man..."
3,n01491361,"tiger shark, Galeocerdo cuvieri"
4,n01494475,"hammerhead, hammerhead shark"


Get codes dataframes; start by reading the train text file

In [97]:
df_synset_codes = pd.read_table(dir_imagenet_2012/'TRN'/'trn.txt', header=None)

In [98]:
df_synset_codes[['path','code']] = df_synset_codes[0].str.split(' ', 1, expand=True)
df_synset_codes.drop(columns=[0], inplace=True)
df_synset_codes.head()

Unnamed: 0,path,code
0,n01440764/n01440764_10026.JPEG,0
1,n01440764/n01440764_10027.JPEG,0
2,n01440764/n01440764_10029.JPEG,0
3,n01440764/n01440764_10040.JPEG,0
4,n01440764/n01440764_10042.JPEG,0


In [99]:
df_synset_codes['synset'] = df_synset_codes['path'].str.split('/', 0, expand=True)[0]
df_synset_codes.drop(columns=['path'], inplace=True)
df_synset_codes.head()

Unnamed: 0,code,synset
0,0,n01440764
1,0,n01440764
2,0,n01440764
3,0,n01440764
4,0,n01440764


In [101]:
df_synset_codes = df_synset_codes.drop_duplicates()
df_synset_codes.head()

Unnamed: 0,code,synset
0,0,n01440764
1300,1,n01443537
2600,2,n01484850
3900,3,n01491361
5200,4,n01494475


In [107]:
df_synset_words_and_codes = df_synset_words.merge(df_synset_codes, on='synset')
df_synset_words_and_codes

Unnamed: 0,synset,words,code
0,n01440764,"tench, Tinca tinca",0
1,n01443537,"goldfish, Carassius auratus",1
2,n01484850,"great white shark, white shark, man-eater, man...",2
3,n01491361,"tiger shark, Galeocerdo cuvieri",3
4,n01494475,"hammerhead, hammerhead shark",4
...,...,...,...
995,n13044778,earthstar,995
996,n13052670,"hen-of-the-woods, hen of the woods, Polyporus ...",996
997,n13054560,bolete,997
998,n13133613,"ear, spike, capitulum",998


In [108]:
df_synset_words_and_codes.to_csv(dir_imagenet_2012/'synset_words_and_codes.csv', index=False)

In [109]:
shutil.copy(dir_imagenet_2012/'synset_words_and_codes.csv', 'synset_words_and_codes.csv')

'synset_words_and_codes.csv'

# LSVRC 2017

Ok, so after doing the above, it actually turns out the torrent:
* [https://academictorrents.com/details/943977d8c96892d24237638335e481f3ccd54cfb](https://academictorrents.com/details/943977d8c96892d24237638335e481f3ccd54cfb)
Contains the same data as the 2012 set above. Further, the data are already organized plus contain annotations, so I'd go this route. 