# Get the subset of ImageNet from kaggle
#### Note that in this notebook, in order to extract the subset we need,
#### we will first download the entire dataset from [Kaggle](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description),
#### then we will extract only the subset we need.

## 0. Import necessary modules

In [None]:
import os
import zipfile
import shutil
import pandas as pd
from PIL import Image
import random
import warnings

# Set the seed
random.seed(42)
# Filter warnings
warnings.filterwarnings("ignore")


## 1. Install Kaggle

In [None]:
!pip install kaggle


## 2. Obtain the Kaggle API Token by downloading it personally

#### * [Log in](https://www.kaggle.com/account/login) to Kaggle website.
#### * Click on your profile picture at the top right and navigate to **Settings**.
#### * Scroll down to the **API** section and click on **Create New Token** (Are you sure? -> continue).
#### * This action will download a kaggle.json file.

## 3. Move the "kaggle.json" file to "~/.kaggle" directory

Create the .kaggle directory in your home folder

In [3]:
!mkdir -p ~/.kaggle

Move the kaggle.json file. Replace '/path/to/kaggle.json' with the actual path to the downloaded file.

In [4]:
!mv /Users/your_user/Downloads/kaggle.json ~/.kaggle/

Set the file permissions

In [5]:
!chmod 600 ~/.kaggle/kaggle.json

## 4. Download the full dataset

In [None]:
!kaggle competitions download -c imagenet-object-localization-challenge


## 5. Extract (unzip) the downloaded file

In [3]:
# Path to your zip file
zip_file_path = 'imagenet-object-localization-challenge.zip'
# Directory to extract to
extract_to_dir = '.'
os.makedirs(extract_to_dir, exist_ok=True)

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to_dir)

print(f"Files extracted to {extract_to_dir}")


Files extracted to .


## 6. Copy a subset to the "resourses/dataset" directory (and resize the images to 300x300)

For resizing images

In [None]:
!pip install Pillow


Load the list of required images

In [7]:
images_df = pd.read_csv('subset_of_imagenet_images_list.csv')
images_df.head()


Unnamed: 0,image_name_full,image_folder,animal,phase_source,phase_destination
0,n02124075_8866.JPEG,n02124075,cat,train,test
1,n02124075_13357.JPEG,n02124075,cat,train,test
2,n02124075_8133.JPEG,n02124075,cat,train,test
3,n02123045_1932.JPEG,n02123045,cat,train,test
4,n02123597_5678.JPEG,n02123597,cat,train,test


Resize and copy the required images to the destination directory

In [9]:
datasets_main_dir = '../resources/datasets'

# Path to the directory where the unzipped files are stored
dataset_dir = extract_to_dir + '/ILSVRC/Data/CLS-LOC'
# Path to the directory where you want to store the subset
subset_dir = datasets_main_dir + '/subset_of_imagenet'
os.makedirs(subset_dir, exist_ok=True)

i = 0
for i in range(images_df.shape[0]):
    img_row = images_df.iloc[i:i+1, :]
    # Build the source path
    phase_source = max(img_row['phase_source'])
    image_folder = max(img_row['image_folder'])
    image_name_full = max(img_row['image_name_full'])
    source_path = os.path.join(dataset_dir, phase_source, image_folder, image_name_full)
    
    # Build the destination path
    phase_destination = max(img_row['phase_destination'])
    animal = max(img_row['animal'])
    destination_dir = os.path.join(subset_dir, phase_destination, animal)
    destination_path = os.path.join(destination_dir, image_name_full.replace('JPEG', 'png'))
    os.makedirs(destination_dir, exist_ok=True)
    
    # Resize and save the image
    img = Image.open(source_path)
    resized_img = img.resize((300, 300), Image.Resampling.LANCZOS)
    resized_img.save(destination_path)
    i += 1
    if i % 500 == 0:
        print(f'{i} images already copied')

print(f'finished! check {subset_dir} folder!')


## ------------------ Sample a Small (Toy) Dataset ------------------
#### If you wish to work with few samples run the following code as well

In [10]:
# how many samples to take from each type of animal
n_samples_train = 49
n_samples_validation = 7
n_samples_test = 14
n_samples_train_full = n_samples_train + n_samples_validation


subset_sample_dir = datasets_main_dir + '/subset_of_imagenet_sample'
os.makedirs(subset_sample_dir, exist_ok=True)

def copy_images(source_dir, destination_dir, image_files):
    for file_name in image_files:
        source_file = os.path.join(source_dir, file_name)
        destination_file = os.path.join(destination_dir, file_name)
        shutil.copy(source_file, destination_file)

for dir_path, dir_names, files in os.walk(subset_dir):
    if files and '.DS_Store' not in files:
        dir_path_splt = dir_path.split('/')
        phase = dir_path_splt[-2]
        animal = dir_path_splt[-1]
        
        print(f'phase: {phase}, animal: {animal}')
        
        if phase == 'train_full':
            n = n_samples_train_full
        elif phase == 'train':
            n = n_samples_train
        elif phase == 'val':
            n = n_samples_validation
        else:
            n = n_samples_test
            
        # sample images paths
        image_files = random.sample(files, n)
        # copy images to the sample folder
        destination_dir = os.path.join(subset_sample_dir, phase, animal)
        os.makedirs(destination_dir, exist_ok=True)
        copy_images(dir_path, destination_dir, image_files)

print(f'finished! check {subset_sample_dir} folder!')

phase: test, animal: cat
phase: test, animal: butterfly
phase: test, animal: dog
phase: test, animal: spider
phase: test, animal: chicken
phase: test, animal: squirrel
phase: test, animal: elephant
phase: train_full, animal: cat
phase: train_full, animal: butterfly
phase: train_full, animal: dog
phase: train_full, animal: spider
phase: train_full, animal: chicken
phase: train_full, animal: squirrel
phase: train_full, animal: elephant
phase: train, animal: cat
phase: train, animal: butterfly
phase: train, animal: dog
phase: train, animal: spider
phase: train, animal: chicken
phase: train, animal: squirrel
phase: train, animal: elephant
phase: val, animal: cat
phase: val, animal: butterfly
phase: val, animal: dog
phase: val, animal: spider
phase: val, animal: chicken
phase: val, animal: squirrel
phase: val, animal: elephant
finished! check ../resources/datasets/subset_of_imagenet_sample folder!


## 7. Remove the full dataset as we copied all images we need

In [13]:
import shutil
shutil.rmtree('../resources/ILSVRC')

OSError: [Errno 66] Directory not empty: '../resources/ILSVRC/Annotations/CLS-LOC/train'