------------------
#### Building our dataset

- cats/dogs dataset ... https://www.kaggle.com/competitions/dogs-vs-cats/data

---------------------------

In [2]:
import glob
import numpy as np
import os
import shutil
from utils import log_progress

In [3]:
location_train = r'D:\AI-DATASETS\02-MISC-large\keras\datasets\cats-dogs-data-LARGE\train'

In [4]:
files = glob.glob(location_train+'\*')

In [5]:
cat_files = [fn for fn in files if 'cat.' in fn]
dog_files = [fn for fn in files if 'dog.' in fn]

len(cat_files), len(dog_files)

(12500, 12500)

We can verify with the preceding output that we have 12,500 images for each category.

Let's now build our smaller dataset so that we have 
- 3,000 images for training, 
- 1,000 images for validation, and 
- 1,000 images for our test dataset 

(with equal representation for the two animal categories):

**Get training samples** 
- 1500 each of cat and dog

In [6]:
cat_train = np.random.choice(cat_files, size=750, replace=False)
dog_train = np.random.choice(dog_files, size=750, replace=False)

In [7]:
cat_files = list(set(cat_files) - set(cat_train))
dog_files = list(set(dog_files) - set(dog_train))

In [8]:
len(cat_files), len(dog_files)

(11750, 11750)

**Get validation samples**
- 500 each of cat and dog

In [9]:
cat_val = np.random.choice(cat_files, size=100, replace=False)
dog_val = np.random.choice(dog_files, size=100, replace=False)

In [10]:
cat_files = list(set(cat_files) - set(cat_val))
dog_files = list(set(dog_files) - set(dog_val))

In [11]:
len(cat_files), len(dog_files)

(11650, 11650)

**Get test samples**
- 500 samples each of cat and dog

In [12]:
cat_test = np.random.choice(cat_files, size=50, replace=False)
dog_test = np.random.choice(dog_files, size=50, replace=False)

In [13]:
print('Cat datasets:', cat_train.shape, cat_val.shape, cat_test.shape)
print('Dog datasets:', dog_train.shape, dog_val.shape, dog_test.shape)

Cat datasets: (750,) (100,) (50,)
Dog datasets: (750,) (100,) (50,)


let's `write` them out to our `disk` in separate folders, so that we can come back to them anytime in the future without worrying if they are present in our main memory

**name of directories of train/val/test sets**

In [14]:
train_dir = location_train+'\\'+'training_data'
val_dir   = location_train+'\\'+'validation_data'
test_dir  = location_train+'\\'+'test_data'

**delete train/val/test folders - if existing**

In [15]:
shutil.rmtree(train_dir) if os.path.isdir(train_dir) else None
shutil.rmtree(val_dir)   if os.path.isdir(val_dir)   else None
shutil.rmtree(test_dir)  if os.path.isdir(test_dir)  else None

**create the train/val/test folders, if not created already**

In [16]:
os.mkdir(train_dir) if not os.path.isdir(train_dir) else None
os.mkdir(val_dir)   if not os.path.isdir(val_dir)   else None
os.mkdir(test_dir)  if not os.path.isdir(test_dir)  else None

In [17]:
train_files    = np.concatenate([cat_train, dog_train])
validate_files = np.concatenate([cat_val,   dog_val])
test_files     = np.concatenate([cat_test,  dog_test])

In [18]:
for fn in log_progress(train_files,    name='Training Images'):
    shutil.copy(fn, train_dir)
    
for fn in log_progress(validate_files, name='Validation Images'):
    shutil.copy(fn, val_dir)
    
for fn in log_progress(test_files,     name='Test Images'):
    shutil.copy(fn, test_dir)

VBox(children=(HTML(value=''), IntProgress(value=0, max=1500)))

VBox(children=(HTML(value=''), IntProgress(value=0, max=200)))

VBox(children=(HTML(value=''), IntProgress(value=0)))