# Flowers image dataset tool notebook

This notebook is used to rearrange original flowers image dataset and to create a new one, copy of source, but with different sets (train, validation and test) size.

Original PyTorch Scholarship Challenge dataset is available [here](https://s3.amazonaws.com/content.udacity-data.com/courses/nd188/flower_data.zip)


## Two options available:
- **OPTION 1** Create a new dataset with specific sets size (percentage of total source dataset) for train, validation and test. Optionally is possible to create a new balanced dataset, with the same quantity of items foreach class.

- **OPTION 2** Create a copy of source dataset preserving original train, validation and test contents but with balanced number of items foreach class.



**Feel free to change this notebook according to your needs. Code is not optimized, but it works! :) **

In [1]:
# imports
import os.path
import pandas as pd
import random

from shutil import copyfile

In [2]:
# Source root folder where "train", "valid" and "test" subfolders are located. 
data_folder = 'flower_original'
# Destination root folder where new sets of images will be copied.
new_folder  = 'new_flowers'

# PLEASE SELECT HOW TO CREATE NEW DATASET - UNCOMMENT ONE OF FOLLOWING 
option = 1       
#option = 2      

######## Option 1 only:
# Create a new dataset with specific sets size (percentage of total source dataset) for train, validation and test.
# Optionally is possible to create a new balanced dataset, with the same quantity of items foreach class.

# Percentage of each set
train_size = 0.8    # 80%
valid_size = 0.1    # 10%
test_size  = 0.1    # 10%

# Set to True if new dataset needs to be class balanced. 
class_balanced = False

In [3]:
# Which method?
if 'option' not in locals():
    raise Exception("PLEASE SELECT HOW TO CREATE NEW DATASET!") 
if option not in [1,2]:
    raise Exception("Method '{}' unknow!".format(option)) 
    
# Check folders! Destination folder must not exists! (safe) 
if os.path.isdir(data_folder) == False:
    raise Exception("Source folder not found!")
if os.path.isdir(new_folder) == True:
    raise Exception("Destination folder exists!")    

### Methods implementation. Nothing to be set after this point.

In [4]:
# These functions are used to load contents from source folder and 
# save data to a Dataframe
# Image count (foreach set) is printed
def get_contents(path, data):
    for folder in sorted(os.listdir(path)):
        if (os.path.isdir(os.path.join(path, folder))):
            get_contents(os.path.join(path, folder), data)
        else:
            data.append((path, folder))
    
def get_set(path):
    setname = ''
    if path.find('valid') > -1:
        setname = 'valid'
    elif path.find('test') > -1:
        setname = 'test'
    elif path.find('train') > -1:
        setname = 'train'
    return setname   

def get_data(folder):
    data = []
    get_contents(folder, data)
    df = pd.DataFrame(data, columns=['Folder', 'File'])
    df['Class'] = df.apply(lambda row: os.path.basename(os.path.normpath(row['Folder'])), axis=1) 
    df['Set'] = df.apply(lambda row: get_set(row['Folder']), axis=1)

    for s in ['train','valid','test']:
        print('** {} Set **'.format(s.title()))
        ds = df[df['Set'] == s]
        print('Items: {} - Classes: {}'.format(ds.shape[0], ds['Class'].unique().size))
        da = ds[['File','Class']].groupby(['Class']).agg(['count'])
        print('Min items per class: {}'.format(da.min()[0]))
        print('Max items per class: {}'.format(da.max()[0]))
        print('')

    return df

In [5]:
# Load contents
df = get_data(data_folder)
df.head()

** Train Set **
Items: 0 - Classes: 0
Min items per class: nan
Max items per class: nan

** Valid Set **
Items: 0 - Classes: 0
Min items per class: nan
Max items per class: nan

** Test Set **
Items: 0 - Classes: 0
Min items per class: nan
Max items per class: nan



Unnamed: 0,Folder,File,Class,Set
0,flower_original,image_00001.jpg,flower_original,
1,flower_original,image_00002.jpg,flower_original,
2,flower_original,image_00003.jpg,flower_original,
3,flower_original,image_00004.jpg,flower_original,
4,flower_original,image_00005.jpg,flower_original,


In [6]:
# Option 1 specific methods.
# get_slice breaks filelist in three sets according to provided percentage
# items_count (if set) is used to create balanced dataset
def get_slice(filelist, train = 0.8, valid = 0.1, test = 0.1, items_count = 0):
    if items_count > 0:
        filecount = items_count
    else:
        filecount = len(filelist)
        
    train = int(filecount * train)
    valid = int(filecount * valid)
    test = filecount - train - valid
    
    train_list = random.sample(filelist, train)
    for i in train_list:
        filelist.remove(i)
        
    valid_list = random.sample(filelist, valid)
    for i in valid_list:
        filelist.remove(i)
      
    test_list = random.sample(filelist, test)
    return train_list, valid_list, test_list 

# main option 1 method. Rearrange original images in three new sets (train, validation, test)
# optionally class balanced
def rearrange(df, root_path, train_size = 0.8, valid_size = 0.1, test_size = 0.1, balanced = False):
    if balanced:
        min_class_items = df[['File','Class']].groupby(['Class']).agg(['count']).min()[0]
    else:
        min_class_items = 0
    df2 = df.set_index('File')
    df2['NewFolder'], df2['NewSet'] = '',''

    for c in df['Class'].unique():
        train, valid, test = get_slice(df[df['Class'] == c]['File'].tolist(), 
                                       train = train_size, valid = valid_size, test = test_size, 
                                       items_count = min_class_items)
        for f in train:
            df2.loc[f,['NewFolder','NewSet']] = [os.path.join(root_path, 'train' , c), 'train']
        for f in valid:
            df2.loc[f,['NewFolder','NewSet']] = [os.path.join(root_path, 'valid', c), 'valid']
        for f in test:
            df2.loc[f,['NewFolder','NewSet']] = [os.path.join(root_path, 'test', c), 'test']            
        
    return df2.reset_index()    


In [7]:
# Option 2 specific method
# Foreach original set, get class with lower file count and balance number of images 
def set_balance(df, root_path):
    df2 = df.set_index('File')
    df2['NewFolder'], df2['NewSet'] = '',''

    for s in ['train', 'valid', 'test']:
        r = df[df['Set'] == s]
        min_class_items = r[['File','Class']].groupby(['Class']).agg(['count']).min()[0]
        for c in r['Class'].unique():
            filelist = r[r['Class'] == c]['File'].tolist() 
            selected = random.sample(filelist, min_class_items)
            for f in selected:
                df2.loc[f,['NewFolder','NewSet']] = [os.path.join(root_path, s , c), s]
            
    return df2.reset_index() 

In [8]:
# Copy file method. Dataframe is used to get source filename and new destination folder
def copy_files(df):
    for index, row in df[df['NewFolder'] != ''].iterrows():
        if not os.path.exists(row['NewFolder']):
            os.makedirs(row['NewFolder'])
        copyfile(os.path.join(row['Folder'], row['File']), 
                 os.path.join(row['NewFolder'], row['File']))           

In [9]:
# Option 1 or 2? 
if option == 1:
    df = rearrange(df, new_folder, 
                   train_size = train_size, valid_size = valid_size, test_size = test_size, 
                   balanced = class_balanced)
else:
    df = set_balance(df, new_folder)

In [10]:
# Copy files!
copy_files(df)

### RESULTS: NEW DATASET CONTENTS

In [11]:
# New dataset items count
df2 = get_data(new_folder)

** Train Set **
Items: 6551 - Classes: 1
Min items per class: 6551
Max items per class: 6551

** Valid Set **
Items: 818 - Classes: 1
Min items per class: 818
Max items per class: 818

** Test Set **
Items: 820 - Classes: 1
Min items per class: 820
Max items per class: 820

