# Fruit-360 preprocessor
This notebook will prepare the fruit-360 dataset for the Peltarion platform.

**Note**: This notebook requires installation of Sidekick. For more information about this package, see: https://github.com/Peltarion/sidekick

In [1]:
import functools
import os
from glob import glob
import resource

import pandas as pd
from PIL import Image
import sidekick
from sklearn.model_selection import train_test_split
from tqdm import tqdm

## Setup

### Paths

In [2]:
# Raw dataset
input_path = './fruits-360/Training'
#os.chdir(input_path)
# Zip output
output_path = './data.zip'

### Progress bar for Pandas

In [3]:
tqdm.pandas()

### Get list of image paths

In [4]:
images_rel_path = glob('fruits-360/Training/*/*.jpg') + glob('fruits-360/Training/*/*.png')
print("Images found: ", len(images_rel_path))

Images found:  53177


## Create Dataframe
The class column values are derived from the names of the subfolders in the `input_path`.

The image column contains the relative path to the images in the subfolders.

In [5]:
df = pd.DataFrame({'image': images_rel_path})
#df['class'] = df['image'].apply(lambda path: os.path.basename(os.path.dirname(path)))
df['class'] = df['image'].progress_apply(lambda path: os.path.basename(os.path.dirname(path)))
df.head()

100%|██████████| 53177/53177 [00:00<00:00, 319375.26it/s]


Unnamed: 0,image,class
0,fruits-360/Training/Tomato 4/r_236_100.jpg,Tomato 4
1,fruits-360/Training/Tomato 4/247_100.jpg,Tomato 4
2,fruits-360/Training/Tomato 4/257_100.jpg,Tomato 4
3,fruits-360/Training/Tomato 4/r_78_100.jpg,Tomato 4
4,fruits-360/Training/Tomato 4/r_68_100.jpg,Tomato 4


### Check that all images have the same format, e.g., RGB

In [6]:
def get_mode(path):
    im = Image.open(path)
    im.close()
    return im.mode

df['image_mode'] = df['image'].progress_apply(lambda path: get_mode(path))
print(df['image_mode'].value_counts())
df = df.drop(['image_mode'], axis=1)

100%|██████████| 53177/53177 [00:19<00:00, 2778.36it/s]

RGB    53177
Name: image_mode, dtype: int64





## View number of rows per class

In [7]:
pd.set_option('display.max_rows', 150)
df['class'].value_counts()

Grape Blue             984
Plum 3                 900
Strawberry Wedge       738
Peach 2                738
Cherry 2               738
Cherry Rainier         738
Melon Piel de Sapo     738
Tomato 3               738
Tomato 1               738
Walnut                 735
Apple Red Yellow 2     672
Tomato 2               672
Pear Red               666
Pepper Yellow          666
Pepper Red             666
Pineapple Mini         493
Nectarine              492
Grape White 3          492
Strawberry             492
Lemon                  492
Physalis with Husk     492
Peach Flat             492
Apple Granny Smith     492
Cherry 1               492
Apple Golden 2         492
Cantaloupe 2           492
Pomegranate            492
Rambutan               492
Peach                  492
Tomato Cherry Red      492
Pear                   492
Grapefruit White       492
Redcurrant             492
Cherry Wax Red         492
Apple Braeburn         492
Mulberry               492
Apple Red 1            492
A

## Shuffle the rows

When you save a new version of a dataset on the platform, the rows in the dataset will be shuffled automatically. To ensure that samples from different classes are displayed in the Datasets preview, you can shuffle the rows before the dataset is uploaded to the platform. 

In [8]:
df = df.sample(frac=1.0, random_state=1)
len(df)

53177

## Create dataset bundle

In [12]:
'''
Available modes:
- crop_and_resize
- center_crop_or_pad
- resize_image
'''
image_processor = functools.partial(sidekick.process_image, mode='crop_and_resize', size=(100, 100), file_format='jpeg')
sidekick.create_dataset(
    output_path,
    df,
    path_columns=['image'],
    preprocess={
        'image': image_processor
    }
)