<a href="https://colab.research.google.com/github/gustavmaskowitz/jupyter-notebooks/blob/master/Kaggle-fruits360-DL-to-Peltarion-dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fruit-360 preprocessor
This notebook will prepare the fruit-360 dataset for the Peltarion platform.

Note: This notebook requires installation of Sidekick. For more information about this package, see: https://github.com/Peltarion/sidekick


In [0]:
#Get sidekick
!pip install git+https://github.com/Peltarion/sidekick#egg=sidekick


In [0]:
import os
import sidekick
import resource
import functools
import pandas as pd
from glob import glob
from PIL import Image
from sklearn.model_selection import train_test_split

In [0]:
!mkdir out

In [4]:
# Path to the raw dataset
input_path = 'Fruit-Images-Dataset/Training'
os.chdir(input_path)
# Path to the zip output
output_path = 'out/out-data.zip'

images_rel_path = glob(os.path.join('*', '*.jpg')) + glob(os.path.join('*', '*.png'))
print("Images found: ", len(images_rel_path))



Images found:  53177


## Create Dataframe
The class column values are derived from the names of the subfolders in the `input_path`.

The image column contains the relative path to the images in the subfolders.


In [5]:
df = pd.DataFrame({'image': images_rel_path})
df['class'] = df['image'].apply(lambda path: os.path.basename(os.path.dirname(path)))
df.head()

Unnamed: 0,image,class
0,Cherry 2/307_100.jpg,Cherry 2
1,Cherry 2/r_232_100.jpg,Cherry 2
2,Cherry 2/r2_197_100.jpg,Cherry 2
3,Cherry 2/161_100.jpg,Cherry 2
4,Cherry 2/r_151_100.jpg,Cherry 2


In [6]:
### Check that all images have the same format, e.g., RGB

def get_mode(path):
   im = Image.open(path)
   im.close()
   return im.mode

df['image_mode'] = df['image'].apply(lambda path: get_mode(path))
print(df['image_mode'].value_counts())
df = df.drop(['image_mode'], axis=1)

RGB    53177
Name: image_mode, dtype: int64


In [7]:
## Create subsets for training and validation

def create_subsets(df, col='class', validation_size=0.20):
   train_data, validate_data = train_test_split(df, test_size=validation_size, random_state=42, stratify=df[[col]])
   train_data.insert(loc=2, column='subset', value='T')
   validate_data.insert(loc=2, column='subset', value='V')
   return train_data.append(validate_data, ignore_index=True)
df = create_subsets(df)
df['subset'].value_counts()
df.head()


Unnamed: 0,image,class,subset
0,Cocos/148_100.jpg,Cocos,T
1,Lemon/r_226_100.jpg,Lemon,T
2,Pineapple/133_100.jpg,Pineapple,T
3,Physalis/234_100.jpg,Physalis,T
4,Physalis/r_192_100.jpg,Physalis,T


In [0]:
## Create dataset bundle
'''
Available modes:
- crop_and_resize
- center_crop_or_pad
- resize_image
'''
image_processor = functools.partial(sidekick.process_image, mode='crop_and_resize', size=(100, 100), file_format='jpeg')
sidekick.create_dataset(
   output_path,
   df,
   path_columns=['image'],
   preprocess={
       'image': image_processor
   }
   )    

In [9]:
# Adding Drive folders to colab notebook
from google.colab import drive
drive.mount('/content/drive')


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


# New Section