# Data Preparation

The purpose of this notebook is to organize the food101 dataset. This notebook assumes that the dataset has already been downloaded (https://www.vision.ee.ethz.ch/datasets_extra/food-101/) and extracted into a folder titled 'food-101'. The assumption is that the 'food-101' is in the same current directory as this notebook (run to `!pwd` in a cell below to confirm). 

In [None]:
# Notebook dependencies
import json
import os
import pandas as pd
import shutil

# load json blob which tells us which images to include in the training set
train_dict = json.load(open('food-101/meta/train.json'))

# the path to the extracted food-101 images (this will be different across computers/environments)
original_dataset_dir = '/home/jupyter/food-101/images'

# the path to where the organized data will be stored
base_dir = '/home/jupyter/data'

In [10]:
# Create class directories in the train folder
for key in train_dict.keys():
    class_dir = os.path.join(base_dir, 'train/' + key)
    os.mkdir(class_dir)

In [11]:
# Copy files from original_dataset_dir to appropriate train directory
for key in train_dict.keys():
    for file in train_dict[key]:
        src = os.path.join(original_dataset_dir, file + '.jpg')
        dst = os.path.join(base_dir + '/train/', file + '.jpg')
        shutil.copyfile(src, dst)

In [12]:
# load json blob which tells us which images to include in the test set 
test_dict = json.load(open('food-101/meta/test.json'))

In [13]:
# Create class directories in the test folder
for key in test_dict.keys():
    class_dir = os.path.join(base_dir, 'test/' + key)
    os.mkdir(class_dir)

In [14]:
# Copy files from original_dataset_dir to appropriate test directory
for key in test_dict.keys():
    for file in test_dict[key]:
        src = os.path.join(original_dataset_dir, file + '.jpg')
        dst = os.path.join(base_dir + '/test/', file + '.jpg')
        shutil.copyfile(src, dst)

At this point, we have split the food-101 dataset into a train and test set based on the information provided by the original publishers. Further, within the test and train sets we have organized each class into a sub-directory. This directory structure is a common way to organize image datasets in preparation for training image classifiers. In the next notebook, we will begin to explore the food-101 dataset.