### Process data in Indian_Food_Images to make ready for training computer vision

**Structure**\
After cleaning up the folder structure, we have a number of folders under data/Indian_Food_Images.  Each folder contains the name/label of a dish and contains images of that dish.

In [3]:
import pandas as pd
import os

#from sklearn.model_selection import train_test_split
from image_prep import get_img_df, train_val_test_split

##### Step 1: Associate image paths with label name in a dataframe.

In [4]:
ifi_path = "data/Indian_Food_Images/"
fc_path = "data/Food_Classification/"

df_ifi = get_img_df(ifi_path)
df_fc = get_img_df(fc_path)


In [17]:
# inspect
print(df_ifi.head(2))
print(df_fc.head(2))


       label                                              path
0  adhirasam  data/Indian_Food_Images/adhirasam/02d09e872d.jpg
1  adhirasam  data/Indian_Food_Images/adhirasam/02f2e49039.jpg
           label                                          path
413  butter_naan  data/Food_Classification/butter_naan/001.jpg
414  butter_naan  data/Food_Classification/butter_naan/002.jpg


Drop unwanted food items from df_fc

In [11]:
unwanted_food_list = ['burger','pizza']

df_fc = df_fc[~df_fc['label'].isin(unwanted_food_list)]
df_fc.label.unique()

array(['butter_naan', 'chai', 'chapati', 'chole_bhature', 'dal_makhani',
       'dhokla', 'fried_rice', 'idli', 'jalebi', 'kaathi_rolls',
       'kadai_paneer', 'kulfi', 'masala_dosa', 'momos', 'paani_puri',
       'pakode', 'pav_bhaji', 'samosa'], dtype=object)

In [18]:
# ensure paths are valid
def assert_df_valid_path(df):
    assert(os.path.isfile(df.iloc[0]['path']))

assert_df_valid_path(df_ifi)
assert_df_valid_path(df_fc)

##### Step 2: Test/Train Split

In [22]:
# first note how many of each class we have
#df.groupby('label').count().head(5)

df_ifi.value_counts('label')

label
adhirasam               50
aloo_gobi               50
naan                    50
mysore_pak              50
modak                   50
                        ..
daal_puri               50
daal_baati_churma       50
chikki                  50
chicken_tikka_masala    50
unni_appam              50
Name: count, Length: 80, dtype: int64

In [23]:
df_fc.value_counts('label')

label
chapati          413
kadai_paneer     412
chole_bhature    411
chai             381
fried_rice       355
pav_bhaji        353
butter_naan      329
dal_makhani      321
momos            319
masala_dosa      311
idli             310
jalebi           297
kaathi_rolls     293
dhokla           289
pakode           278
samosa           262
kulfi            237
paani_puri       130
Name: count, dtype: int64

Neither set has a lot of data to work with.  We might need to play with the percentages but we probably don't want less than 10 images in each of the test and validation sets.  For the Indian_Food_Images set, this would mean a 60/20/20 split.  We can use these proportions for both image sets.

In [24]:
train_ifi, test_ifi, validate_ifi = train_val_test_split(df_ifi, test_size=0.2, val_size=0.2)
train_fc, test_fc, validate_fc = train_val_test_split(df_fc, test_size=0.2, val_size=0.2)


In [26]:
train_fc.value_counts('label')

label
chapati          248
kadai_paneer     247
chole_bhature    246
chai             229
fried_rice       213
pav_bhaji        212
butter_naan      197
dal_makhani      193
momos            191
masala_dosa      187
idli             186
jalebi           178
kaathi_rolls     176
dhokla           173
pakode           167
samosa           157
kulfi            142
paani_puri        78
Name: count, dtype: int64

In [14]:
test.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,10
aloo_gobi,10
aloo_matar,10
aloo_methi,10
aloo_shimla_mirch,10


In [15]:
validate.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,10
aloo_gobi,10
aloo_matar,10
aloo_methi,10
aloo_shimla_mirch,10


##### Step 3: Process images so they can be used in computer vision

In [12]:
# TBD
test = 0.2
val = 0.3

val / (test+val)

0.6