### Process data in Indian_Food_Images to make ready for training computer vision

**Structure**\
After cleaning up the folder structure, we have a number of folders under data/Indian_Food_Images.  Each folder contains the name/label of a dish and contains images of that dish.

In [34]:
import pandas as pd
import os

from sklearn.model_selection import train_test_split

##### Step 1: Associate image paths with label name in a dataframe.

In [28]:
ifi_path = "data/Indian_Food_Images/"

# each folder is also the label name
labels = os.listdir(ifi_path)

# associate the image path with the label
image_dct = {}
for label in labels:
    label_path = ifi_path + label
    # list the images under each folder and append the path onto each image name
    image_dct[label] = [label_path + "/" + img  for img in os.listdir(label_path)]
    
# create df and melt to have the column names (labels) in same row as image path
df = pd.DataFrame(image_dct).melt().rename(columns={'variable':'label', 'value':'path'})


In [29]:
# inspect
df.head(2)

Unnamed: 0,label,path
0,adhirasam,data/Indian_Food_Images/adhirasam/02d09e872d.jpg
1,adhirasam,data/Indian_Food_Images/adhirasam/02f2e49039.jpg


In [30]:
# ensure path is valid
fname = df.iloc[0]['path']
os.path.isfile(fname)


True

##### Step 2: Test/Train Split

In [39]:
# first note how many of each class we have
df.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,50
aloo_gobi,50
aloo_matar,50
aloo_methi,50
aloo_shimla_mirch,50


50 isn't a lot of data to work with.  We might need to play with the percentages but we probably don't want less than 10 images in each of the test and validation sets.

We will stratify on each label to ensure we don't introduce a class imbalance.

In [45]:
# first split training from the rest
train, test_validate = train_test_split(df, test_size=0.4, random_state=0, stratify=df['label'])
# then split test and validate
test, validate = train_test_split(test_validate, test_size=0.5, random_state=0, stratify=test_validate['label'])

In [46]:
train.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,30
aloo_gobi,30
aloo_matar,30
aloo_methi,30
aloo_shimla_mirch,30


In [47]:
test.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,10
aloo_gobi,10
aloo_matar,10
aloo_methi,10
aloo_shimla_mirch,10


In [48]:
validate.groupby('label').count().head(5)

Unnamed: 0_level_0,path
label,Unnamed: 1_level_1
adhirasam,10
aloo_gobi,10
aloo_matar,10
aloo_methi,10
aloo_shimla_mirch,10


##### Step 3: Process images so they can be used in computer vision

In [49]:
# TBD

Unnamed: 0,label,path
2615,makki_di_roti_sarson_da_saag,data/Indian_Food_Images/makki_di_roti_sarson_d...
3151,pithe,data/Indian_Food_Images/pithe/06df13ef65.jpg
1326,dal_makhani,data/Indian_Food_Images/dal_makhani/4d4ea9dfc4...
738,butter_chicken,data/Indian_Food_Images/butter_chicken/6cc4b33...
445,bandar_laddu,data/Indian_Food_Images/bandar_laddu/8a516ab20...
...,...,...
3018,palak_paneer,data/Indian_Food_Images/palak_paneer/24a4b9772...
2039,kadhi_pakoda,data/Indian_Food_Images/kadhi_pakoda/7e6b52c15...
3660,sheera,data/Indian_Food_Images/sheera/14d717de86.jpg
814,cham_cham,data/Indian_Food_Images/cham_cham/1e3fc07649.jpg
