## Data Loader
1. Takes output from FlickrPicCollector (images in directories named by search term found in plants dictionary) as input. 
2. Moves images to class directories mapping from plants values, updates path in dataframe.
3. Creates a dataframe of images with columns image-name, path, label, species.
4. Randomly shuffles and moves 20% of each class directory to another directory tree: test/class{x}.
5. Moves class{x} directories from step 3 to train/class{x}.
6. Saves dataframe as csv.



In [1]:
# Standard D-sci and data wrangling packages
import numpy as np
import pandas as pd

# I/O handling packages
import os
import pickle

# For splitting the files into train and test directories
# based on labels.csv
from sklearn.model_selection import train_test_split

In [2]:
# Use the same dictionary as FlickrPicCollector
plants = {'Heracleum mantegazzianum': 'class_0', 'giant hogweed': 'class_0', 
          'Echium vulgare': 'class_1','blueweed': 'class_1', 
          'Ulex europaeus':'class_2', 'gorse': 'class_2'}

In [3]:
# Get the directory names 
path = 'data/final_BC_images/'
dir_names = os.listdir(path=path)

In [4]:
# Check for success
print(dir_names)

['Heracleum mantegazzianum', 'Echium vulgare', 'Ulex europaeus', 'blueweed', 'gorse', 'giant hogweed']


### Create class directories in final_BC_images
These directories will be used to move the images to their respective classes and create a dataframe that will allow me to use train test split to split the test and train data.

In [5]:
# Used two search terms per class, so will use range to loop over every 2nd value
# in plants for class_x directory structure.
for i in range(0, len(plants), 2):
    os.mkdir(f'{path}/{list(plants.values())[i]}')

### Move images to their respective class directories

In [6]:
# Row count for the new df index.
row_count = 0
# Create a df frame to hold the image information.
labels_df = pd.DataFrame(columns = ['image_name', 'label', 'species', 'path'])

# Creating a for-loop to go over directory names
for dir in dir_names: 
    # create list of images in the directory
    list_images = [file for file in os.listdir(f'data/final_BC_images/{dir}') if file.endswith('.jpg')]
    
    # Getting the names of images
    num_test_images = int(len(list_images))
    for_testset = list_images[ : num_test_images]
    
    # Looping through each image in this list 'for_testset'
    for each_image in for_testset:
        
        # Renaming the file path to move those image to test set
        origin_path_name = f'data/final_BC_images/{dir}/{each_image}'
        destination_path_name = f'data/final_BC_images/{plants[dir]}/{each_image}'
        # Move the images
        os.rename(origin_path_name, destination_path_name) 
        
        # Creating a dataframe entry for each row.
        labels_df.at[row_count, 'image_name'] = each_image
        labels_df.at[row_count, 'label'] = plants[dir]
        labels_df.at[row_count, 'species'] = dir
        labels_df.at[row_count, 'path'] = destination_path_name
        row_count += 1

    # Sanity check
    print(f'{dir} completed.')

Heracleum mantegazzianum completed.
Echium vulgare completed.
Ulex europaeus completed.
blueweed completed.
gorse completed.
giant hogweed completed.


In [7]:
# Check for success
labels_df.shape

(8840, 4)

### Duplicate Images
There's a good chance that there are multiple images downloaded with the same filename multiple times as the search criteria gathered plants based on their common and scientific names / flickr tags.
I expect some flickr users would add both common and scientific names in tags. Let's confirm our suspisions.

In [8]:
# Search for images with the image_name duplicated.
labels_df.loc[labels_df['image_name'].duplicated()]

Unnamed: 0,image_name,label,species,path
4724,50125846317_a87d6c2c55_q.jpg,class_1,blueweed,data/final_BC_images/class_1/50125846317_a87d6...
4729,7542790412_1368eec454_q.jpg,class_1,blueweed,data/final_BC_images/class_1/7542790412_1368ee...
4732,42082099824_142228cd56_q.jpg,class_1,blueweed,data/final_BC_images/class_1/42082099824_14222...
4733,7575657314_4186f04733_q.jpg,class_1,blueweed,data/final_BC_images/class_1/7575657314_4186f0...
4735,19979541003_f723927d11_q.jpg,class_1,blueweed,data/final_BC_images/class_1/19979541003_f7239...
...,...,...,...,...
8806,41921360414_6dd79209aa_q.jpg,class_0,giant hogweed,data/final_BC_images/class_0/41921360414_6dd79...
8812,3750927481_4b17b723a2_q.jpg,class_0,giant hogweed,data/final_BC_images/class_0/3750927481_4b17b7...
8817,49868056146_2ce22732fa_q.jpg,class_0,giant hogweed,data/final_BC_images/class_0/49868056146_2ce22...
8818,23540839382_f3e269016d_q.jpg,class_0,giant hogweed,data/final_BC_images/class_0/23540839382_f3e26...


There are 662 duplicated images, I'll drop all of them all now (drop duplicates doesn't seem to work when there is only one column of the df that is duplicated, so I'll drop all of them).

In [9]:
# Create a list of index values of the duplicate image name rows.
drop_index = labels_df.loc[labels_df['image_name'].duplicated()].index

In [10]:
# Drop the duplicated images
labels_df.drop(drop_index, inplace = True)

In [11]:
# Check for success.
labels_df.shape

(8178, 4)

In [12]:
# use labels_df to split the data between test and train sets. 
train, test = train_test_split(labels_df, stratify = labels_df['label'], test_size = 0.2)

In [13]:
# Check train shape
train.shape

(6542, 4)

In [14]:
# Check test shape
test.shape

(1636, 4)

In [15]:
# Make a copy of test
test = test.copy()

In [16]:
# Sort index values
test.sort_index(inplace=True, ignore_index=True)

In [17]:
# Make a copy of train
train = train.copy()

In [18]:
# Sort index values
train.sort_index(inplace=True, ignore_index=True)

In [19]:
# Create a list of class categories to loop over.
categories = ['class_0', 'class_1', 'class_2']

In [20]:
# This funciton will check for a path to see if it already exists and will create a new directory based on path name if it doesn't exist.
def create_folder(path):
    if not os.path.isdir(path):
        os.makedirs(path)

In [21]:
# Creates test and train directories in final_BC_images and makes class_0, class_1 and class_2 directories in test and train.
create_folder('data/final_BC_images/test')
create_folder('data/final_BC_images/train')
for cat in categories:
    create_folder(f'data/final_BC_images/test/{cat}')
    create_folder(f'data/final_BC_images/train/{cat}')

### Move our test images to correct class directory in test

In [22]:
# Creating a for-loop that will move the test images for each class into a test set
for each_category in categories: 
    images_list = test.loc[test['label'] == each_category, 'image_name']
    # Looping through each image in test['image_name']
    
    for i, each_image in enumerate(images_list):
        # Renaming the file path to move those image to test set
        origin_path_name = (f'data/final_BC_images/{each_category}/{each_image}')
        destination_path_name = (f'data/final_BC_images/test/{each_category}/{each_image}')
        os.rename(origin_path_name, destination_path_name)
        # Ensure the correct row is updated with the new path.
        index = test.loc[test['image_name'] == each_image].index[0]
        test.at[index, 'path'] = destination_path_name
    
    # Sanity check
    print(f'{each_category} completed.')

class_0 completed.
class_1 completed.
class_2 completed.


In [28]:
# Check path is updated correctly, look at some in class_0 and class_1
test[test['label']== 'class_0']

Unnamed: 0,image_name,label,species,path
0,28240243393_0a6232635d_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/test/class_0/28240243393_...
1,3627783867_10cf8b628f_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/test/class_0/3627783867_1...
2,4730020126_b94cda9ca7_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/test/class_0/4730020126_b...
3,19080833602_5a74c4bcf1_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/test/class_0/19080833602_...
4,28170589951_fa2a2cdf50_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/test/class_0/28170589951_...
...,...,...,...,...
1631,2467173809_2f43305682_q.jpg,class_0,giant hogweed,data/final_BC_images/test/class_0/2467173809_2...
1632,15873789434_fe91de360e_q.jpg,class_0,giant hogweed,data/final_BC_images/test/class_0/15873789434_...
1633,19717382755_fe146816f8_q.jpg,class_0,giant hogweed,data/final_BC_images/test/class_0/19717382755_...
1634,15749970639_986c31a13b_q.jpg,class_0,giant hogweed,data/final_BC_images/test/class_0/15749970639_...


In [27]:
# Ensure class 1 images are in test/class_1
test[test['label']=='class_1']

Unnamed: 0,image_name,label,species,path
213,5940669507_eb79ffd1c5_q.jpg,class_1,Echium vulgare,data/final_BC_images/test/class_1/5940669507_e...
214,13942609625_e59fc45ba1_q.jpg,class_1,Echium vulgare,data/final_BC_images/test/class_1/13942609625_...
215,50148381833_1ac0f3d4c3_q.jpg,class_1,Echium vulgare,data/final_BC_images/test/class_1/50148381833_...
216,3748304901_282646b82e_q.jpg,class_1,Echium vulgare,data/final_BC_images/test/class_1/3748304901_2...
217,530729429_8899e26085_q.jpg,class_1,Echium vulgare,data/final_BC_images/test/class_1/530729429_88...
...,...,...,...,...
1039,7924958704_0aa57e73a4_q.jpg,class_1,blueweed,data/final_BC_images/test/class_1/7924958704_0...
1040,50108030358_482b281fef_q.jpg,class_1,blueweed,data/final_BC_images/test/class_1/50108030358_...
1041,3752976661_516f22e060_q.jpg,class_1,blueweed,data/final_BC_images/test/class_1/3752976661_5...
1042,35128706121_6e804642c8_q.jpg,class_1,blueweed,data/final_BC_images/test/class_1/35128706121_...


### Move our train images to correct class directory in train

In [29]:
# Creating a for-loop that will move the train images for each class into a train set
for each_category in categories: 
    images_list = train.loc[train['label'] == each_category, 'image_name']
    # Looping through each image in train['image_name']
    
    for i, each_image in enumerate(images_list):
        # Renaming the file path to move those image to test set
        origin_path_name = (f'data/final_BC_images/{each_category}/{each_image}')
        destination_path_name = (f'data/final_BC_images/train/{each_category}/{each_image}')
        os.rename(origin_path_name, destination_path_name)
        # Ensure the correct row is updated with the new path.
        index = train.loc[train['image_name'] == each_image].index[0]
        train.at[index, 'path'] = destination_path_name
    
    # Sanity check
    print(f'{each_category} completed.')

class_0 completed.
class_1 completed.
class_2 completed.


In [30]:
# Check path is updated correctly, look at some in class_0 and class_1
train[train['label']== 'class_0']

Unnamed: 0,image_name,label,species,path
0,49756928156_cba2d1afa8_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/train/class_0/49756928156...
1,18775248288_034eb329c7_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/train/class_0/18775248288...
2,8784314352_eca9f75ca4_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/train/class_0/8784314352_...
3,19561290949_f1ddc5f760_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/train/class_0/19561290949...
4,50054511647_b04b6727a0_q.jpg,class_0,Heracleum mantegazzianum,data/final_BC_images/train/class_0/50054511647...
...,...,...,...,...
6537,3772390207_4a8948f59f_q.jpg,class_0,giant hogweed,data/final_BC_images/train/class_0/3772390207_...
6538,19668664122_4248faf9e2_q.jpg,class_0,giant hogweed,data/final_BC_images/train/class_0/19668664122...
6539,5529662465_4503189515_q.jpg,class_0,giant hogweed,data/final_BC_images/train/class_0/5529662465_...
6540,50034462801_515decf1f2_q.jpg,class_0,giant hogweed,data/final_BC_images/train/class_0/50034462801...


In [31]:
train[train['label']== 'class_1']

Unnamed: 0,image_name,label,species,path
803,9281721651_af6407f500_q.jpg,class_1,Echium vulgare,data/final_BC_images/train/class_1/9281721651_...
804,50125846317_a87d6c2c55_q.jpg,class_1,Echium vulgare,data/final_BC_images/train/class_1/50125846317...
805,3671233921_c4cffaa618_q.jpg,class_1,Echium vulgare,data/final_BC_images/train/class_1/3671233921_...
806,32623442823_1f56ec6d5e_q.jpg,class_1,Echium vulgare,data/final_BC_images/train/class_1/32623442823...
807,2498144310_3edb431bf9_q.jpg,class_1,Echium vulgare,data/final_BC_images/train/class_1/2498144310_...
...,...,...,...,...
4011,18608384218_315b1d3808_q.jpg,class_1,blueweed,data/final_BC_images/train/class_1/18608384218...
4012,5882309302_aeac6ca078_q.jpg,class_1,blueweed,data/final_BC_images/train/class_1/5882309302_...
4013,49787383171_fe64b53126_q.jpg,class_1,blueweed,data/final_BC_images/train/class_1/49787383171...
4014,14762065069_e5ce4cef18_q.jpg,class_1,blueweed,data/final_BC_images/train/class_1/14762065069...


In [35]:
train[train['label']== 'class_2']

Unnamed: 0,image_name,label,species,path
2623,49731019722_d90a41850e_q.jpg,class_2,Ulex europaeus,data/final_BC_images/train/class_2/49731019722...
2624,3040888929_10f3d861a5_q.jpg,class_2,Ulex europaeus,data/final_BC_images/train/class_2/3040888929_...
2625,26091835421_bb60374a58_q.jpg,class_2,Ulex europaeus,data/final_BC_images/train/class_2/26091835421...
2626,41016471964_c051389df7_q.jpg,class_2,Ulex europaeus,data/final_BC_images/train/class_2/41016471964...
2627,6990430645_74f29d3ae4_q.jpg,class_2,Ulex europaeus,data/final_BC_images/train/class_2/6990430645_...
...,...,...,...,...
5315,8507668458_8880a27841_q.jpg,class_2,gorse,data/final_BC_images/train/class_2/8507668458_...
5316,4598400008_2786970dec_q.jpg,class_2,gorse,data/final_BC_images/train/class_2/4598400008_...
5317,10959255525_595fb1f116_q.jpg,class_2,gorse,data/final_BC_images/train/class_2/10959255525...
5318,4888603061_c9a6d1055a_q.jpg,class_2,gorse,data/final_BC_images/train/class_2/4888603061_...


### Class Imbalance in Training Set
There are:
* 2,025 images in class_0,
* 2,090 images in class_1,
* 2,427 images in class_2.
<br>
<br>
There is a slight class imbalance between class_0 and class_1 and a large class imbalance for class_2.
The fastest and simplest way to address this imbalance is to use the GUI and select and remove 400 images from class 2.
While I'm at it, I'll also remove approx 65 images from class_1. 
In future steps, I'll use the dataframe to manage this process as it's good to keep it in sync with my data.

### Saving our test and train dataframes
I'll use df.to_csv to write our dataframes to 'data/final_BC_images/*.csv'.
<br>
These will be saved for future reference, though they will be off by 400 images in class_2 due to removing those images manually.

In [33]:
test.to_csv('data/final_BC_images/test.csv', header=True)

In [34]:
train.to_csv('data/final_BC_images/train.csv', header=True)

### Success, now to next steps
That worked great, now the final_BC_images directory needs a tidy up.
<br>
I'll handle that via command line.
<br>
I want to remove the empty species directories, the class directories in the final_BC_images directory can also be removed.
<br>
I'll also zip up the cleaned directory and transfer to S3 so that I can access from AWS SageMaker Notebooks.