## Copy Images and Split them Into Train and Test Folders For Yolov5

The format of Yolov5 is different than the format for the Keras models we used. Yolov5 needs to have images split into train and test folders without any class folders. So, all images are in one train folder and one test folder with no subfolders for classes as was the case for the Keras models (necessary for Keras model to use ImageDataGenerator and flow_from_directory). The images are recognized by their labels which are saved in txt files. For information on how we created the txt files, please look at the Yolov5 txt files notebook in this repository. We used this site as a reference for Yolov5 structure: https://towardsai.net/p/computer-vision/yolo-v5-object-detection-on-a-custom-dataset. More about the structure can be found online and described in the README.md. 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import os
import numpy as np
from collections import defaultdict
import shutil
import random
import pandas as pd

In [None]:
#file paths
images_file_path = '/content/gdrive/MyDrive/yolo_data/AML_Cytomorphology/images' #these are the original images from TCIA. We copied orginal data made a yolo folder with data.
labels_file_path = '/content/gdrive/MyDrive/yolo_data/AML_Cytomorphology/labels' 
train_images_path = '/content/gdrive/MyDrive/haley_yolo_split_data/images/train' #path for yolo train images
test_images_path = '/content/gdrive/MyDrive/haley_yolo_split_data/images/test' #path for yolo test images
train_labels_path = '/content/gdrive/MyDrive/haley_yolo_split_data/labels/train' #path for yolo train txt files
test_labels_path = '/content/gdrive/MyDrive/haley_yolo_split_data/labels/test' #path for yolo test txt files

In [None]:
#cell classes - intialize empty dictionary with list as option for values
classes = defaultdict(list)

In [None]:
# Create filepath names for all images
file_paths = [os.path.join(images_file_path,file) for file in os.listdir(images_file_path)] #join file name with full path for copying purposes later
len(file_paths) #check to see we have correct number of images to begin with

18365

In [None]:
#Put images in the correct cell class lists in classes dictionary
for file_path in file_paths:
    cell_name = file_path.split("/")[-1][:3] #get the cell name
    classes[cell_name].append(file_path) #add the cell image path to the appropriate cell name key in the cell classes dictionary 

In [None]:
classes.keys() #print cell classes dictionary keys 

dict_keys(['LYT', 'LYA', 'KSC', 'EOS', 'EBO', 'BAS', 'MON', 'MOB', 'MMZ', 'MYO', 'MYB', 'NGS', 'NGB', 'PMO', 'PMB'])

In [None]:
#check number of images in dictionary to make sure there are 18,365. There are. 
total_dic_images = 0
for key in classes.keys():
    total_dic_images += len(classes[key])
total_dic_images

18365

In [None]:
#Move 90% of each of the lists into a train folder and 10% into test folder unless there's less than 4 images in which case, we'll keep 4 for the test set and the rest for training.
for cell_class in classes.keys():
    class_sample = classes[cell_class] #all of the images within a cell class
    random.shuffle(class_sample) # randomly shuffle the dictionary files
    len_cell_class = len(class_sample) #number of images for the cell class
    if len_cell_class >= 40:
        num_train = int(np.round(0.9 * len_cell_class)) #num train images
        train_paths = class_sample[0:num_train] #train images
        test_paths = class_sample[num_train:] #test images
        for file in train_paths:
            shutil.copy(file, train_images_path) #copy train images to train images folder
        for file in test_paths:
            shutil.copy(file, test_images_path) #copy test images to test images folder
    else:
        num_train = len_cell_class - 4 #num train images
        train_paths = class_sample[0:num_train] #train images
        test_paths = class_sample[num_train:] #test images 
        for file in train_paths:
            shutil.copy(file, train_images_path) #copy train images to train images folder
        for file in test_paths:
            shutil.copy(file, test_images_path) #copy test images to test images folder
    

In [None]:
#check to see the counts in training and test folders
print(len(os.listdir(train_images_path)))
print(len(os.listdir(test_images_path)))

16517
1848


In [None]:
#Look at breakdown of images in training set and test set by cell type 
train_dic = defaultdict(int)
test_dic = defaultdict(int)
for file in os.listdir(train_images_path):
    train_dic[file[:3]] += 1 
for file in os.listdir(test_images_path):
    test_dic[file[:3]] +=1 

In [None]:
print(train_dic)
print(test_dic)

defaultdict(<class 'int'>, {'LYT': 3543, 'LYA': 7, 'KSC': 11, 'EOS': 382, 'EBO': 70, 'BAS': 71, 'MON': 1610, 'MOB': 22, 'MMZ': 11, 'MYO': 2941, 'MYB': 38, 'NGS': 7636, 'NGB': 98, 'PMO': 63, 'PMB': 14})
defaultdict(<class 'int'>, {'LYT': 394, 'LYA': 4, 'KSC': 4, 'EOS': 42, 'EBO': 8, 'BAS': 8, 'MON': 179, 'MOB': 4, 'MMZ': 4, 'MYO': 327, 'MYB': 4, 'NGS': 848, 'NGB': 11, 'PMO': 7, 'PMB': 4})


In [None]:
data = {'Cell Type': train_dic.keys(),
        'Train': train_dic.values(),
      'Test': test_dic.values()}

In [None]:
df = pd.DataFrame(data)
df.sort_values('Train', inplace = True, ascending = False, ignore_index = True)
df

Unnamed: 0,Cell Type,Train,Test
0,NGS,7636,848
1,LYT,3543,394
2,MYO,2941,327
3,MON,1610,179
4,EOS,382,42
5,NGB,98,11
6,BAS,71,8
7,EBO,70,8
8,PMO,63,7
9,MYB,38,4


In [None]:
df['Total'] = df['Train'] + df['Test']
df

Unnamed: 0,Cell Type,Train,Test,Total
0,NGS,7636,848,8484
1,LYT,3543,394,3937
2,MYO,2941,327,3268
3,MON,1610,179,1789
4,EOS,382,42,424
5,NGB,98,11,109
6,BAS,71,8,79
7,EBO,70,8,78
8,PMO,63,7,70
9,MYB,38,4,42


In [None]:
total = sum(df['Total'])
total

18365