# Image Sorting

The dataset from kaggle is sorted as follows: each patient has their own folder of images and subfolders that denote whether or not the image contains cancer labeled 0 and 1. While this is a nice way to organize things, I would like to utilize the split-folders package which requires that all images be split into just 0 and 1 folders for our target. 

You can read more about split-folders <a href="https://pypi.org/project/split-folders/">here.</a>

Note: I have copied the images in 2 separate folders.

In [1]:
import os 
import shutil
from glob import glob
import fnmatch
import split_folders
import random

We need to sort all of our images into testing, training, and validation sets.

First let's use glob to grab all of our images.

In [2]:
images = glob('data/cancer/***/**/*.png', recursive=True)
len(images)

277524

Now we need to sort our data into positive and negative for our target.

All images denote their class at the end of their filename so this will be an easy task.

In [3]:
#create filters to sort our images between positive and negative for the target. 
filtzero = '*class0.png'
filtone = '*class1.png'
#filename match will go through our list of images and filter
zero = fnmatch.filter(images, filtzero)
one = fnmatch.filter(images, filtone)
#confirm we've captured everything
print(len(zero))
print(len(one))
print(len(zero)+len(one))

198738
78786
277524


Now we will move the images to positive and negative folders so that we can use the split-folders package to further sort into training, testing and validation sets.

We created a new folder prior to this step called "sorted" that contains subfolders named "0" and "1".

In [4]:
#move the zeroes
dest0 = 'data/sorted/0/'

#shutil goes through our list of zero images and moves them to our 0 folder. 
for f in zero:
    shutil.move(f, dest0)

In [6]:
#move the ones
dest1 = 'data/sorted/1/'

for f in one:
    shutil.move(f, dest1)

Now that the images have been sorted, we use split folders to sort our images into training, testing, and validation sets. We will use 70% training, 15% testing, and 15% validation. 

In [7]:
split_folders.ratio('data/sorted', output='data/split', ratio=(.7, .15, .15))

Copying files: 277524 files [08:58, 515.24 files/s]


All files are now moved and ready for preprocessing. 