# Balancing Training Data
The data directory generated by our Gen_Data script is highly biased towards negative images. Let's fix that.

### Basic Imports

In [1]:
%pylab inline
import pandas as pd
import cv2
from cv2 import imread
from matplotlib.pyplot import imshow
import os
from tqdm import tqdm
import shutil

Populating the interactive namespace from numpy and matplotlib


### Run this in terminal:
>cd ~/sara_m/data<br/>
>mkdir ct_balanced<br/>
>cp -R ct_data/. ct_balanced<br/>
>mv  -v ./valid/1/* ./train/1<br/>
>mv  -v ./valid/0/* ./train/0<br/>

These commands, when run sequentially, creates a new directory with identical contents to our original data (copying prevents accidental destruction of data), and move all our image data into the train/ folder.

In [5]:
data_path = './data/ct_balanced/'
files_net = []
for path, dirs, files in os.walk(data_path):
    if len(dirs)==0:
        print(path)
        print(len(files))

./data/ct_data/valid/1
98
./data/ct_data/valid/0
12301
./data/ct_data/train/1
122
./data/ct_data/train/0
16732


### Addressing Unbalanced Data
Right now there are nearly 100x more negative images than positive - this makes sense, as we generated our dataset from 3D volumetric slices from patient CT scans, and tumors (even in a relatively sick patient) should only appear in a small subset of those slices.

We could build a pipeline to augment our negative images exclusively in an attempt to decrease this discrepancy; however, augmentation for a factor of 100x datapoints is asking a lot, even for the state of the art. Better to randomly downsample our negative datapoints and augment the entire image dataset at training. The next two cells do this:

In [51]:
#First: randomly remove files from train/0 to bring positive:negative ratio up to 1:3
pos_path = '/home/ubuntu/sara_m/data/ct_balanced/train/1'
neg_path = '/home/ubuntu/sara_m/data/ct_balanced/train/0'
pos_files = os.listdir(pos_path)
n_pos = len(pos_files)
neg_files = os.listdir(neg_path)
n_neg = len(neg_files)

target_ratio = 1/3
chosen = set([os.path.join(neg_path,f) for f in np.random.choice(neg_files,int(((1/target_ratio)*n_pos)), replace=False)])
to_delete = [os.path.join(neg_path,f) for f in neg_files if os.path.join(neg_path,f) not in chosen]

Current ratio: 220:660 or 0.3333333333333333
We want a ratio of around 1:3, or .33


In [44]:
for f in tqdm(to_delete):
    os.remove(f)

100%|██████████| 25847/25847 [00:01<00:00, 24498.95it/s]


### Sanity Check
Let's view the distribution of our data now:

In [69]:
#list data again:
files_net = []
for path, dirs, files in os.walk(data_path):
    if len(dirs)==0:
        print(path)
        print(len(files))

./data/ct_balanced/valid/1
0
./data/ct_balanced/valid/0
0
./data/ct_balanced/train/1
220
./data/ct_balanced/train/0
660


Seems good. This is a fairly small dataset after all is said and done; however, augmentation should address that somewhat, and remember, we still have the full dataset in our other directory to train the model on further after-the-fact.

### Moving Things Back
Now let's move ~ 25% of the data back to the valid/ directory. We'll use this to assess the performance of our network.

In [70]:
for i in list(range(2)):
    d = f'./data/ct_balanced/train/{i}'
    dest = f'./data/ct_balanced/valid/{i}'
    
    files = os.listdir(d) 
    n_files = len(files)
    n_move = int(.25*n_files)
    print(f'{d} has {n_files}; move {n_move}')
    to_move = [os.path.join(d,f) for f in np.random.choice(files, n_move, replace=False)]
    print('Moving!')
    for f in tqdm(to_move):
        shutil.move(f, dest)

100%|██████████| 165/165 [00:00<00:00, 28933.49it/s]
100%|██████████| 55/55 [00:00<00:00, 32628.96it/s]

./data/ct_balanced/train/0 has 660; move 165
Moving!
./data/ct_balanced/train/1 has 220; move 55
Moving!





### Sanity Check
Let's view the distribution of data one last time:

In [72]:
#list data one last time:
files_net = []
for path, dirs, files in os.walk(data_path):
    if len(dirs)==0:
        print(path)
        print(len(files))

./data/ct_balanced/valid/1
55
./data/ct_balanced/valid/0
165
./data/ct_balanced/train/1
165
./data/ct_balanced/train/0
495


## Done!
Let's move on to training.