This notebook performs a simple train/cv/test split for the very first classification model.

In [2]:
from pathlib import Path
import shutil

In [3]:
BASE_PATH = Path("~/data/labeled").expanduser()
DEST_BASE_PATH = BASE_PATH.parent / "split"
DEST_BASE_PATH.mkdir(exist_ok=True)

In [4]:
label_pos = "BirdHome"
label_neg = "BirdRoaming"

In [5]:
def glob_print(label):
    path = BASE_PATH / label
    glob = list(path.glob("*.jpeg"))
    print(f"{path.name}: {len(glob)} images")
    return glob

In [6]:
files_pos = glob_print(label_pos)
files_neg = glob_print(label_neg)

BirdHome: 4221 images
BirdRoaming: 1444 images


Ok, It's not balanced. No worries. I still want to use all of this rather than sample. Just gotta be a 
bit careful interpreting accuracy.

Split is a simple time series split (random sampling would put too many similar images across train and test, inflating the result).

In [7]:
SPLIT = [.6, .2, .2]
SPLIT_NAMES = ["train", "cv", "test"]

In [8]:
for files in [files_neg, files_pos]:
    label = files[0].parent.name
    
    n = len(files)
    current_idx = 0
    for split_frac, split_name in zip(SPLIT, SPLIT_NAMES):
        dest_path = DEST_BASE_PATH / split_name / label
        dest_path.mkdir(exist_ok=True, parents=True)
        
        num_to_select = round(n * split_frac)
        sel_files = files[current_idx:current_idx + num_to_select]    
        print(f"Selected {len(sel_files)} jpegs to copy to {dest_path}")
        
        for jpeg in sel_files:
            shutil.copy(jpeg, dest_path)
        
        current_idx += num_to_select    

Selected 866 jpegs to copy to /home/jvlier/data/split/train/BirdRoaming
Selected 289 jpegs to copy to /home/jvlier/data/split/cv/BirdRoaming
Selected 289 jpegs to copy to /home/jvlier/data/split/test/BirdRoaming
Selected 2533 jpegs to copy to /home/jvlier/data/split/train/BirdHome
Selected 844 jpegs to copy to /home/jvlier/data/split/cv/BirdHome
Selected 844 jpegs to copy to /home/jvlier/data/split/test/BirdHome
