This notebook performs a simple train/cv/test split for the very first classification model.

In [1]:
from pathlib import Path
import shutil

In [2]:
BASE_PATH = Path("~/data").expanduser()
LABELED_PATHS = [
    BASE_PATH / "labeled-first-batch",
    BASE_PATH / "labeled-second-batch",
]
DEST_PATH = BASE_PATH / "split-v2-random"
DEST_PATH.mkdir(exist_ok=True)

In [3]:
label_birdhome = "BirdHome"
label_birdroam = "BirdRoaming"

In [4]:
def glob_print(label):
    result = []
    for labeled_path in LABELED_PATHS:
        path = labeled_path / label
        result.extend(list(path.glob("*.jpeg")))

    num = len(result)
    print(f"{path.name}: {num} images")
    return result, num

In [5]:
files_birdhome, num_birdhome = glob_print(label_birdhome) 
files_birdroam, num_birdroam = glob_print(label_birdroam)

BirdHome: 9050 images
BirdRoaming: 1700 images


Ok, It's not balanced. No worries. I still want to use all of this rather than sample. Just gotta be a 
bit careful interpreting accuracy.

In [6]:
total = num_birdhome + num_birdroam
baseline_accuracy = max(num_birdhome, num_birdroam) / total
print(f"Baseline accuracy (always predict majority) is: {baseline_accuracy:.3f}")

Baseline accuracy (always predict majority) is: 0.842


Split is a simple time series split (random sampling would put too many similar images across train and valid, inflating the result).

Not including a test set this time around, because I'm not going to do any excessive model tuning and will be validating & constantly improving it (active learning) once it's in production. I'm aware of the risk of overfitting to valid set.

In [7]:
SPLIT = [.8, .2]
SPLIT_NAMES = ["train", "valid"]

In [8]:
import numpy as np

In [11]:
for files in [files_birdhome, files_birdroam]:
    label = files[0].parent.name
    
    n = len(files)
    for split_frac, split_name in zip(SPLIT, SPLIT_NAMES):
        dest_path = DEST_PATH / split_name / label
        dest_path.mkdir(exist_ok=True, parents=True)
        
        num_to_select = round(n * split_frac)
        sel_files = np.random.choice(files, size=num_to_select, replace=False)
        sel_files_set = set(sel_files)
        
        files = [f for f in files if f not in sel_files_set]
        
        print(f"Selected {len(sel_files)} jpegs to copy to {dest_path}")
        
        for jpeg in sel_files:
            shutil.copy(jpeg, dest_path)
        

Selected 7240 jpegs to copy to /home/jvlier/data/split-v2-random/train/BirdHome
Selected 1810 jpegs to copy to /home/jvlier/data/split-v2-random/valid/BirdHome
Selected 1360 jpegs to copy to /home/jvlier/data/split-v2-random/train/BirdRoaming
Selected 340 jpegs to copy to /home/jvlier/data/split-v2-random/valid/BirdRoaming
