In [1]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
from stability.data import random_split

First, we need to list all the tiles. We also read in the `y.csv` files, because they have a column describing which subject each patch came from (we could do string manipulation on the paths, but having an index file is more reliable).

In [2]:
data_dir = Path(os.environ["DATA_DIR"]) / "tnbc"

paths = {
    "tile": Path(data_dir).glob("**/*npy"),
    "y": Path(data_dir).glob("**/y*csv")
}

y = []
for p in paths["y"]:
    y.append(pd.read_csv(p))

Now, we can generate the split across participants and link the assignments back to the tiles associated with them. The resulting splits are saved in `split.csv` in the `tnbc` subdirectory.

In [3]:
y_df = pd.concat(y)
subjects = y_df["i"].unique()

split = random_split(subjects, [0.7, 0.1, 0.2])
split.columns = ["i", "split"]
split = pd.merge(y_df[["i", "w", "h", "y"]], split)
split["rel_path"] = split.apply(lambda r: f"tnbc/tiles/{r.i}_{str(r.w)}-{str(r.h)}.npy", axis=1)
split.to_csv(data_dir / "split.csv")

For reference, some example rows from this file are,

In [4]:
split[:3]

Unnamed: 0,i,w,h,y,split,rel_path
0,p9,0,0,5.392317,test,tnbc/tiles/p9_0-0.npy
1,p9,128,0,2.784271,test,tnbc/tiles/p9_128-0.npy
2,p9,256,0,1.53842,test,tnbc/tiles/p9_256-0.npy
