# Data Splits Truncated

Here, we split in the same way as in `data_splits.ipynb`, but then we restrict the number of points in the training data.
There are two ways we could do this:
1. Take the original splits and drop random examples. Problem: As we go to smaller training sets, this will lead to us inadvertently removing all the examples of one compound from the training data but not from validation/test.
2. We do (procedurally) the same splits as before, but with different train-test ratios.

Option 2 is safer to do w.r.t. retaining the split dimensionality.

### 0D Split
For the 0D split, we use a random train-test split.
Standard: We use a 80/10/10 split into train, val, and test set.

Truncated: 
- 40/30/30
- 20/40/40
- 10/45/45
- 5/50/45
- 2.5/50/47.5
- 1.25/50/48.75

### 1D Split
For the 1D split, we use a (1D) GroupShuffleSplit.
Each individual split will be 80/10/10 train/test (of groups not samples!).
As groups, we use either initiator, monomer, or terminator.
{later}

### 2D Split
For the 2D split, we use a (2D) GroupShuffleSplit.
Each individual split will use 10% of groups as test set and 12.5% of remaining groups as validation set. 
Due to the dimensionality, this means we expect 0.1 * 0.1 = 1% of samples in the test and validation set and 0.900^2 * 0.875^2 = 62.0% of sample in the training set.
The remaining samples are not used to prevent leakage.
As groups, we use either \[initiator, monomer], \[monomer, terminator] or \[initiator, terminator].

### 3D Split
For the 3D split, we use a (3D) GroupShuffleSplit.
Each individual split will use 20% of groups as test set, 25% of remaining groups as validation set, and the remaining groups as training set.
Due to the dimensionality, this means we expect 0.2^3 = 0.8% of samples in the test and validation set and 0.800^3 * 0.750^3 = 21.6% of sample in the training set.

In [1]:
import pathlib
import sys

sys.path.append(str(pathlib.Path().resolve().parents[1]))

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit, ShuffleSplit

from src.definitions import DATA_DIR
from src.util.train_test_split import GroupShuffleSplitND

In [2]:
# Load data
data_filename = "synferm_dataset_2023-09-05_40018records.csv"
data_name = data_filename.rsplit("_", maxsplit=1)[0]
df = pd.read_csv(DATA_DIR / "curated_data" / data_filename)
df.shape

(40018, 27)

In [3]:
df.head()

Unnamed: 0,I_long,M_long,T_long,product_A_smiles,I_smiles,M_smiles,T_smiles,reaction_smiles,reaction_smiles_atom_mapped,experiment_id,...,binary_H,scaled_A,scaled_B,scaled_C,scaled_D,scaled_E,scaled_F,scaled_G,scaled_H,major_A-C
0,2-Pyr003,Fused002,TerABT004,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(F)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56113,...,0,0.036021,0.003427,0.0,0.020975,0.002958,0.941981,0.914281,0.0,A
1,2-Pyr003,Fused002,TerABT007,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(Br)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56114,...,0,0.0,0.0,0.0,0.006159,0.364398,0.928851,1.106548,0.0,no_product
2,2-Pyr003,Fused002,TerABT013,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(C(F)(F)F)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56106,...,1,0.0,0.0,0.0,0.014212,2.16642,1.013596,0.537785,0.05686,no_product
3,2-Pyr003,Fused002,TerABT014,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(Cl)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56112,...,0,0.028915,0.005039,0.0,0.015578,0.504057,0.992614,0.890646,0.0,A
4,2-Pyr003,Fused002,TerTH001,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,[Cl-].[NH3+]NC(=S)c1ccccc1,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:13][cH:15][cH:17][c...,56109,...,0,0.350061,0.643219,0.0,0.031689,0.613596,0.109309,0.439018,0.0,B


In [4]:
#define a write function that we can reuse later
def write_indices_and_stats(indices, sizes, pos_class_A, split_dimension, train_size, save_indices=True):
    """
    Write function that can be reused for the other splits
    Args:
        indices: list of 3-tuples
        sizes: list of 3-tuples, length equal to indices
        pos_class_A: list of 3-tuples, length equal to indices
        split_dimension: str or int, e.g. "0", "1", "2", "3"
        save_indices: bool, whether to save the indices. Useful if we need to regenerate statistics. Default: True
        train_size: ratio of train samples from all samples. Give as percentage"
    """
    n_folds = len(indices)
    save_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split_{train_size}"
    save_dir.mkdir(parents=True, exist_ok=True)
    if save_indices:
        for i, (idx_train, idx_val, idx_test) in enumerate(indices):
            with open(save_dir / f"fold{i}_train.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_train]))
                
            with open(save_dir / f"fold{i}_val.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_val]))
                
            with open(save_dir / f"fold{i}_test.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_test]))
    
    for i, (size, count_A) in enumerate(zip(sizes, pos_class_A)):
        with open(save_dir / f"fold{i}_statistics.txt", "w") as f:
            f.write(f"Train samples: {size[0]} ({size[0]/len(df):.1%})\n")
            f.write(f"Val samples: {size[1]} ({size[1]/len(df):.1%})\n")
            f.write(f"Test samples: {size[2]} ({size[2]/len(df):.1%})\n")
            if split_dimension > 1:
                f.write(f"Not used: {len(df) - np.sum(size):.0f} ({(len(df) - np.sum(size)) / len(df):.1%})\n")
            f.write(f"Train samples binary_A has label 1: {count_A[0]} ({count_A[0]/size[0]:.1%})\n")
            f.write(f"Val samples binary_A has label 1: {count_A[1]} ({count_A[1]/size[1]:.1%})\n")
            f.write(f"Test samples binary_A has label 1: {count_A[2]} ({count_A[2]/size[2]:.1%})\n")
            
    # summary statistics
    sum_pos_class_A = np.sum(pos_class_A, axis=0)
    sum_sizes = np.sum(sizes, axis=0)
    with open(save_dir / "summary_statistics.txt", "w") as f:
        f.write(f"Mean Train samples: {sum_sizes[0] / n_folds:.0f} ({sum_sizes[0] / n_folds / len(df):.1%})\n")
        f.write(f"Mean Val samples: {sum_sizes[1] / n_folds:.0f} ({sum_sizes[1] / n_folds / len(df):.1%})\n")
        f.write(f"Mean Test samples: {sum_sizes[2] / n_folds:.0f} ({sum_sizes[2] / n_folds / len(df):.1%})\n")
        if split_dimension > 1:
            f.write(f"Not used: {len(df) - np.sum(sum_sizes) / n_folds:.0f} ({(len(df) - np.sum(sum_sizes) / n_folds) / len(df):.1%})\n")
        f.write(f"Mean Train samples binary_A has label 1: {sum_pos_class_A[0] / n_folds:.0f} ({sum_pos_class_A[0]/sum_sizes[0]:.1%})\n")
        f.write(f"Mean Val samples binary_A has label 1: {sum_pos_class_A[1] / n_folds:.0f} ({sum_pos_class_A[1]/sum_sizes[1]:.1%})\n")
        f.write(f"Mean Test samples binary_A has label 1: {sum_pos_class_A[2] / n_folds:.0f} ({sum_pos_class_A[2]/sum_sizes[2]:.1%})\n")

## 0D split

In [5]:
splitter = ShuffleSplit(n_splits=9, test_size=0.3, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.3/0.7, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

In [7]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=40)

In [8]:
splitter = ShuffleSplit(n_splits=9, test_size=0.4, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.4/0.6, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008), (8003, 16007, 16008)]
[(6550, 12944, 13038), (6466, 13069, 12997), (6551, 12973, 13008), (6501, 13031, 13000), (6498, 13019, 13015), (6529, 12949, 13054), (6487, 12956, 13089), (6511, 12983, 13038), (6470, 12997, 13065)]


In [9]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=20)

In [10]:
splitter = ShuffleSplit(n_splits=9, test_size=0.45, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.45/0.55, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009), (4001, 18008, 18009)]
[(3258, 14585, 14689), (3258, 14636, 14638), (3262, 14628, 14642), (3285, 14622, 14625), (3272, 14611, 14649), (3248, 14584, 14700), (3208, 14597, 14727), (3266, 14622, 14644), (3264, 14572, 14696)]


In [11]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=10)

In [12]:
splitter = ShuffleSplit(n_splits=9, test_size=0.45, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.5/0.55, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009), (2000, 20009, 18009)]
[(1639, 16204, 14689), (1626, 16268, 14638), (1644, 16246, 14642), (1639, 16268, 14625), (1642, 16241, 14649), (1632, 16200, 14700), (1600, 16205, 14727), (1615, 16273, 14644), (1630, 16206, 14696)]


In [13]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=5)

In [14]:
splitter = ShuffleSplit(n_splits=9, test_size=0.475, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.5/0.525, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009), (1000, 20009, 19009)]
[(830, 16213, 15489), (810, 16280, 15442), (803, 16283, 15446), (819, 16298, 15415), (813, 16271, 15448), (828, 16174, 15530), (825, 16155, 15552), (814, 16268, 15450), (838, 16198, 15496)]


In [15]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=2.5)

In [16]:
splitter = ShuffleSplit(n_splits=9, test_size=0.4875, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.5/0.5125, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509), (500, 20009, 19509)]
[(425, 16221, 15886), (409, 16283, 15840), (399, 16286, 15847), (421, 16315, 15796), (416, 16251, 15865), (408, 16193, 15931), (394, 16192, 15946), (401, 16269, 15862), (408, 16223, 15901)]


In [17]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True, train_size=1.25)

## 1D split

In [8]:
splitter = GroupShuffleSplit(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplit(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [9]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["I_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["I_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["M_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["M_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["T_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["T_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(30974, 4318, 4726), (32233, 3546, 4239), (32488, 4020, 3510), (30560, 4729, 4729), (31529, 3911, 4578), (31828, 4387, 3803), (31225, 3725, 5068), (31084, 4497, 4437), (32258, 3504, 4256)]
[(25183, 3332, 4017), (26439, 2886, 3207), (26725, 3018, 2789), (24641, 3543, 4348), (25616, 3082, 3834), (25166, 4038, 3328), (25197, 3248, 4087), (25648, 3906, 2978), (26131, 3024, 3377)]


In [10]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=1, save_indices=True)

## 2D split

In [11]:
splitter = GroupShuffleSplitND(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [12]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long"]]):
    train, val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["M_long", "T_long"]]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df[["M_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["I_long", "T_long"]]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df[["I_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(23868, 608, 565), (23681, 510, 520), (24526, 439, 528), (25765, 338, 602), (23780, 395, 559), (24445, 469, 518), (23951, 395, 687), (24688, 413, 460), (23946, 536, 599)]
[(19656, 452, 464), (18672, 440, 498), (20053, 335, 406), (21101, 291, 429), (18549, 378, 481), (19487, 413, 406), (19115, 304, 630), (20482, 325, 337), (19709, 412, 490)]


In [13]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=2, save_indices=True)

## 3D split

In [14]:
splitter = GroupShuffleSplitND(n_splits=9, test_size=0.2, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.2/0.8, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [15]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long", "T_long"]]):
    train, val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(8180, 355, 367), (7995, 305, 439), (7184, 424, 455), (8145, 272, 528), (8042, 338, 416), (7640, 428, 358), (8029, 398, 304), (7631, 423, 359), (8682, 261, 368)]
[(6663, 325, 239), (6600, 273, 310), (5690, 344, 394), (6498, 234, 437), (6365, 248, 371), (6038, 333, 317), (6296, 368, 248), (6194, 301, 274), (7278, 235, 265)]


In [16]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=3, save_indices=True)

## Control: Show statistics for previously prepared splits

In [17]:
# Load data
data_filename = "synferm_dataset_2023-09-05_40018records.csv"
data_name = data_filename.rsplit("_", maxsplit=1)[0]
df = pd.read_csv(DATA_DIR / "curated_data" / data_filename)
df.shape

(40018, 27)

### 0D split

In [18]:
split_dimension = 0
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 1D split

In [19]:
split_dimension = 1
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 1D groups are mutually exclusive
        if fold_idx < 3: # first three are split on initiator
            assert len(np.intersect1d(df["I_long"].iloc[train_idx], df["I_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["I_long"].iloc[train_idx], df["I_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["I_long"].iloc[val_idx], df["I_long"].iloc[test_idx])) == 0
        elif fold_idx < 6:  # next three are split on monomer
            assert len(np.intersect1d(df["M_long"].iloc[train_idx], df["M_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["M_long"].iloc[train_idx], df["M_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["M_long"].iloc[val_idx], df["M_long"].iloc[test_idx])) == 0
        else:  # last three are split on terminator
            assert len(np.intersect1d(df["T_long"].iloc[train_idx], df["T_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["T_long"].iloc[train_idx], df["T_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["T_long"].iloc[val_idx], df["T_long"].iloc[test_idx])) == 0
        
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 2D split

In [20]:
split_dimension = 2
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 2D groups are mutually exclusive
        if fold_idx < 3: # first three are split on initiator and monomer
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[val_idx]), np.unique(df[["I_long", "M_long"]].iloc[test_idx]))) == 0
        elif fold_idx < 6:  # next three are split on monomer and terminator
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[train_idx]), np.unique(df[["M_long", "T_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[train_idx]), np.unique(df[["M_long", "T_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[val_idx]), np.unique(df[["M_long", "T_long"]].iloc[test_idx]))) == 0
        else:  # last three are split on initiator and terminator
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "T_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "T_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[val_idx]), np.unique(df[["I_long", "T_long"]].iloc[test_idx]))) == 0
            
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 3D split

In [21]:
split_dimension = 3
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 3D groups are mutually exclusive
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[val_idx]))) == 0
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[test_idx]))) == 0
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[val_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[test_idx]))) == 0
            
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")