# Data Splits

We want to split the SynFerm data set into train, validation and test data.
For all splits, we will do 9 random repetitions.
For 1D and 2D split, which both use 3 different groups to split on, these will divide into 3 random repetitions for each of the 3 groups. 

### 0D Split
For the 0D split, we use a random train-test split.
We use a 80/10/10 split into train, val, and test set.

### 1D Split
For the 1D split, we use a (1D) GroupShuffleSplit.
Each individual split will be 80/10/10 train/test (of groups not samples!).
As groups, we use either initiator, monomer, or terminator.

### 2D Split
For the 2D split, we use a (2D) GroupShuffleSplit.
Each individual split will use 10% of groups as test set and 12.5% of remaining groups as validation set. 
Due to the dimensionality, this means we expect 0.1 * 0.1 = 1% of samples in the test and validation set and 0.900^2 * 0.875^2 = 62.0% of sample in the training set.
The remaining samples are not used to prevent leakage.
As groups, we use either \[initiator, monomer], \[monomer, terminator] or \[initiator, terminator].

### 3D Split
For the 3D split, we use a (3D) GroupShuffleSplit.
Each individual split will use 20% of groups as test set, 25% of remaining groups as validation set, and the remaining groups as training set.
Due to the dimensionality, this means we expect 0.2^3 = 0.8% of samples in the test and validation set and 0.800^3 * 0.750^3 = 21.6% of sample in the training set.

In [1]:
import pathlib
import sys

sys.path.append(str(pathlib.Path().resolve().parents[1]))

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit, ShuffleSplit

from src.definitions import DATA_DIR
from src.util.train_test_split import GroupShuffleSplitND
from util import write_indices_and_stats

In [2]:
# Load data
data_filename = "synferm_dataset_2023-09-05_40018records.csv"
data_name = data_filename.rsplit("_", maxsplit=1)[0]
df = pd.read_csv(DATA_DIR / "curated_data" / data_filename)
df.shape

(40018, 27)

In [3]:
df.head()

Unnamed: 0,I_long,M_long,T_long,product_A_smiles,I_smiles,M_smiles,T_smiles,reaction_smiles,reaction_smiles_atom_mapped,experiment_id,...,binary_H,scaled_A,scaled_B,scaled_C,scaled_D,scaled_E,scaled_F,scaled_G,scaled_H,major_A-C
0,2-Pyr003,Fused002,TerABT004,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(F)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56113,...,0,0.036021,0.003427,0.0,0.020975,0.002958,0.941981,0.914281,0.0,A
1,2-Pyr003,Fused002,TerABT007,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(Br)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56114,...,0,0.0,0.0,0.0,0.006159,0.364398,0.928851,1.106548,0.0,no_product
2,2-Pyr003,Fused002,TerABT013,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(C(F)(F)F)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56106,...,1,0.0,0.0,0.0,0.014212,2.16642,1.013596,0.537785,0.05686,no_product
3,2-Pyr003,Fused002,TerABT014,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(Cl)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56112,...,0,0.028915,0.005039,0.0,0.015578,0.504057,0.992614,0.890646,0.0,A
4,2-Pyr003,Fused002,TerTH001,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,[Cl-].[NH3+]NC(=S)c1ccccc1,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:13][cH:15][cH:17][c...,56109,...,0,0.350061,0.643219,0.0,0.031689,0.613596,0.109309,0.439018,0.0,B


## 0D split

In [4]:
splitter = ShuffleSplit(n_splits=9, test_size=0.1, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [5]:
indices = []
sizes = []
pos_class = []
for idx_train_val, idx_test in splitter.split(df):
    # inner split
    train, val = next(inner_splitter.split(idx_train_val))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    # add to list
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )

print(sizes)
print(pos_class)

[(32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002)]
[(array([26042, 18286,  9105]), array([3224, 2288, 1154]), array([3266, 2262, 1054])), (array([26049, 18293,  9079]), array([3262, 2296, 1156]), array([3221, 2247, 1078])), (array([26007, 18255,  9060]), array([3263, 2253, 1147]), array([3262, 2328, 1106])), (array([26046, 18292,  9019]), array([3242, 2279, 1190]), array([3244, 2265, 1104])), (array([26031, 18185,  9066]), array([3229, 2311, 1115]), array([3272, 2340, 1132])), (array([26015, 18231,  9061]), array([3255, 2276, 1143]), array([3262, 2329, 1109])), (array([25995, 18269,  9026]), array([3265, 2260, 1123]), array([3272, 2307, 1164])), (array([26053, 18372,  9047]), array([3231, 2211, 1125]), array([3248, 2253, 1141])), (array([26020, 18236,  9137]), array([3227, 2300, 1093]), array([3285, 2300, 1083]))]


In [6]:
write_indices_and_stats(
    indices, 
    sizes, 
    pos_class,
    total_size=len(df),
    data_name=data_name,
    split_dimension=0, 
    save_indices=True, 
    train_size=""
)

## 1D split

In [7]:
splitter = GroupShuffleSplit(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplit(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [8]:
indices = []
sizes = []
pos_class = []
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["I_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["I_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["M_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["M_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
        
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["T_long"]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df["T_long"][idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
    
print(sizes)
print(pos_class)

[(30974, 4318, 4726), (32233, 3546, 4239), (32488, 4020, 3510), (30560, 4729, 4729), (31529, 3911, 4578), (31828, 4387, 3803), (31225, 3725, 5068), (31084, 4497, 4437), (32258, 3504, 4256)]
[(array([25183, 17684,  8977]), array([3332, 2430, 1036]), array([4017, 2722, 1300])), (array([26439, 18610,  9216]), array([2886, 2070, 1084]), array([3207, 2156, 1013])), (array([26725, 18911,  9629]), array([3018, 1969,  909]), array([2789, 1956,  775])), (array([24641, 17383,  8770]), array([3543, 2306, 1095]), array([4348, 3147, 1448])), (array([25616, 18183,  9011]), array([3082, 2052, 1095]), array([3834, 2601, 1207])), (array([25166, 17794,  8655]), array([4038, 2840, 1580]), array([3328, 2202, 1078])), (array([25197, 18791,  8836]), array([3248, 1201,  848]), array([4087, 2844, 1629])), (array([25648, 18653,  8982]), array([3906, 2964, 1581]), array([2978, 1219,  750])), (array([26131, 18399,  9419]), array([3024, 1372,  514]), array([3377, 3065, 1380]))]


In [9]:
write_indices_and_stats(
    indices, 
    sizes, 
    pos_class,
    total_size=len(df),
    data_name=data_name,
    split_dimension=1, 
    save_indices=True, 
    train_size=""
)

## 2D split

In [11]:
splitter = GroupShuffleSplitND(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [12]:
indices = []
sizes = []
pos_class = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long"]]):
    train, val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["M_long", "T_long"]]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df[["M_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["I_long", "T_long"]]):
    train, val = next(inner_splitter.split(idx_train_val, groups=df[["I_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
        
print(sizes)
print(pos_class)

[(23868, 608, 565), (23681, 510, 520), (24526, 439, 528), (25765, 338, 602), (23780, 395, 559), (24445, 469, 518), (23951, 395, 687), (24688, 413, 460), (23946, 536, 599)]
[(array([19656, 13999,  7056]), array([452, 330, 130]), array([464, 294, 162])), (array([18672, 13236,  6491]), array([440, 305, 210]), array([498, 341, 139])), (array([20053, 14354,  6903]), array([335, 208,  99]), array([406, 263, 172])), (array([21101, 15292,  8275]), array([291, 164,  34]), array([429, 317, 102])), (array([18549, 12239,  5922]), array([378, 342, 206]), array([481, 392, 173])), (array([19487, 13727,  6433]), array([413, 300, 203]), array([406, 283, 148])), (array([19115, 13953,  6335]), array([304, 186, 155]), array([630, 391, 201])), (array([20482, 16388,  8424]), array([325, 128,  32]), array([337, 163,  79])), (array([19709, 12557,  6305]), array([412, 407, 209]), array([490, 453, 246]))]


In [13]:
write_indices_and_stats(
    indices, 
    sizes, 
    pos_class,
    total_size=len(df),
    data_name=data_name,
    split_dimension=2, 
    save_indices=True, 
    train_size=""
)

## 3D split

In [14]:
splitter = GroupShuffleSplitND(n_splits=9, test_size=0.2, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState (not true, copyPaste error from before. Anyway, not a problem)
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.2/0.8, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [15]:
indices = []
sizes = []
pos_class = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long", "T_long"]]):
    train, val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long", "T_long"]].iloc[idx_train_val]))
    # use indices to index indices :P (we need to obtain indices referring to the original dataframe)
    idx_train = idx_train_val[train]
    idx_val = idx_train_val[val]
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class.append(
        (np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_train]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_val]).to_numpy(), 
         np.sum(df[['binary_A', 'binary_B', 'binary_C']].loc[idx_test]).to_numpy(),
        )
    )
    
print(sizes)
print(pos_class)

[(8180, 355, 367), (7995, 305, 439), (7184, 424, 455), (8145, 272, 528), (8042, 338, 416), (7640, 428, 358), (8029, 398, 304), (7631, 423, 359), (8682, 261, 368)]
[(array([6663, 4844, 2386]), array([325, 243,  79]), array([239, 133,  76])), (array([6600, 4614, 2500]), array([273, 244, 104]), array([310, 203,  77])), (array([5690, 3625, 1874]), array([344, 298, 137]), array([394, 264, 142])), (array([6498, 4966, 2019]), array([234, 112,  67]), array([437, 321, 222])), (array([6365, 4554, 2256]), array([248, 215, 118]), array([371, 220,  87])), (array([6038, 4239, 2172]), array([333, 219,  80]), array([317, 235, 125])), (array([6296, 4447, 2500]), array([368, 305, 104]), array([248, 143,  75])), (array([6194, 4239, 1992]), array([301, 272,  97]), array([274, 177, 118])), (array([7278, 5059, 2877]), array([235, 166,  32]), array([265, 185,  99]))]


In [16]:
write_indices_and_stats(
    indices, 
    sizes, 
    pos_class,
    total_size=len(df),
    data_name=data_name,
    split_dimension=3, 
    save_indices=True, 
    train_size=""
)

## Control: Show statistics for previously prepared splits

In [17]:
# Load data
data_filename = "synferm_dataset_2023-09-05_40018records.csv"
data_name = data_filename.rsplit("_", maxsplit=1)[0]
df = pd.read_csv(DATA_DIR / "curated_data" / data_filename)
df.shape

(40018, 27)

### 0D split

In [18]:
split_dimension = 0
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 1D split

In [19]:
split_dimension = 1
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 1D groups are mutually exclusive
        if fold_idx < 3: # first three are split on initiator
            assert len(np.intersect1d(df["I_long"].iloc[train_idx], df["I_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["I_long"].iloc[train_idx], df["I_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["I_long"].iloc[val_idx], df["I_long"].iloc[test_idx])) == 0
        elif fold_idx < 6:  # next three are split on monomer
            assert len(np.intersect1d(df["M_long"].iloc[train_idx], df["M_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["M_long"].iloc[train_idx], df["M_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["M_long"].iloc[val_idx], df["M_long"].iloc[test_idx])) == 0
        else:  # last three are split on terminator
            assert len(np.intersect1d(df["T_long"].iloc[train_idx], df["T_long"].iloc[val_idx])) == 0
            assert len(np.intersect1d(df["T_long"].iloc[train_idx], df["T_long"].iloc[test_idx])) == 0
            assert len(np.intersect1d(df["T_long"].iloc[val_idx], df["T_long"].iloc[test_idx])) == 0
        
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 2D split

In [20]:
split_dimension = 2
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 2D groups are mutually exclusive
        if fold_idx < 3: # first three are split on initiator and monomer
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "M_long"]].iloc[val_idx]), np.unique(df[["I_long", "M_long"]].iloc[test_idx]))) == 0
        elif fold_idx < 6:  # next three are split on monomer and terminator
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[train_idx]), np.unique(df[["M_long", "T_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[train_idx]), np.unique(df[["M_long", "T_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["M_long", "T_long"]].iloc[val_idx]), np.unique(df[["M_long", "T_long"]].iloc[test_idx]))) == 0
        else:  # last three are split on initiator and terminator
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "T_long"]].iloc[val_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "T_long"]].iloc[test_idx]))) == 0
            assert len(np.intersect1d(np.unique(df[["I_long", "T_long"]].iloc[val_idx]), np.unique(df[["I_long", "T_long"]].iloc[test_idx]))) == 0
            
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")


### 3D split

In [21]:
split_dimension = 3
split_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    
for fold_idx in range(9):
    with open(split_dir / f"fold{fold_idx}_statistics_multiclass.txt", "w") as f:
        # import indices
        train_idx = pd.read_csv(split_dir / f"fold{fold_idx}_train.csv")["index"].to_numpy()
        val_idx = pd.read_csv(split_dir / f"fold{fold_idx}_val.csv")["index"].to_numpy()
        test_idx = pd.read_csv(split_dir / f"fold{fold_idx}_test.csv")["index"].to_numpy()
        
        # check mutually exclusive
        assert len(np.intersect1d(train_idx, val_idx)) == 0
        assert len(np.intersect1d(train_idx, test_idx)) == 0
        assert len(np.intersect1d(val_idx, test_idx)) == 0
        
        # check 3D groups are mutually exclusive
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[val_idx]))) == 0
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[train_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[test_idx]))) == 0
        assert len(np.intersect1d(np.unique(df[["I_long", "M_long", "T_long"]].iloc[val_idx]), np.unique(df[["I_long", "M_long", "T_long"]].iloc[test_idx]))) == 0
            
        # determine multiclass distribution
        train_dist = df["major_A-C"].iloc[train_idx].value_counts().sort_index()
        val_dist = df["major_A-C"].iloc[val_idx].value_counts().sort_index()
        test_dist = df["major_A-C"].iloc[test_idx].value_counts().sort_index()

        for i, item in (train_dist / len(train_idx)).items():
            f.write(f"Training set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (val_dist / len(val_idx)).items():
            f.write(f"Validation set 'major_A-C' class {i}: {item:.1%}\n")
        for i, item in (test_dist / len(test_idx)).items():
            f.write(f"Test set 'major_A-C' class {i}: {item:.1%}\n")