# Data Splits

We want to split the SynFerm data set into train, validation and test data.
For all splits, we will do 9 random repetitions.
For 1D and 2D split, which both use 3 different groups to split on, these will divide into 3 random repetitions for each of the 3 groups. 

### 0D Split
For the 0D split, we use a random train-test split.
We use a 80/10/10 split into train, val, and test set.

### 1D Split
For the 1D split, we use a (1D) GroupShuffleSplit.
Each individual split will be 80/10/10 train/test (of groups not samples!).
As groups, we use either initiator, monomer, or terminator.

### 2D Split
For the 2D split, we use a (2D) GroupShuffleSplit.
Each individual split will use 10% of groups as test set and 12.5% of remaining groups as validation set. 
Due to the dimensionality, this means we expect 0.1 * 0.1 = 1% of samples in the test and validation set and 0.900^2 * 0.875^2 = 62.0% of sample in the training set.
The remaining samples are not used to prevent leakage.
As groups, we use either \[initiator, monomer], \[monomer, terminator] or \[initiator, terminator].

### 3D Split
For the 3D split, we use a (3D) GroupShuffleSplit.
Each individual split will use 20% of groups as test set, 25% of remaining groups as validation set, and the remaining groups as training set.
Due to the dimensionality, this means we expect 0.2^3 = 0.8% of samples in the test and validation set and 0.800^3 * 0.750^3 = 21.6% of sample in the training set.

In [1]:
import pathlib
import sys

sys.path.append(str(pathlib.Path().resolve().parents[1]))

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit, ShuffleSplit

from src.definitions import DATA_DIR
from src.util.train_test_split import GroupShuffleSplitND

In [2]:
# Load data
data_filename = "synferm_dataset_2023-09-05_40018records.csv"
data_name = data_filename.rsplit("_", maxsplit=1)[0]
df = pd.read_csv(DATA_DIR / "curated_data" / data_filename)
df.shape

(40018, 27)

In [3]:
df.head()

Unnamed: 0,I_long,M_long,T_long,product_A_smiles,I_smiles,M_smiles,T_smiles,reaction_smiles,reaction_smiles_atom_mapped,experiment_id,...,binary_H,scaled_A,scaled_B,scaled_C,scaled_D,scaled_E,scaled_F,scaled_G,scaled_H,major_A-C
0,2-Pyr003,Fused002,TerABT004,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(F)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56113,...,0,0.036021,0.003427,0.0,0.020975,0.002958,0.941981,0.914281,0.0,A
1,2-Pyr003,Fused002,TerABT007,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(Br)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56114,...,0,0.0,0.0,0.0,0.006159,0.364398,0.928851,1.106548,0.0,no_product
2,2-Pyr003,Fused002,TerABT013,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1cc(C(F)(F)F)ccc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56106,...,1,0.0,0.0,0.0,0.014212,2.16642,1.013596,0.537785,0.05686,no_product
3,2-Pyr003,Fused002,TerABT014,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,Nc1ccc(Cl)cc1S,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:16][cH:18][cH:20][c...,56112,...,0,0.028915,0.005039,0.0,0.015578,0.504057,0.992614,0.890646,0.0,A
4,2-Pyr003,Fused002,TerTH001,COc1ccc(CCOC(=O)N2C[C@H](NC(=O)c3cccc(Cl)n3)[C...,O=C(c1cccc(Cl)n1)[B-](F)(F)F.[K+],COc1ccc(CCOC(=O)N2C[C@@H]3NO[C@]4(OC5(CCCCC5)O...,[Cl-].[NH3+]NC(=S)c1ccccc1,O=C(c1cccc(Cl)n1)[B-](F)(F)F.COc1ccc(CCOC(=O)N...,F[B-](F)(F)[C:2]([c:1]1[cH:13][cH:15][cH:17][c...,56109,...,0,0.350061,0.643219,0.0,0.031689,0.613596,0.109309,0.439018,0.0,B


## 0D split

In [105]:
splitter = ShuffleSplit(n_splits=9, test_size=0.1, random_state=42)
inner_splitter = ShuffleSplit(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [106]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(list(range(len(df)))):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002), (32014, 4002, 4002)]
[(25899, 3257, 3266), (25887, 3269, 3221), (25911, 3245, 3262), (25932, 3224, 3244), (25953, 3203, 3272), (25931, 3225, 3262), (25896, 3260, 3272), (25892, 3264, 3248), (25902, 3254, 3285)]


In [10]:
#define a write function that we can reuse later
def write_indices_and_stats(indices, sizes, pos_class_A, split_dimension, save_indices=True):
    """
    Write function that can be reused for the other splits
    Args:
        indices: list of 3-tuples
        sizes: list of 3-tuples, length equal to indices
        pos_class_A: list of 3-tuples, length equal to indices
        split_dimension: str or int, e.g. "0", "1", "2", "3"
        save_indices: bool, whether to save the indices. Useful if we need to regenerate statistics. Default: True
    """
    n_folds = len(indices)
    save_dir = DATA_DIR / "curated_data" / "splits" / f"{data_name}_{split_dimension}D_split"
    save_dir.mkdir(parents=True, exist_ok=True)
    if save_indices:
        for i, (idx_train, idx_val, idx_test) in enumerate(indices):
            with open(save_dir / f"fold{i}_train.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_train]))
                
            with open(save_dir / f"fold{i}_val.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_val]))
                
            with open(save_dir / f"fold{i}_test.csv", "w") as f:
                f.write("index\n")
                f.write("\n".join([str(i) for i in idx_test]))
    
    for i, (size, count_A) in enumerate(zip(sizes, pos_class_A)):
        with open(save_dir / f"fold{i}_statistics.txt", "w") as f:
            f.write(f"Train samples: {size[0]} ({size[0]/len(df):.1%})\n")
            f.write(f"Val samples: {size[1]} ({size[1]/len(df):.1%})\n")
            f.write(f"Test samples: {size[2]} ({size[2]/len(df):.1%})\n")
            if split_dimension > 1:
                f.write(f"Not used: {len(df) - np.sum(size):.0f} ({(len(df) - np.sum(size)) / len(df):.1%})\n")
            f.write(f"Train samples binary_A has label 1: {count_A[0]} ({count_A[0]/size[0]:.1%})\n")
            f.write(f"Val samples binary_A has label 1: {count_A[1]} ({count_A[1]/size[1]:.1%})\n")
            f.write(f"Test samples binary_A has label 1: {count_A[2]} ({count_A[2]/size[2]:.1%})\n")
            
    # summary statistics
    sum_pos_class_A = np.sum(pos_class_A, axis=0)
    sum_sizes = np.sum(sizes, axis=0)
    with open(save_dir / "summary_statistics.txt", "w") as f:
        f.write(f"Mean Train samples: {sum_sizes[0] / n_folds:.0f} ({sum_sizes[0] / n_folds / len(df):.1%})\n")
        f.write(f"Mean Val samples: {sum_sizes[1] / n_folds:.0f} ({sum_sizes[1] / n_folds / len(df):.1%})\n")
        f.write(f"Mean Test samples: {sum_sizes[2] / n_folds:.0f} ({sum_sizes[2] / n_folds / len(df):.1%})\n")
        if split_dimension > 1:
            f.write(f"Not used: {len(df) - np.sum(sum_sizes) / n_folds:.0f} ({(len(df) - np.sum(sum_sizes) / n_folds) / len(df):.1%})\n")
        f.write(f"Mean Train samples binary_A has label 1: {sum_pos_class_A[0] / n_folds:.0f} ({sum_pos_class_A[0]/sum_sizes[0]:.1%})\n")
        f.write(f"Mean Val samples binary_A has label 1: {sum_pos_class_A[1] / n_folds:.0f} ({sum_pos_class_A[1]/sum_sizes[1]:.1%})\n")
        f.write(f"Mean Test samples binary_A has label 1: {sum_pos_class_A[2] / n_folds:.0f} ({sum_pos_class_A[2]/sum_sizes[2]:.1%})\n")

In [108]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=0, save_indices=True)

## 1D split

In [109]:
splitter = GroupShuffleSplit(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplit(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [110]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["I_long"]):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val, groups=df["I_long"][idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["M_long"]):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val, groups=df["M_long"][idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df["T_long"]):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val, groups=df["T_long"][idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(30974, 4318, 4726), (32233, 3546, 4239), (32488, 4020, 3510), (30560, 4729, 4729), (31529, 3911, 4578), (31828, 4387, 3803), (31225, 3725, 5068), (31084, 4497, 4437), (32258, 3504, 4256)]
[(25110, 3433, 4017), (26053, 2880, 3207), (26311, 3263, 2789), (24626, 3916, 4348), (25534, 3134, 3834), (25799, 3545, 3328), (25222, 3034, 4087), (25138, 3621, 2978), (26069, 2847, 3377)]


In [111]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=1, save_indices=True)

## 2D split

In [5]:
splitter = GroupShuffleSplitND(n_splits=3, test_size=0.1, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.1/0.9, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [6]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long"]]):
    idx_train, idx_val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long"]].iloc[idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["M_long", "T_long"]]):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val, groups=df[["M_long", "T_long"]].iloc[idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))
    
for idx_train_val, idx_test in splitter.split(list(range(len(df))), groups=df[["I_long", "T_long"]]):
    idx_train, idx_val = next(inner_splitter.split(idx_train_val, groups=df[["I_long", "T_long"]].iloc[idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(23868, 608, 565), (23681, 510, 520), (24526, 439, 528), (25765, 338, 602), (23780, 395, 559), (24445, 469, 518), (23951, 395, 687), (24688, 413, 460), (23946, 536, 599)]
[(19233, 504, 464), (19138, 410, 498), (19818, 368, 406), (20796, 271, 429), (19269, 316, 481), (19738, 386, 406), (19397, 333, 630), (19965, 340, 337), (19601, 378, 490)]


In [11]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=2, save_indices=True)

## 3D split

In [16]:
splitter = GroupShuffleSplitND(n_splits=9, test_size=0.2, random_state=np.random.RandomState(42))  # here, we reuse the outer splitter as well, so we use RandomState
inner_splitter = GroupShuffleSplitND(n_splits=1, test_size=0.2/0.8, random_state=np.random.RandomState(42))  # we use a RandomState instance, not an int, because we will reuse this splitter several times

In [17]:
indices = []
sizes = []
pos_class_A = []
for idx_train_val, idx_test in splitter.split(df, groups=df[["I_long", "M_long", "T_long"]]):
    idx_train, idx_val = next(inner_splitter.split(df.iloc[idx_train_val], groups=df[["I_long", "M_long", "T_long"]].iloc[idx_train_val]))
    indices.append((idx_train, idx_val, idx_test))
    sizes.append((len(idx_train), len(idx_val), len(idx_test)))
    pos_class_A.append((np.sum(df['binary_A'][idx_train]), np.sum(df['binary_A'][idx_val]), np.sum(df['binary_A'][idx_test])))

print(sizes)
print(pos_class_A)

[(8180, 355, 367), (7995, 305, 439), (7184, 424, 455), (8145, 272, 528), (8042, 338, 416), (7640, 428, 358), (8029, 398, 304), (7631, 423, 359), (8682, 261, 368)]
[(6719, 270, 239), (6678, 246, 310), (5996, 344, 394), (6801, 223, 437), (6681, 275, 371), (6349, 360, 317), (6671, 308, 248), (6334, 347, 274), (7186, 208, 265)]


In [18]:
write_indices_and_stats(indices, sizes, pos_class_A, split_dimension=3, save_indices=True)

## Control: Show statistics for previously prepared splits

In [64]:
# 0D split
lines = []
train_idx = pd.read_csv(DATA_DIR / "curated_data" / "splits" / "0D_split" / "train_idx.csv")["index"].values
test_idx = pd.read_csv(DATA_DIR / "curated_data" / "splits" / "0D_split" / "test_idx.csv")["index"].values
lines.append(f"0D split: {len(train_idx)} train, {len(test_idx)} ({len(test_idx)/(len(test_idx)+len(train_idx)):.1%}) test")
for i, item in (df["major_A-C"].iloc[train_idx].value_counts().sort_index() / len(train_idx)).items():
    lines.append(f"\tTraining set 'major_A-C' class {i}: {item:.1%}")
for i, item in (df["major_A-C"].iloc[test_idx].value_counts().sort_index() / len(test_idx)).items():
    lines.append(f"\tTest set 'major_A-C' class {i}: {item:.1%}")
# save stats to file
with open(DATA_DIR / "curated_data" / "splits" / "0D_split" / "split_statistics.txt", "w") as f:
    f.write("\n".join(lines))
print("\n".join(lines))

0D split: 36389 train, 4044 (10.0%) test
	Training set 'major_A-C' class A: 53.2%
	Training set 'major_A-C' class B: 21.9%
	Training set 'major_A-C' class C: 8.6%
	Training set 'major_A-C' class no_product: 16.2%
	Test set 'major_A-C' class A: 53.9%
	Test set 'major_A-C' class B: 21.5%
	Test set 'major_A-C' class C: 8.9%
	Test set 'major_A-C' class no_product: 15.7%


In [63]:
# 1D split
lines = []
for i in range(9):
    train_idx = pd.read_csv(DATA_DIR / "curated_data" / "splits" / "1D_split" / f"fold_{i}_train_idx.csv")["index"].values
    test_idx = pd.read_csv(DATA_DIR / "curated_data" / "splits" / "1D_split" / f"fold_{i}_test_idx.csv")["index"].values
    lines.append(f"Fold {i}: {len(train_idx)} train, {len(test_idx)} ({len(test_idx)/(len(test_idx)+len(train_idx)):.1%}) test")
    for i, item in (df["major_A-C"].iloc[train_idx].value_counts().sort_index() / len(train_idx)).items():
        lines.append(f"\tTraining set 'major_A-C' class {i}: {item:.1%}")
    for i, item in (df["major_A-C"].iloc[test_idx].value_counts().sort_index() / len(test_idx)).items():
        lines.append(f"\tTest set 'major_A-C' class {i}: {item:.1%}")

# save stats to file
with open(DATA_DIR / "curated_data" / "splits" / "1D_split" / "split_statistics.txt", "w") as f:
    f.write("\n".join(lines))
print("\n".join(lines))

Fold 0: 35657 train, 4776 (11.8%) test
	Training set 'major_A-C' class A: 52.3%
	Training set 'major_A-C' class B: 21.8%
	Training set 'major_A-C' class C: 9.3%
	Training set 'major_A-C' class no_product: 16.6%
	Test set 'major_A-C' class A: 60.5%
	Test set 'major_A-C' class B: 22.2%
	Test set 'major_A-C' class C: 4.0%
	Test set 'major_A-C' class no_product: 13.3%
Fold 1: 36166 train, 4267 (10.6%) test
	Training set 'major_A-C' class A: 53.6%
	Training set 'major_A-C' class B: 21.8%
	Training set 'major_A-C' class C: 8.8%
	Training set 'major_A-C' class no_product: 15.7%
	Test set 'major_A-C' class A: 50.1%
	Test set 'major_A-C' class B: 22.2%
	Test set 'major_A-C' class C: 7.5%
	Test set 'major_A-C' class no_product: 20.3%
Fold 2: 36895 train, 3538 (8.8%) test
	Training set 'major_A-C' class A: 53.1%
	Training set 'major_A-C' class B: 21.8%
	Training set 'major_A-C' class C: 9.2%
	Training set 'major_A-C' class no_product: 16.0%
	Test set 'major_A-C' class A: 54.9%
	Test set 'major_A-