## 1D data split aldehydes

(adapted from canonical_data_split.ipynb)
We want a set of splits following this recipe:

- First take away the last 96 records, which are meant as an external validation set (this is the last plate ice-12-103)
- Of all remaining data, take away a 1D test set (on aldehydes). We use a 10fold GroupShuffleSplit here, with 10% test data. Note that because we are using group splits with uneven groups, these values will not be exact.
- Of remaining data A, take away 1D validation set (10% of A, also on aldehydes)


In [17]:
import pathlib
import sys
import os
sys.path.append(os.path.abspath("../"))

import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit

from src.data import SLAPData

In [18]:
# load data

data_path = os.path.abspath("../data/Data S3.csv")

data = SLAPData(data_path)

data.load_data_from_file()
data.split_reaction_smiles()

In [19]:
print(data.groups)

[[ 7 47]
 [ 7  0]
 [ 7 28]
 ...
 [34 10]
 [34 35]
 [34 11]]


In [20]:
len(data.all_X)

859

In [21]:
splitter = GroupShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

In [22]:
# we use only the first 763 records, as the validation plate starts after that
# (can be checked in generate_ml_datasets.ipynb).
# Note that this is only applicable to the LCMS data set, not the isolated yields, which have less entries

train_counter, val_counter, test_counter = 0, 0, 0
train_pos_class, val_pos_class, test_pos_class = 0, 0, 0

for i, (data_subset_A, test_1D) in enumerate(splitter.split(data.all_X[:763], groups=data.groups[:763, 1])):
    
    # we take a (1D) validation set. Rest is training set
    inner_splitter = GroupShuffleSplit(n_splits=1, test_size=0.1, random_state=123)
    train_idx, val_1D_idx = next(inner_splitter.split(data_subset_A, groups=data.groups[data_subset_A, 1]))
    train = data_subset_A[train_idx]    # it may be slightly confusing that we index a list of indices here, 
    val_1D = data_subset_A[val_1D_idx]  # but it is necessary, as the splitter returns indices for data_subset_A.
    
    # update counters
    train_counter += len(train)
    val_counter += len(val_1D)
    test_counter += len(test_1D)
    train_pos_class += np.sum(data.all_y[train])
    val_pos_class += np.sum(data.all_y[val_1D])
    test_pos_class += np.sum(data.all_y[test_1D])
    
    print(f"Statistics for fold {i}:")
    print(f"ID \t\t num \t|\t %positive")
    print(f"Train: \t\t {len(train)} \t|\t {np.mean(data.all_y[train]):.0%}")
    print(f"Val_1D: \t {len(val_1D)} \t|\t {np.mean(data.all_y[val_1D]):.0%}")
    print(f"Test_1D: \t {len(test_1D)} \t|\t {np.mean(data.all_y[test_1D]):.0%}")
    assert np.intersect1d(val_1D, test_1D).size == 0
    assert np.intersect1d(train, test_1D).size == 0
    assert np.intersect1d(train, val_1D).size == 0
    
    # save the indices
    save = False
    if save:
        save_path = pathlib.Path("../data/dataset_splits/LCMS_split_763records_1Dsplit_aldehydes_10fold_v2")
        save_path.mkdir(parents=True, exist_ok=True)
        pd.DataFrame(train).to_csv(save_path / f"fold{i}_train.csv", index=False, header=None)
        pd.DataFrame(val_1D).to_csv(save_path / f"fold{i}_val.csv", index=False, header=None)
        pd.DataFrame(test_1D).to_csv(save_path / f"fold{i}_test_1D.csv", index=False, header=None)
        
# summary statistics
n = train_counter + val_counter + test_counter
print("\nSummary statistics:")
print(f"Split sizes: {train_counter/n:.0%} train, {val_counter/n:.0%} val, {test_counter/n:.0%} test")
print(f"Class balance (positive class ratio): {train_pos_class/train_counter:.0%} train, {val_pos_class/val_counter:.0%} val, {test_pos_class/test_counter:.0%} test")


Statistics for fold 0:
ID 		 num 	|	 %positive
Train: 		 575 	|	 55%
Val_1D: 	 53 	|	 26%
Test_1D: 	 135 	|	 47%
Statistics for fold 1:
ID 		 num 	|	 %positive
Train: 		 601 	|	 54%
Val_1D: 	 104 	|	 36%
Test_1D: 	 58 	|	 50%
Statistics for fold 2:
ID 		 num 	|	 %positive
Train: 		 579 	|	 52%
Val_1D: 	 96 	|	 47%
Test_1D: 	 88 	|	 56%
Statistics for fold 3:
ID 		 num 	|	 %positive
Train: 		 588 	|	 53%
Val_1D: 	 96 	|	 34%
Test_1D: 	 79 	|	 58%
Statistics for fold 4:
ID 		 num 	|	 %positive
Train: 		 577 	|	 53%
Val_1D: 	 104 	|	 36%
Test_1D: 	 82 	|	 61%
Statistics for fold 5:
ID 		 num 	|	 %positive
Train: 		 601 	|	 58%
Val_1D: 	 104 	|	 33%
Test_1D: 	 58 	|	 17%
Statistics for fold 6:
ID 		 num 	|	 %positive
Train: 		 628 	|	 52%
Val_1D: 	 79 	|	 41%
Test_1D: 	 56 	|	 59%
Statistics for fold 7:
ID 		 num 	|	 %positive
Train: 		 612 	|	 53%
Val_1D: 	 71 	|	 44%
Test_1D: 	 80 	|	 50%
Statistics for fold 8:
ID 		 num 	|	 %positive
Train: 		 608 	|	 51%
Val_1D: 	 66 	|	 50%
Test_1D: 	

## For SLAP reagents
Same thing, but split on SLAP reagents, not aldehydes

In [23]:
splitter = GroupShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

In [24]:
# we use only the first 763 records, as the validation plate starts after that
# (can be checked in generate_ml_datasets.ipynb).
# Note that this is only applicable to the LCMS data set, not the isolated yields, which have less entries

train_counter, val_counter, test_counter = 0, 0, 0
train_pos_class, val_pos_class, test_pos_class = 0, 0, 0

for i, (data_subset_A, test_1D) in enumerate(splitter.split(data.all_X[:763], groups=data.groups[:763, 0])):
    
    # we take a (1D) validation set. Rest is training set
    inner_splitter = GroupShuffleSplit(n_splits=1, test_size=0.12, random_state=123)
    train_idx, val_1D_idx = next(inner_splitter.split(data_subset_A, groups=data.groups[data_subset_A, 0]))
    train = data_subset_A[train_idx]
    val_1D = data_subset_A[val_1D_idx]
    
    # update counters
    train_counter += len(train)
    val_counter += len(val_1D)
    test_counter += len(test_1D)
    train_pos_class += np.sum(data.all_y[train])
    val_pos_class += np.sum(data.all_y[val_1D])
    test_pos_class += np.sum(data.all_y[test_1D])
    
    print(f"Statistics for fold {i}:")
    print(f"ID \t\t num \t|\t %positive")
    print(f"Train: \t\t {len(train)} \t|\t {np.mean(data.all_y[train]):.0%}")
    print(f"Val: \t\t {len(val_1D)} \t|\t {np.mean(data.all_y[val_1D]):.0%}")
    print(f"Test_1D: \t {len(test_1D)} \t|\t {np.mean(data.all_y[test_1D]):.0%}")
    assert np.intersect1d(val_1D, test_1D).size == 0
    assert np.intersect1d(train, test_1D).size == 0
    assert np.intersect1d(train, val_1D).size == 0
    
    # save the indices
    save = False
    if save:
        save_path = pathlib.Path("../data/dataset_splits/LCMS_split_763records_1Dsplit_SLAP_10fold_v2")
        save_path.mkdir(parents=True, exist_ok=True)
        pd.DataFrame(train).to_csv(save_path / f"fold{i}_train.csv", index=False, header=None)
        pd.DataFrame(val_1D).to_csv(save_path / f"fold{i}_val.csv", index=False, header=None)
        pd.DataFrame(test_1D).to_csv(save_path / f"fold{i}_test_1D.csv", index=False, header=None)

# summary statistics
n = train_counter + val_counter + test_counter
print("\nSummary statistics:")
print(f"Split sizes: {train_counter/n:.0%} train, {val_counter/n:.0%} val, {test_counter/n:.0%} test")
print(f"Class balance (positive class ratio): {train_pos_class/train_counter:.0%} train, {val_pos_class/val_counter:.0%} val, {test_pos_class/test_counter:.0%} test")


Statistics for fold 0:
ID 		 num 	|	 %positive
Train: 		 609 	|	 52%
Val: 		 72 	|	 40%
Test_1D: 	 82 	|	 60%
Statistics for fold 1:
ID 		 num 	|	 %positive
Train: 		 602 	|	 51%
Val: 		 72 	|	 60%
Test_1D: 	 89 	|	 49%
Statistics for fold 2:
ID 		 num 	|	 %positive
Train: 		 602 	|	 49%
Val: 		 72 	|	 56%
Test_1D: 	 89 	|	 63%
Statistics for fold 3:
ID 		 num 	|	 %positive
Train: 		 619 	|	 52%
Val: 		 72 	|	 51%
Test_1D: 	 72 	|	 46%
Statistics for fold 4:
ID 		 num 	|	 %positive
Train: 		 575 	|	 47%
Val: 		 82 	|	 55%
Test_1D: 	 106 	|	 72%
Statistics for fold 5:
ID 		 num 	|	 %positive
Train: 		 619 	|	 52%
Val: 		 72 	|	 44%
Test_1D: 	 72 	|	 54%
Statistics for fold 6:
ID 		 num 	|	 %positive
Train: 		 619 	|	 54%
Val: 		 72 	|	 50%
Test_1D: 	 72 	|	 29%
Statistics for fold 7:
ID 		 num 	|	 %positive
Train: 		 619 	|	 52%
Val: 		 72 	|	 49%
Test_1D: 	 72 	|	 49%
Statistics for fold 8:
ID 		 num 	|	 %positive
Train: 		 609 	|	 52%
Val: 		 72 	|	 61%
Test_1D: 	 82 	|	 37%
Statistic