# Dataloader
Hello dear NMRcrafter! In this notebook I would like to quickly run you through the usage of the nmrcraft dataloader!

### Initiate Dataloader
To initiate the dataloader you can either run the first cell or the second cell. The first cell will initiate the dataloader with the default parameters, whereas the second cell will initiate it with the parameters that you want to use. If you want to change the parameters of the dataloader, you can change them in the second cell!

Cell 1 here has the minimum args required to initalize the Dataloader.

In [1]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal",
)

  from .autonotebook import tqdm as notebook_tqdm


Seed set to 42.


### DataLoader Args
But of course there are a lot of args to choose from including these following that are meant to be chosen by you guys
- `target_columns`: should be defined via args but is that string that defines which targets are the ones you wanna let the model predict.
- `test_size`: defines test train split
- `random_state`: sets random state in dataloader, doesn't really need to be touched but if you for some reason want multiple dataloaders or whatever, go ahead.
- `dataset_size`: total size of dataset that is used. This is done directly at loading, so test train etc happens after this.
- `target_type`: defines whether you want the dataloader to be in 'one-hot' or 'categorical' mode.
- `complex_geometry`: defines what sort of complexes get loaded into the dataset. You can choose between oct, spy, tbp or all
- `include_structural_features`: defines if structural features will be added to X or if X is only NMR data

In [2]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal_X4_E",
    test_size=0.2,
    random_state=42,
    dataset_size=0.001,
    target_type="categorical",  # can be "categorical" or "one-hot"
    complex_geometry="all", # can be oct, spy, tbp or all
    include_structural_features=True, 
)

### Load data
To load the dataset you can just call the `load_data()` methodhod. This will return the X and y tests and trains respectively and also the y_labels.

In [3]:
X_train, X_test, y_train, y_test, y_labels = data_loader.load_data()

### Categorical Features (X array for categorical mode)
In the example below the first six numbers are the scaled NMR features, so the tensors passed to the dataloader (without data leakage 🥳).
The following integers are the rest of the structural features that are not targets, so in this case as metal_X4_E are targets the integers correspond to X1_X2_X3_L (not sure about ordering but that shouldn't matter to the ML model)

In [4]:
print(X_test)

[[-0.93845859 -1.99117285 -1.68097488  2.11218044 -2.18200124 -2.16674603
   0.          0.          1.          0.        ]
 [-0.7868327  -0.1993694  -0.7927561  -0.10603898  0.5430169   0.59474844
   0.          1.          0.          3.        ]
 [ 0.72493972  0.97727417  1.43524621 -0.38569623  0.59346515  0.53410624
   0.          1.          4.          3.        ]
 [-1.20090888 -0.07551744  0.12365021 -1.08172709 -0.03574517 -0.10621104
   0.          1.          4.          2.        ]
 [ 0.86643341  0.32483778  0.29589483 -0.0912498   0.48605053  0.5570423
   0.          0.          1.          0.        ]
 [ 1.33482703  0.96394775  0.61893974 -0.44746834  0.59521383  0.5870601
   0.          9.          1.          2.        ]]


### Target Features (y array for categorical mode)
Here we have the columns representing metal, X4 and E. The integers come from the LabelEndoder and each integer corresponds to one Ligand/metal/thing. For example the first row 1, 1, 2 is 'W', 'Br', 'imido26diisopropylphenyl' (as can be seen in the decoder below).

In [5]:
print(y_test)

[[ 0 10 13]
 [ 1  1  9]
 [ 1  6 11]
 [ 0  3 12]
 [ 1  4  8]
 [ 1  6  7]]


### Decode target (y) arrays
To decode the targets you got you can use either the `categorical_target_decoder()` or the `binarized_target_decoder()` method of the dataloader. This will return an array where each position is the string corresponding to what it is. As far as I know these decoded things can be easily passed into the confusion matrix or maybe after flattening them.

__Keep in mind that the decoding is dataloader specific as you need the same encoder to decode your data so always use the same dataloader for the targets predicted from a X__

Basically: dataloader1 = DataLoader makes X1. Model is trained on X1 and y1 from dataloader1. the predictions of the model must be decoded with dataloader1 and you can't make another dataloader2 = DataLoader and try to decode the y1_pred with dataloader2. 

In [6]:
data_loader.categorical_target_decoder(y_test)


array([['Mo', 'triphenylsiloxy', 'selenido'],
       ['W', 'methyl', 'imido4nitrophenyl'],
       ['W', 'tertbutoxy3F', 'imidotrityl'],
       ['Mo', 'pentafluorophenoxy', 'oxo'],
       ['W', 'phenoxy', 'imido4methylphenyl'],
       ['W', 'tertbutoxy3F', 'imido4hydroxyphenyl']], dtype='<U19')

In [7]:
data_loader.confusion_matrix_data_adapter_categorical(y_test)

['Mo',
 'triphenylsiloxy',
 'selenido',
 'W',
 'methyl',
 'imido4nitrophenyl',
 'W',
 'tertbutoxy3F',
 'imidotrityl',
 'Mo',
 'pentafluorophenoxy',
 'oxo',
 'W',
 'phenoxy',
 'imido4methylphenyl',
 'W',
 'tertbutoxy3F',
 'imido4hydroxyphenyl']

### Confusion matrix with the dataloader
Here just an example how nice the sklearn funcitons seem to work with labeld data.

In [8]:
from sklearn.metrics import confusion_matrix
y_true_cm = y_pred_cm = data_loader.confusion_matrix_data_adapter_categorical(y_test)
confusion_matrix(y_true_cm, y_pred_cm)

array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

### One-hot
Here just the same stuff with one hot as an example so you see how the arrays look like. To be noted is that the structural features are one-hot encoded, so each integer column corresponds to a thing. [Here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) you can read up on the one hot encoder.

In [9]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal_X4",
    test_size=0.2,
    random_state=42,
    dataset_size=0.0003,
    target_type="one-hot",  # can be "categorical" or "one-hot"
    complex_geometry="all", # can be oct, spy, tbp or all
    include_structural_features=True, 
)

X_train, X_test, y_train, y_test, y_labels = data_loader.load_data()

print(X_test)

[[ 1.  1.  1. -1.  1. -1.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
   0.  0.  1.  0.  0.  0.  1.  0.  0.  0.  0.]
 [-1. -1. -1.  1. -1.  1.  1.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.
   1.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]]


### Label Binarized target (y-array one-hot mode)
Here you can see that for example the 6th column of the array corresponds to tertbutoxy3F and is set to 1, so True and the 4th corresponds to neopentyl and is therefore set to true on column four.
The metal column is special as it was already binary so the metals only get a single column that is either True or False. So when the metal column is False it's molly and True means tungsten (in this particular case. This is automatically determined by the [LabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer)).

In [10]:
print(y_test)

[[1 0 0 0 0 1 0 0 0]
 [1 0 0 1 0 0 0 0 0]]


In [11]:
data_loader.binarized_target_decoder(y_test)

[['W', 'tertbutoxy3F'], ['W', 'neopentyl']]

In [12]:
data_loader.confusion_matrix_data_adapter_one_hot(y_test)

['W', 'tertbutoxy3F', 'W', 'neopentyl']