# Dataloader
Hello dear NMRcrafter! In this notebook I would like to quickly run you through the usage of the nmrcraft dataloader!

### Initiate Dataloader
To initiate the dataloader you can either run the first cell or the second cell. The first cell will initiate the dataloader with the default parameters, whereas the second cell will initiate it with the parameters that you want to use. If you want to change the parameters of the dataloader, you can change them in the second cell!

Cell 1 here has the minimum args required to initalize the Dataloader.

In [1]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal",
)

  from .autonotebook import tqdm as notebook_tqdm


Seed set to 42.


### DataLoader Args
But of course there are a lot of args to choose from including these following that are meant to be chosen by you guys
- `target_columns`: should be defined via args but is that string that defines which targets are the ones you wanna let the model predict.
- `test_size`: defines test train split
- `random_state`: sets random state in dataloader, doesn't really need to be touched but if you for some reason want multiple dataloaders or whatever, go ahead.
- `dataset_size`: total size of dataset that is used. This is done directly at loading, so test train etc happens after this.
- `target_type`: defines whether you want the dataloader to be in 'one-hot' or 'categorical' mode.
- `complex_geometry`: defines what sort of complexes get loaded into the dataset. You can choose between oct, spy, tbp or all

In [2]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal_X4_E",
    test_size=0.2,
    random_state=42,
    dataset_size=0.001,
    target_type="categorical",  # can be "categorical" or "one-hot"
    complex_geometry="all", # can be oct, spy, tbp or all
)

### Load data
To load the dataset you can just call the `load_data()` methodhod. This will return the X and y tests and trains respectively and also the y_labels.

In [3]:
X_train, X_test, y_train, y_test, y_labels = data_loader.load_data()


In [4]:
print(X_test)

[[-0.2722084  -2.21144124 -1.43206187  3.74585072 -5.16278678 -3.93518974
   0.          0.          1.          0.        ]
 [-0.18789275 -0.07571338 -0.56527957  0.58809239  0.38235406  0.57606602
   0.          1.          0.          3.        ]
 [ 0.65276891  1.32677885  1.6089516   0.18998478  0.48501124  0.4769992
   0.          1.          4.          3.        ]
 [-0.41815093  0.07191111  0.32900997 -0.80085373 -0.79536918 -0.56904129
   0.          1.          4.          2.        ]
 [ 0.73145027  0.54911177  0.49709759  0.6091456   0.26643317  0.51446819
   0.          0.          1.          0.        ]
 [ 0.99191313  1.31089452  0.81234606  0.10204876  0.48856963  0.56350613
   0.          9.          1.          2.        ]]


In [5]:
print(y_test)

[[ 0 10 13]
 [ 1  1  9]
 [ 1  6 11]
 [ 0  3 12]
 [ 1  4  8]
 [ 1  6  7]]


### Decode target (y) arrays
To decode the targets you got you can use ether the `categorical_target_decoder()` or the `binarized_target_decoder()` method of the dataloader. This will return a labled list of lists. As far as I know these decoded things can be easily passed into the cm or maybe after flattening them.

In [6]:
data_loader.categorical_target_decoder(y_test)


[['Mo', 'triphenylsiloxy', 'selenido'],
 ['W', 'methyl', 'imido4nitrophenyl'],
 ['W', 'tertbutoxy3F', 'imidotrityl'],
 ['Mo', 'pentafluorophenoxy', 'oxo'],
 ['W', 'phenoxy', 'imido4methylphenyl'],
 ['W', 'tertbutoxy3F', 'imido4hydroxyphenyl']]

In [7]:
data_loader.confusion_matrix_data_adapter_categorical(y_test)

['Mo',
 'triphenylsiloxy',
 'selenido',
 'W',
 'methyl',
 'imido4nitrophenyl',
 'W',
 'tertbutoxy3F',
 'imidotrityl',
 'Mo',
 'pentafluorophenoxy',
 'oxo',
 'W',
 'phenoxy',
 'imido4methylphenyl',
 'W',
 'tertbutoxy3F',
 'imido4hydroxyphenyl']

### Confusion matrix with the dataloader
Here just an example how nice the sklearn funcitons seem to work with labeld data.

In [8]:
from sklearn.metrics import confusion_matrix
y_true_cm = y_pred_cm = data_loader.confusion_matrix_data_adapter_categorical(y_test)
confusion_matrix(y_true_cm, y_pred_cm)

array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

### One-hot
Here just the same stuff with one hot as an example so you see how the arrays look like

In [18]:
from nmrcraft.data.dataset import DataLoader

feature_columns = [
            "M_sigma11_ppm",
            "M_sigma22_ppm",
            "M_sigma33_ppm",
            "E_sigma11_ppm",
            "E_sigma22_ppm",
            "E_sigma33_ppm",
        ]

data_loader = DataLoader(
    feature_columns=feature_columns,
    target_columns="metal_X3",
    test_size=0.2,
    random_state=42,
    dataset_size=0.0003,
    target_type="one-hot",  # can be "categorical" or "one-hot"
    complex_geometry="all", # can be oct, spy, tbp or all
)

X_train, X_test, y_train, y_test, y_labels = data_loader.load_data()

print(X_test)

[[ 1.0608492  -1.43010797 -1.19944602  1.1432843  -2.5987814  -2.53579661
   1.          0.          1.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          1.          0.          0.
   0.          0.          0.          0.          0.          0.
   1.        ]
 [-0.63463686 -0.40617919  0.1220005   0.26723443 -0.40590405  0.26531819
   1.          0.          1.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          1.          1.
   0.          0.          0.          0.          0.          0.
   0.        ]]


In [19]:
print(y_test)

[[1 1 0 0 0 0 0]
 [0 1 0 0 0 0 0]]


In [20]:
data_loader.binarized_target_decoder(y_test)

[['W', 'Br'], ['Mo', 'Br']]

In [21]:
data_loader.confusion_matrix_data_adapter_one_hot(y_test)

['W', 'Br', 'Mo', 'Br']