In [2]:
from lavaset.lavaset import LAVASET
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

### Load the HNMR metabolomics dataset

Here we are loading the MTBLS1 publically available dataset, setting the X and y targets for classification and splitting our dataset to training and testing sets.

In [3]:
mtbls1 = pd.read_csv('example_data/MTBLS1.csv')
X = np.array(mtbls1.iloc[:, 1:])
y = np.array(mtbls1.iloc[:, 0], dtype=np.double)

if np.unique(y).any() != 0:
    y = np.where(y == 1, 0, 1).astype(np.double)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=180)

### Run LAVASET

Here we are calling the LAVASET model and the `knn_calculation` function from the model that calculates the nearest neighbors, either based on a number of neighbors set by the user or a distance matrix. 

In [4]:
model = LAVASET(ntrees=100, n_neigh=10, distance=False, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 
knn = model.knn_calculation(mtbls1.columns[1:], data_type='1D')

- ntrees: number of trees (or estimators) for the ensemble (int)
- n_neigh: number of neighbors to take for the calculation of the latent variable; this excludes the feature that has been selected for split, therefore the latent variable is calculated by the total of n+1 features (int)
- distance: parameter indicating whether the input for neighbor calculation is a distance matrix, default is False; if True, then n_neigh should be 0 (boolean)
- nvartosample: the number of features picked for each split, 'sqrt' indicates the squared root of total number of features, if int then takes that specific number of features (string or int)
- nsamtosample: the number of sample to consider for each tree, if float (like 0.5) then it considers `float * total number of samples`, if int then takes that specific number of samples (float or int)
- oobe: parameter for calcualting the out-of-bag score, default=True (boolean)

If the input to the `knn_calculation` function is a distance matrix then:
```bash
model = LAVASET(ntrees=100, n_neigh=0, distance=True, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 

knn = model.knn_calculation(distance_matrix, data_type='distance_matrix')
```
If the neighbors need to be calculated from the 1D spectrum ppm values of an HNMR dataset, then the input is the 1D array with the ppm values. Here the model parameters should be set as `distance=False` and `n_neigh=k`. The `data_type` parameter for the `knn_calculation` in this case will be set to `1D`, as also shown in the MTBLS1 example above. All options include:
- 'distance_matrix' is used for distance matrix input, 
- '1D' is used for 1D data like signals or spectra, 
- 'VCG' is used for VCG data, 
- 'other' is used for any other type of data, where it calculates the nearest neighbors based on the 2D data input. 

Here, we are fitting LAVASET to the training data by calling the LAVASET-specific fitting function called `fit_lavaset`.

In [5]:
lavaset = model.fit_lavaset(X_train, y_train, knn, random_state=5)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.


building tree 3
building tree 0
building tree 2
building tree 9
building tree 4
building tree 5
building tree 11
building tree 6
building tree 7
building tree 13
building tree 8
building tree 12
building tree 1
building tree 10
building tree 14
building tree 15
building tree 16
building tree 17
building tree 18
building tree 19
building tree 20
building tree 21
building tree 22
building tree 23
building tree 24
building tree 25


[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    5.6s


building tree 26
building tree 27
building tree 28
building tree 29
building tree 30
building tree 31
building tree 32
building tree 33
building tree 34


[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    8.0s


building tree 35
building tree 36
building tree 37
building tree 38
building tree 39
building tree 40
building tree 41
building tree 42
building tree 43
building tree 44
building tree 45


[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:    9.3s


building tree 46
building tree 47
building tree 48
building tree 49
building tree 50
building tree 51
building tree 52
building tree 53
building tree 54
building tree 55


[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   11.9s


building tree 56
building tree 57
building tree 58
building tree 59
building tree 60
building tree 61
building tree 62
building tree 63
building tree 64
building tree 65
building tree 66
building tree 67
building tree 68


[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   14.3s


building tree 69
building tree 70
building tree 71
building tree 72
building tree 73
building tree 74
building tree 75
building tree 76
building tree 77
building tree 78
building tree 79
building tree 80


[Parallel(n_jobs=-1)]: Done  66 tasks      | elapsed:   16.9s


building tree 81
building tree 82
building tree 83
building tree 84
building tree 85
building tree 86
building tree 87
building tree 88
building tree 89
building tree 90
building tree 91
building tree 92
building tree 93
building tree 94


[Parallel(n_jobs=-1)]: Done  80 out of 100 | elapsed:   19.7s remaining:    4.9s


building tree 95
building tree 96
building tree 97
building tree 98
building tree 99


[Parallel(n_jobs=-1)]: Done  91 out of 100 | elapsed:   21.3s remaining:    2.1s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   22.4s finished


Here, we are predicting the test data, with the LAVASET-specific predict function. The output of `predict_lavaset` consists of three items. The y predictions, the votes for each tree, and the out-of-bag (oobe) score. 

In [6]:
y_preds, votes, oobe = model.predict_lavaset(X_test, lavaset)

Below is an example of different metrics that can be called via the `sklearn` package for testing the performance of LAVASET.

In [7]:
accuracy = accuracy_score(y_test, np.array(y_preds, dtype=int))
precision = precision_score(y_test, np.array(y_preds, dtype=int))
recall = recall_score(y_test, np.array(y_preds, dtype=int))
f1 = f1_score(y_test, np.array(y_preds, dtype=int))

Finally, to get the feature importance calculations from LAVASET you can call the `feature_evaluation` function by setting the data for which the models has been trained on, here `X_train` and the fitted model, here `lavaset` as defined above. The output is a numpy array consisting of 3 vectors, the first shows the number of times the feature (in the specific index) is considered for a split, the second vector shows the number of times selected for a split, and finally the third vector is the gini value. 

In [10]:
feature_importance_df = pd.DataFrame(model.feature_evaluation(X_train, lavaset), columns=['Times Considered', 'Times Selected', 'Gini'])