In [None]:
from lavaset.lavaset import LAVASET
from lavaset.lavaset_clifi import LAVASET_CLIFI
import numpy as np
import pandas as pd
from scipy.spatial import distance_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

### Load the HNMR metabolomics dataset

Here we are loading the MTBLS1 publically available dataset, setting the X and y targets for classification and splitting our dataset to training and testing sets.

In [None]:
mtbls1 = pd.read_csv('example_data/MTBLS1.csv')
X = np.array(mtbls1.iloc[:, 1:])
y = np.array(mtbls1.iloc[:, 0], dtype=np.double)

if np.unique(y).any() != 0:
    y = np.where(y == 1, 0, 1).astype(np.double)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=180)

### Run LAVASET or LAVASET_CLIFI

Here we are calling the LAVASET model and the `knn_calculation` function from the model that calculates the nearest neighbors, either based on a number of neighbors set by the user or a distance matrix. 

In [None]:

model = LAVASET(ntrees=100, n_neigh=10, distance=False, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 
model_clifi = LAVASET_CLIFI(ntrees=100, n_neigh=10, distance=False, nvartosample='sqrt', nsamtosample=0.5, oobe=True)
knn = model.knn_calculation(mtbls1.columns[1:], data_type='1D')

- ntrees: number of trees (or estimators) for the ensemble (int)
- n_neigh: number of neighbors to take for the calculation of the latent variable; this excludes the feature that has been selected for split, therefore the latent variable is calculated by the total of n+1 features (int)
- distance: parameter indicating whether the input for neighbor calculation is a distance matrix, default is False; if True, then n_neigh should be 0 (boolean)
- nvartosample: the number of features picked for each split, 'sqrt' indicates the squared root of total number of features, if int then takes that specific number of features (string or int)
- nsamtosample: the number of sample to consider for each tree, if float (like 0.5) then it considers `float * total number of samples`, if int then takes that specific number of samples (float or int)
- oobe: parameter for calcualting the out-of-bag score, default=True (boolean)

If the input to the `knn_calculation` function is a distance matrix then:
```bash
model = LAVASET(ntrees=100, n_neigh=0, distance=True, nvartosample='sqrt', nsamtosample=0.5, oobe=True) 

knn = model.knn_calculation(distance_matrix, data_type='distance_matrix')
```
If the neighbors need to be calculated from the 1D spectrum ppm values of an HNMR dataset, then the input is the 1D array with the ppm values. Here the model parameters should be set as `distance=False` and `n_neigh=k`. The `data_type` parameter for the `knn_calculation` in this case will be set to `1D`, as also shown in the MTBLS1 example above. All options include:
- 'distance_matrix' is used for distance matrix input, 
- '1D' is used for 1D data like signals or spectra, 
- 'VCG' is used for VCG data, 
- 'other' is used for any other type of data, where it calculates the nearest neighbors based on the 2D data input. 

Here, we are fitting LAVASET to the training data by calling the LAVASET-specific fitting function called `fit_lavaset`.

In [None]:
lavaset = model.fit_lavaset(X_train, y_train, knn, random_state=5)
lavaset_clifi = model_clifi.fit_lavaset(X_train, y_train, knn, random_state=5)

Here, we are predicting the test data, with the LAVASET-specific predict function. The output of `predict_lavaset` consists of three items. The y predictions, the votes for each tree, and the out-of-bag (oobe) score. 

In [None]:
y_preds, votes, oobe = model.predict_lavaset(X_test, lavaset)
y_preds_clifi, votes_clifi, oobe_clifi = model_clifi.predict_lavaset(X_test, lavaset_clifi)

Below is an example of different metrics that can be called via the `sklearn` package for testing the performance of LAVASET. The same metrics can be used for LAVASET_CLIFI. 

In [None]:
accuracy = accuracy_score(y_test, np.array(y_preds, dtype=int))
precision = precision_score(y_test, np.array(y_preds, dtype=int))
recall = recall_score(y_test, np.array(y_preds, dtype=int))
f1 = f1_score(y_test, np.array(y_preds, dtype=int))

print('accuracy:', accuracy, 'precision:', precision, 'recall:', recall, 'f1-score:', f1)

Finally, to get the feature importance calculations from LAVASET you can call the `feature_evaluation` function by setting the data for which the models has been trained on, here `X_train` and the fitted model, here `lavaset` as defined above. The output is a numpy array consisting of 3 vectors, the first shows the number of times the feature (in the specific index) is considered for a split, the second vector shows the number of times selected for a split, and finally the third vector is the gini value. 

In [None]:
feature_importance_df = pd.DataFrame(model.feature_evaluation(X_train, lavaset), columns=['Times Considered', 'Times Selected', 'Gini'])
feature_importance_df

To get the feature importance calculations from LAVASET_CLIFI you can call the `class_feature_evaluation` function by setting the data for which the models has been trained on, here `X_train` and the fitted model, here `lavaset_clifi` as defined above. 

In [None]:
features_importances = model.class_feature_evaluation(X_train, lavaset_clifi, n_classes)

features_importances = model.class_feature_distribution_evaluation(X_train, lavaset_clifi, n_classes)
features_importances = pd.DataFrame(features_importances)
