# Bambu: Google Colab Tutorial

In this tutorial we are going to explore the main features available in the Bambu QSAR command line tool, including dependence installation, data downloading, feature computation, model training and validation, and inference.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/omixlab/bambu-v2/blob/main/notebooks/Bambu%20Google%20Colab%20Tutorial.ipynb)

### Installing RDKit and Mordred

Molecular descriptors might be computed using Mordred, which is based on RDKit library. Therefore, these two tools might be installed. When using in Google Colab it's possible to install using the following commands.

In [None]:
# run this cell only if you are using this notebook in Google Colab
!pip install kora -q
import kora.install.rdkit

In [None]:
!pip install mordred

### Installing mol2vec

Bambu can also use mol2vec algorithm to compute molecular vector, which may be used as features for machine learning tasks.

In [None]:
!pip install pip install git+https://github.com/samoturk/mol2vec

## Installing Bambu

After installing the dependencies, Bambu can be installed from PyPI using `pip`.

In [None]:
!pip install bambu-qsar==0.0.12

## Downloading data from a PubChem BioAssay

Datasets from the Pubchem Bioassays database can be download using the command `bambu-download`.

In [None]:
!bambu-download \
  --pubchem-assay-id 29 \
  --output 29_raw.csv

## Computing vectors for the downloaded molecules

To compute vectors using the `mol2vec` method we need a pre-trained model, which might be generated using [the library developed by samoturk](https://github.com/samoturk/mol2vec/).

In [None]:
!curl -L -o mol2vec.pickle "https://github.com/samoturk/mol2vec/blob/master/examples/models/model_300dim.pkl?raw=true"

In [None]:
!bambu-preprocess --input 29_raw.csv \
    --train-test-split-percent 0.75 \
    --feature-type mol2vec \
    --undersample \
    --mol2vec-model-path mol2vec.pickle \
    --output 29_preprocessed.csv \
    --output-preprocessor 29_descriptor_preprocessor.pickle

## Training a predictive model 

Now, let's train a model based on the Extra Trees Classifier based on the computed features.

In [None]:
!bambu-train \
	--input-train 29_preprocessed_train.csv \
	--output 29_model.pickle \
	--model-history \
	--time-budget 3600 \
	--estimators extra_tree

## Model Validation

The model we have trained may be validated using the `bambu-validate` command, which will use an y-randomization method to compute classification performance scores and their respective significances.

In [None]:
!!bambu-validate \
	--input-train 29_preprocessed_train.csv \
	--input-test 29_preprocessed_test.csv \
  --model 29_model.pickle \
	--output validation.json \
	--randomizations 100

In [None]:
import pandas as pd
import json

with open('validation.json') as reader:
  validation_results = json.load(reader)
validation_results

df = pd.DataFrame(columns=["metric", "value", "p-value"])

for metric in ["accuracy", "recall", "precision", "f1", "roc_auc"]:
  df = df.append(
      {
        "metric": metric, 
        "value": validation_results["raw_scores"][metric][0], 
        "p-value": validation_results["pvalues"][metric][0]
      }, ignore_index=True
  )

df

## Using the model to analyze new molecules

To use our trained model, we might pass a file (`.sdf`, `.mol2` or `.smiles`) containing multiple molecules to be analyzed. 

In [None]:
!wget -O pubchem_sample.sdf.gz \
  https://github.com/omixlab/bambu-v2/raw/main/tests/pubchem_sample.sdf.gz

!gzip -d -f pubchem_sample.sdf.gz

!bambu-predict \
        --input pubchem_sample.sdf \
        --preprocessor 29_preprocessor.pickle \
        --model 29_model.pickle \
        --output 29_predictions.csv