<a href="https://colab.research.google.com/github/pszemraj/ml4hc-s22-project01/blob/main/notebooks/colab/tabular_classification_LF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Lightning-Flash: Test Various Tabular Classification models

_heavily modified / adapted from the titanic classification tutorial in LF docs_

---

  - [LF Github](https://www.github.com/PytorchLightning/pytorch-lightning/)
  - Check out [Flash documentation](https://lightning-flash.readthedocs.io/en/latest/)
  - Check out [Lightning documentation](https://pytorch-lightning.readthedocs.io/en/latest/)

---

In [1]:
#@title print out GPU info
#@markdown this is the Colab-allocated GPU. If the output here says it fails, no
#@markdown GPU is being used. go to runtime at the top of your colab to set runtime to GPU.


!nvidia-smi

Sat Mar 26 00:09:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# setup

In [2]:
#@markdown add auto-Colab formatting with `IPython.display`
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

In [3]:
#@title mount drive, define root folder
from google.colab import drive
from pathlib import Path
drive_base_str = '/content/gdrive'
drive.mount(drive_base_str)


Mounted at /content/gdrive


In [5]:
drive_head_dir = Path(drive_base_str)

root_dir = "/content/gdrive/MyDrive/ETHZ-2022-S/ML-healthcare-projects/project1/lightning-flash-models" #@param {type:"string"}
root_dir = Path(root_dir)
if not root_dir.exists():
    print(f"{root_dir.resolve()} does not exist, creating generic folder in drive root")
    root_dir = drive_head_dir / "lf-tabular-classifier"
    root_dir.mkdir(exist_ok=True)

In [8]:
#@title define key nn parameters for training
import torch
NUM_EPOCHS =  40#@param {type:"integer"}
BATCH_SIZE = 32 #@param {type:"integer"}
VAL_SPLIT = 0.25 #@param {type:"number"}
TRAIN_FP16 = True #@param {type:"boolean"}

if not torch.cuda.is_available():
    print("cuda not available, setting var TRAIN_FP16 to False.")
    TRAIN_FP16=False

## install

In [6]:
# %%capture
! pip install 'git+https://github.com/PyTorchLightning/lightning-flash.git#egg=lightning-flash[tabular]' -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 527 kB 15.3 MB/s 
[K     |████████████████████████████████| 128 kB 68.6 MB/s 
[K     |████████████████████████████████| 398 kB 59.5 MB/s 
[K     |████████████████████████████████| 2.0 MB 60.4 MB/s 
[K     |████████████████████████████████| 127 kB 68.0 MB/s 
[K     |████████████████████████████████| 80 kB 11.6 MB/s 
[K     |████████████████████████████████| 9.5 MB 52.1 MB/s 
[K     |████████████████████████████████| 809 kB 66.6 MB/s 
[K     |████████████████████████████████| 74 kB 4.0 MB/s 
[K     |████████████████████████████████| 596 kB 69.8 MB/s 
[K     |████████████████████████████████| 829 kB 71.6 MB/s 
[K     |████████████████████████████████| 636 kB 65.5 MB/s 
[K     |████████████████████████████████| 134 kB 75.4 MB/s 
[K     |████████████████████████████████| 1.1 M

In [7]:
#@title define source data parameters

#@markdown - these can also be loaded from gdrive, but I am lazy and `wget` does not require login

mitbih_train_url = "https://www.dropbox.com/s/2ks8s82tm7jvhse/torchfmt_mitbih_train.csv?dl=1" #@param {type:"string"}
mitbih_train_filename = "mitbih_train.csv" #@param {type:"string"}
mitbih_test_url = "https://www.dropbox.com/s/nbaxenoehvqmqnm/torchfmt_mitbih_test.csv?dl=1" #@param {type:"string"}
mitbih_test_filename = "mitbih_test.csv" #@param {type:"string"}

In [9]:
from torchmetrics.classification import Accuracy, Precision, Recall

import flash
from flash.core.data.utils import download_data
from flash.tabular import TabularClassifier, TabularClassificationData

  import pandas.util.testing as tm


###  1. Download the data
The data are downloaded from a URL, and save in a 'data' directory.

In [10]:

!wget $mitbih_train_url -O $mitbih_train_filename
!wget $mitbih_test_url -O $mitbih_test_filename

--2022-03-26 00:17:06--  https://www.dropbox.com/s/2ks8s82tm7jvhse/torchfmt_mitbih_train.csv?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6022:18::a27d:4212
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/2ks8s82tm7jvhse/torchfmt_mitbih_train.csv [following]
--2022-03-26 00:17:07--  https://www.dropbox.com/s/dl/2ks8s82tm7jvhse/torchfmt_mitbih_train.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucd7d293ab31cd8067dab0c7799b.dl.dropboxusercontent.com/cd/0/get/BiLMl2tu-UTxPoSyk8efnMTJ7-mu0D0Tlx0pccOOVS0x7ZdQ_OBRs83A2OSClnt_EaEuC81xuYDvA10X21GxJPQ197knCQZ-XW47hWhJthyPKDHxQx-qj2EqdhPe58ZBagA65SXTwjC8KkRq8spB0-QwEeCvXC3mQ8-jXU0KGp_xzg/file?dl=1# [following]
--2022-03-26 00:17:07--  https://ucd7d293ab31cd8067dab0c7799b.dl.dropboxusercontent.com/cd/0/get/BiLMl2tu-UTxPoSyk8efnMTJ

###  2. Load the data
Flash Tasks have built-in DataModules that you can use to organize your data. Pass in a train, validation and test folders and Flash will take care of the rest.

Creates a TabularData relies on [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). 

```

    def from_csv(
        cls,
        categorical_fields: Optional[Union[str, List[str]]] = None,
        numerical_fields: Optional[Union[str, List[str]]] = None,
        target_fields: Optional[Union[str, List[str]]] = None,
        parameters: Optional[Dict[str, Any]] = None,
        train_file: Optional[str] = None,
        val_file: Optional[str] = None,
        test_file: Optional[str] = None,
        predict_file: Optional[str] = None,
        train_transform: INPUT_TRANSFORM_TYPE = InputTransform,
        val_transform: INPUT_TRANSFORM_TYPE = InputTransform,
        test_transform: INPUT_TRANSFORM_TYPE = InputTransform,
        predict_transform: INPUT_TRANSFORM_TYPE = InputTransform,
        target_formatter: Optional[TargetFormatter] = None,
        input_cls: Type[Input] = TabularClassificationCSVInput,
        transform_kwargs: Optional[Dict] = None,
        **data_module_kwargs: Any,
    ) -> "TabularClassificationData":
    
    ```

In [11]:
import pandas as pd
example_df = pd.read_csv(mitbih_train_filename)
data_cols = list(example_df.columns)
_target = data_cols[-1]
data_cols.pop()
_predictors = data_cols # all other columns are numerical predictors

print(f"the target colname is {_target} and\nthe predictor colnames 5 of {len(_predictors)} are {_predictors[:5]}")

the target colname is class_label and
the predictor colnames 5 of 187 are ['feat_0', 'feat_1', 'feat_2', 'feat_3', 'feat_4']


In [12]:
datamodule = TabularClassificationData.from_csv(
    numerical_fields=_predictors,
    target_fields=_target,
    train_file=mitbih_train_filename,
    test_file=mitbih_test_filename,
    val_split=VAL_SPLIT,
    batch_size=BATCH_SIZE,
)
print(f"found {datamodule.num_classes} classes in predict col")

found 5 classes in predict col




import metric objects

In [13]:
# metrics
import torchmetrics
metric_f1 = torchmetrics.F1(datamodule.num_classes)
metric_CK = torchmetrics.CohenKappa(datamodule.num_classes)
metric_matthewscorr = torchmetrics.MatthewsCorrcoef(datamodule.num_classes)
metric_rocAUC = torchmetrics.AUROC(num_classes=datamodule.num_classes)
my_metrics = [
              Accuracy(),
    metric_f1,
    metric_matthewscorr,
    metric_CK,
    metric_rocAUC,
]  # accuracy is ~useless in imbalanced class problem


  stream(template_mgs % msg_args)
  stream(template_mgs % msg_args)


setup logging 

In [14]:
from pytorch_lightning.loggers import CSVLogger  # noqa: E402]

log_dir = root_dir / "logs"
log_dir.mkdir(exist_ok=True)
logger = CSVLogger(save_dir=str(log_dir.resolve()))

###  3. Build the model

Note: Categorical columns will be mapped to the embedding space. Embedding space is set of tensors to be trained associated to each categorical column. 

In [15]:
import pprint as pp
backbones = TabularClassifier.available_backbones()
print("available model backbones for tabular as follows:\n")
pp.pprint(backbones)

available model backbones for tabular as follows:

['autoint',
 'category_embedding',
 'fttransformer',
 'node',
 'tabnet',
 'tabtransformer']


In [19]:
m_name = 'tabtransformer'
model = TabularClassifier.from_data( 
                                    datamodule,
                                    backbone=m_name)

Using 'tabtransformer' provided by manujosephv/PyTorch Tabular (https://github.com/manujosephv/pytorch_tabular).


# Training

###  4. Create the trainer

- uses key training params defined above

In [20]:
trainer = flash.Trainer(
    max_epochs=NUM_EPOCHS,
    gpus=torch.cuda.device_count(),
    auto_lr_find=True,
    precision=16 if TRAIN_FP16 else 32,
    logger=logger,
)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.


###  5. Train the model

In [21]:
trainer.fit(model, datamodule=datamodule)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                  | Params
--------------------------------------------------------
0 | train_metrics | ModuleDict            | 0     
1 | val_metrics   | ModuleDict            | 0     
2 | test_metrics  | ModuleDict            | 0     
3 | adapter       | PytorchTabularAdapter | 305 K 
--------------------------------------------------------
305 K     Trainable params
0         Non-trainable params
305 K     Total params
1.223     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

validation

In [22]:
# validate results
my_metrics = trainer.validate(
    ckpt_path="best",
    val_dataloaders=datamodule,
    verbose=True,
)




LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Validating: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 VALIDATE RESULTS
{'valid_accuracy': 0.9803088307380676, 'valid_loss': 0.11970166116952896}
--------------------------------------------------------------------------------


###  6. Test model

In [23]:
trainer.test(model, datamodule=datamodule)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'valid_accuracy': 0.9789877533912659, 'valid_loss': 0.13964278995990753}
--------------------------------------------------------------------------------


[{'valid_accuracy': 0.9789877533912659, 'valid_loss': 0.13964278995990753}]

###  7. Save it!

In [24]:
download_chkpt = False #@param {type:"boolean"}


In [25]:
from datetime import datetime
def get_timestamp():
    return datetime.now().strftime("%b-%d-%Y_t-%H")

In [26]:
_chk_name = f"tabular_classification_model_MIT_{get_timestamp()}.pt"
out_dir = root_dir / "model-checkpoints"
model_out_path = out_dir / _chk_name
trainer.save_checkpoint(model_out_path.resolve())

In [27]:
from google.colab import files

if download_chkpt: files.download(model_out_path)

# Predicting

###  8. Load the model from a checkpoint

`TabularClassifier.load_from_checkpoint` supports both url or local_path to a checkpoint. If provided with an url, the checkpoint will first be downloaded and laoded to re-create the model. 

In [28]:
# model = TabularClassifier.load_from_checkpoint(
#     "https://flash-weights.s3.amazonaws.com/0.7.0/tabular_classification_model.pt")

###  9. Generate predictions from a sheet file! Who would survive?

`TabularClassifier.predict` support both DataFrame and path to `.csv` file.

In [29]:
# datamodule = TabularClassificationData.from_csv(
#     predict_file="data/titanic/titanic.csv",
#     parameters=datamodule.parameters,
#     batch_size=8,
# )
# predictions = trainer.predict(model, datamodule=datamodule)

In [30]:
# print(predictions)