# Bastion AI Real World Example
## Finetuning DistilBERT for binary classification on the SMS Spam Collection

Data preparation and training are largely based on https://towardsdatascience.com/fine-tuning-bert-for-text-classification-54e7df642894.

### Installing Bastion AI

### From source

To use this notebook, you'll need a working Bastion AI installation.
First clone our repo:
```
$ git clone git@github.com:mithril-security/bastionai.git
```
Then install the client library:
```
$ cd ./bastionai/client
$ make install
```

### Via pip

Just run the following cell:

In [None]:
%pip install bastionai

### Installing and importing additionnal packages

Let's first import all the necessary packages for the entire notebook.
The makefile for the client has already set up a virtualenv with the client dependences for us.
We just need to install the additionnal packages we'll use:

In [None]:
%pip install transformers pandas sklearn ipykernel ipywidgets

We can now import necessary packages and objects:

In [None]:
import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.utils.data import DataLoader

from bastionai.client import Connection, Private
from bastionai.optimizer_config import Adam
from bastionai.utils import MultipleOutputWrapper, TensorDataset

### Preparing the dataset

The dataset can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip.
Unzip the archive to obtain the datset file:

```
$ unzip smsspamcollection.zip
```

Each row represent a sample, the label come first followed by a tab and the raw text:
```
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
```

We first load the data from the file into a pandas dataframe:

In [None]:
file_path = "./data/SMSSpamCollection"

labels = []
texts = []
with open(file_path) as f:
  for line in f.readlines():
    split = line.split('\t')
    labels.append(1 if split[0] == "spam" else 0)
    texts.append(split[1])
df = pd.DataFrame({ "label": labels, "text": texts })
print(len(df))
df.head()

We then preprocess the data using DistilBERT's tokenizer and we obtain tensors ready to be fed to the model:

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

token_id = []
attention_masks = []
for sample in df.text.values:
    encoding_dict = tokenizer.encode_plus(
        sample,
        add_special_tokens=True,
        max_length=32,
        truncation=True,
        padding="max_length",
        return_attention_mask=True,
        return_tensors='pt'
    )
    token_id.append(encoding_dict['input_ids']) 
    attention_masks.append(encoding_dict['attention_mask'])

token_id = torch.cat(token_id, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(df.label.values)

It's now time to split the data in a train and test sets and to wrap it inside Dataset and DataLoader objects for ease of use:

In [None]:
val_ratio = 0.2

train_idx, test_idx = train_test_split(
    np.arange(len(labels)),
    test_size=val_ratio,
    shuffle=True,
    stratify=labels
)

train_set = TensorDataset([
    token_id[train_idx], 
    attention_masks[train_idx]
], labels[train_idx])

test_set = TensorDataset([
    token_id[test_idx], 
    attention_masks[test_idx]
], labels[test_idx])

train_dataloader = DataLoader(train_set, batch_size=4)
test_dataloader = DataLoader(test_set, batch_size=4)

### Preparing the model for use with DP-SGD and Bastion AI

We now turn to preparing the DistilBERT language model. As Hugging Face's models typically have several outputs (logits, loss, etc.) we use Bastion AI's utility wrapper for models with multiple outputs to select the sole output that corresponds with the logits. In fact, Bastion AI's server supports models with an arbitrtary number of inputs but only supports models with a single output.

In [None]:
# Do not display warnings about layer not initialized
# with pretrained weights (classification layers, this is fine)
from transformers import logging
logging.set_verbosity_error()

model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2,
    output_attentions=False,
    output_hidden_states=False,
    torchscript=True
)
model = MultipleOutputWrapper(model, 0)

### Sending dataset and model and training on the server

Before proceeding, we need to start a local Bastion AI server which can be achivied with the following commands,
assuming you have a working rust toolchain (https://www.rust-lang.org/tools/install):

```
$ cd ../server/bastionai_app
$ cargo run
```

Now that the server code has been compiled and the server has started, it's time to send the dataset and the model to the server.

We first use the `RemoteDataLoader` function to send our train and test dataloaders to the server (what we really send are the dataset and the dataloading parameters, not the dataloaders per se) and we provide a name and description to better identify them.

Second, we use the `RemoteLearner` function to send the model to server and to set all the necessary training config.
As training will be executed remotely, we need to script the model prior to sending it (i.e. compile it to Torch Script) which is automatically done in the `RemoteLearner` constructor. In case the model is not suited for scripting, which is generally the case with Hugging Face's models, the constructor automatically resorts to use tracing, which means the model is run locally on a small but representative input and the torch jit compiler tracks all functions that are called and compiles them on the fly. This approach, although more error prone (in certain cases the input may not activate some needed computation paths) is less picky that scripting and accepts nearly all models.

In addition, as we'll use the DP-SGD algorithm for training, the constructor will also make the model compatible with Bastion AI's DP-SGD implementation. Unlike Opacus that uses backprop hooks to compute per-sample gradients, Bastion AI relies on normal autograd and modified layers that internally store expanded gradients (weight tensors have the same size in memory but are manipulated through expanded views that repeat them as many times as there are samples in a batch so that the gradient of these views are per-sample gradients). Per-samples gradient computation is key to DP-SGD and is one ingredient that make DP usable with Deep Learning models.

To start training, we just call the `fit` method on the `RemoteLerner` object with appropriate number of epochs and DP budget. We can optionally override some of the learner's settings such as the learning rate or the clipping factor.

We may finally retrieve a local copy of the trained model once the training is complete with the `get_model` method and test the model directly on the server with the `test` method.

In [None]:
import warnings
warnings.filterwarnings("ignore")

with Connection("localhost", 50051) as client:
    remote_dataloader = client.RemoteDataLoader(train_dataloader, name="SMSSpamCollection", privacy_limit=Private(302.1))
    
    remote_learner = client.RemoteLearner(
        model,
        remote_dataloader,
        metric="cross_entropy",
        optimizer=Adam(lr=5e-5),
        model_name="DistilBERT",
    )

    remote_learner.fit(nb_epochs=2, eps=Private(22.0), metric_eps=Private(140.0))
    remote_learner.test(metric="accuracy", metric_eps=Private(140.0))
    
    trained_model = remote_learner.get_model()