<a target="_blank" href="https://colab.research.google.com/github/masood/2024-pets-privacy-labels-policies/blob/main/fine_tune_privbert.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Fine-tuning PrivBERT

In this notebook, we provide code to help download and fine-tune PrivBERT to recognize different attributes from the OPP-115 dataset.

💡 This demo needs to be run on GPUs. To do so, navigate to the notebook's menu at the top, `Runtime > Change runtime type` and select `T4 GPU`.

In [1]:
# Uncomment an attribute to train. The 'Main' attribute finds a high-level data practice that a text segment addresses.
# current_attribute = 'Action First-Party'
# current_attribute = 'Action Third-Party'
# current_attribute = 'Audience Type'
# current_attribute = 'Does or Does Not'
# current_attribute = 'Identifiability'
# current_attribute = 'Personal Information Type'
# current_attribute = 'Purpose'
current_attribute = 'Main'

# Install Requirements and Import Relevant Libraries

In [2]:
! pip install transformers[torch] datasets huggingface-hub

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting requests (from transformers[torch])
  Downloading requests-2.32.3-py3-none-any.whl (64 

In [3]:
#Some built-in imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import OrderedDict
import pickle
from os.path import join, isfile
from os import listdir
from pathlib import Path
from tqdm import tqdm

# To Download Training Data
from huggingface_hub import hf_hub_download

# Scikit-Learn to evaluate the fine-tuned model
from sklearn.metrics import classification_report

# Pre-trained Model and Tokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [4]:
# Make use of GPUs when available
from torch import cuda
cuda.empty_cache()
device = 'cuda' if cuda.is_available() else 'cpu'
cuda.is_available()

True

# Utility Functions for Data Processing

In [5]:
def attr_value_labels(attribute):

    if attribute == 'Audience Type':
        labels = OrderedDict([('Children', 0),
             ('Californians', 1),
             ('Citizens from other countries', 2),
             ('Europeans', 3)])
    elif attribute == 'Does or Does Not':
        labels = OrderedDict([('Does', 0),
             ('Does Not', 1)])
    elif attribute == 'Action First-Party':
        labels = OrderedDict([('Collect in mobile app', 0),
                               ('Collect on website', 1)])
    elif attribute == 'Action Third-Party':
        labels = OrderedDict([('Collect on first party website/app', 0),
                             ('See', 1)])
    elif attribute == 'Identifiability':
        labels = OrderedDict([('Aggregated or anonymized', 0),
             ('Identifiable', 1),
             ('Unspecified', 2)])
    elif attribute == 'Personal Information Type':
        labels = OrderedDict([('Computer information', 0),
             ('Contact', 1),
             ('Cookies and tracking elements', 2),
             ('Demographic', 3),
             ('Financial', 4),
             ('Generic personal information', 5),
             ('Health', 6),
             ('IP address and device IDs', 7),
             ('Location', 8),
             ('Personal identifier', 9),
             ('Social media data', 10),
             ('Survey data', 11),
             ('User online activities', 12),
             ('User profile', 13),
             ('Unspecified', 14)])
    elif attribute == 'Purpose':
        labels = OrderedDict([('Additional service/feature', 0),
             ('Advertising', 1),
             ('Analytics/Research', 2),
             ('Basic service/feature', 3),
             ('Legal requirement', 4),
             ('Marketing', 5),
             ('Merger/Acquisition', 6),
             ('Personalization/Customization', 7),
             ('Service operation and security', 8),
             ('Unspecified', 9)])
    elif attribute == 'Main':
        labels = OrderedDict([('First Party Collection/Use', 0),
             ('Third Party Sharing/Collection', 1),
             ('User Access, Edit and Deletion', 2),
             ('Data Retention', 3),
             ('Data Security', 4),
             ('International and Specific Audiences', 5),
             ('Do Not Track', 6),
             ('Policy Change', 7),
             ('User Choice/Control', 8),
             ('Introductory/Generic', 9),
             ('Practice not covered', 10),
             ('Privacy contact information', 11)])

    file = 'labels_' + attribute + '.pkl'
    with open(file, "wb") as f:
        pickle.dump(labels, f)

    if isfile(file):
        labels_file = open(file, "rb")
        labels = pickle.load(labels_file)
        labels_file.close()

    return labels

In [6]:
def get_hyperparameters(attribute):
  if attribute == 'Main':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 3
    LEARNING_RATE = 2.5E-05
    HIDDEN_DROPOUT = 0.15
    ATTENTION_DROPOUT = 0.15
  elif attribute == 'Identifiability':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 10
    VALID_BATCH_SIZE = 10
    EPOCHS = 1
    LEARNING_RATE = 2.5E-05
    HIDDEN_DROPOUT = 0
    ATTENTION_DROPOUT = 0
  elif attribute == 'Does or Does Not':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 2
    LEARNING_RATE = 5E-06
    HIDDEN_DROPOUT = 0
    ATTENTION_DROPOUT = 0
  elif attribute == 'Purpose':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 3
    LEARNING_RATE = 2.5E-05
    HIDDEN_DROPOUT = 0.05
    ATTENTION_DROPOUT = 0.05
  elif attribute == 'Personal Information Type':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 16
    VALID_BATCH_SIZE = 16
    EPOCHS = 4
    LEARNING_RATE = 2.5E-05
    HIDDEN_DROPOUT = 0
    ATTENTION_DROPOUT = 0
  elif attribute == 'Audience Type':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 1
    LEARNING_RATE = 2.5E-05
    HIDDEN_DROPOUT = 0
    ATTENTION_DROPOUT = 0
  elif attribute == 'Action First-Party':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 1
    LEARNING_RATE = 5E-06
    HIDDEN_DROPOUT = 0
    ATTENTION_DROPOUT = 0
  elif attribute == 'Action Third-Party':
    MAX_LEN = 512
    TRAIN_BATCH_SIZE = 2
    VALID_BATCH_SIZE = 2
    EPOCHS = 2
    LEARNING_RATE = 1.5E-05
    HIDDEN_DROPOUT = 0.05
    ATTENTION_DROPOUT = 0.05

  return MAX_LEN, TRAIN_BATCH_SIZE, VALID_BATCH_SIZE, EPOCHS, LEARNING_RATE, HIDDEN_DROPOUT, ATTENTION_DROPOUT

# Setup Attribute and Labels

In [7]:
labels = attr_value_labels(current_attribute)
current_num_levels = len(labels)
target_names = []
label_indices = []

for label, index in labels.items():
    target_names.append(label)
    label_indices.append(index)
    print(str(index) + '. ' + label)

0. First Party Collection/Use
1. Third Party Sharing/Collection
2. User Access, Edit and Deletion
3. Data Retention
4. Data Security
5. International and Specific Audiences
6. Do Not Track
7. Policy Change
8. User Choice/Control
9. Introductory/Generic
10. Practice not covered
11. Privacy contact information


In [8]:
# Hyperparameters for the dataset and the model
MAX_LEN, TRAIN_BATCH_SIZE, VALID_BATCH_SIZE, EPOCHS, LEARNING_RATE, HIDDEN_DROPOUT, ATTENTION_DROPOUT = get_hyperparameters(current_attribute)

In [9]:
tokenizer = AutoTokenizer.from_pretrained("mukund/privbert")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# Prepare Dataset and Dataloader

In [10]:
# Access Training Data From HuggingFace
REPO_ID = "masoodali/apple-app-store-labels-policies"

def get_training_dataframe(FILENAME):
    return pd.read_csv(
        hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
        ,converters={"label": lambda x: x.strip("[]").split(" ")}
    )

In [11]:
class MultiLabelDataset(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [12]:
training_dataframe = get_training_dataframe(f"privacy_policy/training_data/agg_data/agg_data_{current_attribute}.csv")

(…)training_data/agg_data/agg_data_Main.csv:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

In [13]:
num_records = len(training_dataframe)

print('Num of unique segments segments: {}'.format(num_records))

num_labels = len(training_dataframe["label"].iloc[0])

print('Num of labels: {}'.format(num_labels))

sentence_matrices = np.zeros(num_records, dtype = 'object')

label_matrices = np.zeros((num_records, num_labels))

Num of unique segments segments: 3788
Num of labels: 12


In [14]:
for index, row in training_dataframe.iterrows():

    sentence_matrices[index] = row["segment"]

    label_matrices[index] = np.array(row["label"])

In [15]:
df = pd.DataFrame()
df['text'] = sentence_matrices
df['labels'] = list(label_matrices.astype(int))

In [16]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_data=df.sample(frac=train_size,random_state=200)
test_data=df.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = MultiLabelDataset(train_data, tokenizer, MAX_LEN)
testing_set = MultiLabelDataset(test_data, tokenizer, MAX_LEN)

FULL Dataset: (3788, 2)
TRAIN Dataset: (3030, 2)
TEST Dataset: (758, 2)


In [17]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

# Download Pre-trained PrivBERT Model

In [18]:
model = AutoModelForSequenceClassification.from_pretrained("mukund/privbert", num_labels=current_num_levels, hidden_dropout_prob=HIDDEN_DROPOUT, attention_probs_dropout_prob=ATTENTION_DROPOUT)
model.to(device)

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at mukund/privbert and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.15, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.15, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
           

In [19]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

# Fine-tune PrivBERT for the Current Attribute

In [20]:
def train(epoch):
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs.logits, targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')

        loss.backward()
        optimizer.step()

In [21]:
for epoch in range(EPOCHS):
    train(epoch)

0it [00:00, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0, Loss:  0.7215884923934937


1515it [05:31,  4.57it/s]
2it [00:00,  4.87it/s]

Epoch: 1, Loss:  0.24130740761756897


1515it [05:35,  4.51it/s]
2it [00:00,  4.74it/s]

Epoch: 2, Loss:  0.052059948444366455


1515it [05:35,  4.51it/s]


In [22]:
# Saving the model's files

Path(f"./models/{current_attribute}").mkdir(parents=True, exist_ok=True)

output_model_file = f'./models/{current_attribute}/pytorch-privbert.bin'
output_tokenizer_file = f'./models/{current_attribute}/'

torch.save(model, output_model_file)
tokenizer.save_pretrained(output_tokenizer_file)

print('Saved')

Saved


# Test Fine-tuned Model and Gather Metrics

In [23]:
def validation(testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            outputs = outputs.logits
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [24]:
outputs, targets = validation(testing_loader)

final_outputs = np.array(outputs) >=0.5

379it [00:25, 15.13it/s]


### Presence

We first note how good the module is in detecting the presence of an attribute in the text segment.

In [25]:
print(classification_report(final_outputs, targets, labels=label_indices, target_names=target_names, zero_division='warn'))

                                      precision    recall  f1-score   support

          First Party Collection/Use       0.81      0.95      0.87       273
      Third Party Sharing/Collection       0.89      0.88      0.89       251
      User Access, Edit and Deletion       0.63      0.82      0.71        33
                      Data Retention       0.52      0.59      0.55        29
                       Data Security       0.62      0.83      0.71        60
International and Specific Audiences       0.90      0.94      0.92        64
                        Do Not Track       1.00      1.00      1.00         3
                       Policy Change       0.74      0.71      0.73        28
                 User Choice/Control       0.37      0.85      0.51        54
                Introductory/Generic       0.58      0.83      0.68       123
                Practice not covered       0.52      0.59      0.55       108
         Privacy contact information       0.57      0.89      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Absence

We first note how good the module is in detecting the absence of an attribute in the text segment.

In [26]:
print(classification_report(np.array(outputs) < 0.5, np.array(targets) < 0.5, labels=label_indices, target_names=target_names, zero_division='warn'))

                                      precision    recall  f1-score   support

          First Party Collection/Use       0.97      0.88      0.92       485
      Third Party Sharing/Collection       0.94      0.94      0.94       507
      User Access, Edit and Deletion       0.99      0.98      0.98       725
                      Data Retention       0.98      0.98      0.98       729
                       Data Security       0.99      0.96      0.97       698
International and Specific Audiences       0.99      0.99      0.99       694
                        Do Not Track       1.00      1.00      1.00       755
                       Policy Change       0.99      0.99      0.99       730
                 User Choice/Control       0.99      0.89      0.93       704
                Introductory/Generic       0.96      0.88      0.92       635
                Practice not covered       0.93      0.91      0.92       650
         Privacy contact information       0.99      0.97      

### Average each Attribute

Now that we have evaluated the model's ability to detect both the presence and the absence of an attribute, make sure to report its average with the support values. This practice is in line with [Harkous et al.](https://www.usenix.org/conference/usenixsecurity18/presentation/harkous)

For example, if the F1-score for *Third Party Sharing/Collection* is 0.89 for presence and 0.94 for absence, report the average F1-score as 0.92. Make sure to report the support as well.

--- THE END ---