# An information-theoretic probing of wordpiece tokens in BERT

This is a methods exploration of an approach proposed in Bruenner et al. (2020), "On Identifiability in Transformers." Neural networks are often described as black boxes because, compared to classic statistical methods such as regressions, they can be difficult to interpret. This problem is 

## Lexical chunking

Bybee

## Transformers

Transformers are a type of neural network characterized by multi-headed self-attention. They are typically extremely large, having millions or even hundreds of billions of parameters. They were first proposed for the purpose of neural machine translation. Since that time, they have become state-of-the-art for natural language and image processing.

We will take a quick look at multi-headed self-attention, but it is worth noting that most of the parameters in a transformer model are devoted to run-of-the-mill fully connected feed forward layers.

### Multi-headed self-attention

## Question
Do more word-y words retain more token-specific information through the layers of the transformer blocks? Do compositional words like "got to" retain less token-specific information as they move through the transformer blocks?

Does the predictability of a token affect how much of its initial token embed is retained?

## Method

### Data

#### BookCorpus

#### BERT

### Model

Train a multilayer perceptron (MLP) to reverse engineer token embeddings from transformer layers.

### Statistical Analysis

Nearest-neighbor lookup

## Results

## Limitations/ TO DO

In [None]:
from tqdm import tqdm #tqdm.auto is still bugging out. we can use the CLI version.

# Huggingface
from transformers import logging, AutoModel, AutoTokenizer
logging.set_verbosity_error()
MODEL_NAME = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModel.from_pretrained(MODEL_NAME)

# PyTorch
import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

cpu


In [None]:
s = 'This is an example.'
s = tokenizer(s, return_tensors='pt')

In [None]:
with torch.no_grad():
    hidden_states = model(**s, output_hidden_states=True)['hidden_states']

## Indexing the Model State Array with PyTorch

hidden_states is a tuple of tensors/arrays. 

There is one tensor/array for each embedding in the network:
hidden_states[i] == hidden states at i<sup>th</sup> layer of network.

Each of these tensors/arrays has the following shape:
hidden_states[i].shape == [num_examples, sequence_length, embedding_size]  

To get the 13 different embeddings for a single token, we loop over the layers of hidden_states. We collect an example sentence isent, a token within that sentence itok, and all 768 scalar values in the embedding matrix using the colon indexer.

For this example, we have:
 - 1 sentence
 - 7 is the sequence length
 - 13 layers, which is one input embedding *x*_i_ + 12 encoder blocks
 - 768 is the embedding size for BERT embeddings
 
The result is a matrix of shpae [1 x 7 x 13 x 768]

In [None]:
isent = 0
itok = 0

for layer in hidden_states:
    assert len(layer[isent, itok, : ]) == 768

In [None]:
torch.stack(hidden_states, dim=2).shape

torch.Size([1, 7, 13, 768])

## Get Data

> A large corpus of books


In [None]:
from datasets import load_dataset, logging
logging.set_verbosity_error()

ds = load_dataset('bookcorpus', split='train[:5000]')
ds

Dataset({
    features: ['text'],
    num_rows: 5000
})

In [None]:
# What is the max sequence length?
max([len(tokenizer(s)['input_ids']) for s in ds['text']])

66

## Process data

Construct an embed_array of shape [num_sentences x 66 x 13 x 768]

Where Y is the lookup embed stored at embed_array[ : , : , 0 , : ]  
And X is any of the 0 < i <= 13 intermediate embeds at embed_array[ : , : , i , : ]

In [None]:
import numpy as np

embed_array = np.zeros((5000, 66, 13, 768))
embed_array.shape

(5000, 512, 13, 768)

In [None]:
embed_array = torch.zeros((5000, 66, 13, 768))
embed_array.shape

torch.Size([5000, 66, 13, 768])

In [None]:
for i, sample in enumerate(tqdm(ds['text'])):
    inputs = tokenizer(sample, return_tensors='pt')
    with torch.no_grad():
        hidden_states = model(**inputs,
                              output_hidden_states=True)['hidden_states']
        seq_length = hidden_states[0].shape[1]
        embed_array[i:i+1, :seq_length, : , :] = torch.stack(hidden_states, dim=2)
        

100%|██████████| 5000/5000 [01:31<00:00, 54.54it/s]


In [None]:
torch.save(embed_array, '../data/bookcorpus_embeddings_0_5000.pt')

## Data Splits

We will need an array of example embeddings paired with target lookup embeddings. We can ignore most of the data structure.

In [None]:
embed_array = torch.load('../data/bookcorpus_embeddings_0_5000.pt')

In [None]:
def get_X_y(array, layer):
    assert isinstance(layer, int)
    array[:, :, ]
    X = torch.flatten(embed_array[:, :, layer, :], end_dim=1) # generated embeddings in specified layer
    y = torch.flatten(embed_array[:, :, 0, :], end_dim=1) # lookup embedding in first layer
    print(X.shape, y.shape)
    return X, y

In [None]:
X, y = get_X_y(embed_array, 12)

torch.Size([330000, 768]) torch.Size([330000, 768])


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Classifier

Multilayer Perceptron

> The linear perceptron and MLP are both trained by either minimizing the L2 or cosine distance loss using the ADAM optimizer (Kingma & Ba, 2015) with a learning rate of α = 0.0001, β1 = 0.9 and β2 = 0.999. We use a batch size of 256. We monitor performance on the validation set and stop training if there is no improvement for 20 epochs. The input and output dimension of the models is d = 768; the dimension of the contextual word embeddings. For both models we performed a learning rate search over the values α ∈ [0.003, 0.001, 0.0003, 0.0001, 0.00003, 0.00001, 0.000003]. The weights are initialized with the Glorot Uniform initializer (Glorot & Bengio, 2010). The MLP has one hidden layer with 1000 neurons and uses the gelu activation function (Hendrycks & Gimpel, 2016), following the feed-forward layers in BERT and GPT. We chose a hidden layer size of 1000 in order to avoid a bottleneck. We experimented with using a larger hidden layer of size 3072 and adding dropout to more closely match the feed-forward layers in BERT. This only resulted in increased training times and we hence deferred from further architecture search. We split the data by sentences into train/validation/test according to a 70/15/15 split. This way of splitting the data ensures that the models have never seen the test sentences (i.e., contexts) during training. In order to get a more robust estimate of performance we perform the experiments in Figure 2a using 10-fold cross validation. The variance, due to the random assignment of sentences to train/validation/test sets, is small, and hence not shown.  
> -- <cite>Brunner et al. 2020</cite>

In [None]:
random_seed = 42

brunner_param_dict = {
    'hidden_layer_sizes': (1000,), # Brunner et al.
    'activation': 'relu', # Brunner et al. used 'gelu,' but that is not implemented in sklearn
    'solver': 'adam', # Brunner et al.
    'alpha': 0.0001, # L2 regularization term, default setting.
    'batch_size': '256', # Brunner et al.
    'learning_rate': 'constant', # Brunner et al.
    'learning_rate_init': 0.0001, # Brunner et al.
    'max_iter': 999, # we want to stop after n_iter_no_change, not max_iter, so this is just a high value.
    'shuffle': True,
    'random_state': random_seed,
    'tol': 0.0001, # amount by which performance must improve to reset the n_iter_no_change counter, so this is just a small value.
    'verbose': False,
    'warm_start': False,
    'nesterovs_momentum': True,
    'early_stopping': False,
    'validation_fraction': 0.1,
    'beta_1': 0.9, # Brunner et al.
    'beta_2': 0.999, # Brunner et al.
    'epsilon': 1e-08, # Default for Adam optimizer
    'n_iter_no_change': 20, # Brunner et al.
}

In [54]:
class TokenIdentifier(torch.nn.Module):
        def __init__(self, input_size, hidden_size):
            super(MLP, self).__init__()
            self.input_size = input_size
            self.hidden_size  = hidden_size
            self.layers = torch.nn.Sequential(
                torch.nn.Linear(self.input_size, self.hidden_size),
                torch.nn.GELU(),
                torch.nn.Linear(self.hidden_size, self.hidden_size),
                torch.nn.GELU(),
                torch.nn.Linear(self.hidden_size, self.hidden_size),
                torch.nn.GELU(),
                torch.nn.Linear(self.hidden_size, self.input_size),
            )

        def forward(self, x):
            return self.layers(x)

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)


## Results

In [None]:
import seaborn as sns
sns.set_theme(font='Liberation Serif',
              rc={'figure.figsize': (7.5,3.75),
                  'font.size': 11,
                 })

import pandas as pd
