A lot of us used BERT embeddings in this competition. I think I'm not wrong when I say that, for most of us, it's a bit mysterious precisely what information about the text gets encoded in BERT's layers. So I decided to make some visualizations of the embeddings, using a dimensional reduction algorithm called [t-SNE](https://lvdmaaten.github.io/tsne/). Here's a sketch:
1. For BERT-large, each word is represented as a vector in a 1024-dimensional space;
2. Using a good old PCA, we reduce this to a 50-dimensional space, hopefully without losing too much information;
3. Using PCA to further reduce from 50 to 2 dimensions would probably kill a lot of useful information, so we want a more refined method. What t-SNE does, roughly, is create vectors in a 2-dimensional space, such that if two vectors have small distance in the 50-dimensional space, they also have small distance in the 2-dimensional space. We get a 2-dimensional plot, which offers a little insight into how BERT embeddings for various words are distributed.

I have a concrete question that I'd like to answer using these plots. I think a lot of people noticed that you can concatenate different layers of BERT, not necessarily the last ones. For my team, what worked best was concatenating layers -4, -5, -6. We will talk more about our solution elsewhere. But just to give you an idea, here are some experiments which I did with the model from [my previous kernel](https://www.kaggle.com/mateiionita/taming-the-bert-a-baseline). After replacing BERT-base with BERT-large, and concatenating embeddings coming from two layers only, I get the following results:

With layers -5, -6:
CV mean score: 0.4666, std: 0.0278.
Test score: 0.41730251922932554

With layers -3, -4:
CV mean score: 0.4929, std: 0.0267.
Test score: 0.45579418221937407

With layers -1, -2:
CV mean score: 0.5311, std: 0.0205.
Test score: 0.49026846792574713

It's pretty clear that layers -5, -6 are much better suited for this problem than the first 4. So in the graphs below, I took the first 10 examples from gap-development, and I'm plotting the result of t-SNE for layer -1, and separately for layer -5. Hopefully staring long enough at plots like these can reveal something about the different ways in which BERT layers encode information.

In [1]:
import numpy as np, pandas as pd 
import os
import zipfile

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy import stats

import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

Reading just 10 examples from the gap-development file.

In [2]:
!wget https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv
nrows = 10
data = pd.read_csv("gap-development.tsv", sep = '\t', nrows = nrows)

--2022-08-04 11:12:44--  https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1080993 (1.0M) [text/plain]
Saving to: ‘gap-development.tsv’


2022-08-04 11:12:45 (7.39 MB/s) - ‘gap-development.tsv’ saved [1080993/1080993]



In [3]:
#downloading weights and cofiguration file for bert
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
with zipfile.ZipFile("uncased_L-24_H-1024_A-16.zip","r") as zip_ref:
    zip_ref.extractall()
!rm "uncased_L-24_H-1024_A-16.zip"

--2022-08-04 11:12:45--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.210.176, 216.58.209.176, 216.58.209.208, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.210.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1247797031 (1.2G) [application/zip]
Saving to: ‘uncased_L-24_H-1024_A-16.zip’


2022-08-04 11:13:13 (42.8 MB/s) - ‘uncased_L-24_H-1024_A-16.zip’ saved [1247797031/1247797031]



In [7]:
# !wget https://raw.githubusercontent.com/google-research/bert/master/modeling.py 
# !wget https://raw.githubusercontent.com/google-research/bert/master/extract_features.py 
# !wget https://raw.githubusercontent.com/google-research/bert/master/tokenization.py

import modeling
import extract_features
import tokenization
# import tensorflow as tf

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
def compute_offset_no_spaces(text, offset):
	count = 0
	for pos in range(offset):
		if text[pos] != " ": count +=1
	return count

def count_chars_no_special(text):
	count = 0
	special_char_list = ["#"]
	for pos in range(len(text)):
		if text[pos] not in special_char_list: count +=1
	return count

def count_length_no_special(text):
	count = 0
	special_char_list = ["#", " "]
	for pos in range(len(text)):
		if text[pos] not in special_char_list: count +=1
	return count

Passing the 10 GAP examples through BERT, and saving layers -1, -5

In [None]:
text = data["Text"]
text.to_csv("input.txt", index = False, header = False)

os.system("python3 extract_features.py \
  --input_file=input.txt \
  --output_file=output.jsonl \
  --vocab_file=uncased_L-24_H-1024_A-16/vocab.txt \
  --bert_config_file=uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --layers=-1,-5 \
  --max_seq_length=256 \
  --batch_size=8")

bert_output = pd.read_json("output.jsonl", lines = True)
os.system("rm output.jsonl")
os.system("rm input.txt")

In [None]:
emb_2d = {}
for row in range(nrows):
    P = data.loc[row,"Pronoun"].lower()
    A = data.loc[row,"A"].lower()
    B = data.loc[row,"B"].lower()
    P_offset = compute_offset_no_spaces(data.loc[row,"Text"], data.loc[row,"Pronoun-offset"])
    A_offset = compute_offset_no_spaces(data.loc[row,"Text"], data.loc[row,"A-offset"])
    B_offset = compute_offset_no_spaces(data.loc[row,"Text"], data.loc[row,"B-offset"])
    # Figure out the length of A, B, not counting spaces or special characters
    A_length = count_length_no_special(A)
    B_length = count_length_no_special(B)
    
    # Get the BERT embeddings for the current line in the data file
    features = pd.DataFrame(bert_output.loc[row,"features"]) 
    
    span = range(2,len(features)-2)
    emb1, emb5 = {}, {}
    count_chars = 0
    
    # Make a list with the text of each token, to be used in the plots
    texts = []

    for j in span:
        token = features.loc[j,'token']
        texts.append(token)
        emb1[j] = np.array(features.loc[j,'layers'][0]['values'])
        emb5[j] = np.array(features.loc[j,'layers'][1]['values'])
        if count_chars == P_offset:
            texts.pop()
            texts.append("@P" + token)
        if count_chars in range(A_offset, A_offset + A_length): 
            texts.pop()
            if data.loc[row,"A-coref"]:
                texts.append("@G" + token)
            else:
                texts.append("@R" + token)
        if count_chars in range(B_offset, B_offset + B_length): 
            texts.pop()
            if data.loc[row,"B-coref"]:
                texts.append("@G" + token)
            else:
                texts.append("@R" + token)
        count_chars += count_length_no_special(token)
    
    X1 = np.array(list(emb1.values()))
    X5 = np.array(list(emb5.values()))
    if row == 0: print("Shape of embedding matrix: ", X1.shape)

    # Use PCA to reduce dimensions to a number that's manageable for t-SNE
    pca = PCA(n_components = 50, random_state = 7)
    X1 = pca.fit_transform(X1)
    X5 = pca.fit_transform(X5)
    if row == 0: print("Shape after PCA: ", X1.shape)

    # Reduce dimensionality to 2 with t-SNE.
    # Perplexity is roughly the number of close neighbors you expect a
    # point to have. Our data is sparse, so we chose a small value, 10.
    # The KL divergence objective is non-convex, so the result is different
    # depending on the seed used.
    tsne = TSNE(n_components = 2, perplexity = 10, random_state = 6, 
                learning_rate = 1000, n_iter = 1500)
    X1 = tsne.fit_transform(X1)
    X5 = tsne.fit_transform(X5)
    if row == 0: print("Shape after t-SNE: ", X1.shape)
    
    # Recording the position of the tokens, to be used in the plot
    position = np.array(list(span)) 
    position = position.reshape(-1,1)
    
    X = pd.DataFrame(np.concatenate([X1, X5, position, np.array(texts).reshape(-1,1)], axis = 1), 
                     columns = ["x1", "y1", "x5", "y5", "position", "texts"])
    X = X.astype({"x1": float, "y1": float, "x5": float, "y5": float, "position": float, "texts": object})

    # Remove a few outliers based on zscore
    X = X[(np.abs(stats.zscore(X[["x1", "y1", "x5", "y5"]])) < 3).all(axis=1)]
    emb_2d[row] = X

Finally, plot the 2-dimensional representations output by t-SNE. I labeled each datapoint by the token it represents, using blue text for the pronoun, green text for the correct coreferent, and red text for incorrect correferents. The color of the points represents the position of the token in the sentence: blue is towards the beginning, red towards the end.

In [None]:
for row in range(nrows):
    X = emb_2d[row]
    
    # Plot for layer -1
    plt.figure(figsize = (20,15))
    p1 = sns.scatterplot(x = X["x1"], y = X["y1"], hue = X["position"], palette = "coolwarm")
    p1.set_title("development-"+str(row+1)+", layer -1")
    
    # Label each datapoint with the word it corresponds to
    for line in X.index:
        text = X.loc[line,"texts"]
        if "@P" in text:
            p1.text(X.loc[line,"x1"]+0.2, X.loc[line,"y1"], text[2:], horizontalalignment='left', 
                    size='medium', color='blue', weight='semibold')
        elif "@G" in text:
            p1.text(X.loc[line,"x1"]+0.2, X.loc[line,"y1"], text[2:], horizontalalignment='left', 
                    size='medium', color='green', weight='semibold')
        elif "@R" in text:
            p1.text(X.loc[line,"x1"]+0.2, X.loc[line,"y1"], text[2:], horizontalalignment='left', 
                    size='medium', color='red', weight='semibold')
        else:
            p1.text(X.loc[line,"x1"]+0.2, X.loc[line,"y1"], text, horizontalalignment='left', 
                    size='medium', color='black', weight='semibold')
    
    # Plot for layer -5
    plt.figure(figsize = (20,15))
    p1 = sns.scatterplot(x = X["x5"], y = X["y5"], hue = X["position"], palette = "coolwarm")
    p1.set_title("development-"+str(row+1)+", layer -5")
    
    for line in X.index:
        text = X.loc[line,"texts"]
        if "@P" in text:
            p1.text(X.loc[line,"x5"]+0.2, X.loc[line,"y5"], text[2:], horizontalalignment='left', 
                    size='medium', color='blue', weight='semibold')
        elif "@G" in text:
            p1.text(X.loc[line,"x5"]+0.2, X.loc[line,"y5"], text[2:], horizontalalignment='left', 
                    size='medium', color='green', weight='semibold')
        elif "@R" in text:
            p1.text(X.loc[line,"x5"]+0.2, X.loc[line,"y5"], text[2:], horizontalalignment='left', 
                    size='medium', color='red', weight='semibold')
        else:
            p1.text(X.loc[line,"x5"]+0.2, X.loc[line,"y5"], text, horizontalalignment='left', 
                    size='medium', color='black', weight='semibold') 

There's some useful information in these plots. Notice two reasons why points are close:
1. They represent the same word, or similar words, independently of context, such as "girlfriend" and "boyfriend" in development-1.
2. They represent tokens which have close positions in the sentence, such as "episode" and "final" in development-1.

In some cases, you can see directly from these plots that BERT has learned some information that's very useful for coreference resolution. For example, in development-5, "she" and "rivera" are very close together.

I intend to play with these tools more, and update the kernel if I have any new insights. But let me end with a disclaimer: here's a great [explanation](https://distill.pub/2016/misread-tsne/) of some of the pitfalls of t-SNE visualizations.