<a href="https://colab.research.google.com/github/Taaniya/exploring-gpt2-language-model/blob/main/Visualizing_gpt2_token_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook visualizes word token embeddings of GPT2 on Tensorboard projector

In [1]:
#! pip install transformers

In [2]:
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
import tensorflow as tf
from tensorboard.plugins import projector

import re
import os
from tqdm import tqdm

In [5]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

word_embeddings = model.transformer.wte.weight      # Word Token Embeddings 
position_embeddings = model.transformer.wpe.weight  # Word Position Embeddings 

In [6]:
print(word_embeddings.shape)

print(position_embeddings.shape)

torch.Size([50257, 768])
torch.Size([1024, 768])


In [7]:
# create logging directory

#mm
log_dir='d:/ai/logs/vocab/'

if not os.path.exists(log_dir):
    os.makedirs(log_dir)

In [8]:
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

In [9]:
tokenizer.pretrained_vocab_files_map

{'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'},
 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'},
 'tokenizer_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/tokenizer.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/ma

In [10]:
tokenizer.vocab

{'Ð°': 16142,
 'Ġlocation': 4067,
 'Ġembrace': 12553,
 'izarre': 12474,
 'Ġcannabinoids': 46830,
 'xxxx': 12343,
 'Ġ4090': 48908,
 'Ġphysician': 14325,
 'agonists': 36764,
 'fac': 38942,
 'Ġpremier': 18256,
 'ĠFormer': 14466,
 'vironments': 12103,
 'isively': 42042,
 'R': 49,
 'Ġflexible': 12846,
 'Ġencoded': 30240,
 'component': 42895,
 'ĠWorlds': 19946,
 'Ġstaffing': 36700,
 'alk': 971,
 'Loop': 39516,
 'Tel': 33317,
 'uese': 20506,
 'Ġbloated': 45709,
 'ches': 2052,
 'Ġintroducing': 16118,
 'Ġequations': 27490,
 'Station': 12367,
 'Ġinstrument': 8875,
 'ĠDeer': 34022,
 'single': 29762,
 '696': 38205,
 'Ġgrit': 34954,
 'ĠArrows': 43946,
 'Mot': 47733,
 'Ġportal': 17898,
 'Ġhyster': 24258,
 'Ġrevenues': 13089,
 'ible': 856,
 'Ġtrim': 15797,
 'Ġlandsc': 25227,
 'ĠPens': 38740,
 'ĠUk': 5065,
 'Ġbeloved': 14142,
 'dt': 28664,
 'ĠHR': 15172,
 'ĠSupreme': 5617,
 'ĠKel': 15150,
 'Value': 11395,
 'Ġembedded': 14553,
 'ĠBattle': 5838,
 'Ġreinforced': 23738,
 'Ġflung': 45111,
 'DP': 6322,
 'Ġi

**Creating list of tokens in vocab sorted by their index in vocab**

In [11]:
vocab_list = sorted(tokenizer.vocab.items(), key=lambda x:x[1])

**Verify if the resulting list is sorted by the token indices.**

In [12]:
for k,v in tokenizer.vocab.items():
    if v < 10:
        print(k, v)

% 4
& 5
* 9
( 7
' 6
" 1
) 8
! 0
$ 3
# 2


In [13]:
vocab_list[:10]

[('!', 0),
 ('"', 1),
 ('#', 2),
 ('$', 3),
 ('%', 4),
 ('&', 5),
 ("'", 6),
 ('(', 7),
 (')', 8),
 ('*', 9)]

**Save the sorted token labels from vocab as metadata file**

In [15]:
# Just create the metadata file

with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
    
    #aa tqdm
    for word, idx in tqdm(vocab_list):
        line = str(word.encode(encoding='iso-8859-1', errors='replace'))
        line = re.sub("^b'", "", line)
        line = re.sub('^b"', "", line)
        line = re.sub("'$", "", line)
        line = re.sub('"$', '', line)
        f.write("{}\n".format(line))

100%|██████████████████████████████████| 50257/50257 [00:00<00:00, 60062.06it/s]


**Save the word embeddings**

In [None]:
embeddings = tf.Variable(model.transformer.wte.weight.detach().numpy())
checkpoint = tf.train.Checkpoint(embedding=embeddings)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

'./logs/vocab/embedding.ckpt-1'

Finally set up tensorboard projector's configuration. This creates a configuration file with .pbtxt extension.

In [None]:
# Set up config.
config = projector.ProjectorConfig()
embedding = config.embeddings.add()

# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`.
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)

Run tensorboard to visualize the embeddings. Use UMAP for faster and cleaner visualizations. Search a few keywords and find their nearest neighbours in the 3D space and in the drop down.

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs/vocab/

#### References

1. [Tensorboard embedding projector](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin)

2. https://towardsdatascience.com/how-to-visualize-text-embeddings-with-tensorboard-47e07e3a12fb

3. https://github.com/huggingface/transformers/issues/1458