In [1]:
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O /tmp/bert-base-uncased-vocab.txt

--2020-12-04 09:25:00--  https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.104.253
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.104.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231508 (226K) [text/plain]
Saving to: ‘/tmp/bert-base-uncased-vocab.txt’


2020-12-04 09:25:00 (3.87 MB/s) - ‘/tmp/bert-base-uncased-vocab.txt’ saved [231508/231508]



In [2]:
from tokenizers import BertWordPieceTokenizer
from tokenizers.tools import EncodingVisualizer


In [3]:
EncodingVisualizer.unk_token_regex.search("aaa[udsnk]aaa")

In [4]:
text = """Mathias Bynens 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞': Whenever you’re working on a piece of JavaScript code that deals with strings or regular expressions in some way, just add a unit test that contains a pile of poo (💩) in a string, 💩💩💩💩💩💩💩💩💩💩💩💩 and see if anything breaks. It’s a quick, fun, and easy way to see if your code supports astral symbols. Once you’ve found a Unicode-related bug in your code, all you need to do is apply the techniques discussed in this post to fix it."""

In [5]:
tokenizer = BertWordPieceTokenizer("/tmp/bert-base-uncased-vocab.txt", lowercase=True)
visualizer = EncodingVisualizer(tokenizer=tokenizer)

## Visualizing Tokens With No Annotations

In [6]:
visualizer(text)

## Visualizing Tokens With Aligned Annotations
First we make some annotations with the Annotation class

In [7]:
from tokenizers.tools import Annotation

In [8]:
anno1 = Annotation(start=0, end=2, label="foo")
anno2 = Annotation(start=2, end=4, label="bar")
anno3 = Annotation(start=6, end=8, label="poo")
anno4 = Annotation(start=9, end=12, label="shoe")
annotations=[
    anno1,
    anno2,
    anno3,
    anno4,
    Annotation(start=23, end=30, label="random tandem bandem sandem landem fandom"),
    Annotation(start=63, end=70, label="foo"),
    Annotation(start=80, end=95, label="bar"),
    Annotation(start=120, end=128, label="bar"),
    Annotation(start=152, end=155, label="poo"),
]



In [9]:
visualizer(text,annotations=annotations)

## Using A Custom Annotation Format
Every system has its own representation of annotations. That's why we can instantiate the EncodingVisualizer with a convertion function.

In [10]:
funnyAnnotations = [dict(startPlace=i,endPlace=i+3,theTag=str(i)) for i in range(0,20,4)]
funnyAnnotations

[{'startPlace': 0, 'endPlace': 3, 'theTag': '0'},
 {'startPlace': 4, 'endPlace': 7, 'theTag': '4'},
 {'startPlace': 8, 'endPlace': 11, 'theTag': '8'},
 {'startPlace': 12, 'endPlace': 15, 'theTag': '12'},
 {'startPlace': 16, 'endPlace': 19, 'theTag': '16'}]

In [11]:
converter = lambda funny: Annotation(start=funny['startPlace'], end=funny['endPlace'], label=funny['theTag'])
visualizer = EncodingVisualizer(tokenizer=tokenizer, default_to_notebook=True, annotation_converter=converter)

In [12]:
visualizer(text, annotations=funnyAnnotations)

## Trying with Roberta


In [13]:
!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json" -O /tmp/roberta-base-vocab.json
!wget "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt" -O /tmp/roberta-base-merges.txt


--2020-12-04 09:25:00--  https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.226.19
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.226.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 898823 (878K) [application/json]
Saving to: ‘/tmp/roberta-base-vocab.json’


2020-12-04 09:25:00 (4.35 MB/s) - ‘/tmp/roberta-base-vocab.json’ saved [898823/898823]

--2020-12-04 09:25:00--  https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.104.253
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.104.253|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘/tmp/roberta-base-merges.txt’


2020-12-04 09:25:01 (4.04 MB/s) - ‘/tmp/

In [14]:
from tokenizers import ByteLevelBPETokenizer
roberta_tokenizer = ByteLevelBPETokenizer.from_file('/tmp/roberta-base-vocab.json', '/tmp/roberta-base-merges.txt')
roberta_visualizer = EncodingVisualizer(tokenizer=roberta_tokenizer, default_to_notebook=True)
roberta_visualizer(text, annotations=annotations)