Skip to content

Commit

Permalink
Python - Replace last BPETokenizer occurences
Browse files Browse the repository at this point in the history
  • Loading branch information
n1t0 committed Feb 18, 2020
1 parent f263d76 commit 5daf1ee
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 8 deletions.
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -42,7 +42,7 @@ Start using in a matter of seconds:
```python
# Tokenizers provides ultra-fast implementations of most current tokenizers:
>>> from tokenizers import (ByteLevelBPETokenizer,
BPETokenizer,
CharBPETokenizer,
SentencePieceBPETokenizer,
BertWordPieceTokenizer)
# Ultra-fast => they can encode 1GB of text in ~20sec on a standard server's CPU
Expand Down
2 changes: 1 addition & 1 deletion bindings/node/README.md
Expand Up @@ -55,7 +55,7 @@ console.log(wpEncoded.getTypeIds());

## Provided Tokenizers

- `BPETokenizer`: The original BPE
- `CharBPETokenizer`: The original BPE
- `ByteLevelBPETokenizer`: The byte level version of the BPE
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
Expand Down
10 changes: 5 additions & 5 deletions bindings/python/README.md
Expand Up @@ -73,12 +73,12 @@ python setup.py install
Using a pre-trained tokenizer is really simple:

```python
from tokenizers import BPETokenizer
from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
Expand All @@ -89,10 +89,10 @@ print(encoded.tokens)
And you can train yours just as simply:

```python
from tokenizers import BPETokenizer
from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
Expand All @@ -106,7 +106,7 @@ tokenizer.save("./path/to/directory", "my-bpe")

### Provided Tokenizers

- `BPETokenizer`: The original BPE
- `CharBPETokenizer`: The original BPE
- `ByteLevelBPETokenizer`: The byte level version of the BPE
- `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
- `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
Expand Down
2 changes: 1 addition & 1 deletion bindings/python/tokenizers/__init__.pyi
Expand Up @@ -7,7 +7,7 @@ from .trainers import *

from .implementations import (
ByteLevelBPETokenizer as ByteLevelBPETokenizer,
BPETokenizer as BPETokenizer,
CharBPETokenizer as CharBPETokenizer,
SentencePieceBPETokenizer as SentencePieceBPETokenizer,
BertWordPieceTokenizer as BertWordPieceTokenizer,
)
Expand Down

0 comments on commit 5daf1ee

Please sign in to comment.