Skip to content

Vocabulary list of GPT-4o (o200k_base) and GPT-4/GPT-3.5 (cl100k_base) tokenizers. Special tokens are excluded.

Notifications You must be signed in to change notification settings

kaisugi/gpt4_vocab_list

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

GPT-4 Vocab List

o200k_base

import base64
import requests


res = requests.get("https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken")
contents = res.content

for token, rank in (line.split() for line in contents.splitlines() if line):
    decoded_token = base64.b64decode(token)

    try:
        print(repr(decoded_token.decode('utf-8')))
    except:
        print(decoded_token)

cl100k_base

import base64
import requests


res = requests.get("https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken")
contents = res.content

for token, rank in (line.split() for line in contents.splitlines() if line):
    decoded_token = base64.b64decode(token)

    try:
        print(repr(decoded_token.decode('utf-8')))
    except:
        print(decoded_token)

About

Vocabulary list of GPT-4o (o200k_base) and GPT-4/GPT-3.5 (cl100k_base) tokenizers. Special tokens are excluded.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published