# 01 - Tokens

GPT models process text using *tokens*, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

The conversion of a prompt into tokens happens automatically when you submit a prompt so you don't need to do anything yourself. However, OpenAI services like Azure OpenAI use the number of tokens processed as part of the pricing model, in the case of Azure OpenAI, charging per 1,000 tokens. So understanding how many tokens your prompts consume is an important part of planning and building any application that will use OpenAI.

The prompt **"Hello world, this is fun!"** gets tokenized as follows:

```
Hello
 world,
 this
 is
 fun
!

(6 tokens)
```
Notice how spaces and punctuation are included as part of the tokens. A token doesn't always necessarily equate to a single word or phrase.

Let's try the prompt **"Example using words like indivisible and emojis"**.

```
Example
 using
 words
 like
 ind
iv
isible
 and
 em
oj
is

(11 tokens)
```
This time you can see that some of the words, **indivisible** and **emojis**, got broken up into smaller chunks.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

> You can experiment with this yourself using the *tokenizer* tool available on the OpenAI website at https://platform.openai.com/tokenizer

## Experimenting with tokens in code

OpenAI provide the `tiktoken` package that you can use to experiment with tokenization in your code.

`tiktoken` supports three encodings used by Azure OpenAI Service models:

| Encoding name | Azure OpenAI Service models |
| ------------- | -------------- |
| gpt2 (or r50k_base) | Most GPT-3 models |
| p50k_base | Code models, text-davinci-002, text-davinci-003 |
| cl100k_base | text-embedding-ada-002 |

You can use `tiktoken` as follows to tokenize a string and see what the output looks like.

In [1]:
import tiktoken

encoding = tiktoken.get_encoding("p50k_base")
encoding.encode("Hello world, this is fun!")

[15496, 995, 11, 428, 318, 1257, 0]

Was the output of the above code what you were expecting?

If you were expecting text broken up like the examples at the top of this page, you were probably wondering why you just got back a bunch of seemingly random numbers. This is because the AI models don't work on words. Instead, they use a method called *BPE* (Byte Pair Encoding) to convert the text into numeric tokens.

One of the features of BPE is that it's reversible, so you can convert the tokens back into the original text.

### Challenge - Display the text instead of the tokens

See if you can write code to display the text instead of the tokens.

:bulb: **HINT:** See the following cookbook for some tips on working with `tiktoken`: [How to count tokens with tiktoken](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb)

In [None]:
# Write code to display the text from the tokens below

#FIXME

If you're successful, the results should be similar to the following

`[b'Hello', b' world', b',', b' this', b' is', b' fun', b'!']`

### Challenge - Write a function to return the number of tokens

Using what you've learned so far, complete the following function so that it returns the count of the number of tokens in a text string.

In [None]:
def get_num_tokens_from_string(string: str, encoding_name: str='p50k_base') -> int:
    #FIXME

get_num_tokens_from_string("Hello World, this is fun!")