# Generative Artificial Intelligence (GenAI)
### Tokenization

Believe it or not, transformers got their name from the popular cartoon/comic Transformers "... because Jakob Uszkoreit, one of the paper's authors, liked the sound of that word." (see sources).

#### Progression of Sequence Models
`Recurrent Neural Network (RNN) -> Long Short-Term Memory (LSTM) -> Transformer`

*Recurrent Neural Network (RNN):* Sequence model that processes one step at a time, maintaining a memory of previous steps via a hidden state.  Struggles remembered information far back in the sequence.

*Long Short-Term Memory (LSTM):* handles long sequences due to vanishing gradients.  Forget gate, input gate, output gate.  Much better at long-term memory. Hangles vanishing gradients well.

*Transformer:* The superhero squad of sequence models. It ditches recurrence entirely and uses attention mechanisms to handle long-range dependencies. This architecture powers GPT, BERT, and pretty much every modern language model.

This file shows some examples of Generative AI tokenization outputs.  The tokens are used within the model's context window (how much can it remember at once) in order to produce positional encodings that are used within the attention mechanism.

#### Tokenizing with OpenAI
This snippet will use OpenAI's API to return the tokens for a given string.  Notice that the tokens may not include full words.

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize("Hello Moose, you code fast!")
ids = tokenizer.encode("Hello Moose, you code fast!")

print("Tokens:", tokens)
print("Token IDs:", ids)

#### OpenAI transformers library uses underlying BPE (Byte Pair Encoding) called `tiktoken`.

To install on your machine:

- `pip install tiktoken` or
- `pip3 install tiktoken`

In [None]:
import tiktoken

# Get the encoding for a specific model
encoding = tiktoken.encoding_for_model("gpt-4")

# Encode a string into tokens
tokens = encoding.encode("Hello, world!")

# Decode tokens back into a string
text = encoding.decode(tokens)

print(tokens)  # Outputs: [15496, 11, 995]
print(text)    # Outputs: Hello, world!

#### View token IDs that are embedded within the model's core training:

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Moose codes fast"
token_ids = tokenizer.encode(text)
tokens = tokenizer.tokenize(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)


#### Special Characters and Emojis

The following snippet shows examples of how the tokenizer handles special characters and punctuation.  The tokenizer will place all punctuation, regardless of significance, into their own tokens.  

Emojis are also tokenized into their own tokens.  The tokenizer will also split words into subwords if they are not in the vocabulary.  

For example, "Moose" is not in the vocabulary, so it is split into "Mo" and "ose".  The tokenizer will also add a special token at the beginning and end of the text to indicate the start and end of the sequence.


In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

samples = [
    "Moose codes fast!",
    "unbelievable",
    "🔥🐍🚀",
    "def add(a, b): return a + b",
    "What???!!",
    "Café résumé"
]

for text in samples:
    tokens = tokenizer.tokenize(text)
    ids = tokenizer.encode(text)
    print(f"\nInput: {text}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {ids}")


### Sources

1. https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
2. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf