## NLP Fundamentals 

Instructor : __Victor Geislinger__, data scientist and machine learning engineer using natural language processing (NLP). This included building and deploying NLP classification models as well as working on and with large language models (LLMs).

### What We'll Learn

![image.png](attachment:e0ff01af-e0ba-4119-94da-3f224b8db183.png)

__Intro to NLP :__ An introduction to different NLP tasks and applications and the challenges in working with language data

__Encoding Text Data :__ How to encode text data using tokenization and embeddings

__Text Generation :__ Techniques for the NLP task of text generation using recurrent neural networks

#### Natural Languages vs. Structured Languages
__Natural languages :__ A natural language is a language that evolved naturally through human communication, such as Spanish, Mandarin, or American Sign Language.

__Structured Language :__ A structured language is an invented or constructed language, such as a computer programming language.

__NLP: Natural Language Processing__  NLP reveals structure and meaning from human language to computers and its importance has grown in the modern age.

![image.png](attachment:78bc7de6-8c0c-4047-825b-cbf02b123977.png)

### Key NLP Applications 

| **Application/Task**               | **Definition**                                                   |
|------------------------------------|------------------------------------------------------------------|
| Speech Recognition                 | Converts spoken language to text                                 |
| Text Classification                | Categorizes pieces of text based on content                      |
| Machine Translation                | Translates text from one language to another                     |
| Text Summarization                 | Creates concise summaries while retaining meaning                |
| Question Answering                 | Answers natural language questions using documents               |
| Chatbots & Conversational Agents   | Converse back and forth in natural language                      |


### Extractive vs. Abstractive Summarization
__Extractive:__ Directly quotes main points from the source

__Abstractive:__ Summarizes with novel words and phrases


### Challenges in NLP with Computers
Language is complex, nuanced, and ambiguous, making it challenging for computers to process language data.

NLP Needs (Some) Clean and Labeled Text Data, Data can have misspellings, unconventional words, and biases that can be difficult and expensive for humans to label.

__NLP stands for "natural language processing" and forms the bridge between human communication and computer logic.__

Typical NLP tasks and applications include:

Speech recognition

Text classification

Machine translation

Text summarization

Question answering

Chatbots

__Some of the major challenges in NLP are the complexity, nuance, and ambiguity of natural language, as well as the difficulty of acquiring clean and labeled text data.__

## Normalization, Pretokenization, Tokenization, and Post-processing in NLP

#### Normalization

Normalization in Natural Language Processing (NLP) is the crucial first step of text preprocessing. It involves transforming text into a consistent format to ensure that it can be processed efficiently and accurately by NLP models. The primary goal of normalization is to reduce the variability in the text data. Common steps in normalization include lowercasing all characters to avoid case sensitivity issues, removing punctuation and special characters to focus on the core content, expanding contractions (e.g., "don't" to "do not") to standardize expressions, and applying stemming or lemmatization to reduce words to their root or base forms (e.g., "running" to "run" or "better" to "good"). These steps help in creating a uniform dataset, making it easier for NLP algorithms to analyze and learn from the text.

#### Pretokenization

Pretokenization is the process that precedes the actual tokenization step in NLP. It involves breaking down the normalized text into preliminary units, often based on whitespace and punctuation. Pretokenization simplifies the tokenization process by handling common text structures and edge cases upfront. For instance, it can separate punctuation from words (e.g., "Hello, world!" to ["Hello", ",", "world", "!"]), handle special tokens like hashtags and mentions (e.g., "@user" to ["@", "user"]), and split compound words into subwords (e.g., "unhappiness" to ["un", "happiness"]). Pretokenization ensures that the text is in a more manageable form for the subsequent tokenization step, improving the efficiency and accuracy of tokenization.

#### Tokenization
![image.png](attachment:a9abe263-5410-4f3d-ae76-10693cd7bd0e.png)

Tokenization is the process of converting the pretokenized text into discrete units called tokens, which can be words, subwords, or even characters, depending on the NLP model's requirements. Tokenization is essential because it transforms the continuous stream of text into structured inputs that models can process. For example, word tokenization would convert a sentence like "ChatGPT helps users" into ["ChatGPT", "helps", "users"]. Subword tokenization, used by models like BERT and GPT, breaks down words into smaller units, allowing the model to handle unknown or rare words effectively. This step is crucial for feature extraction, as it represents the text in a format that can be numerically encoded and fed into machine learning algorithms for further processing.

#### Post-processing

Post-processing in NLP involves the steps taken after the model has generated its output to refine and format the results for the end user. This step can include several tasks such as detokenization (reassembling tokens back into coherent text), correcting grammatical errors, and applying any necessary transformations to ensure the output is in a human-readable form. For instance, if a model generates a sequence of tokens like ["hello", ",", "world", "!"], post-processing would combine these tokens into the sentence "Hello, world!". Additionally, post-processing might involve filtering out any inappropriate or irrelevant content generated by the model, ensuring that the final output meets the desired standards and requirements. This step is crucial for delivering polished and useful results in real-world applications of NLP.

Each of these stages—normalization, pretokenization, tokenization, and post-processing—plays a critical role in the NLP pipeline, ensuring that text data is appropriately prepared, processed, and refined for accurate and effective analysis and generation by language models.




Convert to all lowercase letters __: Normalization__

Adds special tags, such as the tokens that mark the beginning and end of a sentence __: Postprocessing__

Breaks original text into smaller chunks __: Pretokenization__

Creates blocks of text that are used as tokens __: Tokenization__

Replace or remove accented characters __: Normalization__


### Encoding and Decoding Text with Hugging Face Tokenizers
##### Hugging Face provides pretrained tokenizers through its flexible API as part of the transformers Python library.

In [1]:
from transformers import AutoTokenizer

### Loading Tokenizer

In this notebook, we'll explore Hugging Face's tokenizers by using a pretrained
model. Hugging Face has many tokenizers available that have already been trained
for specific models and tasks!

In [2]:
# Choose a pretrained tokenizer to use
my_tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

### Encoding: Text to Tokens
#### Tokens: String Representations


In [3]:
# Simple method getting tokens from text
raw_text = '''Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!'''
tokens = my_tokenizer.tokenize(raw_text)

print(tokens)

['Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!']


In [4]:
# This method also returns special tokens depending on the pretrained tokenizer
detailed_tokens = my_tokenizer(raw_text).tokens()

print(detailed_tokens)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']


#### Tokens: Integer ID Representations

In [5]:
# Way to get tokens as integer IDs
print(my_tokenizer.encode(raw_text))

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


In [6]:
print(detailed_tokens)

# Tokenizer method to get the IDs if we already have the tokens as strings
detailed_ids = my_tokenizer.convert_tokens_to_ids(detailed_tokens)
print(detailed_ids)

['[CLS]', 'Rory', "'", 's', 'shoes', 'are', 'mage', '##nta', 'and', 'so', 'are', 'Corey', "'", 's', 'but', 'they', 'aren', "'", 't', 'nearly', 'as', 'dark', '!', '[SEP]']
[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


Another way can look a little complex but can be useful when working with
tokenizers for certain tasks.

In [7]:
# Returns an object that has a few different keys available
my_tokenizer(raw_text)

{'input_ids': [101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:
# focus on `input_ids` which are the IDs associated with the tokens.
print(my_tokenizer(raw_text).input_ids)

[101, 14845, 112, 188, 5743, 1132, 27595, 13130, 1105, 1177, 1132, 19521, 112, 188, 1133, 1152, 4597, 112, 189, 2212, 1112, 1843, 106, 102]


### Decoding: Tokens to Text

We of course can use the tokenizer to go from token IDs to tokens and back to text!

In [9]:
# Integer IDs for tokens
ids = my_tokenizer.encode(raw_text)

# The inverse of the .enocde() method: .decode()
my_tokenizer.decode(ids)

"[CLS] Rory's shoes are magenta and so are Corey's but they aren't nearly as dark! [SEP]"

In [10]:
# To ignore special tokens (depending on pretrained tokenizer)
my_tokenizer.decode(ids, skip_special_tokens=True)

"Rory's shoes are magenta and so are Corey's but they aren't nearly as dark!"

In [11]:
# List of tokens as strings instead of one long string
my_tokenizer.convert_ids_to_tokens(ids)

['[CLS]',
 'Rory',
 "'",
 's',
 'shoes',
 'are',
 'mage',
 '##nta',
 'and',
 'so',
 'are',
 'Corey',
 "'",
 's',
 'but',
 'they',
 'aren',
 "'",
 't',
 'nearly',
 'as',
 'dark',
 '!',
 '[SEP]']

> One thing to consider is if a string is outside of the tokenizer's vocabulary,
> also known as an "unkown" token.
> 
> They are typically represented with `[UNK]` or
> some other similar variant.


<!--
If the tokenizer encoded the text so each character was a token (which is
actually not as easy as it sounds), then it would be impossible to have an
"unknown" token. Word-based tokenization will always be in danger of having 
"unknown" tokens since it's virtually impossible to have every possible word (
and "non-word") in its vocabulary!

And so you might think that subword tokenization wouldn't have an issue with
"unknown" tokens. And although there are fewer than word-based tokenization, it
does happen!

--------------------------------------------------------------------------------

Tokenizers are specific so it's important to use a tokenizer that will recognize
most of the text you're working with! For example, a lot of tokenizers might not
consider emoji as tokens but could be really important if emoji are especially
numerous in your data (like a corpus of chat messages)!

If you're seeing a lot of "unknown" tokens with the text you're working with,
might consider using a different tokenizer appropiate for the task. Or it's also
possible to fine-tune a pretrained model or train one from scratch!

-->

In [12]:
phrase = '🥱 the dog next door kept barking all night!!'
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

🥱 the dog next door kept barking all night!!
['[CLS]', '[UNK]', 'the', 'dog', 'next', 'door', 'kept', 'barking', 'all', 'night', '!', '!', '[SEP]']
[CLS] [UNK] the dog next door kept barking all night!! [SEP]


In [13]:
phrase = '''wow my dad thought mcdonalds sold tacos \N{SKULL}'''
ids = my_tokenizer.encode(phrase)
print(phrase)
print(my_tokenizer.convert_ids_to_tokens(ids))
print(my_tokenizer.decode(ids))

wow my dad thought mcdonalds sold tacos 💀
['[CLS]', 'w', '##ow', 'my', 'dad', 'thought', 'm', '##c', '##don', '##ald', '##s', 'sold', 'ta', '##cos', '[UNK]', '[SEP]']
[CLS] wow my dad thought mcdonalds sold tacos [UNK] [SEP]


### Properties of Hugging Face Tokenizers

There are a lot of great features when using tokenizers in Hugging Face that can make it very simple to try out and use different models. Here we'll breifly discuss some properties that can be useful.

We'll load a couple different models:

* `bert-base-cased` ([doc](https://huggingface.co/docs/transformers/model_doc/bert))
* `xlm-roberta-base` ([doc](https://huggingface.co/docs/transformers/model_doc/xlm-roberta))
* `google/pegasus-xsum` ([doc](https://huggingface.co/docs/transformers/model_doc/pegasus))
* `allenai/longformer-base-4096` ([doc](https://huggingface.co/docs/transformers/model_doc/longformer))


__Hugging Face Tokenizer Properties__

The Hugging Face API allows you to use a variety of tokenizers, each with its own properties. In this demo, we compared:

* 'bert-base-cased'
* 'xlm-roberta-base'
* 'google/pegasus-xsum'
* 'allenai/longformer-base-4096'
  
__Maximum Length__

Different tokenizers will handle some text better, such as longer input sequences. The `.model_max_length` property of the tokenizer object will tell you the maximum length the model can handle.

If the length of your data exceeds the maximum length of your tokenizer, you may need to chunk the data before tokenizing it. Or you could consider switching to a different tokenizer that has a longer maximum length.

#### Special Tokens
Different tokenizers will have different special tokens defined. They might have tokens representing:

* Unknown token
* Beginning of sequence token
* Separator token
* Token used for padding
* Classifier token
* Token used for masking values
  
Additionally, there may be multiple subtypes of each special token. For example, some tokenizers have multiple different unknown tokens (e.g. <unk> and <unk_2>).

#### Hugging Face Tokenizers Takeaways

Different tokenizers can create very different tokens for the same piece of text. When choosing a tokenizer, consider what properties are important to you, such as the maximum length and the special tokens.

If none of the available tokenizers perform the way you need them to, you can also fine-tune a tokenizer to adjust it for your use case.


In [16]:
# Import the tokenizer from hugging face

from transformers import AutoTokenizer

In [21]:
model_names = (
    'bert-base-cased',
    'xlm-roberta-base',
    'google/pegasus-xsum',
    'allenai/longformer-base-4096',
)

model_tokenizers = {
    model_name: AutoTokenizer.from_pretrained(model_name)
    for model_name in model_names
}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Many models that tokenizers are associated with can only take in a maximum number of tokens and so the tokenizer might not be equipped to encode a very long sequence. It might not always be relevant, but you can find this length with `.model_max_length`.

In [24]:
print(model_tokenizers.keys())

dict_keys(['bert-base-cased', 'xlm-roberta-base', 'google/pegasus-xsum', 'allenai/longformer-base-4096'])


In [26]:
# Max Length

for model_name, temtokenizer in model_tokenizers.items():
    max_lenght = temtokenizer.model_max_length
    print(f"{model_name}\n\tMax Length : {max_lenght}")
    print(f"\n")

bert-base-cased
	Max Length : 512


xlm-roberta-base
	Max Length : 512


google/pegasus-xsum
	Max Length : 512


allenai/longformer-base-4096
	Max Length : 1000000000000000019884624838656




We've already mentioned special tokens like the "unknown" token. Different models use different ways to distinguish special tokens and not all models cover all the special tokens since it's dependent on the model's task it was trained for.

In [32]:
for model_name, temp_tokenizer in model_tokenizers.items():
    special_tokens = temp_tokenizer.all_special_tokens
    print(f"{model_name}\n\tspecial tokens:{special_tokens}")
    print(f" ")
    

bert-base-cased
	special tokens:['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
 
xlm-roberta-base
	special tokens:['<s>', '</s>', '<unk>', '<pad>', '<mask>']
 
google/pegasus-xsum
	special tokens:['</s>', '<unk>', '<pad>', '<mask_2>', '<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>', '<unk_63>', '<unk_64>', '<unk

In [27]:
# Special Tokens
model_tokenizers['bert-base-cased'].unk_token

'[UNK]'

In [33]:
for model_name, temp_tokenizer in model_tokenizers.items():
    print(f'{model_name}')
    print(f'\tUnknown: \n\t\t{temp_tokenizer.unk_token=}')
    print(f'\tBeginning of Sequence: \n\t\t{temp_tokenizer.bos_token=}')
    print(f'\tEnd of Sequence: \n\t\t{temp_tokenizer.eos_token=}')
    print(f'\tMask: \n\t\t{temp_tokenizer.mask_token=}')
    print(f'\tSentence Separator: \n\t\t{temp_tokenizer.sep_token=}')
    print(f'\tClass of Input: \n\t\t{temp_tokenizer.cls_token=}')
    print('\n')

bert-base-cased
	Unknown: 
		temp_tokenizer.unk_token='[UNK]'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token=None
	Mask: 
		temp_tokenizer.mask_token='[MASK]'
	Sentence Separator: 
		temp_tokenizer.sep_token='[SEP]'
	Class of Input: 
		temp_tokenizer.cls_token='[CLS]'


xlm-roberta-base
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token='<s>'
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask>'
	Sentence Separator: 
		temp_tokenizer.sep_token='</s>'
	Class of Input: 
		temp_tokenizer.cls_token='<s>'


google/pegasus-xsum
	Unknown: 
		temp_tokenizer.unk_token='<unk>'
	Beginning of Sequence: 
		temp_tokenizer.bos_token=None
	End of Sequence: 
		temp_tokenizer.eos_token='</s>'
	Mask: 
		temp_tokenizer.mask_token='<mask_2>'
	Sentence Separator: 
		temp_tokenizer.sep_token=None
	Class of Input: 
		temp_tokenizer.cls_token=None


allenai/longform

#### Type Annotations

```python
def normalize_text(text: str) -> str:
    # TODO: Normalize incoming text; can be multiple actions
    normalized_text = ''
    return normalized_text

Adding `: str` inside the parentheses tells us that `text` is expected to be a string, and adding `-> str` after the parentheses tells us that the function's return value `(normalized_text)` is expected to be a string.

As of Python 3.12, type annotations don't directly affect your code running and are completely optional in Python. The major benefit is it helps document what variable types are expected. See the Python `typing` documentation for more details.

__I've used type annotations in this and other exercises for this lesson to help make it clear what you, as the learner, should fill in.__ I also think it's useful to become familiar with type annotations even if they are optional because the industry has been moving in the direction of including type annotations. You'll frequently see type annotations in the Python libraries you use.

### Coding a Tokenizer from Scratch

In this exercise, you'll build a tokenizer from scratch to practice and get familiar with how tokenization works. However, you would typically not create a tokenizer from scratch using pure Python. This is partly because tokenizers often have to process large amounts of text data, so the tokenizer needs to be optimized.

Thankfully, there are frameworks that can help with creating and using tokenizers, such as Hugging Face's tokenizers library(opens in a new tab). But for this exercise, we'll do it from scratch for learning purposes.

#### Encoding Text Data

You will complete the task in following steps below.

- Implement the normalization function for your custom tokenizer
- Implement the pretokenization function in your custom tokenizer
- Implement the tokenization step in your custom tokenizer
- Implement the postprocessing step in your custom tokenizer
- Create an encoder to encode the text to token IDs using your custom tokenizer
- Create a decoder to decode token IDs to text using your custom tokenizer

In [44]:
from __future__ import annotations
import string

In [54]:
# Define Sample Text
sample_text = '''Mr. Louis continued to say, "Penguins are important,
but we mustn't forget the number 1 priority: the READER!"
'''

print(sample_text)


Mr. Louis continued to say, "Penguins are important,
but we mustn't forget the number 1 priority: the READER!"



### Normalization

This step is where you'll normalize your text by converting to lowercase, removing accented characters, etc.

For example, the text: `Did Uncle Max like the jalapeño dip?`

might be normalized to: `did uncle max like the jalapeno dip`


In [55]:
def normalize_text(text : str)-> str:
    # Complete : Normaliz the incoming text, can be multiple actions
    # ONly keep ASCII letters, numbers, punctuations, and whitespace characters
    acceptable_characters = (
        string.ascii_letters
        + string.digits
        + string.punctuation
        + string.whitespace
    )
    normalized_text = ''.join(
        filter(lambda letter: letter in acceptable_characters, text)
    )
    # MAKE TEXT to lowercase
    normalized_text = normalized_text.lower()
    return normalized_text

In [56]:
# Test out your normalization
normalize_text(sample_text)

'mr. louis continued to say, "penguins are important,\nbut we mustn\'t forget the number 1 priority: the reader!"\n'

### Pretokenization

This step will take in the normalized text and pretokenize the text into a list
of smaller pieces.

For example, the text: `Did Uncle Max like the jalapeño dip?`

might be normalized & then pretokenized to:
```
[
    'did',
    'uncle',
    'max',
    'like',
    'the',
    'jalapeno',
    'dip?',
]
```

In [57]:
# Pretokenzation
def pretokenize_text(text: str) -> list[str]:
    #complete pretokenize normalized text
    #Split based on spaces
    smaller_pieces = text.split()
    return smaller_pieces

In [58]:
normalized_text = normalize_text(sample_text)
pretokenize_text(normalized_text)

['mr.',
 'louis',
 'continued',
 'to',
 'say,',
 '"penguins',
 'are',
 'important,',
 'but',
 'we',
 "mustn't",
 'forget',
 'the',
 'number',
 '1',
 'priority:',
 'the',
 'reader!"']

### Tokenization

This step will take in the list of pretokenized pieces (after the text has 
been normalized) into the tokens that will be used.

For example, the text: `Did Uncle Max like the jalapeño dip?`

might be normalized, pretokenized, and then tokenized to:
```
[
    'did',
    'uncle',
    'max',
    'like',
    'the',
    'jalapeno',
    'dip'
    '?',
]
```

__Tokenization__ breaks text into tokens that computers can understand.

Steps for encoding text as tokens depend on the data and task:

- Normalization
- Pretokenization
- Tokenization
- Postprocessing
- Tokenization methods vary and have their pros and cons.

* Character tokenization results in a smaller vocabulary but can be harder for downstream tasks.
* Word tokenization results in a larger vocabulary with more out-of-vocabulary tokens.
* Subword tokenization is a balance between small and large tokens where frequent words are not split and rare words are broken down, keeping the vocabulary size manageable.
  
Embeddings represent tokens as vectors that incorporate the context of the text in its vector space.



### Sequences

- Text can be treated as a sequence of characters, words, tokens, or embeddings.

- Depending on the task, there are different model architectures to handle the sequence.

__Encoder-Decoder Models__
    
For an encoder-decoder model, the encoder encodes the input sequence into a representation of context while the decoder decodes this representation to generate an output sequence.

__For now we will focus on relatively simple yet versatile which is RNN Recurrent Neural Networks__

<p float="left">
  <img src="attachment:6c86fd89-bb8d-4975-be88-17dba595b8ec.png" width="600" />
  <img src="attachment:d34e33a2-26d2-4909-99e7-761e04550ca6.png" width="600" />
</p>

![image.png](attachment:2de00bb2-6658-417a-9cb5-275f80daa725.png)

Autoregressive models *tend to repeat* the same tokens in its output sequence, so
*sampling methods* are frequently used for choosing the next token where the probability
of a particular token is chosen.

## Sampling Methods for Tokens

- **Temperature**: adjusts the randomness in choosing the next token
- **Top-k sampling**: samples from only the *k* most likely tokens
- **Nucleus or top-p sampling**: uses a dynamic cutoff for sampling the most likely tokens (cumulative probability is under *p*)
- **Beam search**: considers the likelihood of strings of multiple tokens instead of just a single next token


#### 1. Temperature
**Temperature** is a way to adjust the randomness when a model is picking the next word. A higher temperature means more randomness, and a lower temperature means less randomness.

**Example**:
- High Temperature (1.0): "The cat sat on the... banana."
- Low Temperature (0.2): "The cat sat on the... mat."

In this example, with high temperature, the model might pick a strange word like "banana". With low temperature, it sticks to more likely words like "mat".

#### 2. Top-k Sampling
**Top-k sampling** means the model only considers the top *k* most likely next words and picks one of them.

**Example**:
- Top-3 Sampling: "The cat sat on the..."
  - The model considers only the top 3 options: "mat", "floor", "couch".
  - It then randomly picks one of these.

#### 3. Nucleus or Top-p Sampling
**Nucleus or top-p sampling** is similar to top-k, but instead of a fixed number *k*, it considers the smallest set of words whose combined probability is at least *p* (e.g., 0.9).

**Example**:
- Top-p (0.9) Sampling: "The cat sat on the..."
  - The model considers enough options to make up 90% of the probability: "mat", "floor", "couch", "bed".
  - It then randomly picks one of these.

#### 4. Beam Search
**Beam search** is a method that looks at several possible sequences of words to find the best one, rather than just picking the next word.

**Example**:
- Beam Search: "The cat sat on the..."
  - Instead of choosing one word at a time, the model looks ahead to see possible sequences like "mat", "floor", and "couch".
  - It chooses the sequence that is most likely: "The cat sat on the mat and purred."

#### Summary
- **Temperature**: Adjusts how random the word choices are.
- **Top-k Sampling**: Picks from the top *k* most likely words.
- **Nucleus or Top-p Sampling**: Picks from a set of words making up at least *p* probability.
- **Beam Search**: Looks at sequences of words to find the best continuation.

These methods help models generate more natural and varied text by controlling how they pick the next word.
