### Examples of Unigram, Bigram, and Trigram Tokens

Let's take a long sentence as an example:
"The quick brown fox jumps over the lazy dog."

#### Unigram Tokens
Unigrams are single words. The sentence tokenized into unigrams would be:
- "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"

#### Bigram Tokens
Bigrams are pairs of consecutive words. The sentence tokenized into bigrams would be:
- "The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog"

#### Trigram Tokens
Trigrams are triples of consecutive words. The sentence tokenized into trigrams would be:
- "The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over", "jumps over the", "over the lazy", "the lazy dog"

### What is a Corpus?

A corpus (plural: corpora) is a large and structured set of texts (documents). In the context of natural language processing (NLP) and text mining, a corpus is used to train and evaluate models. It can be a collection of written texts, transcriptions of spoken words, or any other form of linguistic data.

Example of a corpus:
- Document 1: "The quick brown fox jumps over the lazy dog."
- Document 2: "The dog barks loudly."
- Document 3: "Foxes are clever animals."

### What is a Vocabulary in the Above Context?

The vocabulary of a corpus is the set of all unique words or tokens that appear in the corpus. It represents the lexicon used in the documents of the corpus. 

For the example corpus given above, the vocabulary would include all unique words from the documents:

- "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "barks", "loudly", "Foxes", "are", "clever", "animals"

To illustrate further, let’s build a simple vocabulary and token sets:

### Example Corpus

1. Document 1: "The quick brown fox jumps over the lazy dog."
2. Document 2: "The dog barks loudly."
3. Document 3: "Foxes are clever animals."

#### Vocabulary
The set of all unique words in the corpus:
- {"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "barks", "loudly", "Foxes", "are", "clever", "animals"}

### Tokenization Examples from Document 1

#### Unigrams
- ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

#### Bigrams
- ["The quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the", "the lazy", "lazy dog"]

#### Trigrams
- ["The quick brown", "quick brown fox", "brown fox jumps", "fox jumps over", "jumps over the", "over the lazy", "the lazy dog"]



# Tokenization Techniques in NLP

Tokens in natural language processing (NLP) are the individual elements that make up a text, such as words, subwords, or characters. There are various ways to define and extract tokens depending on the application and the complexity of the language model. Here are several common methods:

### 1. Word Tokenization

**Definition**: Splitting text into individual words.

**Example**:
- Sentence: "The quick brown fox jumps over the lazy dog."
- Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

**Applications**: General-purpose text processing, information retrieval, and traditional NLP tasks.

### 2. Subword Tokenization

**Definition**: Splitting words into smaller subword units, which can include prefixes, suffixes, or common word parts. This is useful for handling out-of-vocabulary words.

**Example**:
- Sentence: "unhappiness"
- Tokens: ["un", "happi", "ness"]

**Applications**: Neural machine translation, language modeling (e.g., Byte Pair Encoding (BPE), WordPiece).

### 3. Character Tokenization

**Definition**: Splitting text into individual characters.

**Example**:
- Sentence: "cat"
- Tokens: ["c", "a", "t"]

**Applications**: Languages with complex morphology, tasks requiring fine-grained text analysis, and character-level language models.

### 4. N-gram Tokenization

**Definition**: Extracting contiguous sequences of n items (words, subwords, or characters).

**Example**:
- Sentence: "The quick brown fox"
- Unigrams: ["The", "quick", "brown", "fox"]
- Bigrams: ["The quick", "quick brown", "brown fox"]
- Trigrams: ["The quick brown", "quick brown fox"]

**Applications**: Text classification, information retrieval, and feature extraction for machine learning models.

### 5. Sentence Tokenization

**Definition**: Splitting text into individual sentences.

**Example**:
- Text: "Hello world! How are you?"
- Tokens: ["Hello world!", "How are you?"]

**Applications**: Document summarization, sentiment analysis, and text segmentation.

### 6. Byte-Pair Encoding (BPE)

**Definition**: An algorithm that iteratively merges the most frequent pairs of bytes or characters in the text, resulting in subword units.

**Example**:
- Sentence: "low lower lowest"
- Tokens: ["low", "low", "er", "low", "est"]

**Applications**: Machine translation, language modeling, and reducing the vocabulary size.

### 7. WordPiece Tokenization

**Definition**: Similar to BPE but uses a probabilistic approach to find the best subword units.

**Example**:
- Sentence: "unhappiness"
- Tokens: ["un", "##happiness"]

**Applications**: Google's BERT and other Transformer-based models.

### 8. Regular Expression Tokenization

**Definition**: Using regular expressions to define patterns for splitting text into tokens.

**Example**:
- Sentence: "Email: user@example.com"
- Pattern: r'\w+|\S'
- Tokens: ["Email", ":", "user", "@", "example", ".", "com"]

**Applications**: Custom text processing tasks, specific tokenization rules, and handling special text formats.

### 9. Morpheme-based Tokenization

**Definition**: Splitting text into morphemes, the smallest meaningful units in a language.

**Example**:
- Word: "unhappiness"
- Tokens: ["un-", "happy", "-ness"]

**Applications**: Linguistic research, languages with complex morphology, and text analysis.

### 10. Semantic Tokenization

**Definition**: Splitting text into units based on semantic meaning, such as named entities or phrases.

**Example**:
- Sentence: "Barack Obama was born in Hawaii."
- Tokens: ["Barack Obama", "was born in", "Hawaii"]

**Applications**: Named entity recognition (NER), information extraction, and semantic analysis.

### Choosing the Right Tokenization Method

The choice of tokenization method depends on the specific task and the characteristics of the text data. For instance:

- **Word Tokenization** is simple and effective for many general-purpose NLP tasks.
- **Subword and Character Tokenization** are better suited for languages with rich morphology or for models dealing with out-of-vocabulary words.
- **N-gram Tokenization** can capture context and sequential information.
- **Sentence Tokenization** is useful for tasks involving sentence-level analysis.
- **BPE and WordPiece Tokenization** are commonly used in modern NLP models like BERT and GPT.
- **Regular Expression Tokenization** allows for custom tokenization based on specific patterns.
- **Morpheme-based and Semantic Tokenization** are more advanced and used for specific linguistic or semantic tasks.

Each method has its advantages and trade-offs, and the right choice often depends on the language, the corpus, and the specific application.
