<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 2: Working with Text Data (F# Edition)

This notebook is an F# port of the Python chapter 2 notebook from "Build a Large Language Model From Scratch".
We use TorchSharp for tensor operations and .NET standard libraries for text processing.

- This chapter covers data preparation and sampling to get input data "ready" for the LLM

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp?timestamp=1" width="500px">

## Install NuGet packages

In [81]:
#r "nuget: TorchSharp-cpu, 0.103.0"
#r "nuget: Tiktoken, 2.1.1"

open System
open System.IO
open System.Net.Http
open System.Text.RegularExpressions
open TorchSharp

printfn "TorchSharp version: %s" (typeof<torch.Tensor>.Assembly.GetName().Version.ToString())
printfn "Setup complete."

TorchSharp version: 0.103.0.0
Setup complete.


## 2.1 Understanding word embeddings

- No code in this section

- There are many forms of embeddings; we focus on text embeddings in this book

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp" width="500px">

- LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions)
- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp" width="300px">

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp" width="300px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [82]:
let filePath = "the-verdict.txt"

if not (File.Exists(filePath)) then
    let url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    use client = new HttpClient()
    let content = client.GetStringAsync(url).Result
    File.WriteAllText(filePath, content)

printfn "File downloaded/verified."

File downloaded/verified.


In [83]:
let rawText = File.ReadAllText("the-verdict.txt")

printfn "Total number of characters: %d" rawText.Length
printfn "%s" (rawText.Substring(0, 99))

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [84]:
let text1 = "Hello, world. This, is a test."
let result1 = Regex.Split(text1, @"(\s)")

printfn "%A" result1

[|"Hello,"; " "; "world."; " "; "This,"; " "; "is"; " "; "a"; " "; "test."|]


- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [85]:
let result2 = Regex.Split(text1, @"([,.]|\s)")

printfn "%A" result2

[|"Hello"; ","; ""; " "; "world"; "."; ""; " "; "This"; ","; ""; " "; "is"; " ";
  "a"; " "; "test"; "."; ""|]


- As we can see, this creates empty strings, let's remove them

In [86]:
// Strip whitespace from each item and then filter out any empty strings.
let result3 = result2 |> Array.map (fun s -> s.Trim()) |> Array.filter (fun s -> s <> "")
printfn "%A" result3

[|"Hello"; ","; "world"; "."; "This"; ","; "is"; "a"; "test"; "."|]


- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on

In [87]:
let text2 = "Hello, world. Is this-- a test?"

let result4 =
    Regex.Split(text2, @"([,.:;?_!""()'\\]|--|\s)")
    |> Array.map (fun s -> s.Trim())
    |> Array.filter (fun s -> s <> "")

printfn "%A" result4

[|"Hello"; ","; "world"; "."; "Is"; "this"; "--"; "a"; "test"; "?"|]


- This is pretty good, and we are now ready to apply this tokenization to the raw text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp" width="350px">

In [88]:
let preprocessed =
    Regex.Split(rawText, @"([,.:;?_!""()'\\]|--|\s)")
    |> List.ofArray
    |> List.map (fun s -> s.Trim())
    |> List.filter (fun s -> s <> "")

printfn "%A" (preprocessed |> List.take 30)

["I"; "HAD"; "always"; "thought"; "Jack"; "Gisburn"; "rather"; "a"; "cheap";
 "genius"; "--"; "though"; "a"; "good"; "fellow"; "enough"; "--"; "so"; "it";
 "was"; "no"; "great"; "surprise"; "to"; "me"; "to"; "hear"; "that"; ","; "in"]


- Let's calculate the total number of tokens

In [89]:
printfn "%d" preprocessed.Length

4690


## 2.3 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp" width="500px">

- From these tokens, we can now build a vocabulary that consists of all the unique tokens

In [90]:
let allWords = preprocessed |> List.distinct |> List.sort
let vocabSize = allWords.Length

printfn "%d" vocabSize

1130


In [91]:
let vocab =
    allWords
    |> List.mapi (fun i token -> (token, i))
    |> Map.ofList

- Below are the first 50 entries in this vocabulary:

In [92]:
vocab
|> Map.toSeq
|> Seq.take 51
|> Seq.iter (fun (token, id) -> printfn "(%s, %d)" token id)

(!, 0)
(", 1)
(', 2)
((, 3)
(), 4)
(,, 5)
(--, 6)
(., 7)
(:, 8)
(;, 9)
(?, 10)
(A, 11)
(Ah, 12)
(Among, 13)
(And, 14)
(Are, 15)
(Arrt, 16)
(As, 17)
(At, 18)
(Be, 19)
(Begin, 20)
(Burlington, 21)
(But, 22)
(By, 23)
(Carlo, 24)
(Chicago, 25)
(Claude, 26)
(Come, 27)
(Croft, 28)
(Destroyed, 29)
(Devonshire, 30)
(Don, 31)
(Dubarry, 32)
(Emperors, 33)
(Florence, 34)
(For, 35)
(Gallery, 36)
(Gideon, 37)
(Gisburn, 38)
(Gisburns, 39)
(Grafton, 40)
(Greek, 41)
(Grindle, 42)
(Grindles, 43)
(HAD, 44)
(Had, 45)
(Hang, 46)
(Has, 47)
(He, 48)
(Her, 49)
(Hermia, 50)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/07.webp?123" width="500px">

- Putting it now all together into a tokenizer class

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp?123" width="500px">

In [93]:
type SimpleTokenizerV1(vocab: Map<string, int>) =
    let strToInt = vocab
    let intToStr = vocab |> Map.toSeq |> Seq.map (fun (s, i) -> (i, s)) |> Map.ofSeq

    member _.Encode(text: string) =
        let preprocessed =
            Regex.Split(text, @"([,.:;?_!""()'\\]|--|\s)")
            |> List.ofArray
            |> List.map (fun s -> s.Trim())
            |> List.filter (fun s -> s <> "")
        preprocessed |> List.map (fun s -> strToInt.[s])

    member _.Decode(ids: int list) =
        let text =
            ids
            |> List.map (fun i -> intToStr.[i])
            |> String.concat " "
        // Replace spaces before the specified punctuations
        Regex.Replace(text, @"\s+([,.?!""()'\\])", "$1")

- The `Encode` function turns text into token IDs
- The `Decode` function turns token IDs back into text

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [94]:
let tokenizer1 = SimpleTokenizerV1(vocab)

let sampleText = "\"It's the last he painted, you know,\" \n           Mrs. Gisburn said with pardonable pride."
let ids1 = tokenizer1.Encode(sampleText)
printfn "%A" ids1

[1; 56; 2; 850; 988; 602; 533; 746; 5; 1126; 596; 5; 1; 67; 7; 38; 851; 1108;
 754; 793; 7]


- We can decode the integers back into text

In [95]:
printfn "%s" (tokenizer1.Decode(ids1))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [96]:
printfn "%s" (tokenizer1.Decode(tokenizer1.Encode(sampleText)))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


## 2.4 Adding special context tokens

- It's useful to add some "special" tokens for unknown words and to denote the end of a text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp?123" width="500px">

- Some tokenizers use special tokens to help the LLM with additional context
- Some of these special tokens are
  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- `[UNK]` to represent words that are not included in the vocabulary

- Note that GPT-2 does not need any of these tokens mentioned above but only uses an `<|endoftext|>` token to reduce complexity
- The `<|endoftext|>` is analogous to the `[EOS]` token mentioned above
- GPT also uses the `<|endoftext|>` for padding
- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units

- We use the `<|endoftext|>` tokens between two independent sources of text:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp" width="500px">

- Let's see what happens if we tokenize the following text (it will fail because "Hello" is not in the vocabulary):

In [97]:
let tokenizer1b = SimpleTokenizerV1(vocab)

let textToFail = "Hello, do you like tea. Is this-- a test?"

try
    tokenizer1b.Encode(textToFail) |> ignore
with
| :? System.Collections.Generic.KeyNotFoundException as ex ->
    printfn "KeyError: %s" ex.Message
    printfn "The word 'Hello' is not in the vocabulary!"

KeyError: The given key was not present in the dictionary.
The word 'Hello' is not in the vocabulary!


- The above produces an error because the word "Hello" is not contained in the vocabulary
- To deal with such cases, we can add special tokens like `"<|unk|>"` to the vocabulary to represent unknown words
- Since we are already extending the vocabulary, let's add another token called `"<|endoftext|>"`

In [98]:
let allTokens =
    let sorted = preprocessed |> List.distinct |> List.sort
    List.append sorted [ "<|endoftext|>"; "<|unk|>" ]

let vocab2 =
    allTokens
    |> List.mapi (fun i token -> (token, i))
    |> Map.ofList

In [99]:
printfn "%d" vocab2.Count

1132


In [100]:
vocab2
|> Map.toSeq
|> Seq.sortBy snd
|> Seq.skip (vocab2.Count - 5)
|> Seq.iter (fun (token, id) -> printfn "(%s, %d)" token id)

(younger, 1127)
(your, 1128)
(yourself, 1129)
(<|endoftext|>, 1130)
(<|unk|>, 1131)


- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token

In [101]:
type SimpleTokenizerV2(vocab: Map<string, int>) =
    let strToInt = vocab
    let intToStr = vocab |> Map.toSeq |> Seq.map (fun (s, i) -> (i, s)) |> Map.ofSeq
    let unkToken = "<|unk|>"

    member _.Encode(text: string) =
        let preprocessed =
            Regex.Split(text, @"([,.:;?_!""()'\\]|--|\s)")
            |> Array.map (fun s -> s.Trim())
            |> Array.filter (fun s -> s <> "")
        let preprocessed =
            preprocessed
            |> Array.map (fun item ->
                if strToInt.ContainsKey(item) then item
                else unkToken)
        preprocessed |> Array.map (fun s -> strToInt.[s])

    member _.Decode(ids: int[]) =
        let text =
            ids
            |> Array.map (fun i -> intToStr.[i])
            |> String.concat " "
        // Replace spaces before the specified punctuations
        Regex.Replace(text, @"\s+([,.:;?!""()'\\])", "$1")

Let's try to tokenize text with the modified tokenizer:

In [102]:
let tokenizer2 = SimpleTokenizerV2(vocab2)

let text1a = "Hello, do you like tea?"
let text2a = "In the sunlit terraces of the palace."

let combinedText = sprintf "%s <|endoftext|> %s" text1a text2a

printfn "%s" combinedText

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [103]:
let encoded2 = tokenizer2.Encode(combinedText)
printfn "%A" encoded2

[|1131; 5; 355; 1126; 628; 975; 10; 1130; 55; 988; 956; 984; 722; 988; 1131; 7|]


In [104]:
printfn "%s" (tokenizer2.Decode(tokenizer2.Encode(combinedText)))

<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.


## 2.5 BytePair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- In this notebook, we use the [Tiktoken .NET library](https://github.com/tryAGI/Tiktoken) to work with the BPE tokenizer

In [105]:
open Tiktoken

let tiktokenEncoder = ModelToEncoder.For("text-embedding-3-large")
printfn "Tiktoken encoder loaded."

Tiktoken encoder loaded.


In [106]:
let bpeText =
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces" +
    "of someunknownPlace."

let allowedTokens =
    dict [
        "<|endoftext|>", 1
    ]

let integers = tiktokenEncoder.EncodeWithAllowedSpecial(bpeText, ["<|endoftext|>"])

printfn "%A" (integers |> Seq.toArray)

[|9906; 11; 656; 499; 1093; 15600; 30; 220; 100257; 33813; 385; 11; 220; 3055;
  499; 220; 4908; 15600; 30; 366; 91; 8862; 728; 428; 91; 29; 763; 259; 71|]


In [107]:
let decoded = tiktokenEncoder.Decode(integers |> Seq.toArray)

printfn "%s" decoded

Hello, do you like tea? <|endoftext|>Hello, do you like tea? <|endoftext|> In th


- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

## 2.6 Data sampling with a sliding window

- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="400px">

In [108]:
let rawText2 = File.ReadAllText("the-verdict.txt")

let encText = tiktokenEncoder.Encode(rawText2) |> Seq.toArray
printfn "%d" encText.Length

4943


- For each text chunk, we want the inputs and targets
- Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right

In [109]:
let encSample = encText.[50..]

In [110]:
let contextSize = 4

let x = encSample.[..contextSize-1]
let y = encSample.[1..contextSize]

printfn "x: %A" x
printfn "y:      %A" y

x: [|323; 9749; 5678; 304|]
y:      [|9749; 5678; 304; 264|]


- One by one, the prediction would look like as follows:

In [111]:
for i in 1 .. contextSize do
    let context = encSample.[..i-1]
    let desired = encSample.[i]
    printfn "%A ----> %d" context desired

[|323|] ----> 9749
[|323; 9749|] ----> 5678
[|323; 9749; 5678|] ----> 304
[|323; 9749; 5678; 304|] ----> 264


In [112]:
for i in 1 .. contextSize do
    let context = encSample.[..i-1]
    let desired = encSample.[i]
    printfn "%s ----> %s"
        (tiktokenEncoder.Decode(context))
        (tiktokenEncoder.Decode([| desired |]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


- We will take care of the next-word prediction in a later chapter after we covered the attention mechanism
- For now, we implement a simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one

- We use TorchSharp as the .NET equivalent of PyTorch

- We use a sliding window approach, changing the position by +1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp?123" width="500px">

- Create dataset and dataloader that extract chunks from the input text dataset

In [None]:
/// GPT Dataset: produces input/target pairs from tokenized text using a sliding window
type GPTDatasetV1(txt: string, maxLength: int, stride: int) =
    let tokenizer = ModelToEncoder.For("gpt-2")
    let tokenIds = tokenizer.Encode(txt) |> Seq.toArray

    let inputIds, targetIds =
        assert (tokenIds.Length > maxLength)
        let inputs = ResizeArray<torch.Tensor>()
        let targets = ResizeArray<torch.Tensor>()
        for i in 0 .. stride .. (tokenIds.Length - maxLength - 1) do
            let inputChunk = tokenIds.[i .. i + maxLength - 1] |> Array.map int64
            let targetChunk = tokenIds.[i + 1 .. i + maxLength] |> Array.map int64
            inputs.Add(torch.tensor(inputChunk))
            targets.Add(torch.tensor(targetChunk))
        inputs.ToArray(), targets.ToArray()

    member _.Length = inputIds.Length
    member _.Item(idx: int) = inputIds.[idx], targetIds.[idx]
    member _.InputIds = inputIds
    member _.TargetIds = targetIds

In [None]:
/// Creates batches from the dataset
/// Returns a sequence of (inputBatch, targetBatch) tensors
let createDataloaderV1 (txt: string) (batchSize: int) (maxLength: int) (stride: int) (shuffle: bool) (dropLast: bool) =
    let dataset = GPTDatasetV1(txt, maxLength, stride)
    let indices =
        if shuffle then
            let rng = Random(42)
            [| 0 .. dataset.Length - 1 |] |> Array.sortBy (fun _ -> rng.Next())
        else
            [| 0 .. dataset.Length - 1 |]

    let numBatches =
        if dropLast then dataset.Length / batchSize
        else (dataset.Length + batchSize - 1) / batchSize

    seq {
        for b in 0 .. numBatches - 1 do
            let startIdx = b * batchSize
            let endIdx = min (startIdx + batchSize) dataset.Length
            if (not dropLast) || (endIdx - startIdx = batchSize) then
                let batchIndices = indices.[startIdx .. endIdx - 1]
                let inputTensors = batchIndices |> Array.map (fun i -> fst (dataset.Item(i)))
                let targetTensors = batchIndices |> Array.map (fun i -> snd (dataset.Item(i)))
                let inputBatch = torch.stack(inputTensors |> Array.toSeq |> ResizeArray)
                let targetBatch = torch.stack(targetTensors |> Array.toSeq |> ResizeArray)
                yield (inputBatch, targetBatch)
    }

- Let's test the dataloader with a batch size of 1 for an LLM with a context size of 4:

In [None]:
let rawText3 = File.ReadAllText("the-verdict.txt")

In [None]:
let dataloader1 = createDataloaderV1 rawText3 1 4 1 false true

let batches1 = dataloader1 |> Seq.toArray
let (firstInput, firstTarget) = batches1.[0]
printfn "[%O, %O]" firstInput firstTarget

In [None]:
let (secondInput, secondTarget) = batches1.[1]
printfn "[%O, %O]" secondInput secondTarget

- An example using stride equal to the context length (here: 4) as shown below:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/14.webp" width="500px">

- We can also create batched outputs
- Note that we increase the stride here so that we don't have overlaps between the batches, since more overlap could lead to increased overfitting

In [None]:
let dataloader2 = createDataloaderV1 rawText3 8 4 4 false true

let (inputs2, targets2) = dataloader2 |> Seq.head
printfn "Inputs:\n %O" inputs2
printfn "\nTargets:\n %O" targets2

## 2.7 Creating token embeddings

- The data is already almost ready for an LLM
- But lastly let us embed the tokens in a continuous vector representation using an embedding layer
- Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="400px">

- Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

In [None]:
let inputIds = torch.tensor([| 2L; 3L; 5L; 1L |])

- For the sake of simplicity, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size 3:

In [None]:
let vocabSizeSmall = 6L
let outputDimSmall = 3L

torch.manual_seed(123) |> ignore
let embeddingLayer = torch.nn.Embedding(vocabSizeSmall, outputDimSmall)

- This would result in a 6x3 weight matrix:

In [None]:
printfn "%O" (embeddingLayer.weight)

- For those who are familiar with one-hot encoding, the embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer
- Because the embedding layer is just a more efficient implementation that is equivalent to the one-hot encoding and matrix-multiplication approach it can be seen as a neural network layer that can be optimized via backpropagation

- To convert a token with id 3 into a 3-dimensional vector, we do the following:

In [None]:
printfn "%O" (embeddingLayer.forward(torch.tensor([| 3L |])))

- Note that the above is the 4th row in the `embeddingLayer` weight matrix
- To embed all four `inputIds` values above, we do

In [None]:
printfn "%O" (embeddingLayer.forward(inputIds))

- An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp?123" width="500px">

## 2.8 Encoding word positions

- Embedding layers convert IDs into identical vector representations regardless of where they are located in the input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="400px">

- Positional embeddings are combined with the token embedding vector to form the input embeddings for a large language model:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

- The BytePair encoder has a vocabulary size of 50,257:
- Suppose we want to encode the input tokens into a 256-dimensional vector representation:

In [None]:
let vocabSizeBPE = 50257L
let outputDim = 256L

let tokenEmbeddingLayer = torch.nn.Embedding(vocabSizeBPE, outputDim)

- If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector
- If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

In [None]:
let maxLength = 4
let dataloader3 = createDataloaderV1 rawText3 8 maxLength maxLength false true
let (inputs3, targets3) = dataloader3 |> Seq.head

In [None]:
printfn "Token IDs:\n %O" inputs3
printfn "\nInputs shape:\n %O" (inputs3.shape)

In [None]:
let tokenEmbeddings = tokenEmbeddingLayer.forward(inputs3)
printfn "%O" (tokenEmbeddings.shape)

- GPT-2 uses absolute position embeddings, so we just create another embedding layer:

In [None]:
let contextLength = int64 maxLength
let posEmbeddingLayer = torch.nn.Embedding(contextLength, outputDim)

In [None]:
let posEmbeddings = posEmbeddingLayer.forward(torch.arange(int64 maxLength))
printfn "%O" (posEmbeddings.shape)

- To create the input embeddings used in an LLM, we simply add the token and the positional embeddings:

In [None]:
let inputEmbeddings = tokenEmbeddings + posEmbeddings
printfn "%O" (inputEmbeddings.shape)

- In the initial phase of the input processing workflow, the input text is segmented into separate tokens
- Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="400px">

# Summary and takeaways

This F# notebook replicated the key concepts from Chapter 2:

1. **Tokenization** - Breaking text into tokens using regex
2. **Vocabulary building** - Mapping tokens to integer IDs
3. **Simple tokenizer** - V1 (no unknown handling) and V2 (with `<|unk|>` and `<|endoftext|>`)
4. **BPE tokenization** - Using Tiktoken .NET library for GPT-2 compatible tokenization
5. **Sliding window data sampling** - Creating input/target pairs for next-word prediction
6. **Token embeddings** - Using TorchSharp embedding layers
7. **Positional embeddings** - Adding position information to token embeddings