### GPT-2: A Mathematical Overview

**Introduction**:
GPT-2 (Generative Pre-trained Transformer 2) is an advanced deep learning model designed for natural language processing tasks, specifically in generating coherent and contextually relevant text. It builds upon the transformer architecture, characterized by its utilization of self-attention mechanisms and feed-forward neural networks, to effectively capture the complexities and nuances of human language.

**1. Architectural Framework**:
At its core, GPT-2 employs the Transformer architecture, which consists of several key components:

- **Layers**: The model consists of a stack of multiple transformer blocks, each containing a multi-head self-attention mechanism and a feed-forward neural network.

- **Multi-Head Self-Attention**: This mechanism enables the model to assess the importance of different words in a sequence with respect to one another. For a given input, represented by embedding matrices \( X \), the multi-head attention is expressed mathematically as follows:
  
  $$
  \text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
  $$

  where each head is computed as:
  
  $$
  \text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)
  $$
  
  Here, \( W_i^Q, W_i^K, W_i^V \) are learnable projection matrices for queries, keys, and values. The attention function, applied to each head, is defined as:

  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V
  $$

  where \( d_k \) is the dimensionality of the keys.

- **Positional Encoding**: Since the transformer does not inherently capture token order, positional encodings are added to the input embeddings to provide information about the position of words within a sentence:

  $$
  PE_{(pos, 2i)} = \sin\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)
  $$
  
  $$
  PE_{(pos, 2i+1)} = \cos\left( \frac{pos}{10000^{2i/d_{\text{model}}}} \right)
  $$

**2. Multilayer perceptron**:
Each transformer block includes a MLP block, which applies two linear transformations with a non-linear activation function (typically ReLU):

$$
\text{MLP}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2
$$

where \( W_1, W_2 \) are weight matrices, and \( b_1, b_2 \) are biases, facilitating complex transformations of the input.

**3. Loss Function and Training**:
GPT-2 utilizes a causal language modeling approach during training, wherein the objective is to predict the next word in a sequence given the preceding context. The model is trained using maximum likelihood estimation with a cross-entropy loss function:

$$
\mathcal{L} = -\sum_{t=1}^{n} \log P(w_t | w_1, w_2, \ldots, w_{t-1})
$$

where $P(w_t | w_1, w_2, \ldots, w_{t-1})$ denotes the probability of the next word $w_t$ conditioned on the previous words in the sequence.

**4. Pre-training and Fine-tuning**:
GPT-2 is pre-trained on a vast corpus of text using unsupervised learning techniques. This pre-training phase enables the model to derive context and language patterns effectively. The model can subsequently be fine-tuned on specific tasks or datasets to adapt its capabilities to particular applications, enhancing performance on downstream tasks such as text generation, summarization, or dialogue generation.

### Conclusion
In summary, GPT-2 represents a significant advancement in the field of natural language processing, combining sophisticated mathematical constructs with deep learning techniques. Its architecture, characterized by self-attention mechanisms and feed-forward networks, allows it to generate human-like text based on contextual cues, making it a powerful tool for a variety of applications in language understanding and generation.

In [None]:
# Preliminary setup of execution environment
import os
from pathlib import Path
import subprocess

nntile_dir = Path.cwd() / ".."

# Set environment variables
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Limit CUDA visibility
os.environ["OMP_NUM_THREADS"] = "1" # Disable BLAS parallelism
os.environ["PYTHONPATH"] = str(nntile_dir / "build" / "wrappers" / "python") # Path to a binary dir of NNTile Python wrappers

# All StarPU environment variables are available at https://files.inria.fr/starpu/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html
os.environ["STARPU_NCPU"] = "1" # Use only 1 CPU core
os.environ["STARPU_NCUDA"] = "1" # Use only 1 CUDA device
os.environ["STARPU_SILENT"] = "1" # Do not show lots of StarPU outputs
os.environ["STARPU_SCHED"] = "dmdasd" # Name StarPU scheduler to be used
os.environ["STARPU_FXT_TRACE"] = "0" # Do not generate FXT traces
os.environ["STARPU_WORKERS_NOBIND"] = "1" # Do not bind workers (it helps if several instances of StarPU run in parallel)
os.environ["STARPU_PROFILING"] = "1" # This enables logging performance of workers and bandwidth of memory nodes
os.environ["STARPU_HOME"] = str(Path.cwd() / "starpu") # Main directory in which StarPU stores its configuration files
os.environ["STARPU_PERF_MODEL_DIR"] = str(Path(os.environ["STARPU_HOME"]) / "sampling") # Main directory in which StarPU stores its performance model files
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CPU"] = "1" # Assume all CPU cores are equal
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CUDA"] = "1" # Assume all CUDA devices are equal
os.environ["STARPU_HOSTNAME"] = "GPT2_example" # Force the hostname to be used when managing performance model files
os.environ["STARPU_FXT_PREFIX"] = str(Path(os.environ["STARPU_HOME"]) / "fxt") # Directory to store FXT traces if enabled

# NNTile-related
os.environ["NNTILE_LOGGER"] = "1" # Enable logger
os.environ["NNTILE_LOGGER_SERVER_ADDR"] = "127.0.0.1" # Logger will be running on the localhost
os.environ["NNTILE_LOGGER_SERVER_PORT"] = "5001" # Port for logger server
os.environ["NNTILE_LOGGER_CLIENT_PORT"] = "6006" # Port for client web interface of the logger
os.environ["NNTILE_LOGGER_SERVER_DIR"] = str(Path.cwd() / "logs") # Directory to store logs on the logger server

In [None]:
# Launch logger if needed
if os.getenv("NNTILE_LOGGER", "0") != "0":
    logger_env = os.environ.copy()
    logger_env.update({
        "LOG_DIR": os.getenv("NNTILE_LOGGER_SERVER_DIR"),
        "SPLIT_HOURS": "720",
        "CLEAR_LOGS": "0",
        "SERVER_PORT": os.getenv("NNTILE_LOGGER_SERVER_PORT")
    })
    logger_proc = subprocess.Popen(["python", nntile_dir / "logger" / "server.py"], env=logger_env)

In [None]:
# Prepare TinyStories dataset into train.bin file
# Describe parameters and arguments
!python ../wrappers/python/examples/causal_lm_data_preparation.py --seq-len=1024 --batch-size=1024 --dataset-select=25000

remote_model_name, (`str`, default=`"openai-community/gpt2"`): This parameter specifies the name of the GPT-2 based model that resides within the HuggingFace infrastructure and will be utilized to initialize the configuration and the intial state of the NNTile model.

pretrained, (choices=`["local", "remote"]`, default=`"local"`): This flag indicates the location of the pretrained model, with the `local` option requiring a configuration path (`config-path`) to start training from a randomly initialized state unless the checkpoint (`checkpoint-path`) is provided, in which case training continues from the last saved checkpoint state.

checkpoint-path, (`str`, default=`""`): This refers to the file path where a saved checkpoint can be found, allowing for the resumption of training from a specific point if available.

config-path, (`str`, default=""): This denotes the path to the configuration .json file that must be provided in the current version if the `pretrained` parameter is set to `"local."`.

save-checkpoint-path, (`str`, default=`".model"`): This parameter specifies the directory path where intermediate checkpoints will be saved during the training process for future reference.

optimizer, (choices=`["sgd", "adam", "adamw"]`, default=`"adam"`): This defines the type of optimizer that will be employed during the training process; the current version of NNTile supports three distinct optimization methods.

model-path, (`str`, default=`".model"`): This indicates the directory path where previously loaded remote models are stored, facilitating easy access for further use.

seq-len, (`int`, default=`1024`): Size of the sequence
seq-len-tile", type=int, default=-1)

batch-size, (`int`, default=`1`): Batch size for training using NNTile.

minibatch-size", (`int`, default=`-1`): Minibatch size for training using NNTile, by default equals to `batch-size`.

dtype, (choices=`["fp32", "fp64", "tf32", "bf16", "fp32_fast_fp16", "fp32_fast_bf16"]`, default=`"fp32"`): This parameter outlines the various data types supported by NNTile, allowing users the flexibility to choose based on their model requirements.

restrict, (choices=`["cpu", "cuda", None]`, default=`None`): This option allows users to specify restrictions on the computational resources utilized during training; selecting `"cpu"` restricts training to CPU-only cores, `"cuda"` limits it to GPU cores, while setting it to None allows for training across all available cores.


In [None]:
# Launch an external python process to finetune a pretrained gpt2_lmhead model on TinyStories
# If logger server is launched, then TensorBoard results can be accessed at localhost:6006
!python ../wrappers/python/examples/gpt2_lmhead_training.py \
    --restrict="cuda" --pretrained=local --config-path="../wrappers/python/examples/gpt2_default_config.json" \
    --optimizer="adam" --lr=1e-4 --dtype=bf16 --nepochs=1 --batch-size=1024 --minibatch-size=8 \
    --dataset-file="tinystories/train.bin" --logger --logger-server-addr=127.0.0.1