> NLP and LLM Terms

In [None]:

# corpus
    # A collection of text documents used to train a language model. The corpus can be a collection of books, articles, or any text data.
    
    
# Vocabulary
    # The set of unique tokens (words, sub-words, or characters) that a model can understand. The vocabulary is typically derived from
    # the training corpus and includes common words and special tokens like [PAD], [UNK], [CLS], and [SEP].
    
    
# Attention Mechanism
    # A mechanism that allows the model to focus on important words in a sequence, enabling the model to handle long-range dependencies 
    # and capture context.


# Tokens
    # The smallest unit of input or output the model processes. Tokens can represent words, sub-words, or even characters, depending 
    # on the tokenization strategy.
    
    
# Tokenization
    # The process of converting raw text into tokens (usually words, sub-words, or characters) that the model can process. 
    # Tokenizers break down text based on a model's vocabulary (e.g., Byte Pair Encoding or WordPiece).
    # Byte Pair Encoding (BPE):
        # A tokenization algorithm that iteratively merges the most frequent pairs of characters in a corpus to create a vocabulary of 
        # variable-length tokens. BPE is widely used in NLP tasks, including machine translation and text generation.


# Embeddings
    # Dense vector representations of tokens (words/sub-words) that capture semantic meaning. Used in LLMs to map input tokens into 
    # a continuous vector space where similar meanings are close together. Instead of treating each word as a unique, isolated token, 
    # embeddings allow words with similar meanings to be represented by vectors (arrays of numbers) that are close together 
    # in a multi-dimensional space.
    
    # example: Word2Vec, GloVe, FastText, BERT embeddings.
        # Imagine you have the words: 
        # "dog", "cat", "apple", "banana" 
        # "dog" -> [0.1, 0.2, 0.3, 0.4], 
        # "cat" -> [0.2, 0.3, 0.4, 0.5], 
        # "apple" -> [0.3, 0.4, 0.5, 0.6], 
        # "banana" -> [0.4, 0.5, 0.6, 0.7]
        # The embeddings for "dog" and "cat" are closer together than "dog" and "apple" because "dog" and "cat" are semantically
        # similar (both animals) compared to "dog" and "apple" (different categories).
        
        
# Part-of-Speech (POS) Tagging:
    # Assigning each word in a sentence a grammatical category (e.g., noun, verb, adjective). Helps the model understand the 
    # structure of sentences and the role of each word, which is useful for tasks like parsing, translation, and question answering


# Named Entity Recognition (NER):
    # Identifying and categorizing entities (names, dates, locations, organizations, etc.) in text.


# Stemming or Lemmatization:
    # Reducing words to their base or root form. Stemming is a rule-based process that removes prefixes or suffixes, while lemmatization 
    # uses a vocabulary and morphological analysis to return the base form of a word.
    

# Sampling Techniques
    # Methods used to generate outputs from a model, such as greedy search (selecting the most likely next token), beam search (exploring 
    # multiple token sequences), and temperature sampling (introducing randomness to outputs).


# Beam Search
    # A search strategy used during text generation to explore multiple possible token sequences and select the most likely ones. 
    # It reduces the likelihood of poor-quality outputs compared to greedy search.
    
    
# Greedy Search
    # A simpler search method where the model always selects the most probable next token. It is fast but may lead to less coherent 
    # or repetitive outputs.


# Autoregressive Models
    # LLMs like GPT, which generate text one token at a time, predicting the next token based on previously generated tokens. 
    # This type of model is suitable for tasks like text generation.
    
    
# Masked Language Models (MLM)
    # Models like BERT that learn by predicting masked-out tokens in a sentence, using the surrounding context. These models are 
    # bidirectional, meaning they consider context from both directions.
    
    
# Zero-Shot Learning
    # The model’s ability to perform tasks without explicit examples in the training data. For example, a zero-shot LLM can classify 
    # text without having seen labeled examples for that specific task.
    
    
# Few-Shot Learning
    # The model can generalize from only a few examples during inference. For instance, by providing the model a few sample questions 
    # and answers, it can handle similar tasks effectively.
    
    
# Fine-Tuning vs. Transfer Learning
    # Fine-Tuning: The process of adapting a pretrained LLM to a specific task (e.g., classification, question answering) by training 
        # it further on task-specific labeled data.
    # Transfer Learning: Leveraging knowledge from a pretrained model and applying it to a new but related task, without needing to 
        # retrain from scratch.


# Temperature Sampling
    # A technique used during text generation to control the randomness of the output. Higher temperatures (e.g., 1.0) result in 
    # more diverse outputs, while lower temperatures (e.g., 0.2) make the model more deterministic.





> LLM Training

In [None]:
# 1. PEFT + LoRA (Parameter Efficient Fine-tuning + Low-Rank Adaptation)
    # Description: Fine-tunes only a small adapter layer added on top of a pre-trained model, conserving memory and improving efficiency.
    # Use Case: Helps in training large models by keeping the original model frozen and updating only small parts.


# 2. Quantization-Aware Training (QAT)
    # Description: Reduces model size by converting high-precision weights (e.g., FP32) to lower precision formats (e.g., FP16 or INT8).
    # Benefits: Saves memory and reduces training time but may affect model accuracy.
    # Challenges: Requires careful monitoring to ensure model quality isn’t degraded.


# 3. Gradient Checkpointing
    # Description: Saves memory by storing only certain intermediate values during backpropagation.
    # Use Case: Reduces memory usage but slows down training.


# 4. Distributed Training
    # Description: Splits the model and data across multiple devices or nodes for faster training.
    # Key Techniques:
        # FSDP (Fully Sharded Data Parallel): Shards model weights and optimizer states across devices.
        # Deepspeed Zero Redundancy Optimizer (ZeRO): Distributes model parameters to save memory and optimize training efficiency.


> LLM Inference

In [None]:
# 1. Post-Training Quantization (PTQ)
    # Description: Quantizes a model’s weights and activations after training to reduce memory usage.
    # Use Case: Reduces memory footprint for serving models at lower precision (e.g., FP32 → INT8).


# 2. Distributed Inference
    # Description: Partitioning model weights across multiple devices to handle large models.
    # Techniques:
    # Model Partitioning: Divides a large model across multiple GPUs or nodes for more efficient computation.
    # In-flight Batching: Enables the processing of new requests while others are still being computed, improving GPU utilization.


# 3. Dynamic Batching & Continuous Batching
    # Description: Dynamically adjusts batch sizes during inference to maximize GPU utilization, reducing latency.
    # Benefits: Ensures high throughput and efficiency, especially for models with varying input lengths.

> Optimization Techniques

In [None]:
# 1. TensorRT-LLM
    # Description: Optimizes models with kernel fusion and memory techniques like KV caching, Paged Attention, and FlashAttention.
    # Benefits: Improves performance but requires conversion into TensorRT format for use.


# 2. vLLM
    # Description: An inference engine that uses Paged Attention to reduce resource wastage, optimizing memory usage and improving throughput.
    # Benefits: High efficiency in processing tokens compared to traditional methods.


# 3. DeepSpeed-Fastgen
    # Description: Combines DeepSpeed's training and inference capabilities for fast, efficient model serving.
    # Key Features: Supports Dynamic Splitfuse batching, improving latency and throughput for large models.
    

# Key Considerations
    # Memory Constraints: LLM training and inference are memory-intensive processes. Techniques like PEFT, QAT, and gradient 
        # checkpointing can help mitigate memory limitations.
    # Model Size: Models with billions of parameters may require distributed training or inference strategies to handle the memory demands.
    # Efficiency: Methods like mixed precision, distributed training, and dynamic batching are key to improving efficiency in 
        # training and inference.
    # Latency: Techniques like dynamic batching and continuous batching can help reduce inference latency, especially for real-time 
        # applications.
    # Throughput: Distributed inference and model partitioning can improve throughput by leveraging multiple devices for 
        # parallel processing.
    # Resource Optimization: Techniques like TensorRT-LLM and vLLM optimize memory usage and improve performance for large models.
    # Scalability: Distributed training and inference methods enable scaling LLMs to handle larger models and datasets efficiently.
    # Model Serving: Techniques like DeepSpeed-Fastgen provide end-to-end solutions for training and serving large language 
        # models effectively.
    # Performance Trade-offs: Quantization and distributed strategies may impact model accuracy, so careful monitoring and tuning 
        # are essential to maintain performance.

