Imagine you have a bag full of possible words, each with a certain chance of being picked. **Sampling** is the process by which a language model chooses its next word (or token) from this bag. The model assigns a weight (logit) to each possible token in its vocabulary, and these logits are then converted into probabilities. By selectively choosing from these probabilities, we can control how “creative” or “focused” a model’s responses will be.

- **Logits and Probabilities**  
    A language model produces a *logit vector*, where each element corresponds to how strongly the model “wants” to pick a particular token. These logits get converted to probabilities via the **softmax** function, turning raw scores into a probability distribution.


- **Sampling**  
    Once we have a probability distribution over all possible tokens, we randomly select the next token according to those probabilities. This randomness can lead to more varied and interesting outputs than if we always picked the single most likely token.

- **Log Probabilities (Logprobs)**  
    Rather than storing probabilities directly, language models often return their log probabilities (logprobs). This is useful because probabilities can become extremely small, especially for large vocabularies, leading to the **underflow** problem where very small numbers get rounded down to zero in standard floating-point representations.

### Introduction of Key Concepts

1. **Softmax Operation**  
   At each step, the model generates a logit $z_i$ for every token $i$. To turn these into probabilities $p_i$, we apply the softmax function:

   $$
   p_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}},
   $$

   where $p_i$ is the probability of choosing token $i$. Large positive logits lead to higher probabilities, while negative logits are still valid inputs.

2. **Temperature**  
   A parameter called *temperature* ($T$) can scale the logits before applying the softmax. Formally:

   $$
   p_i = \frac{\exp\left(\frac{z_i}{T}\right)}{\sum_{j} \exp\left(\frac{z_j}{T}\right)}.
   $$

   - If $T < 1$, the distribution becomes “sharper” (less random).  
   - If $T > 1$, the distribution becomes “flatter” (more random).

3. **Top-k Sampling**  
   Instead of considering every token in the vocabulary, *top-k* sampling selects only the $k$ tokens with the highest logits. We then apply softmax *only* to those top $k$ tokens. For instance, if $k=50$, the model only compares the 50 most likely tokens before sampling. This approach can reduce computational cost and prune very unlikely tokens.

4. **Top-p (Nucleus) Sampling**  
   In *top-p* sampling, also known as **nucleus sampling**, we order tokens by their probability and include the smallest number of tokens whose cumulative probability exceeds $p$ (e.g., 0.9). This method dynamically adapts to different contexts: if the model is very confident about a small set of next words, it restricts itself to those; otherwise, it considers a broader set.

---

## Mathematical Formulation (For Binary Classification)  

Large language models (LLMs) are fundamentally **token prediction engines** - they calculate probabilities for every possible next word/token. In creative tasks, we want diversity, but in classification, we need **focused decision-making**. Top-k and top-p sampling act as "probability filters" to constrain the model's choices to only the most relevant candidates.

### Real-World Analogy:  
Imagine sorting through a bag of Scrabble tiles:  
- **Without constraints**: You might accidentally pick irrelevant letters (e.g., "Q" when you need simple English words)  
- **With top-k**: Only consider the 5 most common English letters  
- **With top-p**: Keep drawing letters until you have 90% confidence you've captured all reasonable options  

### Foundation:  
Given prompt $p(x)$, the model produces logits $z_i$ for all tokens $i \in V$. For binary classification, we care about two special tokens:  
$$ V_{\text{class}} = \{0, 1\} $$

**Key Insight**:  
We want to restrict sampling *only* to $V_{\text{class}}$ while preserving relative probabilities:  

$$
P(y|x) = \begin{cases}  
\frac{\exp(z_y/T)}{\sum_{j \in S} \exp(z_j/T)} & \text{if } y \in S \\  
0 & \text{otherwise}  
\end{cases}
$$

Given prompt $p(x)$, the model produces logits $z_i$ for all tokens $i \in V$. For binary classification, we care about two special tokens:  
$$ V_{\text{class}} = \{0, 1\} $$

**Key Insight**:  
We want to restrict sampling *only* to $V_{\text{class}}$ while preserving relative probabilities:  

$$
P(y|x) = \begin{cases}  
\frac{\exp(z_y/T)}{\sum_{j \in S} \exp(z_j/T)} & \text{if } y \in S \\  
0 & \text{otherwise}  
\end{cases}
$$

where $S$ is our constrained token set determined by:  
- **Top-k**: $S =$ top $k$ tokens by $z_i$  
- **Top-p**: $S =$ minimal set where $\sum_{j \in S} P(j|x) \geq p$  


## Case Study: Business Record Linkage  

### Task Setup:  
- **Input**: Two business names ("Starbucks Coffee", "Starbucks Corp")  
- **Output**: 0 (same entity) or 1 (different entities)  
- **Prompt**:  

```
Determine if these businesses are the same entity. Answer only 0 or 1.
Business 1: Starbucks Coffee
Business 2: Starbucks Corp
Answer: ___
```


### Sampling Dynamics:  

| Scenario | Top-k (k=2) | Top-p (p=0.95) |  
|----------|-------------|----------------|  
| **Confident Match** (Model thinks 0: 95%, 1: 5%) | Forces selection between {0,1} | Automatically selects {0} (cumulative 95% > p) |  
| **Uncertain Case** (0: 52%, 1: 48%) | Still samples from {0,1} | Includes both tokens (52%+48%=100% ≥ p) |  
| **Noisy Edge Case** (0: 40%, 1: 35%, "Maybe":25%) | Excludes "Maybe" | Might exclude "Maybe" if 0+1=75% < p |  

## 4. Why This Matters for Classification  

### Critical Advantages:  
1. **Stochastic Certainty Control**:  
   - Low temperature + top-p creates "sharpened" distributions:  
   $$ T=0.3 \Rightarrow P(0) = \frac{\exp(z_0/0.3)}{\exp(z_0/0.3)+\exp(z_1/0.3)} $$  
   Makes confident decisions more decisive  

2. **Out-of-Class Rejection**:  
   Top-p automatically excludes nonsensical tokens even if:  
   - The model gets distracted (e.g., starts generating explanations)  
   - There are spelling variations ("zero" vs "0")  

3. **Calibration Preservation**:  
   Unlike argmax (which loses probability information), sampling preserves relative likelihoods between 0 and 1 while filtering irrelevant options.

## 5. Implementation Strategy  

### Optimal Configuration:  
For binary classification:  
```python
generation_config = {
    "temperature": 0.3,  # Sharpens distribution
    "top_k": 2,          # Force 0/1 selection  
    "top_p": 0.99,       # Allow automatic fallback  
    "max_tokens": 1      # Strict single-token output
}

```

### Advanced Sampling Strategies for Robust LLM Classification  
*Focused on Binary Entity Matching via {0,1} Token Sampling*
 

**Majority Voting (Plurality Sampling)**: Generating multiple samples and taking the most frequent answer. This reduces variance and errors from individual samples. However, it's computationally expensive. Since we are outputting 0/1, taking multiple samples and majority vote could smooth out uncertainties.
- **Mechanism**:  
    Generate $n$ independent samples from $P(y|p(x))$ and take the most frequent answer:  
$$ \hat{y} = \text{mode}\{y^{(1)}, y^{(2)}, ..., y^{(n)}\} $$  


**Nucleus (Top-p) Sampling with Temperature Scaling**: Adjusting temperature to sharpen distributions. They already use temperature, but combining it with top-p could help in focusing on relevant tokens. Maybe they can adjust temperature dynamically based on confidence.


**Log Probability Thresholding**: Using the model's confidence (log probabilities) to accept or reject predictions. If the log prob is below a threshold, maybe reject the sample or flag for review. This could help in uncertain cases.


**Monte Carlo Dropout**: Adding dropout at inference time and sampling multiple times to estimate uncertainty. 

## References

1. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). “[The Curious Case of Neural Text Degeneration](https://openreview.net/forum?id=rygGQyrFvH).” *International Conference on Learning Representations (ICLR)*.  
2. Radford, A., Wu, J., Child, R., et al. (2019). “[Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).” *OpenAI*.

---

##  **When to Use LLMs vs Traditional Classifiers**  
### **LLM Sweet Spot**:  
- **Zero/Few-Shot Complexity**: Tasks requiring world knowledge (e.g., "Is 'Apollo Therapeutics' a biotech startup?")  
- **Multi-Modal Context**: Classifying product listings using both images and text descriptions  
- **Dynamic Taxonomies**: Frequently changing categories (e.g., news topic classification with emerging trends)  

### **Traditional Classifiers Preferred When**:  
- Stable label schema with abundant training data  
- Latency-sensitive applications (LLMs are 10-100x slower)  
- Strict regulatory constraints (LLMs are harder to explain)

---

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML.

Jiang, Z., Araki, J., Ding, H., & Neubig, G. (2021). How Can We Know When Language Models Know? ACL.

WANG, X., & SOLDAN, M. (2023). Towards Controllable and Faithful Natural Language Generation. EMNLP Tutorial.


OpenAI. (2023). GPT-4 Technical Report.

Microsoft Azure. (2023). Best Practices for LLM Classification in Production Systems.

Amazon AWS. (2023). Cost-Effective Deployment of Large Language Models.

Google Cloud. (2023). LLM Classification Security Patterns.

Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR.

Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT.