The term prompt is used in many different ways. We define a prompt as the input text to an LLM, denoted by $x$. The LLM generates a text $y$ by maximizing the probability $Pr(x|y)$. In this generating process the prompt acts as the condition on which we make predictions, and it can contain any information that helps describe and solve the problem. 

A prompt can be onbtained using a prompt template [Liu et al., 2023a]. A template is a piece of text containing placeholder or variables, where each place holder can be filled with specific information. 

### In-context Learning

Learning can occur during inference. In-context learning is one such method, where prompts involve demonstrations of problem-solving, and LLMs can learn from these demonstrations how to solve new problemms. Since we do not update the model parameters in this process, in-context learning can be viewed as a way to efficiently activate and reorganize the knowledge learned in pre-training without additional training or fine-tuning. This enables quick adaption of LLMs to new problems, pushing the boundries of what pre-trained LLMs can achieve without task-specific adjustments.

In-context learning can be illustrated by comparing three methods: zero-shot learning, one-shot learning and few-shot learning. Zero-shot learning as the name implies does not involve a traditional ``learning process". It instead direclty applies LLMs to address new problems that were not observed during training. In practice we can repetitively adjust the prompts to guide LLMs in generating better responses, without demonstrating problem-solving steps or providing examples. Consider the following examples. 

In many practical applications, the effectiveness of in-context learning relies heavily on the quality of the prompts and the fundamental abilities of pre-trained LLMs. On one hand, we need a significant prompt engineering effort to develop appropriate prompts that help LLMs learn more effectively from demonstrations. On the other hand, stronger LLMs can make better use of in-context learning for performing new tasks. For example, suppose we wish to use an LLM to translate words from Inukitut to English. If the LLM lacks pre-training on Inuktitut data, its understanding of Inuktitut will be weak, and it will be difficult for the model to perform well in translation regardless of how we prompt it. In this case, we need to continue training the LLM with more Inuktitut data, rather than trying to find better prompts.


It might be interesting to explore how in-context learning emerges during pre-training and why it works during inference. One simple understanding is that LLMs have gained some knowl- edge of problem-solving, but there are many possible predictions, which are hard to distinguish when the models confront new problems. Providing demonstrations can guide the LLMs to fol- low the “correct” paths. Furthermore, some researchers have tried to interpret in-context learn- ing from several different perspectives, including Bayesian inference [Xie et al., 2022], gradient decent [Dai et al., 2023; Von Oswald et al., 2023], linear regression [Akyürek et al., 2023], meta learning [Garg et al., 2022], and so on.



### Text Classificaiton

Text classification is perhaps one of the most common problems in NLP. Many tasks can be broadly categorized as assigning pre-defined labels to a given text. Here we consider the polarity classification problem in sentiment analysis. In a general setup of polarity classification, we are required to categorize a given text into one of three categories: neg- ative, positive, or neutral.

> The polarity of the text can be classified as positive.

Although the answer is correct, the LLM gives this answer not in labels but in text describing the result. The problem is that LLMs are designed to generate text but not assign labels to text and treat classification problems as text generation problems. As a result, we need another system to map the LLMs output to the label space (**label mapping**). That is we extract “positive” from “The polarity of the text can be classified as positive”. This is trivial in most cases because we can identify label words via simple heuristics. But occasionally, LLMs may not express the classification results using these label words. In this case, the problem becomes more complicated, as we need some way to map the generated text or words to predefined label words.


One method to induce output labels from LLMs is to reframe the problem as a cloze task. For example, the following shows a cloze-like prompt for polarity classification. We can use LLMs to complete the text and fill the blank with the most appropriate word. Ideally we wish the filled word would be positive, negative, or neutral. However, LLMs are not guaranteed to generate these label words. One method to address this problem is to constrain the prediction to the set of label words and select the one with the highest probability. Then, the output label is given by

$$
\text{label} = \text{arg max}_{y \in Y} P(y|x)
$$
where $y$ denotes the word filled in the blank and $Y$ denotes the set of label words $\{ positive, negative, neutral\}$

Another method of using LLMs to generate labels is to constrain the output with prompts. For example, we can prompt LLMs to predict within a controlled set of words. Here is an example.
> Analyze the polarity of the following text and classify it as positive, negative, or neutral

Sentiment analysis is a common NLP problem that has probably been well understood by LLMs through pre-training or fine-tuning. Thus we can prompt LLMs using simple instructions to perform the task. However, for new classification problems, it may be necessary to provide additional details about the task, such as the classification standards, so that the LLMs can perform correctly. To do this, we can add a more detailed description of the task and/or demonstrate classification examples in the prompts. To illustrate, consider the following example.


While it seems straightforward to use LLMs for classification problems, there are still issues that have not been well addressed. For example, when dealing with a large number of categories, it remains challenging to effectively prompt LLMs. Note that if we face a very difficult classifica- tion problem and have a certain amount of labeled data, fine-tuning LLMs or adopting “BERT + classifier”-like architectures is also desirable.


### Information Extraction
Many NLP problems can be regarded as information extraction problems, involving the identification or extraction of specific pieces of information from unstructured text.  This information can include named entities, relationships, events, and other relevant data points. The goal of information extraction is to transform raw data into a format that can be easily analyzed and used in various downstream applications.

As information extraction covers a wide range of problems, we cannot discuss them all here. Instead, we start with the task of named entity recognition. This is a task that has long been a concern in NLP. Named entity recognition is a process that detects and classifies key information in text into specific groups. These key pieces of information, known as named entities, typically include proper names and are categorized into distinct classes such as people, locations, organizations, dates, monetary values, and percentages.


## Advanced Prompting Methods

### Chain of Thought
CoT methods provide a simple way to prompt LLMs to generate step-by-step reasoning for complex problems, thereby approaching tasks in a more human-like manner. Rather than coming to a conclusion directly, the CoT methods instruct LLMs to generate reasoning steps or to learn from demonstrations of detailed reasoning processes provided in the prompts. 


We can consider it as the question and prompt an LLM to answer it.

It seems difficult for the LLM to directly give a correct answer. A simple improvement is to add demonstrations of similar problems in the prompt, and thus the LLM can learn from these demonstrations.

The problem here is that, although we have shown a similar question-answer pair, it remains difficult for the LLM to reason out the correct answer.
In CoT, not only can LLMs learn from the correspondence between questions and answers but they may gain more from detailed problem- solving steps that used to derive the answers. To do this, we can incorporate some reasoning steps into the prompt to obtain a CoT prompt.
There are several benefits of using CoT prompting. 

- First, CoT allows LLMs to decom- pose complex problems into smaller, sequential reasoning steps. This somewhat mirrors human problem-solving behaviors, making it particularly effective for tasks requiring detailed, multi-step reasoning.
- Second, CoT makes the reasoning process more transparent and interpretable. Since all reasoning steps are visible, we can understand and interpret how a conclusion was reached.
- Third, if users can follow the logic behind the reasoning process, they will be more likely to trust the pre- dictions of an LLM. This is particularly important when applying LLMs in fields like medicine, education, and finance.
- Fourth, CoT is an in-context learning approach, and thus, it is applicable to most well-trained, off-the-shelf LLMs. Moreover, CoT provides efficient ways to adapt LLMs to different types of problems.

These methods can be applied to a variety of different problems. Typical problem-solving scenarios for CoT include mathematical reasoning, logical reasoning, commonsense reasoning, symbolic reasoning, code generation, and so on. 

Although we have focused on the basic idea of CoT in this section, it can be improved in several ways. For example, we can consider the reasoning process as a problem of searching through many possible paths, each of which may con- sist of multiple intermediate states (i.e., reasoning steps). In general, we wish the search space to be well-defined and sufficiently large, so that we are more likely to find the optimal result. For this reason, an area of current LLM research is aimed at designing better structures for representing reasoning processes, allowing LLMs to tackle more complex reasoning challenges. These struc- tures include tree-based structures [Yao et al., 2024], graph-based structures [Besta et al., 2024], and so on. . By using these compact representations of reasoning paths, LLMs can explore a wider range of decision-making paths, analogous to System 2 thinking.

**We should consider its practical limitations.** One of them is the need for detailed, multi-step reasoning demonstrations in few-shot CoT scenarios, which may be difficult to obtain, either automatically or manually. Also, there is no standard method for breaking down complex problems into simpler problem-solving steps. This often heavily depends on the user’s experience. In addition, errors in intermediate steps can also affect the accuracy of the final conclusion. For further discussion on the pros and cons of CoT, the interested reader can refer to recent surveys on this topic [Chu et al., 2023; Yu et al., 2023; Zhang et al., 2023a].

## Ensembling

Model ensembling for text generation has been extensively discussed in the NLP literature. The idea is to combine the predictions of two or more models to generate a better prediction. This technique can be directly applicable to LLMs. For example, we can collect a set of LLMs and run each of them on the same input. The final output is a combined prediction from these models.

**For LLM prompting, it is also possible to improve performance by combining predictions based on different prompts.** Suppose we have an LLM and a collection of prompts that address the same task. We can run this LLM with each of the prompts and then combine the predictions.

Each of these prompts will lead to a different prediction, and we can consider all three predictions
to generate the final one.

Formally, $\{x_1, \cdots, x_K \}$ be $K$ prompts for performing the same task. Given an LLM $P(\cdot|\cdot)$, we can find the best prediction for each $x_i$ using $\hat{y}_i = \text{arg max}_{y_i}P(y_i|x_i)$. These predictions can be combined to form "new" prediction:

$$
\hat{y} = \text{Combine}(\hat{y_1}, \cdots, \hat{y_K})
$$
Here $\text{Combine}(\cdot)$ is the combination model which can be designed in several ways. For example, we can select the best prediction by voting or by identifying the one that overlaps the most with others.

Another method for model combination is to perform model averaging during token prediction. Let $\hat{y}_j$ be the predicted token at the $j$-th step for model combination. The probability of predicting $\hat{y}_j$ is given by
$$
\hat{y}_j = \text{arg max}_{y_i} \sum_{k=1}^{K}\text{log} P(y_j|x_k,\hat{y}_i, \hat{y}_{j-1})
$$

In ensembling for LLM prompting, it is generally advantageous to use diverse prompts so that the combination can capture a broader range of potential responses. This practice is common in ensemble learning, as diversity helps average out biases and errors that may be specific to any single model or configuration.

From a Bayesian viewpoint, we can treat the prompt $x$ as a latent variable, given the problem of intrest $p$. This allows the predictive distribution of $y$ given $p$ to be written as the distribtution of $y$ given $p$ to be written as the distribution $P(x|y)$ marginalized over all prompts

$$
P(y|p) = \int P(y|x)P(x|p)dx
$$
The integal computes the total probability of $y$ by considering all possible values of $x$, weighted by their likelihoods given $p$. Here $P(y|x)$ is given by the LLM, and $P(x|p)$ is the prior distribution of prompts for the problem.

--- 

### Marginalizing Over Prompts: A Bayesian Perspective

In the Bayesian framework, we can view the prompt \( x \) as a latent variable that helps generate the response \( y \) for a given problem \( p \). This idea is captured by the equation:

$$
P(y \mid p) = \int P(y \mid x) \, P(x \mid p) \, dx.
$$

This equation tells us that the probability of generating \( y \) for problem \( p \) is obtained by “averaging” the contributions from all possible prompts \( x \). Each prompt’s influence is weighted by \( P(x \mid p) \), the prior probability of that prompt given the problem.

### Understanding the Components

- **$(P(y \mid x)$:**  
  This is the probability of generating \( y \) given a specific prompt \( x \). In practice, this comes directly from the language model (LLM).

- **$P(x \mid p)$:**  
  This term represents how likely a particular prompt \( x \) is, given the problem \( p \). Conceptually, it tells us which prompts are more “appropriate” or relevant to the problem at hand.

- **The Integral:**  
  The integral over \( x \) sums (or marginalizes) the contributions from all possible prompts, producing the overall probability $ P(y \mid p)$.

### Practical Considerations

In an ideal Bayesian world, this formulation is rigorous. However, there are a couple of practical challenges:

1. **Defining $P(x \mid p)$ Precisely:**  
   The space of all possible prompts \( x \) is vast, and we rarely have an explicit, tractable model for $$ P(x \mid p) $. In other words, determining the exact probability of every conceivable prompt given \( p \) is generally infeasible.

2. **Computational Feasibility:**  
   The integral
   $$
   \int P(y \mid x) \, P(x \mid p) \, dx
   $$
   is often computationally intractable because it requires summing over an enormous (or even continuous) set of possible prompts.

### How This Connects to Practice

While the equation is mathematically sound and conceptually illuminates how uncertainty in the choice of prompt can be incorporated, it often serves as a guiding principle rather than a directly computable formula. In practice, the approach is usually approximated by:

- **Sampling a Finite Set of Prompts:**  
  Instead of integrating over an infinite prompt space, we select a diverse, manageable number of prompts that are considered plausible for the problem \( p \). This is analogous to approximating the integral with a sum:
  $$
  P(y \mid p) \approx \sum_{i=1}^{N} P(y \mid x_i) \, P(x_i \mid p).
  $$

- **Ensemble Techniques:**  
  Commonly, we assume that the chosen prompts are equally good (or weight them based on heuristics), and then combine the outputs from the language model. This leads to techniques like averaging log probabilities during token prediction, which effectively aggregates multiple perspectives.

### Summary

- The Bayesian equation
  $$
  P(y \mid p) = \int P(y \mid x) \, P(x \mid p) \, dx
  $$
  is a conceptual tool showing how to account for the uncertainty in prompt selection.
  
- $P(x \mid p)$ is a theoretical construct that tells us the likelihood of a prompt \( x \) given the problem \( p \). In practice, we rarely have an explicit model for this distribution.

- Due to the intractability of the integral over all possible prompts, we approximate it by sampling a finite set of diverse prompts and combining the results (often via averaging in log-space).

Thus, while the equation is elegant and helps justify the use of diverse prompts, it is not typically used in its raw form for computation. Instead, it motivates practical ensemble methods that approximate this marginalization.

# Model Averaging During Token Prediction

In language generation, a model predicts the next token one step at a time. At each step, given a prompt and the context of previously generated tokens, the model produces a probability distribution over the next token. For example, if we denote the prompt by $x$ and the previously generated tokens by $\hat{y}_{<j}$, the model assigns a probability 
$$
P(y \mid x, \hat{y}_{<j})
$$ 
to each possible next token $y$.

When using ensemble methods—such as combining the outputs from multiple prompts or models—we obtain several such distributions. Suppose we have $K$ different prompts $\{x_1, x_2, \dots, x_K\}$ for the same task. Each prompt provides its own probability distribution:
$$
P(y \mid x_k, \hat{y}_{<j}), \quad k = 1, 2, \dots, K.
$$

A common approach to combine these predictions is to assume that, given the context, the predictions are conditionally independent. This allows us to multiply the probabilities from each prompt to form a combined probability for the token $y$:
$$
P_{\text{combined}}(y) = \prod_{k=1}^{K} P(y \mid x_k, \hat{y}_{<j}).
$$

Multiplying several small numbers, however, can lead to numerical underflow. To address this, we take the logarithm of the probabilities, which converts the product into a sum:
$$
\log P_{\text{combined}}(y) = \sum_{k=1}^{K} \log P(y \mid x_k, \hat{y}_{<j}).
$$
Because the logarithm is a monotonic function, maximizing the combined probability is equivalent to maximizing the sum of the log probabilities.

Thus, the next token is selected by choosing the token that maximizes this summed score:
$$
\hat{y}_j = \arg\max_{y} \sum_{k=1}^{K} \log P(y \mid x_k, \hat{y}_{<j}).
$$

This approach—averaging in log-space—not only mitigates numerical issues but also leverages the diverse strengths of multiple prompts. By aggregating the different probability distributions, the final prediction becomes more robust, effectively averaging out any biases or errors that might be present in any single prompt's output.

In summary, model averaging during token prediction involves:
- Combining the probability estimates from multiple sources by multiplying them.
- Using logarithms to convert the multiplication into a stable sum.
- Selecting the token with the highest combined log probability.

This method helps ensure that the final output captures a broader range of perspectives and is less sensitive to the idiosyncrasies of any single prompt.

### Self-consistency
 Here we consider the self-consistency method, which outputs not the prediction with the highest probability, but rather the one that best aligns with other predictions [Wang et al., 2022a; 2023b]. First, an LLM is prompted with CoT as usual and generates multiple reasoning paths by sampling. Self-consistency provides a criterion for determining the best prediction in a pool of can- didates. Self-consistency provides a criterion for determining the best prediction in a pool of candidates. Instead, it can be seen as an instance of output ensembling methods, also known as hypothesis selection methods, which have long been explored in NLP, particularly for text generation problems [Xiao et al., 2013].  In these methods, multiple outputs are generated by varying model architectures or parameters. Each output is then assigned a score by some criterion, and the outputs are re-ranked based on these scores. There are various ways to define the scoring function, such as measuring the agreement between an output and others, and using a stronger model to rescore each output.

An interpretation of self-consistency is to view it as a minimum Bayes risk search process. It searches for the best output by minimizing the Bayes risks. It searches for the best ouput by minimizing the Bayes risk.  More specifically, a risk function $R(y, y_r)$ is defined on each pair of outputs, representing the cost of replacing $y$ with $y_r$. Given a set of outputs $\Omega$, the risk of an ouput $y \in \Omega$ is given by
$$
\text{Risk}(y) = E_{y_r \sim P(y_r|x)}R(y, y_r)
= \sum_{y_r \in \Omega} R(y, y_r) \cdot P(y_r|x)
$$

