# Neural Sequence Models and Pre-training

Transformer architectures [1] and advances in self-supervised learning have fundamentally altered natural language processing. By pre-training neural networks on vast unlabeled datasets using self-supervision, we can create foundation models that adapt to diverse tasks through fine-tuning or prompting. This approach often eliminates the need for task-specific supervised learning, instead relying on the adaptation of pre-trained models.

$$
\text{Foundation Model} = \text{Architecture} + \text{Self-Supervised Pre-training}
$$



The significance of large language models lies in their emergent properties: while trained solely on next-token prediction ($P(x_t|x_{<t})$), they demonstrate sophisticated linguistic understanding and generalization across diverse NLP tasks, often surpassing specialized supervised systems. Recent research [2] suggests these emergent capabilities may extend beyond purely linguistic domains.

# Prompting of Pre-trained Models
Large language models employ next-token prediction ($P(x_t|x_{<t})$) at scale to develop general language understanding. Through extensive training, this seemingly simple objective enables these models to transform complex NLP tasks into text generation problems via prompting. This capability allows for reformulating traditional tasks, such as classification, into generation tasks without task-specific architectures.


### Prompting for Text Classification

Consider how humans classify text - we read a passage and naturally understand its sentiment or category. Modern language models can perform a similar task, but instead of making a direct classification, they complete a carefully crafted prompt. This approach transforms the traditional classification problem into a more natural language generation task.
Let's look at a concrete example:
```
Assume that the polarity of a text is a label chosen from {positive, negative, neutral}. Identify the polarity of the input.
Input: I love the food here. It’s amazing!
Polarity: ________
```
When presented with this prompt, the model completes the blank with a classification label, much like how a human would naturally respond.


To formalize this intuition, we can express the classification process mathematically. For any input text $x$, we create a prompt template $p(x)$ and predict the most likely completion:

$$
\text{class}(x) = f(\text{argmax}_{y \in V} P(y|p(x)))
$$

where:
- $V$ is the model's vocabulary (all possible tokens it might generate)
- $y$ is the predicted completion token
- $P(y|p(x))$ is the probability of generating token $y$ given the prompt
- $f$ is a mapping function that converts completions to formal class labels



An instruction-based prompt consists of three key components:

1. **Task Instructions**
   ```
   Assume that the polarity of a text is a label chosen from {positive, negative, neutral}.
   Identify the polarity of the input.
   ```
   This defines the task parameters and expected output format.

2. **Format Markers**
   ```
   Input: [text goes here]
   Polarity: [model completion here]
   ```
   These markers provide structural cues that:
   - Clearly separate different components of the prompt
   - Guide the model's attention to relevant sections
   - Structure the expected response format

3. **Input Text**
   ```
   I love the food here. It's amazing!
   ```

The complete prompt combines these elements:

$$
\begin{align*}
p(x) &= \text{instructions} \\
&+ \text{"Input: "} + x \\
&+ \text{"Polarity: "}
\end{align*}
$$

## References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.


[2] Bubeck, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with GPT-4.



Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." arXiv preprint arXiv:2005.14165.


Wei, J., et al. (2022). "Chain of Thought Prompting Elicits Reasoning in Large Language Models." arXiv preprint arXiv:2201.11903.


Kojima, T., et al. (2022). "Large Language Models are Zero-Shot Reasoners." arXiv preprint arXiv:2205.11916.