# Motivation

The success of LLMs has led to a spurt of research on:
- Effectively re-training LLMS as part of a transfer learning task (see previous lecture)
- Inner worknigs of LLMs / what do LLMs learn
- Working on the limitations of LLMs, such as quadratic complexity
- Using LLMs for additional tasks
  - Multi-modality
  - LLMs as optmizers
- Looking at the capabilities and limitations of LLMs and to which extent they can lead to AGI
- Are LLMs sufficient for AGI?

We are going to cover each in turn what follows

# 1. Inner workings of LLMs / what do LLMs learn



## [Finding interpretable features by using sparse autoencoders](https://transformer-circuits.pub/2023/monosemantic-features/index.html)

-	Try to understand ever-larger models, the volume of the **latent space** representing the **model's internal state** that we need to interpret **grows exponentially**.
-	We do not currently see a way to understand, search or enumerate such a space unless it can be decomposed into independent components, each of which we can understand on its own
-	**Superposition Hypothesis** states that **single neuron** (or rather ensembles of neurons) are used to **encode multiple features** rather than individual ones
-	Feature here means something about the world e.g. what language is being spoken, are we talking about a place and so on rather than input feature

<a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html"><img src="https://drive.google.com/uc?export=view&id=1J9oJ0rYoj0VecMdrLRq8JxzCY6h3Fyav" width=60%></a>

-	As a result **looking at individual neurons, does not lead to understanding**
-	However, **directions in activations are  interpretable**, it's natural to think there's some **"basic set" of meaningful directions** which more **complex directions can be created from**.
-	We call these **directions features**, and they're what we'd like to decompose models into.
-	This is called **dictionary learning**
- "Dictionary" (a set of elements) that can be used to represent data efficiently.
- The basic idea is to find a sparse representation of input data using an overcomplete basis set.
- In other words, **each data point is described as a linear combination of a few dictionary elements**.
-	It is an **np-hard problem** as: we're asking to determine a **high-dimensional vector from a low-dimensional projection**. Put another way, we're trying to invert a very rectangular matrix
-	The only thing which makes it possible is that we are looking for a high-dimensional vector that is sparse! This is the famous and well-studied problem of compressed sensing, which is NP-hard in its exact form. It is possible to store high-dimensional sparse structure in lower-dimensional spaces, but recovering it is hard.
-	Trains simple one layer transformer language model
-	Use spars categorical autoencoder on activation with higher-dimensionality than input in the middle to disentangle superposition

<a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html"><img src="https://drive.google.com/uc?export=view&id=1HMYbCo70Un34xPStdudTYRehV6f0Bu0a" width=60%></a>


<a href="https://transformer-circuits.pub/2023/monosemantic-features/index.html"><img src="https://drive.google.com/uc?export=view&id=13nt_-zVGir16KAbGmFMlq9zAkz-tasxZ" width=60%></a>






## Induction heads and in-context learning

- [Argument](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
 ) is that main in context learning feature of transformers is that as they are trained they develop special attention heads called **induction heads**

Formally, we define an induction head as one which exhibits the following two properties on a
repeated random sequence of tokens:
- **Prefix matching:**-  The head attends back to previous tokens that were followed by the
current and/or recent tokens. That is, it attends to the token which induction would suggest
comes next.
- **Copying:** The headʼs output increases the logit corresponding to the attended-to token.

In other words, induction heads are any heads that empirically increase the likelihood of [B] given
[A][B]...[A] when shown a repeated sequence of completely random tokens. An illustration of
induction headsʼ behavior is shown here:


<a href="https://www.anthropic.com/index/in-context-learning-and-induction-heads"><img src="https://drive.google.com/uc?export=view&id=1dUHogj8WCBjsNU-8WHUyWvW49VvqUy0l" width=60%></a>


One of things weʼll be trying to establish is that when induction heads occur in sufficiently large
models and operate on sufficiently abstract representations, the very same heads that do this
sequence copying also take on a more expanded role of analogical sequence copying or in-context
nearest neighbors. By this we mean that they promote sequence completions like
[A*][B*] … [A] → [B] where A* is not exactly the same token as A but similar in some
embedding space, and also B is not exactly the same token as B*


Additoinal descirptions of how induction heads work:
- [here](https://www.lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated)
- [here](https://www.perfectlynormal.co.uk/blog-induction-heads-illustrated)
- [here](https://www.anthropic.com/index/in-context-learning-and-induction-heads)






**More complex example of attention matching**

- Show an attention head (found at layer 26 of the 40-layer model) which
does more complex pattern matching. One might even think of it as learning a simple function in
context!

- To explore this behavior, generated some synthetic text which follows a simple pattern.
- Eachline follows one of four templates, followed by a label for which template it is drawn from.
- The template is randomly selected, as are the words which fill in the template:

(month) (animal): 0

(month) (fruit): 1

(color) (animal): 2

(color) (fruit): 3

Below, we show how the attention head behaves on this synthetic example. To make the diagram
easier to read, we've masked the attention pattern to only show the ":" tokens are the destination,
and the logit attribution to only show where the output is the integer tokens.


<a href="https://www.anthropic.com/index/in-context-learning-and-induction-heads"><img src="https://drive.google.com/uc?export=view&id=1XKrZHPtfmA4kADNrZU7DHHKpGxLFTEqn" width=60%></a>


# Multi modality


Multimodal language model: understand and generate content across multiple modes of data, such as text, images, audio, and sometimes even video

<a href="https://arxiv.org/pdf/2309.05519.pdf"><img src="https://drive.google.com/uc?export=view&id=1nK8uh3rz0B5tdp9eGPa4FxvSduV_V4nd" width=60%></a>


**The approach approach:**
  - Centerpiece is a (pretrained) large language model of choice
  - Encoders and decoders are pretrained models (such as diffusion models) for images, audio, video and any other modality
  - To get a mapping from the the latent vector of the language model a projection layer is introduced (in a sense a vector mappig from the latent space of one model to the latent space of another model has to be learned)
  - To get a mapping from the output of the languag model to the multi-modal decoders another mapping layer is introduced
  - This way the total number of paparamters that need to be updated is limited



  



Depending on the modality different parts of the model are used

<a href="https://arxiv.org/pdf/2309.05519.pdf"><img src="https://drive.google.com/uc?export=view&id=1ohvYG0m6HROeZGoQmTUG6h7OMTm-27jt" width=60%></a>


**Training the model - How doe the training and loss work**

There are usually 3 types of adjustments, that are made

(1) For the encoder the information can be passed through the LLM to generate text which describes the input. This text can then be compared to original captionings

(2) For the decoder image caption for the generated image, video, audio is passed into a text encoder, the resulting text can be compared to the image output projection

<a href="https://arxiv.org/pdf/2309.05519.pdf"><img src="https://drive.google.com/uc?export=view&id=17L2RhClSuclmFE_Njhs4XDhGJIzOSP62" width=60%></a>

(3) Involves additional
training of overall MM-LLMs using ‘(INPUT, OUTPUT)’ pairs, where ‘INPUT’ represents the user’s
instruction, and ‘OUTPUT’ signifies the desired model output that conforms to the given instruction.
Technically, leverage LoRA to enable a small subset of parameters within NExT-GPT to be
updated concurrently with two layers of projection during the IT phase (here parameters throughout the whole LLM are updated).


<a href="https://arxiv.org/pdf/2309.05519.pdf"><img src="https://drive.google.com/uc?export=view&id=1zTjxcT_eGcfHshNEiZ_jvCkgcj5Dt4O9" width=60%></a>




Along these lines the following [paper](https://arxiv.org/pdf/2311.02782.pdf) looks at multi modal anomaly detection

# LLMS as optmizers

Idea:
- Use the language model to directly optimize a task by prompting it accordingly (to optimze a task)
- Iteravely improve the prompt

[**Large language models as optimizers:**](https://arxiv.org/abs/2309.03409) Novel approach to use large language models (LLMs) as optimizers for various tasks, where the optimization problem is described in natural language and the LLM generates new solutions based on the problem description and the previously found solutions1.

**OPRO framework:** Optimization by PROmpting (OPRO) framework, which consists of a meta-prompt that contains the optimization problem description and the optimization trajectory, and a solution generation step that leverages the LLM sampling temperature to balance between exploration and exploitation.

**Case studies and applications:** demonstrate the potential of LLMs for optimization on two classic mathematical optimization problems: **linear regression and the traveling salesman problem**. Also apply OPRO to prompt optimization, where the goal is to find a prompt that maximizes the task accuracy for natural language processing tasks. Show that OPRO can consistently improve the performance of the generated prompts on several benchmarks, such as GSM8K and Big-Bench Hard, and outperform human-designed prompts by a large margin.

-  The **GSM8K dataset**, short for Generative Spoken Model 8K, is a dataset specifically designed for training large language models to understand and generate human-like speech
- The **BIG-bench-hard dataset** is an advanced and challenging benchmark designed for evaluating the capabilities of large language models, particularly in tasks that are difficult for current AI models. BIG-bench stands for "Beyond the Imitation Game benchmark

<a href="https://arxiv.org/abs/2309.03409"><img src="https://drive.google.com/uc?export=view&id=1UYdD3FPchie2-vZtcznZGEnhw_mtWUdL" width=60%></a>





Example of prompt

<a href="https://arxiv.org/abs/2309.03409"><img src="https://drive.google.com/uc?export=view&id=1JhSsucVL7w5LVD4OGjhaai6Nqi_Rm3jR" width=60%></a>


# Limitations of LLMs and how to adress them

## LLMs and quadratic complexity of the attention mechanism
**Attention scales with quadratic complexity**

A number of approaches have been developed to deal with this issue
- Change the attention mechanism e.g. do not attend all tokens (sparse attention) -covered previously
- Storing some of the weights of attention in memory
- Chunking and search (covered in information retrieval)
- Using an LLM to summarize sub-components of a text into smaller pieces (in a search tree that can be used to retrieve appropriate text) - covered here


[MEMWALKER](https://arxiv.org/pdf/2310.05029.pdf)(Memory Walker): An interactive reader - A method that uses a large language model (LLM) to read long texts and answer questions by building a memory tree and navigating it iteratively.

Memory tree construction - A process that splits the long text into segments and summarizes them into nodes that form a tree structure. The LLM generates summaries using iterative prompting.

Navigation - Upon receiving a query the model starts from the root node and traverses the tree to find the relevant segment for the query3. The LLM decides which node to inspect or revert to by generating reasoning and action.

Note that this approach represents a traditional computer science based tree search approach but implemented with an LLM



<a href="https://arxiv.org/pdf/2310.05029.pdf"><img src="https://drive.google.com/uc?export=view&id=1eA6B2ba7iJzNX6hAa7ImVvZ4SNo5E7Mr" width=60%></a>




# Are LLMS sufficient for AGI?

Beyond the successes of LLMs, there have been some critical voices on how close they bring us to AGI:

A recent [paper](https://arxiv.org/pdf/2311.00871.pdf) by Google Deep Mind suggests that transformes
- Are good for **in context learning** - perform new tasks when **prompted** with **unseen input-output examples** without any explicit model training as long as they are **close to the training data distribution**
- But are not good if the **input-output examples** are **not close to the training data distribution**


In the talk ["Can LLMs Really Reason & Plan?"](https://youtu.be/uTXXYi75QCU?si=6XLKu1kOjni_ALo5)  Subbarao Kambhampati argues that:
- LLMS are N Gram on steroids 3000 grams
- They are good at approximate look up of relevant "knowledge" they have been trained on
- We underestimate the width of the corpus the LLM has been trained on and thus attribute reasoning and planning, when it is really just soft retrieval (in a sense there is no clear separation between train and test dataset)
- Can't do reasoning
  - No matter how computationally complex a problem is will always give answer after the same amount of time
  - In reality gives approximate answers from its knowledge base
- Thus not reasoning but retrieving
  - E.g explain jokes but someine has explained joke in internet
  - Good at cypher text one two and 13 what you can find on the internet..


Large language models are thus good for idea generation, but they are not good for reasoning (they are akin to system I but are lacking system II)




A recent paper termed paper [System 2 Attention](https://arxiv.org/pdf/2311.11829v1.pdf) thus introduces a system II into large language models

- Regenerating the input context to only include the relevant portions, using the LLM itself as a natural language reasoner. This step is done by giving the LLM a zero-shot prompt that instructs it to perform the desired attention task over the input.
- Attending to the regenerated context to elicit the final response, using the LLM again with another zero-shot prompt that asks for the answer to the original query

Howerver, in the paper this is mainly done by extending the prompt, so this would not be a form of system 2 according to  the talk of Subbarao Kambhampati