In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Adding extra-parametric capabilities to a LLM

Large Language Models have demonstrated zero and few-shot ability on many tasks.

For example:
- Question Answering
- Mathematical reasoning

Moreover, some of this only emerges when the number of parameters becomes very large.

<table>
    <tr>
        <th><center>Few/One/Zero shot learning</center></th>
    </tr>
    <tr>
        <td><img src="images/LM_Few_Shot_Accuracy.png"" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://arxiv.org/pdf/2005.14165.pdf</center></td>
    </tr>   
</table>

This suggests that the parameters of the LLM
- encode factual knowledge
    - book knowledge
- encode procedural knowledge
    - how to solve a math problem

Besides consuming many parameters, encoding facts and procedures in parameters have drawbacks
- Factual knowledge is current *only up to the time of training*
- When solving math problems, LLM's are known
    - to get the procedural reasoning correct
    - but make simple arithmetic errors in calculation

A recent trend has been to augment the LLM with capabilities *external* to its parameters
- Factual knowledge obtained from a live source: the Web
- Computational abilities by being able to *execute* programs produced by the LLM

We can illustrate the difference between parametric and non-parametric knowledge
by considering the task of Question Answering
- Parametric Knowledge: closed book exam
    - all knowledge acquired and stored before inference time
- Non-Parametric Knowledge: open book exam

Similarly, extra parametric compute can be explained by the analogy
about answering a question with a numeric answer that is the solution to an equation
- Extra parametric: using a calculator to evaluate the equation 
- Non extra parametric: solving the equation by hand
    - can have the correct equation but incorrect answer due to miscalculation

We refer to
- the first case as *non-parametric knowledge*
- the second case as *extra-parametric compute*

We briefly discuss examples of both types.

# Non-Parametric Knowledge: Retriever-Generator Architecture:

The *Retriever-Generator* architecture has two components
- A *Retriever* that is able to gather factual knowledge from an external (non-parametric) source
- A *Generator* that produces the answer

The process is sometimes called *Retrieval Augmented Generation (RAG)*.

Details may be found in
- this [paper](https://arxiv.org/pdf/2005.11401.pdf)


There is also a nice [online article](https://lilianweng.github.io/posts/2020-10-29-odqa/)
describing various approaches in the context of the Question Answering task.


<img src="images/retriever_generator_lweng.png" width=80%>

Attribution: https://lilianweng.github.io/posts/2020-10-29-odqa/

The architecture in the far-right of the diagram is our standard LLM
- Question as input
- Answer as output

In this case: world knowledge is encoded in the parameters of the LLM.

The Retriever-Generator architecture is depicted in the middle of the diagram
- Question is the input of the Retriever
- The Retriever's output (the "Context") is the input of the Generator
    - e.g., the Top 5 facts retrieved
- The Generator (LLM) outputs the answer, given the context obtained by the Retriever

In this case: world knowledge is *non-paramteric*



The Generator only architecture computes
$\pr{\y | \x }$
 directly.

The Generator component of the Retriever-Generator architecture
- is *conditioned* 
- on both question $\x$ and context $\z$ 
- in order to produce answer $\y$

$$\text{Generator: } \pr{ \y | \x, \z } = \pr{ \y | \x, \text{Retriever}(\x) }$$
and ultimately $\pr{\y | \x }$

$$
\begin{array} \\
\pr{\y | \x}_\text{RAG Sequence} &  = & \sum_{\z \in \text{Top } K \,\pr{ ? | \x }} { \pr{ \z | \x }_\eta * \pr{ \y | \x, \z}   } \\
& = & \sum_{\z \in \text{Top } K \, \pr{ ? | \x }} { \pr{ \z | \x }_\eta \prod_{i=1}^N { \pr{ \y_i | \x, \z, \y_{(1:i-1)}}}   } & \text{ since } \pr{ \y | \x, \z} = \prod_{i=1}^N { \pr{ \y_i | \x, \z, \y_{(1:i-1)}} }\\
\end{array}
$$

**Note**

The [paper](https://arxiv.org/pdf/2005.11401.pdf) contrasts
- "RAG Sequence": a single context depending only on question $\x$
- "RAG Token": a separate context for each target token $\y_\tp$




The Retriever-Reader architecture (far left of the diagram)
- is similar to the Retriever-Generator
- but uses a Reader rather than Generator to output the answer
    - The answer produced by the Reader is a sub-string of the retrieved facts
    - identified by a start/end position
    
The world knowledge is non-parametric (just like the Retriever-Generator)
- but the answer format is much more restricted

## Retrieve-Generator: training

Both the Retriever and the Generator are parameterized.

When the Generator is a LLM
- a pre-trained LLM may be used
- and its parameters "fine-tuned"
- not trained from scratch

But the Retriever's parameters need to be learned from scratch via training
- depending on how the Retriever obtains external knowledge.
- how to generate a "query" to the Knowledge Source

Here is a diagram

<img src="images/retriever_generator_training.png">

Attribution: https://arxiv.org/pdf/2005.11401.pdf#page=2

# Extra Parametric Compute

The LLM has been shown to have *some* ability to perform math.

However: this seems to be one of the capabilities that "emerge" only in large models.

<img src="images/gpt3_arithmetic.png">

Attribution: https://arxiv.org/pdf/2005.14165.pdf#page=21


The above chart was for a simple arithmetic operation.

LLM's have also demonstrated some ability on multi-step reasoning problems.

The ability to solve multi-step problems is improved by
- [Chain of Thought prompting](https://arxiv.org/pdf/2201.11903.pdf)
    - prompting the model to show the solution "step by step"
- [Show your work prompting](https://arxiv.org/pdf/2112.00114.pdf)

Both these methods guide the LLM to produce the answer in small steps, rather than all at once.

<table>
<img src="images/cot_math.png">

Attribution: https://arxiv.org/pdf/2201.11903.pdf#page=19
</table>


In multi-step math problems, LLM's 
- sometimes generate the correct sequence of solution steps
- but fumbles the math (failing to carry the digit)

The [CoT paper](https://arxiv.org/pdf/2201.11903.pdf#page=27) calls these "calculator errors"
- They report that 34% of examples demonstrated calculator errors
    - including those with incorrect reasoning

LLM's perform poorly on a simple mathematical task:
- [output the sum of the two inputs, **plus 1**](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/modified_arithmetic#example)
- using few-shot learning

<img src="images/arith_ex1.png" width= 70%>

GPT-3 has been reported to have **zero** accuracy on this task.

Even with explicit instruction (as above) the model performs poorly on "looping"
- What is the $50^{th}$ number in the Fibonacci sequence

<img src="images/pot_vs_cot.png">

On the other hand, LLM's have been shown to have the ability to generate programs.

[Program of Thoughts Prompting](https://arxiv.org/pdf/2211.12588v3.pdf)
is a method
- where the LLM is trained (few-shot) to produce *programs* as output
- the programs are *executed* by an external module

In other words
- the LLM tries to get the step by step process correct
- and uses an external "calculator" to avoid "doing the math"

Programs of Thought prompting is like Chain of Thought Prompting
- prompt asks for a "step by step" answer
- the exemplars encourage *descriptive variable names*
    - improves the ability to generate a correct program ?

## MRKL systems

The Modular Reasoning, Knowledge and Language (MRKL) architecture
augments a LLM with the ability to call external systems
- Web query
- Calendar
- Program Execution

<table>
    <center><strong>MRKL Architecture</strong></center>
    <tr>
        <img src="images/mrkl.png" width=80%>
    </tr>
    
    Attribution: https://arxiv.org/pdf/2205.00445.pdf#page=2
</table>


A LLM using MRKL is trained to make external calls to generate the answer:

The prompt

    What is 10 + 1 ?
might generate response

   CALCULATOR( 10 + 1 )
    

## Program Aided LLM

The [Program Aided Language Model](https://arxiv.org/pdf/2211.10435.pdf)
- train the LLM to output *Python code* rather than text
- the "answer" is obtained by using an external code interpreter on the generated text

<table>
    <center><strong>Program Aided Language Model</strong></center>
    <tr>
        <img src="images/program_aided_llm.png" width=80%>
    </tr>
    
    Attribution: https://arxiv.org/pdf/2211.10435.pdf#page=2
</table>



Some recent advances on solving multi-step quantitative reasoning
problems can be found in
- [Minerva: Solving Quantitative Reasoning Problems with Language Models](https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html)

# FinQA: Financial Question Answering 

The [FinQA dataset](https://arxiv.org/pdf/2109.00122.pdf)
was created to test the ability of a model
- to perform Question Answering in the domain of Finance
- demonstrating the reasoning behind the answer
    - by outputting a program to calculate the ansser

<img src="images/finqa_ex.png">

The authors demonstrate a Retriever-Generator model for the task.
- Retriever: External Knowledge Source to store Financial Reports on companies
- Generator: outputs a "calculator program"

Here we see both forms of extra-parametric capabilities integrated with a LLM.


The authors of the FinQA dataset have also created the [ConvFinQA dataset](https://arxiv.org/pdf/2210.03849.pdf)
- *conversational* question answering
- a follow-up question can reference the answer to a previous question

# Conclusion

There are some obvious benefits to adding Extra Parametric capabilities to a LLM
- The LLM can be smaller
    - knowledge (factual and procedural) stored outside of parameters
- New "skills" can be added via exemplars demonstrating calls to a new library
    - Derivative pricing library

In [2]:
print("Done")

Done
