## Imports

In [1]:
%%capture
# System installs
!sudo apt install texlive-extra-utils
!sudo apt install tralics

# Python installs
%mkdir papers
%pip install python-magic
%pip install latex2mathml
%pip install --upgrade pip
%pip install openai

# Forked version, works without GROBID
!git clone https://github.com/irhum/s2orc-doc2json.git

## Preprocessing

In [2]:
import openai
openai.api_key = 'YOUR KEY HERE' #@param {type:"string"}
paper_idx = "2210.05359" #@param {type:"string"}

### Download paper's LaTeX source

In [3]:
import sys
import json
import functools
import tenacity
import numpy as np
import re

sys.path.append("s2orc-doc2json")
from doc2json.tex2json import process_tex

In [4]:
%%capture
idx_cleaned = paper_idx.replace(".", "_")
!curl https://arxiv.org/e-print/{paper_idx} -o {idx_cleaned}.gz
process_tex.process_tex_file(f"{idx_cleaned}.gz", output_dir="papers")

### LaTeX to processed text

In [5]:
def process_passages(raw_passages):
    passages = []

    for raw_passage in raw_passages:
        # These are strings with "reference IDs" for equations/tables/figs
        passage = raw_passage['text']

        # indicator if passage is a display mode latex passage
        is_display = False

        # replace equation ref ids with actual latex
        for span in raw_passage['eq_spans']:
            passage = passage.replace(span['ref_id'], f"${span['latex']}$")
            if 'DISPLAY' in span['ref_id']:
                is_display = True

        # if either displaymode latex passage, or a short passage add to preceding
        if is_display or (len(passages) > 0 and len(passage) < 150):
            passages[-1] = passages[-1] + passage
        
        # if passage is longer than 1200 chars, break into chunks of 750 chars max.
        elif len(passage) > 1200:
            MAX_CHUNK = 750
            passage_chunks = [passage[i:i + MAX_CHUNK] for i in range(len(passage), MAX_CHUNK)]
            passages.extend(passage_chunks)
        else:
            passages.append(passage)
    
    return passages

# Load processed JSON
with open(f"papers/{idx_cleaned}.json") as f:
    data = json.load(f)

# extract abstract, and list of passages
abstract = "".join([texts['text'] for texts in data['latex_parse']['abstract']])
passages = process_passages(data['latex_parse']['body_text'])

### Embed passages

In [6]:
@tenacity.retry(wait=tenacity.wait_random_exponential(min=1, max=60))
def get_embed(text, model="text-embedding-ada-002"):
   response = openai.Embedding.create(input = [text], model=model)
   return response['data'][0]['embedding']

embeds_resp = [get_embed(passage) for passage in passages]
embeds = np.array(embeds_resp)

## Q/A

In [7]:
question = "How is the physics sim integrated with the language model?" #@param {type:"string"}

### Step 1: Generate "hypothetical passages" to answer the question.

This is inspired by the [HyDE](https://arxiv.org/abs/2212.10496) paper, which makes the following discovery: the embedding of a real answer is closer to a "hypothetical answer" than it is to the question.

In [8]:
messages_gen = [{"role": "system", "content": r"""You are an interactive academic paper writing assistant. You are well-versed in authoritative, academic writing, and produce perfect LaTeX where needed to support your idea. The user interacts with you as follows:
* The user provides you an abstract.
* Once you finish analyzing, you prompt the user for a question.
* The user asks you a question. In response, you produce a passage (Passage 1).
    * This should NOT directly answer the question; it is a passage from the final paper you two are writing.
* You also produce an alternate response (Passage 2), which is about some small technical detail in Passage 1"""},

{"role": "user", "content": "Here's the abstract for the paper we're looking at: Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from language models (LMs). Existing generation algorithms must avoid both unlikely strings, which are incoherent, and highly likely ones, which are short and repetitive. We propose contrastive decoding (CD), a more reliable search objective that returns the difference between likelihood under a large LM (called the expert, e.g. OPT-13b) and a small LM (called the amateur, e.g. OPT-125m). CD is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs, and that this difference signals exactly which texts should be preferred. CD requires zero training, and produces higher quality text than decoding from the larger LM alone. It also generalizes across model types (OPT and GPT2) and significantly outperforms four strong decoding algorithms in automatic and human evaluations. FOOTREF1"},

{"role": "assistant", "content": "I've finished analyzing this abstract. How can I help?"},

{"role": "user", "content": "How does this technique choose the next best token?"},

{"role": "assistant", "content": """Passage 1 - [[In order to contrast the expert and amateur LM, contrastive decoding searches for text that maximizes the contrastive objective $\mathcal {L}_{\text{CD}} = \log p_\textsc {exp}(\textsf {x$ cont $}\mid \textsf {x$ pre $}) - \log p_\textsc {ama}(\textsf {x$ cont $}\mid \textsf {x$ pre $})$ , subject to constraints that $\textsf {x$ cont $}$ should be plausible (i.e., achieve sufficiently high probability under the expert LM)]].
Passage 2 - [[As shown in Equation EQREF5 , we first filter tokens based on plausibility constraints $\mathcal {V}_\text{head}(x_{<i})$ , eliminating tokens that fail to achieve sufficiently high probabilities under the expert LM. Then we score the remaining tokens based on the amount of contrast they demonstrate, according to $ \log p_\textsc {exp}( x_i \mid x_{<i}) - \log p_\textsc {ama}( x_i \mid x_{<i})$ . As a result, we end up selecting plausible tokens under the expert LM that least resemble the amateur LM.]]"""},

{"role": "user", "content": f"Great work! Let's move on to a different topic. Here's the abstract for a new paper we're looking at: {abstract}"},
 
{"role": "assistant", "content": "How can I help?"},
 
{"role": "user", "content": f"{question}"}]

In [9]:
response = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',  
    temperature=0.8,            
    messages=messages_gen, 
)

gen_passages = response['choices'][0]['message']['content']
gen_passages = [text[2:-2] for text in re.findall(r'\[\[.*\]\]', gen_passages)]

### Step 2: Retrieve best "real" passage

We embed the "hypothetical" passage, and use maximum inner product search (implemented as a literal inner product, since we're doing a 1 to ~100 vectors search) to find a "real" passage that's most similar.

Note we generate two "hypothetical" passages but only use the second one. This isn't for super "necessary" reasons; the current prompt just means on average it has more "technical content", and is more likely to match the actual desired passage.

In [10]:
gen_passages_emb = np.array([get_embed(gen_passage) for gen_passage in gen_passages])
retr_passages = [passages[i] for i in np.argsort(-embeds @ gen_passages_emb[1])][:4]
retr_passages = "\n".join([f'Passage {i} - <<{passage}>>' for (i, passage) in enumerate(retr_passages)])

### Step 3: Answer generation

We retrieve the four closest "real" passages to the "hypothetical" passage, and then inject them into the prompt. And then, just ask for the answer.

In [11]:
messages_ans = [{"role": "system", "content": """You are a research assistant with access to a powerful search engine for scientific contexts. When your user asks you a question, you:
1. Retrieve up to 4 relevant passages. Although these passages may have grammatical/syntax errors, they are factually correct.
2. Analyze the relevance of the passages to the question being asked.
3. Produce a response that is as concise as possible while answering the question."""},

{"role": "user", "content": "Here's the abstract for the paper we're looking at: Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from language models (LMs). Existing generation algorithms must avoid both unlikely strings, which are incoherent, and highly likely ones, which are short and repetitive. We propose contrastive decoding (CD), a more reliable search objective that returns the difference between likelihood under a large LM (called the expert, e.g. OPT-13b) and a small LM (called the amateur, e.g. OPT-125m). CD is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs, and that this difference signals exactly which texts should be preferred. CD requires zero training, and produces higher quality text than decoding from the larger LM alone. It also generalizes across model types (OPT and GPT2) and significantly outperforms four strong decoding algorithms in automatic and human evaluations. FOOTREF1"},

{"role": "assistant", "content": "I've finished analyzing this abstract. How can I help?"},

{"role": "user", "content": "Retrieve upto 4 high quality passages to answer the following: How does this technique choose the next best token?"},

{"role": "assistant", "content": """Passage 1 - <<In order to contrast the expert and amateur LM, contrastive decoding searches for text that maximizes the contrastive objective $\mathcal {L}_{\text{CD}} = \log p_\textsc {exp}(\textsf {x$ cont $}\mid \textsf {x$ pre $}) - \log p_\textsc {ama}(\textsf {x$ cont $}\mid \textsf {x$ pre $})$ , subject to constraints that $\textsf {x$ cont $}$ should be plausible (i.e., achieve sufficiently high probability under the expert LM).>>
Passage 2 - <<As shown in Equation EQREF5 , we first filter tokens based on plausibility constraints $\mathcal {V}_\text{head}(x_{<i})$ , eliminating tokens that fail to achieve sufficiently high probabilities under the expert LM. Then we score the remaining tokens based on the amount of contrast they demonstrate, according to $ \log p_\textsc {exp}( x_i \mid x_{<i}) - \log p_\textsc {ama}( x_i \mid x_{<i})$ . As a result, we end up selecting plausible tokens under the expert LM that least resemble the amateur LM.>>"""},

{"role": "user", "content": "Using these passages (directly copying LaTeX as needed), answer the question."},

{"role": "assistant", "content": "First, we filter out all tokens that the expert LM doesn't assign high probabilities to. Then, each token is scored based on the *difference* in log-likelihood between the expert and amateur LMs ($ \log p_\textsc {exp}( x_i \mid x_{<i}) - \log p_\textsc {ama}( x_i \mid x_{<i})$). In doing so, we take advantage of the fact that the amateur LM provides a signal of which tokens to *avoid*."},

{"role": "user", "content": f"Let's move on to a new topic. Here's the abstract for the new paper we're looking at: {abstract}"},
 
{"role": "assistant", "content": "Okay, what's your question?"},
 
{"role": "user", "content": f"Retrieve upto 4 high quality passages to answer the following: {question}"},

{"role": "assistant", "content": f"{retr_passages}"},

{"role": "user", "content": "Analyzing and extracting information from these passages, answer the question. DO NOT make unsupported claims."}]

In [12]:
completions = openai.ChatCompletion.create(
    model='gpt-3.5-turbo',  
    temperature=0.5,
    messages=messages_ans
)

### Get answer

In [13]:
completions['choices'][0]['message']['content']

"The physics simulation engine (MuJoCo) is integrated with the language model (LM) in Mind's Eye by appending the simulation results to the input prompts of LMs during inference. The simulation engine returns the most likely outcome based on its encoded world knowledge, which is then used as part of the input for the LM. This allows the LM to perform grounded reasoning in the physical world. The implementation of Mind's Eye comprises three main components: a text-to-code LM as the front-end, a physics simulation engine as the back-end, and a foundation model for general reasoning. The proposed method can be used as a plug-and-play framework that works with any LM and requires neither handcrafted prompts nor costly fine-tuning."

## Analysis

### Tech
#### Information Retrieval (IR)
* Just embedding the question as the query vector does *not* work well.
* Generate "hypothetical" answers and embedding them works...if the answers are relevant. Need better prompts to actually encourage the model to hallucinate.
* ^This generation process feels sledge-hammery though. There's only like 100 vectors per PDF, can't we just use BM25? 

#### Synthesis
* In early tests, if I provide "gold standard" retrieved passages, ChatGPT seems quite good at condensing them into an answer. The problem then is in the IR, not the LM itself.

### Usefulness
* This is not a useful tool as designed. If I want to know if there's a scaling law plot in the paper...I can just scan the paper, no need to wait 20 seconds for this whole pipe to execute.
* More interesting is stuff *across* papers. What if I'm reading through this *application* of the [GBP algorithm to robots](https://arxiv.org/abs/2203.11618), and behind the scenes the system also scrapes [this](https://arxiv.org/abs/2107.02308) so I can ask questions about GBP itself?