# Dense-X-Retrieval Pack

This notebook walks through using the `DenseXRetrievalPack`, which parses documents into nodes, and then generates propositions from each node to assist with retreival.

This follows the idea from the paper [Dense X Retrieval: What Retreival Granularity Should We Use?](https://arxiv.org/abs/2312.06648).

From the paper, a proposition is described as:

```
Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format.
```

We use the provided OpenAI prompt from their paper to generate propositions, which are then embedded and used to retrieve their parent node chunks.

## Setup

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

import nest_asyncio

nest_asyncio.apply()

In [1]:
!mkdir -p 'data/'
!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.0M  100 13.0M    0     0  1160k      0  0:00:11  0:00:11 --:--:-- 1574k


In [2]:
from llama_hub.file.unstructured import UnstructuredReader

documents = UnstructuredReader().load_data("data/llama2.pdf")

[nltk_data] Downloading package punkt to /home/loganm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/loganm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
  from .autonotebook import tqdm as notebook_tqdm


## Run the DenseXRetrievalPack

The `DenseXRetrievalPack` creates both a retriver and query engine.

In [None]:
from llama_index.llama_pack import download_llama_pack

DenseXRetrievalPack = download_llama_pack("DenseXRetrievalPack", "./dense_pack")

In [4]:
from llama_index.llms import OpenAI
from llama_index.text_splitter import SentenceSplitter

dense_pack = DenseXRetrievalPack(
    documents,
    proposition_llm=OpenAI(model="gpt-3.5-turbo", max_tokens=750),
    query_llm=OpenAI(model="gpt-3.5-turbo", max_tokens=256),
    text_splitter=SentenceSplitter(chunk_size=1024),
)
dense_query_engine = dense_pack.query_engine

100%|██████████| 90/90 [02:02<00:00,  1.36s/it]
Generating embeddings: 100%|██████████| 2210/2210 [00:13<00:00, 159.12it/s]


In [5]:
from llama_index import VectorStoreIndex

base_index = VectorStoreIndex.from_documents(documents)
base_query_engine = base_index.as_query_engine()

## Test Queries

### How was Llama2 pretrained?

In [6]:
from IPython.display import Markdown, display

response = dense_query_engine.query("How was Llama2 pretrained?")
display(Markdown(str(response)))

Llama 2 was pretrained using an optimized auto-regressive transformer. Several changes were made to improve performance, including more robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models.

In [7]:
response = base_query_engine.query("How was Llama2 pretrained?")
display(Markdown(str(response)))

Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context.

### What baselines are used to compare performance and accuracy?

In [8]:
response = dense_query_engine.query(
    "What baselines are used to compare performance and accuracy?"
)
display(Markdown(str(response)))

The baselines used to compare performance and accuracy are Llama 1, Falcon, and MPT. These models are compared with Llama 2 to evaluate its advancements in terms of safety and other important aspects.

In [9]:
response = base_query_engine.query(
    "What baselines are used to compare performance and accuracy?"
)
display(Markdown(str(response)))

The baselines used to compare performance and accuracy are the MPT and Falcon models. These models are compared against the evaluation framework and any publicly reported results to determine the best score.

### What datasets were used for measuring performance and accuracy?

In [10]:
response = dense_query_engine.query(
    "What datasets were used for measuring performance and accuracy?"
)
display(Markdown(str(response)))

The datasets used for measuring performance and accuracy include HumanEval, MBPP, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, CommonsenseQA, NaturalQuestions, TriviaQA, SQuAD, GSM8K, and MATH.

In [11]:
response = base_query_engine.query(
    "What datasets were used for measuring performance and accuracy?"
)
display(Markdown(str(response)))

The datasets used for measuring performance and accuracy include HumanEval, MBPP, PIQA, SIQA, HellaSwag, WinoGrande, ARC easy and challenge, OpenBookQA, CommonsenseQA, NaturalQuestions, TriviaQA, SQuAD, QuAC, BoolQ, GSM8K, and MATH.