# Fuzzy Citation Query Engine

This notebook walks through using the `FuzzyCitationEnginePack`, which can wrap any existing query engine and post-process the response object to include direct sentence citations, identified using fuzzy-matching.

## Setup

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
!mkdir -p 'data/'
!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'

In [None]:
!pip install unstructured[pdf]

In [2]:
from llama_index import VectorStoreIndex

In [3]:
from llama_hub.file.unstructured import UnstructuredReader

documents = UnstructuredReader().load_data("data/llama2.pdf")

[nltk_data] Downloading package punkt to /home/loganm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/loganm/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
  from .autonotebook import tqdm as notebook_tqdm


In [4]:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

## Run the FuzzyCitationEnginePack

The `FuzzyCitationEnginePack` can wrap any existing query engine.

In [12]:
from llama_index.llama_pack import download_llama_pack

FuzzyCitationEnginePack = download_llama_pack("FuzzyCitationEnginePack", "./fuzzy_pack")

In [13]:
fuzzy_engine_pack = FuzzyCitationEnginePack(query_engine, threshold=50)

In [6]:
response = fuzzy_engine_pack.run("How was Llama2 pretrained?")

In [7]:
print(str(response))

Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context.


### Compare response to citation sentences

In [9]:
for response_sentence, node_chunk in response.metadata.keys():
    print("Response Sentence:\n", response_sentence)
    print("\nRelevant Node Chunk:\n", node_chunk)
    print("----------------")

Response Sentence:
 Llama 2 was pretrained using an optimized auto-regressive transformer. 

Relevant Node Chunk:
 Llama 2-Chat, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. 
----------------
Response Sentence:
 Llama 2 was pretrained using an optimized auto-regressive transformer. 

Relevant Node Chunk:
 (2023), using an optimized auto-regressive transformer, but made several changes to improve performance. 
----------------
Response Sentence:
 The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. 

Relevant Node Chunk:
 We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). 
----------------
Response Sentence:
 The pretraining approach involved robust data cleaning, upda

So if we compare the original LLM output:

```
Llama 2 was pretrained using an optimized auto-regressive transformer. The pretraining approach involved robust data cleaning, updating the data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention (GQA) to improve inference scalability for larger models. The training corpus included a new mix of data from publicly available sources, excluding data from Meta's products or services. The pretraining methodology and training details are described in more detail in the provided context.
```

With the generated fuzzy matches above, we can clearly see where each sentence came from!

### [Advanced] Inspect citation metadata

Using the citation metadata, we can get the exact character location of the response from the original document!

In [10]:
for chunk_info in response.metadata.values():
    start_char_idx = chunk_info["start_char_idx"]
    end_char_idx = chunk_info["end_char_idx"]

    node = chunk_info["node"]
    node_start_char_idx = node.start_char_idx
    node_end_char_idx = node.end_char_idx

    # using the node start and end char idx, we can offset the
    # citation chunk to locate the citation in the
    document_start_char_idx = start_char_idx + node_start_char_idx
    document_end_char_idx = document_start_char_idx + (end_char_idx - start_char_idx)
    text = documents[0].text[document_start_char_idx:document_end_char_idx]

    print(text)
    print(node.metadata)
    print("----------------")

Llama 2-Chat, a fine-tuned version of Llama 2 that is optimized for dialogue use cases. 
{'filename': 'data/llama2.pdf'}
----------------
(2023), using an optimized auto-regressive transformer, but made several changes to improve performance. 
{'filename': 'data/llama2.pdf'}
----------------
We also increased the size of the pretraining corpus by 40%, doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). 
{'filename': 'data/llama2.pdf'}
----------------
Specifically, we performed more robust data cleaning, updated our data mixes, trained on 40% more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference scalability for our larger models. 
{'filename': 'data/llama2.pdf'}
----------------
2.1 Pretraining Data

Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. 
{'filename': 'data/llama2.pdf'}
-------------

## Try a random question

If we ask a question unrelated to the data in the index, we should not have any matching citaitons (in most cases).

In [11]:
response = fuzzy_engine_pack.run("Where is San Francisco located?")

print(len(response.metadata.keys()))

0
