# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

In [1]:
# NOTE: An OpenAI API key must be set here for application initialization, even if not in use.
# If you're not utilizing OpenAI models, assign a placeholder string (e.g., "not_used").
import os
os.environ["OPENAI_API_KEY"] = "your_key"

In [2]:
# Cinderella story defined in sample.txt
with open('finQA/output_summary_text.txt', 'r') as file:
    text = file.read()

print(text[:100])

The financial dataset highlights the company's management of interest rates and foreign currency exp


1) **Building**: RAPTOR recursively embeds, clusters, and summarizes chunks of text to construct a tree with varying levels of summarization from the bottom up. You can create a tree from the text in 'sample.txt' using `RA.add_documents(text)`.

2) **Querying**: At inference time, the RAPTOR model retrieves information from this tree, integrating data across lengthy documents at different abstraction levels. You can perform queries on the tree with `RA.answer_question`.

### Building the tree

In [3]:
from raptor import RetrievalAugmentation, RetrievalAugmentationConfig

  from .autonotebook import tqdm as notebook_tqdm
2024-05-21 16:37:09,804 - Loading faiss.
2024-05-21 16:37:09,828 - Successfully loaded faiss.


In [4]:
from raptor.QAModels import GPT4oQAModel
from raptor.SummarizationModels import GPT4oSummarizationModel

custom_qa = GPT4oQAModel()
custom_summarizer = GPT4oSummarizationModel()

custom_config = RetrievalAugmentationConfig(
    summarization_model=custom_summarizer,
    qa_model=custom_qa
)

In [5]:
RA = RetrievalAugmentation(config=custom_config)

# construct the tree
RA.add_documents(text)

2024-05-21 11:24:04,624 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7ff1be4786d0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7ff1c0c267f0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-21 11:24:04,626 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

### Querying from the tree

```python
question = # any question
RA.answer_question(question)
```

In [6]:
question = "What is Apple?"
answer = RA.answer_question(question=question)

print("Answer: ", answer)

2024-05-21 13:01:30,478 - Using collapsed_tree
2024-05-21 13:01:30,732 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-21 13:01:39,490 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Answer:  Apple Inc. is a multinational technology company known for its strong market presence and superior return on investments. The company has demonstrated sustained growth and significantly outperformed major market indices over various five-year periods. For instance, from September 30, 2006, to September 30, 2011, an initial $100 investment in Apple grew to $495, vastly outpacing the S&P 500, S&P Computer Hardware Index, and Dow Jones U.S. Technology Index. Similarly, from 2008 to 2013, a $100 investment in Apple grew to $431, compared to the S&P 500 Index at $161, the S&P Computer Hardware Index at $197, and the Dow Jones U.S. Technology Supersector Index at $175.

Apple's financial data as of September 24, 2011, revealed gross unrecognized tax benefits totaling $1.4 billion, with $563 million potentially impacting the effective tax rate if acknowledged. By September 28, 2013, Apple had significant financial obligations, including $4.7 billion in future minimum payments for fac

In [5]:
# Save the tree by calling RA.save("path/to/save")
SAVE_PATH = "finQA/GPT4o_full"
# RA.save(SAVE_PATH)

In [32]:
# load back the tree by passing it into RetrievalAugmentation

RA = RetrievalAugmentation(tree=SAVE_PATH)

answer = RA.answer_question(question=question)
print("Answer: ", answer)

Exception ignored in: <function OpenAI.__del__ at 0x7ff1bbf6f9d0>
Traceback (most recent call last):
  File "/Users/krrishchawla/Desktop/CS/raptor/raptor/lib/python3.9/site-packages/openai/_client.py", line 208, in __del__
    def __del__(self) -> None:
KeyboardInterrupt: 
2024-05-21 15:21:14,539 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT3TurboSummarizationModel object at 0x7ff04ff0eee0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7ff04ff0ee50>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
      

Answer:  In 2009, Apple's R&D expenses increased by 20% compared to the previous year. The company spent a total of $1.8 billion on research and development activities. This rise was primarily driven by increased headcount and software development costs. Despite the increase in R&D spending, the expenses decreased as a percentage of net sales due to a 52% rise in overall sales.

In 2010, Apple's R&D expenses continued to grow, with a 34% increase compared to the previous year. The company invested $1.8 billion in research and development, mainly to support expanded activities and increased headcount. Similar to the previous year, the R&D costs decreased as a percentage of net sales, this time due to a 52% rise in net sales.

Therefore, from 2009 to 2010, Apple's R&D expenses increased by 34% in dollar amount, reaching $1.8 billion, while the expenses as a percentage of net sales decreased due to the significant growth in sales.


In [6]:
import json
import pickle

# Function to load JSON file
def load_json(filepath):
    with open(filepath, 'r') as file:
        return json.load(file)

# Function to save JSON file
def save_json(data, filepath):
    with open(filepath, 'w') as file:
        json.dump(data, file, indent=2)

# Function to answer questions using RA model
def generate_answers(data, ra_model):
    new_data = []
    for entry in data:
        question = entry["question"]
        model_answer = ra_model.answer_question(question=question)
        new_entry = {
            "question": question,
            "answer": model_answer
        }
        new_data.append(new_entry)
    return new_data

# Function to answer questions using RA model
def generate_rag_answers(data, ra_model, start_layer):
    new_data = []
    for entry in data:
        question = entry["question"]
        model_answer = ra_model.answer_question(question=question,
                                                start_layer=start_layer, 
                                                num_layers=1)
        new_entry = {
            "question": question,
            "answer": model_answer
        }
        new_data.append(new_entry)
    return new_data

In [7]:
# Path to your original JSON file
json_file_path = 'evaluations/ground_truth_1.json'
# Path to save the new JSON file
new_json_file_path = 'evaluations/raptor_output_1.json'

# Load the JSON data
data = load_json(json_file_path)

RA = RetrievalAugmentation(tree=SAVE_PATH, config=custom_config)

new_data = generate_answers(data, RA)

# Save the new questions and answers to a new JSON file
save_json(new_data, new_json_file_path)

print(f"New JSON with model answers saved to {new_json_file_path}")

2024-05-21 16:37:28,659 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7f80864f5850>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7f80a32197c0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-21 16:37:28,660 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

New JSON with model answers saved to evaluations/raptor_output_1.json


In [9]:
# Path to your original JSON file
json_file_path = 'evaluations/ground_truth_1.json'
# Path to save the new JSON file
new_json_file_path = 'evaluations/rag4_output_1.json'

# Load the JSON data
data = load_json(json_file_path)

# Change to only leaf layer
RA = RetrievalAugmentation(tree=SAVE_PATH, config=custom_config)

new_data = generate_rag_answers(data, RA, start_layer=4)

# Save the new questions and answers to a new JSON file
save_json(new_data, new_json_file_path)

print(f"New JSON with model answers saved to {new_json_file_path}")

2024-05-21 16:42:49,722 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7f80864f5850>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7f80a32197c0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-21 16:42:49,723 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

New JSON with model answers saved to evaluations/rag4_output_1.json


## Using other Open Source Models for Summarization/QA/Embeddings

If you want to use other models such as Llama or Mistral, you can very easily define your own models and use them with RAPTOR. 

In [None]:
import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline

In [None]:
# if you want to use the Gemma, you will need to authenticate with HuggingFace, Skip this step, if you have the model already downloaded
from huggingface_hub import login
login()

In [None]:
from transformers import AutoTokenizer, pipeline
import torch

# You can define your own Summarization model by extending the base Summarization Class. 
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the GEMMA model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.summarization_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),  # Use "cpu" if CUDA is not available
        )

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]
        
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the summary using the pipeline
        outputs = self.summarization_pipeline(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated summary
        summary = outputs[0]["generated_text"].strip()
        return summary


In [None]:
class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.qa_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
        )

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the answer using the pipeline
        outputs = self.qa_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated answer
        answer = outputs[0]["generated_text"][len(prompt):]
        return answer

In [None]:
from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)


In [None]:
RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())

In [None]:
RA = RetrievalAugmentation(config=RAC)

In [None]:
with open('demo/sample.txt', 'r') as file:
    text = file.read()
    
RA.add_documents(text)

In [None]:
question = "How did Cinderella reach her happy ending?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)