# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

In [2]:
# NOTE: An OpenAI API key must be set here for application initialization, even if not in use.
# If you're not utilizing OpenAI models, assign a placeholder string (e.g., "not_used").
import os
os.environ["OPENAI_API_KEY"] = "your_key"

In [5]:
# Cinderella story defined in sample.txt
with open('finQA/output_summary_text2.txt', 'r') as file:
    text = file.read()

print(text[:100])

Analog Devices, a company in the Tech & Electronics sector, provided financial information for the y


1) **Building**: RAPTOR recursively embeds, clusters, and summarizes chunks of text to construct a tree with varying levels of summarization from the bottom up. You can create a tree from the text in 'sample.txt' using `RA.add_documents(text)`.

2) **Querying**: At inference time, the RAPTOR model retrieves information from this tree, integrating data across lengthy documents at different abstraction levels. You can perform queries on the tree with `RA.answer_question`.

### Building the tree

In [6]:
from raptor import RetrievalAugmentation, RetrievalAugmentationConfig

  from .autonotebook import tqdm as notebook_tqdm
2024-05-22 10:56:07,693 - Loading faiss.
2024-05-22 10:56:07,717 - Successfully loaded faiss.


In [7]:
from raptor.QAModels import GPT4oQAModel
from raptor.SummarizationModels import GPT4oSummarizationModel

custom_qa = GPT4oQAModel()
custom_summarizer = GPT4oSummarizationModel()

custom_config = RetrievalAugmentationConfig(
    summarization_model=custom_summarizer,
    qa_model=custom_qa
)

In [8]:
RA = RetrievalAugmentation(config=custom_config)

# construct the tree
RA.add_documents(text)

2024-05-22 10:56:11,724 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7fb87d1696d0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7fb8430d37f0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-22 10:56:11,725 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

### Querying from the tree

```python
question = # any question
RA.answer_question(question)
```

In [9]:
question = "What is Apple?"
answer = RA.answer_question(question=question)

print("Answer: ", answer)

2024-05-22 13:28:53,317 - Using collapsed_tree
2024-05-22 13:28:53,771 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-22 13:29:12,217 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Answer:  Apple Inc. is a leading multinational technology company headquartered in Cupertino, California, known for its innovative products and services in the Tech & Electronics sector. Founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976, Apple has grown to become one of the most valuable and influential companies in the world.

### Key Aspects of Apple Inc.:

1. **Product Portfolio**:
   - **Hardware**: Apple designs, manufactures, and markets a range of consumer electronics, including the iPhone (smartphones), iPad (tablets), Mac (personal computers), Apple Watch (smartwatches), and Apple TV (digital media players).
   - **Software**: The company develops its own operating systems, such as iOS for mobile devices, macOS for computers, watchOS for smartwatches, and tvOS for Apple TV. It also offers a suite of productivity and creativity software, including iWork and iLife.
   - **Services**: Apple provides various services, including the App Store, Apple Music, iCloud, Appl

In [10]:
# Save the tree by calling RA.save("path/to/save")
SAVE_PATH = "finQA/GPT4o_full_2"
RA.save(SAVE_PATH)

2024-05-22 13:29:24,488 - Tree successfully saved to finQA/GPT4o_full_2


In [11]:
# load back the tree by passing it into RetrievalAugmentation

RA = RetrievalAugmentation(tree=SAVE_PATH)

question = 'what is apples net revenue in 2012?'

answer = RA.answer_question(question=question)
print("Answer: ", answer)

2024-05-22 13:29:49,333 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT3TurboSummarizationModel object at 0x7fb7defee490>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7fb7defee9d0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-22 13:29:49,337 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
           

Answer:  Apple's net revenue in 2012 was $156.5 billion.


In [12]:
import json

# Function to load JSON file
def load_json(filepath):
    with open(filepath, 'r') as file:
        return json.load(file)

# Function to save JSON file
def save_json(data, filepath):
    with open(filepath, 'w') as file:
        json.dump(data, file, indent=2)

# Function to answer questions using RA model
def generate_answers(data, ra_model):
    new_data = []
    for entry in data:
        question = entry["question"]
        model_answer = ra_model.answer_question(question=question)
        new_entry = {
            "question": question,
            "answer": model_answer
        }
        new_data.append(new_entry)
    return new_data

# Function to answer questions using RA model
def generate_rag_answers(data, ra_model, start_layer):
    new_data = []
    for entry in data:
        question = entry["question"]
        model_answer = ra_model.answer_question(question=question,
                                                start_layer=start_layer, 
                                                num_layers=1)
        new_entry = {
            "question": question,
            "answer": model_answer
        }
        new_data.append(new_entry)
    return new_data

In [13]:
# Path to your original JSON file
json_file_path = 'evaluations/ground_truth_1.json'
# Path to save the new JSON file
new_json_file_path = 'evaluations/raptor_output_2.json'

# Load the JSON data
data = load_json(json_file_path)

RA = RetrievalAugmentation(tree=SAVE_PATH, config=custom_config)

new_data = generate_answers(data, RA)

# Save the new questions and answers to a new JSON file
save_json(new_data, new_json_file_path)

print(f"New JSON with model answers saved to {new_json_file_path}")

2024-05-22 13:30:44,898 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7fb87d1696d0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7fb8430d37f0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-22 13:30:44,900 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

New JSON with model answers saved to evaluations/raptor_output_2.json


In [15]:
# Path to your original JSON file
json_file_path = 'evaluations/ground_truth_1.json'
# Path to save the new JSON file
new_json_file_path = 'evaluations/rag4_output_2.json'

# Load the JSON data
data = load_json(json_file_path)

# Change to only leaf layer
RA = RetrievalAugmentation(tree=SAVE_PATH, config=custom_config)

new_data = generate_rag_answers(data, RA, start_layer=4)

# Save the new questions and answers to a new JSON file
save_json(new_data, new_json_file_path)

print(f"New JSON with model answers saved to {new_json_file_path}")

2024-05-22 13:40:55,733 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.GPT4oSummarizationModel object at 0x7fb87d1696d0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x7fb8430d37f0>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: RAPTOR_Clustering
        Clustering Parameters: {}
        
2024-05-22 13:40:55,737 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: <Encoding 'cl100k_base'>
            Max Tokens: 100
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Sel

New JSON with model answers saved to evaluations/rag4_output_2.json


## Using other Open Source Models for Summarization/QA/Embeddings

If you want to use other models such as Llama or Mistral, you can very easily define your own models and use them with RAPTOR. 

In [None]:
import torch
from raptor import BaseSummarizationModel, BaseQAModel, BaseEmbeddingModel, RetrievalAugmentationConfig
from transformers import AutoTokenizer, pipeline

In [None]:
# if you want to use the Gemma, you will need to authenticate with HuggingFace, Skip this step, if you have the model already downloaded
from huggingface_hub import login
login()

In [None]:
from transformers import AutoTokenizer, pipeline
import torch

# You can define your own Summarization model by extending the base Summarization Class. 
class GEMMASummarizationModel(BaseSummarizationModel):
    def __init__(self, model_name="google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the GEMMA model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.summarization_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),  # Use "cpu" if CUDA is not available
        )

    def summarize(self, context, max_tokens=150):
        # Format the prompt for summarization
        messages=[
            {"role": "user", "content": f"Write a summary of the following, including as many key details as possible: {context}:"}
        ]
        
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the summary using the pipeline
        outputs = self.summarization_pipeline(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated summary
        summary = outputs[0]["generated_text"].strip()
        return summary


In [None]:
class GEMMAQAModel(BaseQAModel):
    def __init__(self, model_name= "google/gemma-2b-it"):
        # Initialize the tokenizer and the pipeline for the model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.qa_pipeline = pipeline(
            "text-generation",
            model=model_name,
            model_kwargs={"torch_dtype": torch.bfloat16},
            device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
        )

    def answer_question(self, context, question):
        # Apply the chat template for the context and question
        messages=[
              {"role": "user", "content": f"Given Context: {context} Give the best full answer amongst the option to question {question}"}
        ]
        prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Generate the answer using the pipeline
        outputs = self.qa_pipeline(
            prompt,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.95
        )
        
        # Extracting and returning the generated answer
        answer = outputs[0]["generated_text"][len(prompt):]
        return answer

In [None]:
from sentence_transformers import SentenceTransformer
class SBertEmbeddingModel(BaseEmbeddingModel):
    def __init__(self, model_name="sentence-transformers/multi-qa-mpnet-base-cos-v1"):
        self.model = SentenceTransformer(model_name)

    def create_embedding(self, text):
        return self.model.encode(text)


In [None]:
RAC = RetrievalAugmentationConfig(summarization_model=GEMMASummarizationModel(), qa_model=GEMMAQAModel(), embedding_model=SBertEmbeddingModel())

In [None]:
RA = RetrievalAugmentation(config=RAC)

In [None]:
with open('demo/sample.txt', 'r') as file:
    text = file.read()
    
RA.add_documents(text)

In [None]:
question = "How did Cinderella reach her happy ending?"

answer = RA.answer_question(question=question)

print("Answer: ", answer)