In [3]:
from autogen import retrieve_utils

paper = retrieve_utils.extract_text_from_pdf("data/paper.pdf")
print(paper)

GraphReader: Building Graph-based Agent to Enhance
Long-Context Abilities of Large Language Models
Shilong Li∗1, Yancheng He∗1, Hangyu Guo∗1, Xingyuan Bu∗†‡1, Ge Bai1, Jie Liu2,3,
Jiaheng Liu1, Xingwei Qu4, Yangguang Li3, Wanli Ouyang2,3, Wenbo Su1, Bo Zheng1
1Alibaba Group2The Chinese University of Hong Kong
3Shanghai AI Laboratory4University of Manchester
{zhuli.lsl, buxingyuan.bxy}@taobao.com
Abstract
Long-context capabilities are essential for large
language models (LLMs) to tackle complex
and long-input tasks. Despite numerous efforts
made to optimize LLMs for long contexts, chal-
lenges persist in robustly processing long in-
puts. In this paper, we introduce GraphReader,
a graph-based agent system designed to han-
dle long texts by structuring them into a graph
and employing an agent to explore this graph
autonomously. Upon receiving a question, the
agent first undertakes a step-by-step analysis
and devises a rational plan. It then invokes a
set of predefined functions to read n

In [4]:
import os
from autogen import ConversableAgent

agent = ConversableAgent(
    "chatbot",
    llm_config={"config_list": [{"model": "gpt-4o-mini", "api_key": os.environ.get("OPENAI_API_KEY")}]},
    code_execution_config=False,  # Turn off code execution, by default it is off.
    function_map=None,  # No registered functions, by default it is None.
    human_input_mode="NEVER",  # Never ask for human input.
)



In [5]:
key_atom_prompt = """
You are now an intelligent assistant tasked with meticulously extracting both key elements and atomic facts from a long text. 1. Key Elements: The essential nouns (e.g., characters, times, events, places, numbers), verbs (e.g., actions), and adjectives (e.g., states, feelings) that are pivotal to the text’s narrative. 2. Atomic Facts: The smallest, indivisible facts, presented as concise sentences. These include propositions, theories, existences, concepts, and implicit elements like logic, causality, event sequences, interpersonal relationships, timelines, etc.  Requirements: ##### 1. Ensure that all identified key elements are reflected within the corresponding atomic facts. 2. You should extract key elements and atomic facts comprehensively, especially those that are important and potentially query-worthy and do not leave out details. 3. Whenever applicable, replace pronouns with their specific noun counterparts (e.g., change I, He, She to actual names). 4. Ensure that the key elements and atomic facts you extract are presented in the same language as the original text (e.g., English or Chinese). 5. You should output a total of key elements and atomic facts that do not exceed 1024 tokens. 6. Your answer format for each line should be: [Serial Number], [Atomic Facts], [List of Key Elements, separated with ‘|’] #####  Example: ##### User: One day, a father and his little son ......  Assistant: 1. One day, a father and his little son were going home. | father | little son | going home 2. ...... #####  Please strictly follow the above format. Let’s begin.
"""


In [6]:
reply = agent.generate_reply(messages=[{"content": paper, "role": "system"},{"content": "Find flaws in this paper's methods.", "role": "user"}])
print(reply)

Here are several potential flaws in the methods of the paper "GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models":

1. **Dependence on Graph Construction**: The effectiveness of GraphReader relies heavily on the quality and completeness of the graph constructed from the text. If key elements or atomic facts are missed during the extraction, the subsequent exploration and reasoning may lead to incomplete or incorrect answers. The method describes normalization and linking processes, but these may not always guarantee that the most relevant information is represented in the graph.

2. **Chunking Limitations**: The authors divide long texts into chunks while preserving paragraph structures. However, this approach may not always capture the necessary context for complex multi-hop questions. Key information might be divided across adjacent chunks, leading to incomplete understanding or loss of critical relationships between facts. Additionally

In [7]:
chunks = retrieve_utils.split_text_to_chunks(paper, 300, "multi_lines")

chunk_dict = {i: chunk for i, chunk in enumerate(chunks)}


max_tokens is too small to fit a single line of text. Breaking this line:
	GraphReader: Building Graph-based Agent to Enhance ...
Failed to split docs with must_break_at_empty_line being True, set to False.


In [8]:
reply = agent.generate_reply(
    messages=[
        {"content": key_atom_prompt, "role": "system"},
        {"content": chunks[0], "role": "user"},
    ]
)
print(reply)

1. Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. | Long-context capabilities | large language models | complex | long-input tasks  
2. Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs. | Numerous efforts | optimize | LLMs | long contexts | challenges | robustly processing | long inputs  
3. This paper introduces GraphReader, a graph-based agent system designed to handle long texts. | GraphReader | graph-based agent system | handle | long texts  
4. GraphReader structures long texts into a graph and employs an agent to explore this graph autonomously. | GraphReader | structures | long texts | graph | agent | explore | autonomously  
5. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. | agent | receiving | question | undertakes | step-by-step analysis | devises | rational plan  
6. The age

In [88]:
def convert_to_dict(text):
    lines = text.strip().split("\n")
    result = {}

    for line in lines:
        try:
            parts = line.split("|")
            sentence = parts[0].split(".", 1)[1].strip()
            keys = [key.strip() for key in parts[1:]]

            for key in keys:
                if key in result:
                    result[key].append(sentence)
                else:
                    result[key] = [sentence]
        except IndexError:
            # If split fails, skip this line and continue with the next
            continue

    return result

In [12]:
converted_dict = convert_to_dict(reply)
converted_dict

{'Long-context capabilities': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'large language models': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'complex': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'long-input tasks': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'Numerous efforts': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],
 'optimize': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],
 'LLMs': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],


In [13]:
import nltk
from nltk.stem import PorterStemmer
from sklearn.cluster import DBSCAN
from sentence_transformers import SentenceTransformer
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
import re
from collections import Counter

nltk.download("punkt")

  from tqdm.autonotebook import tqdm, trange
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\12700K\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

IndexError: list index out of range

In [85]:
def frequency_filtering(tag_dict, alpha):
    """Filter out tags appearing less than alpha times."""
    tag_counts = Counter(tag_dict.keys())
    return {
        tag: list(set(values))
        for tag, values in tag_dict.items()
        if tag_counts[tag] >= alpha
    }


def rule_aggregation(tag_dict):
    ps = PorterStemmer()
    new_dict = {}
    for tag, values in tag_dict.items():
        new_tag = ps.stem(re.sub(r"[^a-zA-Z0-9\s]", " ", tag.lower()))
        if new_tag in new_dict:
            new_dict[new_tag].extend(values)
            new_dict[new_tag] = list(set(new_dict[new_tag]))
        else:
            new_dict[new_tag] = values.copy()
    return new_dict


def semantic_aggregation(tag_dict, threshold):
    """Cluster tags based on semantic similarity using DBSCAN."""
    model = SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L6-v2")
    tags = list(tag_dict.keys())
    embeddings = model.encode(tags)
    clustering = DBSCAN(eps=threshold, min_samples=1).fit(embeddings)
    new_dict = {}
    for i, label in enumerate(clustering.labels_):
        if label == -1:
            new_dict[tags[i]] = tag_dict[tags[i]]
        else:
            cluster_tags = [
                tags[j] for j, l in enumerate(clustering.labels_) if l == label
            ]
            new_key = min(cluster_tags, key=len)
            if new_key not in new_dict:
                new_dict[new_key] = []
            for tag in cluster_tags:
                new_dict[new_key].extend(tag_dict[tag])
            new_dict[new_key] = list(set(new_dict[new_key]))

    return new_dict


def association_aggregation(tag_dict, min_support):
    """Merge associated tags using FP-Growth algorithm."""
    tags = [list(tag_dict.keys())]
    te = pd.get_dummies(pd.DataFrame(tags)).astype(bool)
    frequent_itemsets = pd.DataFrame(
        fpgrowth(te, min_support=min_support, use_colnames=True)
    )

    new_dict = {}
    for _, row in frequent_itemsets.iterrows():
        if len(row["itemsets"]) > 1:
            new_key = " ".join(sorted(row["itemsets"]))
            new_dict[new_key] = []
            for tag in row["itemsets"]:
                tag = tag.split('_', 1)[1]
                new_dict[new_key].extend(tag_dict[tag])
                tag_dict.pop(tag, None)
            new_dict[new_key] = list(set(new_dict[new_key]))
    new_dict.update(tag_dict)  # Add remaining unmerged tags
    return new_dict


def aggregate_tags(
    tag_dict, alpha=1, semantic_threshold=5, association_min_support=1.2
):
    """Aggregate tags using all methods."""
    filtered_dict = frequency_filtering(tag_dict, alpha)
    aggregated_dict = rule_aggregation(filtered_dict)
    semantically_aggregated_dict = semantic_aggregation(
        aggregated_dict, semantic_threshold
    )
    final_dict = association_aggregation(
        semantically_aggregated_dict, association_min_support
    )
    return final_dict


# Example usage
original_dict = {
    "Python": ["Python is a programming language"],
    "python": ["pythons are cool snakes"],
    "Machine Learning": ["ML is a subset of AI"],
    "machine-learning": ["machine-learning algorithms learn from data"],
    "Data Science": ["Data Science involves analyzing data"],
    "data_science": ["data_science uses statistical methods"],
    "AI": ["AI stands for Artificial Intelligence"],
    "Artificial Intelligence": ["Artificial Intelligence mimics human intelligence"],
}
result = aggregate_tags(original_dict)
for key, value in result.items():
    print(f"{key}: {value}")

python: ['Python is a programming language', 'pythons are cool snakes']
machine learn: ['ML is a subset of AI', 'machine-learning algorithms learn from data']
data sci: ['data_science uses statistical methods', 'Data Science involves analyzing data']
ai: ['AI stands for Artificial Intelligence']
artificial intellig: ['Artificial Intelligence mimics human intelligence']




In [82]:
aggregate_tags(converted_dict)



{'long context cap': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'large language model': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'complex': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'long input task': ['Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'numerous effort': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],
 'optim': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],
 'llm': ['Numerous efforts have been made to optimize LLMs for long contexts, yet challenges persist in robustly processing long inputs.'],
 'long context':

In [86]:
def extend_and_merge_dict(original_dict, new_dict):
    """
    Extends the original dictionary with a new dictionary,
    merging and deduplicating value lists for common keys.

    :param original_dict: The dictionary to be extended
    :param new_dict: The dictionary to extend with
    :return: The extended and merged dictionary
    """
    for key, new_values in new_dict.items():
        if key in original_dict:
            # Merge lists and remove duplicates
            original_dict[key] = list(set(original_dict[key] + new_values))
        else:
            # Add new key-value pair
            original_dict[key] = new_values

    return original_dict


# Example usage
dict1 = {
    "Python": ["is a programming language", "is used for data science"],
    "AI": ["stands for Artificial Intelligence"],
}

dict2 = {
    "Python": ["is popular", "is used for data science"],
    "Machine Learning": ["is a subset of AI"],
}

result = extend_and_merge_dict(dict1, dict2)
for key, value in result.items():
    print(f"{key}: {value}")

Python: ['is popular', 'is a programming language', 'is used for data science']
AI: ['stands for Artificial Intelligence']
Machine Learning: ['is a subset of AI']


In [89]:
greader_dict = {}
for chunk in chunks:
    reply = agent.generate_reply(
        messages=[
            {"content": key_atom_prompt, "role": "system"},
            {"content": chunk, "role": "user"},
        ]
    )
    greader_dict = extend_and_merge_dict(greader_dict, convert_to_dict(reply))

  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S.%f")
  return datetime.datetime.utcnow(

In [91]:
len(greader_dict)

1774

In [94]:
agg_greader = aggregate_tags(greader_dict)
print(len(agg_greader))
agg_greader



998


{'long context cap': ['GraphReader establishes a scalable long-context capability based on a 4k context window.',
  'This paper introduces GraphReader, a graph-based agent designed to enhance the long-context capabilities of large language models.',
  'The main contributions are threefold: the introduction of GraphReader, the establishment of a scalable long-context capability, and the demonstration of performance.',
  'Recent efforts by Chen et al., Ding et al., and Peng have focused on positional interpolation to enhance long-context capabilities.',
  'Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks.'],
 'language model': ['Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia published a paper in 2023 titled "Longlora: Efficient fine-tuning of long-context large language models".',
  'Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian published a paper in 2023 titled "Extendi