MAIN NOTEBOOK: super_agent.ipynb - Integrated language graph agent with a range of tools to extract semantic and structured insights from a knowledge graph.
prompts/- Contains prompts for building individual graph and vector agent executors, the combined language graph system prompt, code examples, and graph schema..env- Store your OpenAI, ArangoDB, Pinecone, and LangSmith API keys.graph_agent.ipynb- Graph Agent to extract structured relationships from knowledge graphs.vector_agent.ipynb- Vector Agent to extract semantic insights from embedded data.ingest.py- Script to extract rich text descriptions from graph nodes and upload to Pinecone.requirements.txt- Python dependencies for the project.visualization.html- Sample graph visualization generated by the agent.
Synthea is an open-source synthetic patient dataset that models health records for fictional individuals, covering demographics, clinical history, and social factors. Generated using real-world healthcare patterns, it supports research, development, and testing of health IT systems while ensuring patient privacy.
Source: Synthea Size: 145,514 nodes, 311,701 edges
I chose Synthea because of its:
- Rich text descriptions that enhance semantic retrieval capabilities.
- Diverse entity types that mimic real-world healthcare settings, making it an ideal testbed for graph-based analysis.
- Complex relational structures, demonstrating how NetworkX can uncover patterns that might be difficult for humans to discern.
- OpenAI: Provides the LLM backend for intelligent natural language understanding and reasoning.
- ArangoDB: Stores the graph database, enabling scalable and efficient traversal of healthcare data.
- Pinecone: Supports vector-based semantic search for retrieving similar medical records.
- LangGraph: Implements the agent’s execution flow and decision-making logic.
- LangSmith: Facilitates debugging and monitoring of the agent's performance.
- System Prompt: Defines the agent's core behavior, available tools, and constraints.
- Graph Schema: Provides details about node and edge collections in ArangoDB.
- Example-based Few-Shot Prompting: Uses a selector that retrieves relevant NetworkX examples based on semantic similarity.
- Dynamic Prompt Construction: Integrates the system prompt, schema, and relevant examples to generate the final query-specific prompt.
The agent employs multiple tools to query and analyze the dataset:
- Vector Search Tool: Uses Pinecone to perform semantic similarity searches.
- Graph Traversal Tool: Executes AQL queries in ArangoDB for structured graph traversal.
- NetworkX Analysis Tool: Runs graph algorithms (e.g., centrality, shortest path) on a subgraph extracted from ArangoDB.
- Graph Visualization Tool: Uses PyVis to create interactive visualizations of the subgraph structure.
The agent follows a ReAct (Reasoning + Acting) design using LangGraph:
- User Query Processing: The agent receives a natural language query.
- Tool Selection: The agent determines which tools to invoke based on the query context.
- Execution & Iteration: The selected tool runs, and the agent refines the results if needed.
- Final Response: The agent returns a structured answer, potentially with visualizations or data insights.
The workflow includes a stateful memory system, allowing conversations to persist and enabling iterative refinements over multiple turns. Additionally, constraints are in place to ensure scalability when working with large datasets, such as sampling nodes from subgraphs for NetworkX analyses.
- Consider alternative agentic designs like hierarchical agents or planner-executor models for better modularity and reasoning.
- Leverage graph states to store intermediate results, reducing redundant LLM parsing to tool calls.
- Optimize NetworkX for large graphs by implementing parallel processing and providing more robust examples.