# Enterprise-Scale Bitcoin Data Knowledge Graph with LlamaIndex

This notebook demonstrates the following
1. Ingest Raw Bitcoin Blocks, Economic Indicators and On-Chain Metrics
2. Building a Knowledge Graph in LlamaIndex with a Neo4J Graph Store
3. Intelligent querying using LlamaIndex Agents

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import logging
from llamaindex_utils import (
    ingest_raw_block_data, 
    ingest_onchain_metrics, 
    ingest_economic_indicators,
    get_raw_block_data,
    get_onchain_metrics,
    get_economic_indicators)
from datetime import timedelta, datetime
from utils.triplets import TripletGenerator
from dotenv import load_dotenv
import os
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llamaindex_utils import get_neo4j_graph_store
from llama_index.core import PropertyGraphIndex
from llamaindex_utils import LlamaAgents
import nest_asyncio

# to run async code in notebook
nest_asyncio.apply()

# Configure logging
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
logging.basicConfig(level=logging.INFO, format='%(levelname)s - %(message)s')
logger = logging.getLogger(__name__) 

## Setup
To begin, we set some LlamaIndex specific settings
1. LLM - OpenAI gpt-4.1-mini
2. Embedding Model - all-MiniLM-L6-v2

You may swap out different models as per need

In [3]:
load_dotenv("devops/env/default.env")
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
llm = OpenAI(model="gpt-4.1-mini", 
             temperature=0,
             api_key=OPENAI_API_KEY)
Settings.llm = llm
Settings.embed_model = embed_model

INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
INFO - 2 prompts are loaded, with the keys: ['query', 'text']


## Data Ingestion
We start by ingesting the following data:
1. Raw Bitcoin Blocks from [Public Node](https://bitcoin-rpc.publicnode.com)
2. Economic Indicators like S&P500, CPI, Federal Funds Rate, etc from [FRED](https://fred.stlouisfed.org/docs/api/fred/)
3. On-Chain Metrics like Transaction Volume, Hash Rate, etc from [Blockchain.info](https://api.blockchain.info/charts)

**Ingesting such high quantities of data while respecting rate limit takes a while!**

In [None]:
td = timedelta(days=10) # We define the period of ingestion from current time

ingest_raw_block_data(td)
ingest_economic_indicators(td)
ingest_onchain_metrics(td)

## Triplets Generation
Triplets are a sequence of three entities that codifies a statement about data in the form of subject–predicate–object expressions. <br>
For e.g If Mary is an Engineer and John and Mary are friends then triplets for this would be: <br>
Mary -> IS -> Engineer <br>
Mary -> FRIENDS -> John <br>

For our use case Triplets are of the form:
Block: 8914 -> FOLLOWS -> Block: 8913<br>
Block: 1234 -> CONTAINS -> Transaction: s4d56f7g8hu9j <br>
Sp500 -> HAS_VALUE_ON -> 2025-03-26<br>
and so on

These triplets are generated by breaking down the structured data that we ingested, through code. Alternatively LLMs can also be used to generate triplets, but since we are not dealing with semantic data, we are using hard-code generated triplets to save a LOT of LLM API calls

Triplets are made up of entities (Blocks, Transactions, CPI, etc) that are joined by relationships (HAS_VALUE, CONTAINS, etc)

In [None]:
blocks_data = get_raw_block_data()
economic_data = get_onchain_metrics()
onchain_data = get_economic_indicators()

triplet_generator = TripletGenerator()
nodes, relations, text_nodes = triplet_generator.load_and_process_data(blocks_data, economic_data, onchain_data)
logger.info(f"Generated {len(nodes)} nodes")
logger.info(f"Generated {len(relations)} relations")

### Batch Embeddings
We have now generated the core components for creating a Knowledge Graph using triplets <br>
We can also embed these components to enable vector based similarity search. To speed things up, we will use batching.

Since the embedding model runs on you local system, generation times will vary

In [None]:
# based on BaseNode embedding texts
node_texts = []
for node in nodes:
    node_texts.append("\n".join([f"{key}: {node.properties[key]}" for key in node.properties.keys()]))
    

node_embeddings = embed_model.get_text_embedding_batch(node_texts)
text_embeddings = embed_model.get_text_embedding_batch([text_node.text for text_node in text_nodes])
for node, embedding in zip(nodes, node_embeddings):
    node.embedding = embedding
for text_node, embedding in zip(text_nodes, text_embeddings):
    text_node.embedding = embedding

logger.info(f"Embedded {len(nodes)} nodes")

## Building the Knowledge Graph
We are finally ready to build a LlamaIndex Knowledge Graph. This involves two components:
1. **Graph Store** <br>
The Graph Store is the underlying DB for the Knowledge Graph. LlamaIndex supports a variety of Graph Stores including an in-memory one. Here we will use Neo4j Graph Store. This offloads storage persistence responsibility as well as allows us to perform similarity search based on the embeddings we generated previously
2. **Graph Index** <br>
The Graph Index is the data structure that allows us to quickly retrieve relevant context for a user query. LlamaIndex offers KnowledgeGraphIndex(deprecated) and PropertyGraphIndex. This Index enables us to perform complex queries and leverage AI Agents to interact with our data, providing a high-level interface

In [None]:
# You may pass your username/password/url here
graph_store = get_neo4j_graph_store()

# Add Nodes, Relations and TextNodes to the Graph Store
graph_store.upsert_nodes(nodes)
graph_store.upsert_relations(relations)
graph_store.upsert_llama_nodes(text_nodes)

# Initialize Graph Index with the Graph Store
kg_index = PropertyGraphIndex.from_existing(
    property_graph_store=graph_store,
    llm=llm
)

logger.info(f"PropertyGraphIndex created with schema: \n{str(kg_index.property_graph_store.structured_schema)[:1000]}...")

INFO - PropertyGraphIndex created with schema: 
{'node_props': {'Block': [{'property': 'hash', 'type': 'TEXT', 'values': ['00000000000000000000ab2d5af936a7c0b91f4216ac3a3e7f0b'], 'distinct_count': 1}, {'property': 'height', 'type': 'INTEGER', 'min': '894214', 'max': '894214', 'distinct_count': 1}, {'property': 'timestamp', 'type': 'INTEGER', 'min': '1745784304', 'max': '1745784304', 'distinct_count': 1}, {'property': 'name', 'type': 'STRING', 'values': ['894214'], 'distinct_count': 1}, {'property': 'month', 'type': 'INTEGER', 'min': '4', 'max': '4', 'distinct_count': 1}, {'property': 'hour', 'type': 'INTEGER', 'min': '20', 'max': '20', 'distinct_count': 1}, {'property': 'size', 'type': 'INTEGER', 'min': '1034417', 'max': '1034417', 'distinct_count': 1}, {'property': 'day', 'type': 'INTEGER', 'min': '27', 'max': '27', 'distinct_count': 1}, {'property': 'difficulty', 'type': 'FLOAT', 'min': '1.232343879770509E14', 'max': '1.232343879770509E14', 'distinct_count': 1}, {'property': 'datetim

## Querying the Knowledge Graph
We can directly query the Knowledge Graph by creating a Query Engine. Internally it calls an LLM and our Graph Store to parse raw data for natural language queries and return human-readable output.

In [9]:
query_engine = kg_index.as_query_engine(
    similarity_top_k=5,
    embedding_mode="hybrid",
    response_mode="tree_summarize", #tree_summarize, no_text
    include_text=True,
)
# Modify this query as per need
query_engine.query("Tell me about the Block with height 894214").response

INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


'The block with height 894214 contains multiple transactions, including those with the following identifiers: 675a45c28fb1b1e5b458328478a711eccbec20f8a155a7ec7eada7dcb65b6f80, ed9590df5777b94de6b271ba4d1dfa28c7a1c5c518164a32b25b9ead03704251, 081a66b7cd3e13ddb9a97faaa3ddf2f170daf44fb4466b3434ece9d4866d0d80, 956abb004f3114c32b8edd4d78f8acdb8fa01de5e3ffb3b516041481e7df7164, and 33ee38bbcd2e498287de880a94e8cfbb31a9fec22574afaccc6766f0e3b136c3. Several of these transactions send funds to the address bc1pq3qht4k45ccx89paqpd3axfwf4cmpyyxgwp79n6lxl8qv3hx8wuslyufll, while one transaction sends to addresses bc1pnx3x47zagukpjychkzk3tmzclfyu3rcd72f9nllz0cukve3xj7tqwj2ra5 and bc1pf39ct659za9j6azccfh3jpptk9s6us8lnjll9y9wuemq443ah6qqtu45dz.'

## LlamaIndex Agents for Knowledge Graph



LlamaIndex has many default retrievers that can be directly used to perform variety of queries on our Knowledge Graph. <br>
For more complex and highly-specific queries we will leverage LlamaIndex Agents.<br>
- A lot of Agents we use here are highly similar to **CypherTemplateRetriever** where the LLM fills in a Cypher query to retrieve the relevant results from the Knowledge Graph.
- Obviously we can't have Cypher query templates for all possible cases, hence we also have Agents that use
    - **VectorContextRetriever** to perform vector similarity search
    - **TextToCypherRetriever** where the LLM will create a Cypher query from scratch

In [69]:
llama_agents = LlamaAgents(kg_index=kg_index)

#### Example: Find specific block details

In [65]:
await llama_agents.query("When was block with 894214 created?")

'Block with height 894214 was created on April 27, 2025, at 20:05:04 (UTC). \n\nIf you need more details about this block or its transactions, feel free to ask!'

#### Example: Trace Funds sent to an Address

In [68]:
await llama_agents.query("Trace the funds sent to this btc address bc1pq3qht4k45ccx89paqpd3axfwf4cmpyyxgwp79n6lxl8qv3hx8wuslyufll")

'The Bitcoin address bc1pq3qht4k45ccx89paqpd3axfwf4cmpyyxgwp79n6lxl8qv3hx8wuslyufll has received multiple small-value transactions. Here is a summary of the received funds:\n\n- The transactions are all recorded at block height 894214.\n- The timestamps for these transactions are around 2025-04-27 20:05:04.\n- The received values range from very small amounts like 0.0000033 BTC to about 0.0000488 BTC.\n- There are many individual transactions, each sending a small fraction of a Bitcoin to this address.\n\nNo sent transactions were found in the latest 20 transactions for this address, indicating these are incoming funds.\n\nIf you want, I can trace where these received funds were sent afterward or provide details on specific transactions. Let me know how you would like to proceed.'