# Crypto Whitepaper LLM

Imports:

In [1]:
from qdrant_client import QdrantClient
from src.corpus import *
from src.rag import *
from src.pipeline import *
from src.imaging import *

available_projects = ["bitcoin", "ethereum", "uniswap_v2", "chainlink_v1", "aave_v1"]

### Step 1: Reading all Whitepapers (PDF files) from `data/raw_pds/`

Loading all PDF files:

In [2]:
docs = load_corpus("data")
print(f"Loaded {len(docs)} documents")
if docs:
    print(docs[0].keys()) 

Loaded 6 documents
dict_keys(['document_class', 'project_id', 'text', 'source_path'])


Confirming successful loading process and getting some basic info:

In [3]:
for doc in docs:
    print("-"*50)
    for key, value in doc.items():
        if key != "text":
            print(f"{key}" + ":", doc[key])
        else:
            print(f"{key}" + ":\n" + doc[key][:100] + "...")

--------------------------------------------------
document_class: raw_pdfs
project_id: aave
text:
Protocol Whitepaper
V1.0
[EMAIL]
January 2020
Abstract
This document describes the definitions and t...
source_path: data\raw_pdfs\aave.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: bitcoin
text:
Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
[EMAIL]
www.bitcoin.org
Abstract. A ...
source_path: data\raw_pdfs\bitcoin.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: chainlink
text:
Chainlink 2.0: Next Steps in the Evolution of
Decentralized Oracle Networks
Lorenz Breidenbach1
Chri...
source_path: data\raw_pdfs\chainlink.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: ethereum
text:
When Satoshi Nakamoto first set the Bitcoin blockchain into motion in January 2009, he was
simultane...
source_path: data\raw_pdfs\ethereum.pdf
--------------------

Check text length of all docs:

In [4]:
print("Text lengths:")
for doc in docs:
    print(doc["project_id"] + ":", len(doc["text"]))  

Text lengths:
aave: 27774
bitcoin: 21145
chainlink: 361960
ethereum: 84978
ethereum_eip_150: 121380
solana: 45804


### Step 2: Setting up the RAG

Chunking all whitepapers, with overlapping between the chunks to keep context:

In [5]:
chunk_objects = create_chunk_objects(docs=docs)
chunk_objects

[{'id': 'aave_0',
  'project_id': 'aave',
  'source': 'data\\raw_pdfs\\aave.pdf',
  'chunk_index': 0,
  'text': 'Protocol Whitepaper V1.0 [EMAIL] January 2020 Abstract This document describes the definitions and theory behind the Aave Protocol explaining the different aspects of the implementation. Contents Introduction 1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Formal Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Protocol Architecture 2.1 Lending Pool Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Lending Pool Data Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Lending Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Lending Pool Configurator . . . . . . . . . . . . . . . . . . . . . . . . . . 

Creating embeddings for all chunks:

In [6]:
embeddings = embed_chunks(chunk_objects)
embeddings

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

array([[-5.01484983e-02, -3.35568301e-02, -5.43546788e-02, ...,
         1.10461907e-02,  3.74264978e-02,  2.32786201e-02],
       [-4.88143452e-02, -6.07051738e-02, -7.75696710e-02, ...,
        -1.07340422e-02,  6.35056663e-03,  9.09574423e-03],
       [-1.42356027e-02, -3.49780694e-02, -6.34338930e-02, ...,
         4.72240262e-02, -3.76252392e-05,  5.34903780e-02],
       ...,
       [-6.57549798e-02,  2.52599455e-02, -6.67224359e-03, ...,
         8.54725689e-02,  5.60708530e-02,  1.34296678e-02],
       [-8.73252079e-02, -2.24423539e-02, -4.96248938e-02, ...,
        -5.14123542e-03,  2.34074332e-02, -1.09556150e-02],
       [-1.05926186e-01, -8.31515156e-03, -9.39507782e-02, ...,
         7.61780515e-02,  4.14076708e-02,  3.81451249e-02]],
      shape=(219, 384), dtype=float32)

Setting up the QDRANT client with the default collection name "crypto_whitepapers":

In [7]:
qdrant_client, COLLECTION = init_qdrant_collection(embeddings)

Uploading all chunks to the QDRANT client:

In [8]:
upload_to_qdrant(qdrant_client, chunk_objects, embeddings, COLLECTION)

Uploaded 219 chunks.


### Step 3: LLM Pipeline (with optional finetuning)

Defining an example question:

In [9]:
example_question = "Can you explain Bitcoin to me and can you show me a chart of the Bitcoin supply over time?"

Analyzing the question:
- Which projects are involved?
- Type of the question?
- Does it need an image?

In [10]:
question_analysis = analyze_question(example_question)
question_analysis

{'projects': ['bitcoin'], 'type': 'tokenomics', 'needs_image': True}

Using the RAG system to get the 5 chunks, that may provide the best answer:

In [11]:
possible_answers = retrieve_rag(example_question, qdrant_client, COLLECTION, top_k=5)
possible_answers

[{'text': 'include a transaction giving themselves 25 BTC out of nowhere. Additionally, if any transaction has a higher total denomination in its inputs than in its outputs, the difference also goes to the miner as a "transaction fee". Incidentally, this is also the only mechanism by which BTC are issued; the genesis state contained no coins at all. ethereum.org In order to better understand the purpose of mining, let us examine what happens in the event of a malicious attacker. Since Bitcoin\'s underlying cryptography is known to be secure, the attacker will target the one part of the Bitcoin system that is not protected by cryptography directly: the order of transactions. The attacker\'s strategy is simple: 1. Send 100 BTC to a merchant in exchange for some product (preferably a rapid-delivery digital good) 2. Wait for the delivery of the product 3. Produce another transaction sending the same 100 BTC to himself 4. Try to convince the network that his transaction to himself was the o

Generate an answer with a basic model: `meta-llama-3-8b-instruct`

$\Rightarrow$ More infos about this model can be found here: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

In [12]:
answer_base = generate_answer(example_question, possible_answers, use_finetuned_model=False)
print(answer_base["answer_text"])

I'd be happy to explain Bitcoin and provide a chart of its supply over time!

Bitcoin is a decentralized peer-to-peer online currency that maintains value without any backing, intrinsic value, or central issuer (see Chunk 5). It was created by Satoshi Nakamoto in January 2009. The core idea behind Bitcoin is the concept of mining, which involves solving complex mathematical problems to validate transactions and create new blocks in the blockchain.

In terms of supply, Bitcoin's total supply is capped at 21 million BTC, with a small portion already mined (see Chunk 1). New Bitcoins are issued through a process called "mining," where miners solve complex mathematical problems to validate transactions and create new blocks. The miner who solves the problem first gets to add a new block of transactions to the blockchain and is rewarded with newly minted Bitcoins, as well as transaction fees from the transactions included in that block.

Here's a rough chart of Bitcoin's supply over time:



Generate an answer with a finetuned model: `theia-llama-3.1-8b-v1`

$\Rightarrow$ More infos about this model can be found here: https://huggingface.co/QuantFactory/Theia-Llama-3.1-8B-v1-GGUF

In [13]:
answer_finetuned = generate_answer(example_question, possible_answers, use_finetuned_model=True)
print(answer_finetuned["answer_text"])

Bitcoin is a decentralized peer-to-peer online currency that operates without any backing, intrinsic value, or central issuer. It maintains a value that is determined by market forces rather than any physical commodity or government intervention. 

The supply of Bitcoin is controlled through a process where new coins are created only when transactions are completed and the associated fees are paid. The total supply of Bitcoin is capped at 21 million coins, with a portion already mined and distributed over time. 

In terms of the chart showing Bitcoin supply over time, it illustrates how the amount of Bitcoin in circulation increases gradually, reaching the maximum supply of 21.00 million by the year 2140 through a process known as halving, where the mining rewards for generating new blocks are reduced every four years.

See Chunk 5 for more insights into the concept of Bitcoin and its significance in the context of digital currencies.

Disclaimer: This is not financial advice.


Review the **already finetuned** answer:

In [14]:
# Optional review step (second LLM call)
answer_reviewed = review_answer(example_question, possible_answers, draft_answer=answer_finetuned, use_llm=True, use_finetuned_model=True)
print(answer_reviewed["answer_text"])

The improved answer is as follows:

Bitcoin is a decentralized peer-to-peer online currency that operates without any backing, intrinsic value, or central issuer. It maintains a value that is determined by market forces rather than any physical commodity or government intervention. 

The supply of Bitcoin is controlled through a process where new coins are created only when transactions are completed and the associated fees are paid. The total supply of Bitcoin is capped at 21 million coins, with a portion already mined and distributed over time. 

In terms of the chart showing Bitcoin supply over time, it illustrates how the amount of Bitcoin in circulation increases gradually, reflecting the gradual mining process and the halving events that occur every four years, ultimately reaching the maximum supply of 21.00 million by the year 2140.

**Disclaimer:** This is not financial advice. 

I have kept the essential information from the draft while ensuring that any unsupported claims are

### Step 4: Image generation

Check if an image should be created:

1. Show the URL to the image
2. Display the image in VSCode

In [15]:
image_request = build_image_request(example_question, question_analysis, answer_reviewed)
if image_request.get("should_generate") is True: 
    image_url = generate_tokenomics_image(image_request.get("prompt_suggestion"))
    print("Image URL:", image_url)
    display(Image(url=image_url))

Image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/org-E3LNA03R1XJIPS20pdl0y5ce/user-wlzKhmKn9GQ4CBXJwxjqhqB4/img-PYNJNOqbOwMXB7NfowXLTtpO.png?st=2025-12-17T04%3A05%3A07Z&se=2025-12-17T06%3A05%3A07Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=38e27a3b-6174-4d3e-90ac-d7d9ad49543f&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2025-12-17T04%3A46%3A56Z&ske=2025-12-18T04%3A46%3A56Z&sks=b&skv=2024-08-04&sig=uG5b9JYAPUuKEY9Q6x%2Bhpnx8Ei6ZAN2In8V6lwON4/U%3D
