# Crypto Whitepaper Analysis

**Imports and basic configuration**

The first code cell sets up everything needed for the demo notebook:
- core dependency: QdrantClient is imported to create and interact with the vector database used for retrieval (RAG)
- project modules: all helper functions from the src/ modules are imported:
    - corpus: loading PDFs and cleaning extracted text
    - rag: chunking, embedding, and retrieval logic
    - pipeline: question analysis, answer generation (base vs. "fine-tuned"), optional review step
    - imaging: building image prompts and generating an image (if requested)
- project scope: available_projects defines which whitepaper projects are supported in this notebook (the question analysis step uses this list to detect which project(s) the user is asking about)

In [1]:
from qdrant_client import QdrantClient
from src.corpus import *
from src.rag import *
from src.pipeline import *
from src.imaging import *

available_projects = ["bitcoin", "ethereum", "uniswap_v2", "chainlink_v1", "aave_v1"]

### Step 1: Reading all Whitepapers (PDF files) from `data/raw_pds/`

**Loading the whitepaper corpus from PDF files**

This step scans the data/ directory recursively and loads every PDF (each PDF represents one project's whitepaper). For each file, the following is done:
- the raw text is extracted from the PDF
- the text is cleaned/normalized: artifacts like headers/footers, broken Unicode, extra whitespace are removed
- the text is stored in a structured document dictionary with metadata (project ID, source path, and the cleaned text)

The cell then prints how many documents were loaded and shows the available keys for one example document (as the keys are the same for each document):

In [2]:
docs = load_corpus("data")
print(f"Loaded {len(docs)} documents")
if docs:
    print(docs[0].keys()) 

Loaded 6 documents
dict_keys(['document_class', 'project_id', 'text', 'source_path'])


**Checking loaded documents and metadata**

After loading, let's quickly verify that the pipeline worked as expected:
- each document should have the required metadata fields
- the text extraction should have produced readable content (not empty / not garbled)
- the mapping between project_id and source_path should look correct

To avoid flooding the output, this cell prints the full metadata but only the first 100 characters of each document’s text:

In [3]:
for doc in docs:
    print("-"*50)
    for key, value in doc.items():
        if key != "text":
            print(f"{key}" + ":", doc[key])
        else:
            print(f"{key}" + ":\n" + doc[key][:100] + "...")

--------------------------------------------------
document_class: raw_pdfs
project_id: aave
text:
Protocol Whitepaper
V1.0
[EMAIL]
January 2020
Abstract
This document describes the definitions and t...
source_path: data\raw_pdfs\aave.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: bitcoin
text:
Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
[EMAIL]
www.bitcoin.org
Abstract. A ...
source_path: data\raw_pdfs\bitcoin.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: chainlink
text:
Chainlink 2.0: Next Steps in the Evolution of
Decentralized Oracle Networks
Lorenz Breidenbach1
Chri...
source_path: data\raw_pdfs\chainlink.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: ethereum
text:
When Satoshi Nakamoto first set the Bitcoin blockchain into motion in January 2009, he was
simultane...
source_path: data\raw_pdfs\ethereum.pdf
--------------------

**Checking corpus size (text length per document)**

Before chunking and embedding, it's helpful to understand how large each document is. The following cell prints the character count of the cleaned text for each project. Large differences here can indicate:
- unusually short/empty extractions
- very large documents that may produce many chunks and take longer to embed

In [4]:
print("Text lengths:")
for doc in docs:
    print(doc["project_id"] + ":", len(doc["text"]))  

Text lengths:
aave: 27774
bitcoin: 21145
chainlink: 361960
ethereum: 84978
ethereum_eip_150: 121380
solana: 45804


### Step 2: Setting up the RAG

**Chunking all whitepapers into overlapping text segments (chunks)**

To prepare the corpus for Retrieval-Augmented Generation (RAG), each document is split into smaller pieces. Each chunk overlaps with the next one so that important context (e.g. definitions or sentences crossing a boundary) isn’t lost. The following cell creates the chunk objects from docs, prints how many chunks were created in total, and shows the available fields/metadata keys for one example chunk:

In [5]:
chunk_objects = create_chunk_objects(docs=docs)
print(f"Loaded {len(chunk_objects)} chunk objects")
if chunk_objects:
    print(chunk_objects[0].keys()) 

Loaded 219 chunk objects
dict_keys(['id', 'project_id', 'source', 'chunk_index', 'text'])


**Inspecting a few chunks and their metadata**

Before embedding and storing the chunks in a vector database, it’s useful to check the output of chunking. This cell prints the first ten chunks, showing the metadata fields (id, project_id, source, chunk_index, ...) and the actual chunk text. This helps confirm that chunk boundaries look reasonable and that metadata is correctly attached for later citation/debugging:

In [9]:
for chunk in chunk_objects[:11]:
    print("-"*50)
    for key, value in chunk.items():
        if key != "text":
            print(f"{key}" + ":", chunk[key])
        else:
            print(f"{key}" + ":\n" + chunk[key])

--------------------------------------------------
id: aave_0
project_id: aave
source: data\raw_pdfs\aave.pdf
chunk_index: 0
text:
Protocol Whitepaper V1.0 [EMAIL] January 2020 Abstract This document describes the definitions and theory behind the Aave Protocol explaining the different aspects of the implementation. Contents Introduction 1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Formal Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Protocol Architecture 2.1 Lending Pool Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Lending Pool Data Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Lending Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Lending Pool Configurator . . . . . . . . . . . . . . . . . 

**Creating embeddings for all chunks**

Here each chunk’s text is converted into a dense vector representation (an embedding) using a SentenceTransformer model. These vectors capture semantic meaning, so similar chunks end up close together in embedding space. The resulting embeddings array contains one vector per chunk and will be uploaded to Qdrant for similarity search during retrieval:

In [6]:
embeddings = embed_chunks(chunk_objects)
embeddings

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

array([[-5.01484983e-02, -3.35568301e-02, -5.43546788e-02, ...,
         1.10461907e-02,  3.74264978e-02,  2.32786201e-02],
       [-4.88143452e-02, -6.07051738e-02, -7.75696710e-02, ...,
        -1.07340422e-02,  6.35056663e-03,  9.09574423e-03],
       [-1.42356027e-02, -3.49780694e-02, -6.34338930e-02, ...,
         4.72240262e-02, -3.76252392e-05,  5.34903780e-02],
       ...,
       [-6.57549798e-02,  2.52599455e-02, -6.67224359e-03, ...,
         8.54725689e-02,  5.60708530e-02,  1.34296678e-02],
       [-8.73252079e-02, -2.24423539e-02, -4.96248938e-02, ...,
        -5.14123542e-03,  2.34074332e-02, -1.09556150e-02],
       [-1.05926186e-01, -8.31515156e-03, -9.39507782e-02, ...,
         7.61780515e-02,  4.14076708e-02,  3.81451249e-02]],
      shape=(219, 384), dtype=float32)

**Initializing the Qdrant vector database (in-memory)**

The Qdrant client is set up and a collection with the name `crypto_whitepapers` is created (or recreated). The collection configuration (especially the vector size) must match the embedding dimensionality, so we pass embeddings to determine the correct vector length. This prepares Qdrant to store and search our chunk vectors efficiently:

In [7]:
qdrant_client, COLLECTION = init_qdrant_collection(embeddings)

**Uploading chunks and embeddings to Qdrant**

Now the data is stored in the vector database. Each chunk becomes a Qdrant point with:
- a vector (the embedding) used for similarity search
- a payload (metadata + chunk text) used for filtering, tracing sources, and displaying results

After this step, the collection is ready for retrieval: given a query embedding, Qdrant can return the most semantically similar chunks:

In [8]:
upload_to_qdrant(qdrant_client, chunk_objects, embeddings, COLLECTION)

Uploaded 219 chunks.


### Step 3: LLM Pipeline (with optional finetuning)

**Defining an example user question*

A simple but realistic example question is set up, that includes two intents:
- a textual explanation ("...explain Bitcoin...")
- a visual request ("...show me a chart of the Bitcoin supply over time?")

In [9]:
example_question = "Can you explain Bitcoin to me and can you show me a chart of the Bitcoin supply over time?"

**Analyzing the question (intent and scope)**

Before retrieval, a lightweight analysis is done to extract the high-level structure from the question:
- Which project(s) are mentioned or implied (e.g., Bitcoin)
- Question type / intent (explanation, comparison, definition, etc.)
- Whether a visual (diagram/chart) would likely improve the response

The output (question_analysis) guides later steps like prompting and optional image creation:

In [10]:
question_analysis = analyze_question(example_question)
question_analysis

{'projects': ['bitcoin'], 'type': 'tokenomics', 'needs_image': True}

**Retrieving the most relevant context chunks (RAG)**

Now the vector database is queried to fetch the top 5 most semantically similar chunks for the question. These chunks act as "base context" for the LLM, so the model can answer using information from the corpus rather than relying purely on its general knowledge:

In [11]:
possible_answers = retrieve_rag(example_question, qdrant_client, COLLECTION, top_k=5)
possible_answers

[{'text': 'include a transaction giving themselves 25 BTC out of nowhere. Additionally, if any transaction has a higher total denomination in its inputs than in its outputs, the difference also goes to the miner as a "transaction fee". Incidentally, this is also the only mechanism by which BTC are issued; the genesis state contained no coins at all. ethereum.org In order to better understand the purpose of mining, let us examine what happens in the event of a malicious attacker. Since Bitcoin\'s underlying cryptography is known to be secure, the attacker will target the one part of the Bitcoin system that is not protected by cryptography directly: the order of transactions. The attacker\'s strategy is simple: 1. Send 100 BTC to a merchant in exchange for some product (preferably a rapid-delivery digital good) 2. Wait for the delivery of the product 3. Produce another transaction sending the same 100 BTC to himself 4. Try to convince the network that his transaction to himself was the o

**Generating an answer with the base model (meta-llama-3-8b-instruct: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)**

Using the retrieved chunks as context, an answer with the baseline model is generated.
This gives a reference output that can be compared against the "fine-tuned" model version in the following step. Note, that the code only prints the final answer text:

In [12]:
answer_base = generate_answer(example_question, possible_answers, use_finetuned_model=False)
print(answer_base["answer_text"])

I'd be happy to explain Bitcoin and provide a chart of its supply over time!

Bitcoin is a decentralized peer-to-peer online currency that maintains value without any backing, intrinsic value, or central issuer (see Chunk 5). It was created by Satoshi Nakamoto in January 2009. The core idea behind Bitcoin is the concept of mining, which involves solving complex mathematical problems to validate transactions and create new blocks in the blockchain.

In terms of supply, Bitcoin's total supply is capped at 21 million BTC, with a small portion already mined (see Chunk 1). New Bitcoins are issued through a process called "mining," where miners solve complex mathematical problems to validate transactions and create new blocks. The miner who solves the problem first gets to add a new block of transactions to the blockchain and is rewarded with newly minted Bitcoins, as well as transaction fees from the transactions included in that block.

Here's a rough chart of Bitcoin's supply over time:



**Generating an answer with the "fine-tuned" model (theia-llama-3.1-8b-v1: https://huggingface.co/QuantFactory/Theia-Llama-3.1-8B-v1-GGUF)**

Here the same answer is generated again, but this time using an alternative model that is treated as the "fine-tuned" option in this demo setup.
Comparing this output to the base model helps illustrate how model choice can affect:
- clarity and structure
- technical depth
- style and tone

while still using the same retrieved context:

In [13]:
answer_finetuned = generate_answer(example_question, possible_answers, use_finetuned_model=True)
print(answer_finetuned["answer_text"])

Bitcoin is a decentralized peer-to-peer online currency that operates without any backing, intrinsic value, or central issuer. It maintains a value that is determined by market forces rather than any physical commodity or government intervention. 

The supply of Bitcoin is controlled through a process where new coins are created only when transactions are completed and the associated fees are paid. The total supply of Bitcoin is capped at 21 million coins, with a portion already mined and distributed over time. 

In terms of the chart showing Bitcoin supply over time, it illustrates how the amount of Bitcoin in circulation increases gradually, reaching the maximum supply of 21.00 million by the year 2140 through a process known as halving, where the mining rewards for generating new blocks are reduced every four years.

See Chunk 5 for more insights into the concept of Bitcoin and its significance in the context of digital currencies.

Disclaimer: This is not financial advice.


**Optional: reviewing/improving the answer with a second LLM call**

This step runs an additional "review pass" over the draft answer. Generally, the goal is to:
- fix missing details or weak explanations
- improve readability and structure
- reduce contradictions or repetition

In other words: retrieval gives the content, and this review step helps fixing issues and improving presentation:

In [None]:
answer_reviewed = review_answer(example_question, possible_answers, draft_answer=answer_finetuned, use_llm=True, use_finetuned_model=True)
print(answer_reviewed["answer_text"])

The improved answer is as follows:

Bitcoin is a decentralized peer-to-peer online currency that operates without any backing, intrinsic value, or central issuer. It maintains a value that is determined by market forces rather than any physical commodity or government intervention. 

The supply of Bitcoin is controlled through a process where new coins are created only when transactions are completed and the associated fees are paid. The total supply of Bitcoin is capped at 21 million coins, with a portion already mined and distributed over time. 

In terms of the chart showing Bitcoin supply over time, it illustrates how the amount of Bitcoin in circulation increases gradually, reflecting the gradual mining process and the halving events that occur every four years, ultimately reaching the maximum supply of 21.00 million by the year 2140.

**Disclaimer:** This is not financial advice. 

I have kept the essential information from the draft while ensuring that any unsupported claims are

### Step 4: Image generation

**Deciding whether to generate an image, then displaying it**

Based on the earlier question_analysis, it is checked whether an image should be generated (e.g. a diagram or chart-style illustration). If yes, the following happens:
- a prompt suggestion for the image model is built
- the image is generated and the URL is printed
- the image should be displayed in the notebook (via IPython.display)

Note: For some reason the returned image link expires or stops working after some time (we don't know why). If it's not visible in this notebook, please check the PNG `Crypto_Demo_Image.png`:

In [15]:
image_request = build_image_request(example_question, question_analysis, answer_reviewed)
if image_request.get("should_generate") is True: 
    image_url = generate_tokenomics_image(image_request.get("prompt_suggestion"))
    print("Image URL:", image_url)
    display(Image(url=image_url))

Image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/org-E3LNA03R1XJIPS20pdl0y5ce/user-wlzKhmKn9GQ4CBXJwxjqhqB4/img-PYNJNOqbOwMXB7NfowXLTtpO.png?st=2025-12-17T04%3A05%3A07Z&se=2025-12-17T06%3A05%3A07Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=38e27a3b-6174-4d3e-90ac-d7d9ad49543f&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2025-12-17T04%3A46%3A56Z&ske=2025-12-18T04%3A46%3A56Z&sks=b&skv=2024-08-04&sig=uG5b9JYAPUuKEY9Q6x%2Bhpnx8Ei6ZAN2In8V6lwON4/U%3D
