# Crypto Whitepaper Analysis

**Imports and basic configuration**

The first code cell sets up everything needed for the demo notebook:
- core dependency: QdrantClient is imported to create and interact with the vector database used for retrieval (RAG)
- project modules: all helper functions from the src/ modules are imported:
    - corpus: loading PDFs and cleaning extracted text
    - rag: chunking, embedding, and retrieval logic
    - pipeline: question analysis, answer generation (base vs. "fine-tuned"), optional review step
    - imaging: building image prompts and generating an image (if requested)
- project scope: available_projects defines which whitepaper projects are supported in this notebook (the question analysis step uses this list to detect which project(s) the user is asking about)

In [1]:
from qdrant_client import QdrantClient
from src.corpus import *
from src.rag import *
from src.pipeline import *
from src.imaging import *

available_projects = ["bitcoin", "ethereum", "uniswap_v2", "chainlink_v1", "aave_v1"]

  from .autonotebook import tqdm as notebook_tqdm


### Step 1: Reading all Whitepapers (PDF files) from `data/raw_pds/`

**Loading the whitepaper corpus from PDF files**

This step scans the data/ directory recursively and loads every PDF (each PDF represents one project's whitepaper). For each file, the following is done:
- the raw text is extracted from the PDF
- the text is cleaned/normalized: artifacts like headers/footers, broken Unicode, extra whitespace are removed
- the text is stored in a structured document dictionary with metadata (project ID, source path, and the cleaned text)

The cell then prints how many documents were loaded and shows the available keys for one example document (as the keys are the same for each document):

In [3]:
docs = load_corpus("data")
print(f"Loaded {len(docs)} documents")
if docs:
    print(docs[0].keys()) 

Loaded 7 documents
dict_keys(['document_class', 'project_id', 'text', 'source_path'])


**Checking loaded documents and metadata**

After loading, let's quickly verify that the pipeline worked as expected:
- each document should have the required metadata fields
- the text extraction should have produced readable content (not empty / not garbled)
- the mapping between project_id and source_path should look correct

To avoid flooding the output, this cell prints the full metadata but only the first 100 characters of each document’s text:

In [4]:
for doc in docs:
    print("-"*50)
    for key, value in doc.items():
        if key != "text":
            print(f"{key}" + ":", doc[key])
        else:
            print(f"{key}" + ":\n" + doc[key][:100] + "...")

--------------------------------------------------
document_class: raw_pdfs
project_id: bitcoin
text:
Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
[EMAIL]
www.bitcoin.org
Abstract. A ...
source_path: data/raw_pdfs/bitcoin.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: ethereum_eip_150
text:
ETHEREUM: A SECURE DECENTRALISED GENERALISED TRANSACTION LEDGER
EIP-150 REVISION
DR. GAVIN WOOD
FOUN...
source_path: data/raw_pdfs/ethereum_eip_150.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: ripple
text:
The Ripple Protocol Consensus Algorithm
David Schwartz
[EMAIL]
Noah Youngs
[EMAIL]
Arthur Britto
[EM...
source_path: data/raw_pdfs/ripple.pdf
--------------------------------------------------
document_class: raw_pdfs
project_id: ethereum
text:
When Satoshi Nakamoto first set the Bitcoin blockchain into motion in January 2009, he was
simultane...
source_path: data/raw_pdfs/ethereum.pdf
--

**Checking corpus size (text length per document)**

Before chunking and embedding, it's helpful to understand how large each document is. The following cell prints the character count of the cleaned text for each project. Large differences here can indicate:
- unusually short/empty extractions
- very large documents that may produce many chunks and take longer to embed

In [5]:
print("Text lengths:")
for doc in docs:
    print(doc["project_id"] + ":", len(doc["text"]))  

Text lengths:
bitcoin: 21145
ethereum_eip_150: 121380
ripple: 29796
ethereum: 84978
solana: 45804
aave: 27774
chainlink: 361960


### Step 2: Setting up the RAG

**Chunking all whitepapers into overlapping text segments (chunks)**

To prepare the corpus for Retrieval-Augmented Generation (RAG), each document is split into smaller pieces. Each chunk overlaps with the next one so that important context (e.g. definitions or sentences crossing a boundary) isn’t lost. The following cell creates the chunk objects from docs, prints how many chunks were created in total, and shows the available fields/metadata keys for one example chunk:

In [6]:
chunk_objects = create_chunk_objects(docs=docs)
print(f"Loaded {len(chunk_objects)} chunk objects")
if chunk_objects:
    print(chunk_objects[0].keys()) 

Loaded 229 chunk objects
dict_keys(['id', 'project_id', 'source', 'chunk_index', 'text'])


**Inspecting a few chunks and their metadata**

Before embedding and storing the chunks in a vector database, it’s useful to check the output of chunking. This cell prints the first ten chunks, showing the metadata fields (id, project_id, source, chunk_index, ...) and the actual chunk text. This helps confirm that chunk boundaries look reasonable and that metadata is correctly attached for later citation/debugging:

In [7]:
for chunk in chunk_objects[:11]:
    print("-"*50)
    for key, value in chunk.items():
        if key != "text":
            print(f"{key}" + ":", chunk[key])
        else:
            print(f"{key}" + ":\n" + chunk[key])

--------------------------------------------------
id: bitcoin_0
project_id: bitcoin
source: data/raw_pdfs/bitcoin.pdf
chunk_index: 0
text:
Bitcoin: A Peer-to-Peer Electronic Cash System Satoshi Nakamoto [EMAIL] www.bitcoin.org Abstract. A purely peer-to-peer version of electronic cash would allow online payments to be sent directly from one party to another without going through a financial institution. Digital signatures provide part of the solution, but the main benefits are lost if a trusted third party is still required to prevent double-spending. We propose a solution to the double-spending problem using a peer-to-peer network. The network timestamps transactions by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work. The longest chain not only serves as proof of the sequence of events witnessed, but proof that it came from the largest pool of CPU power. As long as a majority of CPU power is con

**Creating embeddings for all chunks**

Here each chunk’s text is converted into a dense vector representation (an embedding) using a SentenceTransformer model. These vectors capture semantic meaning, so similar chunks end up close together in embedding space. The resulting embeddings array contains one vector per chunk and will be uploaded to Qdrant for similarity search during retrieval:

In [8]:
embeddings = embed_chunks(chunk_objects)
embeddings

Batches: 100%|██████████| 8/8 [00:04<00:00,  1.76it/s]


array([[-0.10673528, -0.05583877, -0.10713287, ...,  0.02472721,
         0.07349703,  0.00038791],
       [-0.06282514,  0.01452515, -0.04318814, ...,  0.07220383,
         0.05585287,  0.00982241],
       [-0.11983434, -0.03005962, -0.05769655, ...,  0.04040922,
         0.04605623,  0.07250062],
       ...,
       [-0.07357167, -0.01645934, -0.09005164, ...,  0.06037074,
         0.01693998,  0.01973926],
       [-0.05211445, -0.03525576, -0.06252136, ...,  0.03355342,
         0.04933025, -0.01571748],
       [-0.06606366, -0.01918201, -0.0625135 , ...,  0.05776502,
         0.01472961, -0.00019222]], shape=(229, 384), dtype=float32)

**Initializing the Qdrant vector database (in-memory)**

The Qdrant client is set up and a collection with the name `crypto_whitepapers` is created (or recreated). The collection configuration (especially the vector size) must match the embedding dimensionality, so we pass embeddings to determine the correct vector length. This prepares Qdrant to store and search our chunk vectors efficiently:

In [9]:
qdrant_client, COLLECTION = init_qdrant_collection(embeddings)

**Uploading chunks and embeddings to Qdrant**

Now the data is stored in the vector database. Each chunk becomes a Qdrant point with:
- a vector (the embedding) used for similarity search
- a payload (metadata + chunk text) used for filtering, tracing sources, and displaying results

After this step, the collection is ready for retrieval: given a query embedding, Qdrant can return the most semantically similar chunks:

In [10]:
upload_to_qdrant(qdrant_client, chunk_objects, embeddings, COLLECTION)

Uploaded 229 chunks.


### Step 3: LLM Pipeline (with optional finetuning)

**Defining an example user question*

A simple but realistic example question is set up, that includes two intents:
- a textual explanation ("...explain Bitcoin...")
- a visual request ("...show me a chart of the Bitcoin supply over time?")

In [11]:
example_question = "Can you explain Bitcoin to me and can you show me a chart of the Bitcoin supply over time?"

**Analyzing the question (intent and scope)**

Before retrieval, a lightweight analysis is done to extract the high-level structure from the question:
- Which project(s) are mentioned or implied (e.g., Bitcoin)
- Question type / intent (explanation, comparison, definition, etc.)
- Whether a visual (diagram/chart) would likely improve the response

The output (question_analysis) guides later steps like prompting and optional image creation:

In [12]:
question_analysis = analyze_question(example_question)
question_analysis

{'projects': ['bitcoin'], 'type': 'tokenomics', 'needs_image': True}

**Retrieving the most relevant context chunks (RAG)**

Now the vector database is queried to fetch the top 5 most semantically similar chunks for the question. These chunks act as "base context" for the LLM, so the model can answer using information from the corpus rather than relying purely on its general knowledge:

In [13]:
possible_answers = retrieve_rag(example_question, qdrant_client, COLLECTION, top_k=5)
possible_answers

[{'text': 'include a transaction giving themselves 25 BTC out of nowhere. Additionally, if any transaction has a higher total denomination in its inputs than in its outputs, the difference also goes to the miner as a "transaction fee". Incidentally, this is also the only mechanism by which BTC are issued; the genesis state contained no coins at all. ethereum.org In order to better understand the purpose of mining, let us examine what happens in the event of a malicious attacker. Since Bitcoin\'s underlying cryptography is known to be secure, the attacker will target the one part of the Bitcoin system that is not protected by cryptography directly: the order of transactions. The attacker\'s strategy is simple: 1. Send 100 BTC to a merchant in exchange for some product (preferably a rapid-delivery digital good) 2. Wait for the delivery of the product 3. Produce another transaction sending the same 100 BTC to himself 4. Try to convince the network that his transaction to himself was the o

**Generating an answer with the base model (meta-llama-3-8b-instruct: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)**

Using the retrieved chunks as context, an answer with the baseline model is generated.
This gives a reference output that can be compared against the "fine-tuned" model version in the following step. Note, that the code only prints the final answer text:

In [14]:
answer_base = generate_answer(example_question, possible_answers, use_finetuned_model=False)
print(answer_base["answer_text"])

Okay, let’s break down Bitcoin and look at its supply over time.

**What is Bitcoin?**

Based on the provided chunks ([Chunk 1], [Chunk 5]), Bitcoin was created by Satoshi Nakamoto as a decentralized, peer-to-peer online currency. It doesn't rely on a central bank or any single issuer to maintain its value. A key part of Bitcoin’s design is a “proof-of-work” blockchain – this means that transactions are grouped into blocks and secured using complex calculations (mining). Miners compete to solve these calculations, and the winner gets rewarded with newly created Bitcoins and transaction fees.  It's essentially a first-to-file system for transaction ordering.

**Bitcoin Supply Over Time**

Unfortunately, the provided chunks do not contain a chart of Bitcoin’s supply over time. [Chunk 1] mentions that the genesis block contained no coins at all, and that new coins are issued through transaction fees and mining rewards. However, it doesn't provide specific data on how much has been created

**Generating an answer with the "fine-tuned" model (theia-llama-3.1-8b-v1: https://huggingface.co/QuantFactory/Theia-Llama-3.1-8B-v1-GGUF)**

Here the same answer is generated again, but this time using an alternative model that is treated as the "fine-tuned" option in this demo setup.
Comparing this output to the base model helps illustrate how model choice can affect:
- clarity and structure
- technical depth
- style and tone

while still using the same retrieved context:

In [15]:
answer_finetuned = generate_answer(example_question, possible_answers, use_finetuned_model=True)
print(answer_finetuned["answer_text"])

Okay, let’s break down Bitcoin and look at its supply over time.

**What is Bitcoin?**

Based on the provided chunks ([Chunk 1], [Chunk 5]), Bitcoin was created by Satoshi Nakamoto as a decentralized, peer-to-peer online currency. It doesn't rely on a central bank or any single issuer to maintain its value. A key part of Bitcoin’s design is a “proof-of-work” blockchain – this means that transactions are grouped into blocks and secured using complex calculations (mining). Miners compete to solve these calculations, and the winner gets rewarded with newly created Bitcoins and transaction fees.  It's essentially a first-to-file system for transaction ordering.

**Bitcoin Supply Over Time**

Unfortunately, the provided chunks do not contain a chart of Bitcoin’s supply over time. [Chunk 1] mentions that the genesis block contained no coins at all, and that new coins are issued through transaction fees and block rewards. However, it doesn't provide specific data on how the supply has changed

**Optional: reviewing/improving the answer with a second LLM call**

This step runs an additional "review pass" over the draft answer. Generally, the goal is to:
- fix missing details or weak explanations
- improve readability and structure
- reduce contradictions or repetition

In other words: retrieval gives the content, and this review step helps fixing issues and improving presentation:

In [16]:
answer_reviewed = review_answer(example_question, possible_answers, draft_answer=answer_finetuned, use_llm=True, use_finetuned_model=True)
print(answer_reviewed["answer_text"])

Okay, let’s break down Bitcoin and explore its supply.

**What is Bitcoin?**

Based on the provided chunks ([Chunk 1], [Chunk 5]), Bitcoin was created by Satoshi Nakamoto as a decentralized, peer-to-peer online currency. It doesn't rely on a central bank or any single issuer to maintain its value. A key part of Bitcoin’s design is a “proof-of-work” blockchain – this means that transactions are grouped into blocks and secured using complex calculations (mining). Miners compete to solve these calculations, and the winner receives newly created Bitcoins and transaction fees. It's essentially a system for ordering transactions where the first to file gets priority.

**Bitcoin Supply Over Time**

Unfortunately, the provided chunks do not contain a chart of Bitcoin’s supply over time. [Chunk 1] states that the genesis block contained no coins at all, and that new coins are issued through transaction fees and block rewards. It doesn't provide specific data on how the supply has changed. Howev

### Step 4: Image generation

**Deciding whether to generate an image, then displaying it**

Based on the earlier question_analysis, it is checked whether an image should be generated (e.g. a diagram or chart-style illustration). If yes, the following happens:
- a prompt suggestion for the image model is built
- the image is generated and the URL is printed
- the image should be displayed in the notebook (via IPython.display)

**Note:** To run the following cell successfully, you'll need your own OpenAI API key. The notebook/function expects an OpenAI_API.txt file containing your API key on the first line. In the repo, only a placeholder is included (see OpenAI_API.txt). To run image generation locally, remove this placeholder and insert your own key. Also note that for some reason the returned image link expires or stops working after some time (we don't know why). If it's not visible in this notebook, please check the PNG `Crypto_Demo_Image.png`:

In [17]:
image_request = build_image_request(example_question, question_analysis, answer_reviewed)
if image_request.get("should_generate") is True: 
    image_url = generate_tokenomics_image(image_request.get("prompt_suggestion"))
    print("Image URL:", image_url)
    display(Image(url=image_url))

Image URL: https://oaidalleapiprodscus.blob.core.windows.net/private/org-E3LNA03R1XJIPS20pdl0y5ce/user-wlzKhmKn9GQ4CBXJwxjqhqB4/img-gRte3w1NkkESBOzkkBWIVmTH.png?st=2026-01-09T09%3A28%3A05Z&se=2026-01-09T11%3A28%3A05Z&sp=r&sv=2024-08-04&sr=b&rscd=inline&rsct=image/png&skoid=0e2a3d55-e963-40c9-9c89-2a1aa28cb3ac&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2026-01-09T09%3A00%3A24Z&ske=2026-01-10T09%3A00%3A24Z&sks=b&skv=2024-08-04&sig=AC5FuCSPEIMCAlRnfhGq/InZjEfFWxcAmw1MlREZ/Xc%3D
