# Project Plan – LLM-Powered Crypto Whitepaper & Tokenomics Assistant

## 1. Project Idea 

We build an **LLM-based assistant** that helps users understand and compare **crypto projects and their tokenomics** by reading their **whitepapers and official documentation**.

The assistant will:
- Answer questions like *“How does staking work in Project X?”*, *“Compare the tokenomics of Project A and B”*, or *“What are the main risks mentioned in the docs?”*.
- Use **retrieval-augmented generation (RAG)** to ground answers in the actual documents and **cite sources**.
- Optionally **generate explanatory images/infographics** (e.g., token distribution diagrams) via a function-calling interface to an AI image generator.

**User interface (concept):**
- The user interacts with the system through a **conversational interface** in the notebook (text prompts).
- The system returns:
  - A **text answer** (structured explanation, comparisons, pros/cons, risks).
  - **References to specific document passages** used as evidence.
  - Optionally, a **generated image** that visualizes tokenomics or concepts.

---

## 2. Mapping to Course Requirements

We implement the following components:

1. **Retrieval Augmentation / RAG**
   - We create a **small document corpus** (e.g., 3–5 whitepapers / official docs of selected projects).
   - We build an **index** (e.g., using text chunks + embeddings).
   - For each user query, we:
     - Retrieve relevant document passages.
     - Pass them as context to the LLM.
     - Ask the LLM to **cite which passages** support the answer.

2. **Multi-step LLM Pipeline**
   - We design a pipeline with **multiple coordinated LLM calls**, for example:
     1. **Question analysis / router:**  
        Detect the project(s) and type of question (overview, tokenomics, risk, comparison).
     2. **Retrieval:**  
        Use the router’s decision to form search queries and retrieve chunks from the index.
     3. **Answer generation:**  
        Generate an answer that:
        - Uses the retrieved context,
        - Has a clear structure (e.g., overview / details / risks),
        - Includes citations.
     4. **Answer review (optional second LLM call):**  
        Check that:
        - The answer is grounded in the retrieved text,
        - Citations are present,
        - Style guidelines are followed (e.g., disclaimers, no financial advice).

3. **Function-calling Interface to an AI Image Generator**
   - We design a **tool/function** (e.g., `generate_tokenomics_image(prompt: str)`).
   - The LLM can decide to call this tool when:
     - The user explicitly asks for a visualization, or
     - The question is about tokenomics/flows where a diagram is helpful.
   - The tool calls an **external image generation API** (e.g., DALL·E / Stable Diffusion-like service).
   - The result (e.g., an image URL or path) is returned to the user.
   - The function-calling interface is implemented as part of the LLM tool set.

**Optional goals (if time allows):**
- Simple **evaluation** comparing:
  - Plain LLM (no documents) vs.
  - RAG-based system (with documents).
- Additional features like:
  - Query rewriting or re-ranking in RAG.
  - A tiny **synthetic fine-tuning** (e.g., style adaptation) on crypto Q&A.

---

## 3. System Architecture (High-Level)

### 3.1 Data / Knowledge Base

- **Corpus:**  
  A curated set of ~3–5 crypto projects. For each project, we collect:
  - Whitepaper or litepaper (converted to plain text).
  - Optionally short official documentation pages (e.g., tokenomics sections).
- **Preprocessing:**
  - Split documents into manageable **text chunks** (e.g., by sections/headings).
  - Compute **embeddings** for each chunk.
  - Store chunks + metadata in a simple **index** (e.g., in memory or a local file).

### 3.2 Tools / Functions

We will implement Python functions (tools) that the LLM can call:

- `retrieve_documents(query, project_filter=None)`  
  → Returns top-k relevant text chunks + metadata.

- `list_supported_projects()`  
  → Returns the set of projects covered by the corpus.

- `generate_tokenomics_image(prompt: str)`  
  → Calls the external image generator and returns an image handle/URL.

These tools will be exposed to the LLM via function-calling.

### 3.3 LLM Pipeline Steps

1. **Question Analysis (LLM Call 1)**
   - Input: user question + list of known projects + tool descriptions.
   - Output:
     - Detected project(s),
     - Question type/category,
     - Decision whether an image might be useful.

2. **Retrieval (Python + possibly LLM assistance)**
   - Use `retrieve_documents(...)` with:
     - The user question,
     - Optional project filter from step 1.
   - Produce a set of relevant chunks.

3. **Answer Generation (LLM Call 2)**
   - Input: user question + retrieved chunks.
   - Output: answer containing:
     - Structured explanation,
     - References to specific chunks (e.g., “Source: Project X whitepaper, section 3.2”).

4. **Answer Review (LLM Call 3, optional)**
   - Input: question + retrieved chunks + draft answer.
   - Output:
     - Refined answer,
     - Ensures citations and a short disclaimer.

5. **Image Generation (Tool Call from LLM)**
   - When applicable, the LLM:
     - Constructs a **visual description** of the tokenomics/flow.
     - Calls `generate_tokenomics_image(prompt=...)`.
   - The notebook shows the resulting image after the call.

---

## 4. Implementation Plan 

We plan to implement the project in the following phases:

1. **Phase 0 – Environment Setup**
   - Set up Python environment and required libraries in VS Code.
   - Create the initial Jupyter notebook structure.

2. **Phase 1 – Data Collection & Preprocessing**
   - Select 3–5 crypto projects.
   - Download and convert whitepapers/docs to text.
   - Implement chunking and embedding computation.
   - Build a simple index (e.g., list or vector store).

3. **Phase 2 – Basic RAG Prototype**
   - Implement `retrieve_documents(query, project_filter)`.
   - Build a simple “RAG answer” function:
     - Single LLM call with user question + top-k chunks.
   - Test with a few example queries.

4. **Phase 3 – Multi-step LLM Pipeline**
   - Implement the **question analysis** step.
   - Integrate it with retrieval and answer generation.
   - Add the optional **reviewer** step.
   - Ensure that answers include **citations**.

5. **Phase 4 – Image Generation Tool**
   - Implement `generate_tokenomics_image(prompt)`.
   - Integrate it as a function/tool the LLM can call.
   - Create prompts for typical tokenomics diagrams.
   - Test with queries that request visualizations.

6. **Phase 5 – Evaluation, Documentation & Presentation Prep**
   - Create a small set of test questions.
   - Compare:
     - Plain LLM vs. RAG-enhanced pipeline (qualitative evaluation).
   - Document:
     - Architecture,
     - Implementation details,
     - Limitations (e.g., small corpus, no financial advice),
     - Potential improvements.
   - Prepare example interactions for the final presentation.

---

## 5. Limitations

- **Small, curated corpus:**  
  We only support the selected projects; the assistant cannot answer arbitrarily about all crypto.
- **No financial advice:**  
  The system is meant for **educational explanations** of documentation, not investment recommendations.
- **Approximate understanding:**  
  LLMs might still misinterpret nuanced technical details; hence we:
  - Ground answers in retrieved text,
  - Add disclaimers,
  - Emphasize that answers are based on the documents provided.

---

# Structure

# Project Overview & Work Split

We build an **LLM-based assistant** that explains and compares crypto projects using their whitepapers/docs, with **RAG**, a **multi-step pipeline**, an **image-generation tool**, and an **optional fine-tuning step**.

---

### Team & Main Components

- **Person 1 – Corpus & Preprocessing** (`src/corpus.py`, `notebook/main.ipynb`)
- **Person 2 – Index & Retrieval (RAG core)** (`src/indexer.py`)
- **Person 3 – LLM Pipeline (+ Optional Fine-Tuning)** (`src/pipeline.py`, `src/fine_tuning.py`)
- **Person 4 – Image Tool & Notebook Integration** (`src/imaging.py`)

---

### End-to-End Flow (High-Level)

User question  
→ **1. Question analysis (LLM / optionally fine-tuned classifier)**  
→ **2. Retrieve relevant document chunks (RAG)**  
→ **3. Generate answer with citations (LLM)**  
→ **4. Optional review/refinement (LLM)**  
→ **5. Optional image generation tool call**  
→ **Final response: text answer + sources + (optional) image**

---

# Component Breakdown & Responsibilities

### 1. Corpus & Preprocessing  
**Owner:** Person 1  
**File:** `src/corpus.py`

**Goal:** Turn PDFs/docs for **5–10 crypto projects** into clean text documents with metadata.

**Steps / Tasks:**
- Select projects (e.g. Bitcoin, Ethereum, Uniswap, Chainlink, Aave, …)
- Download **official whitepapers/docs** and save to `data/raw_pdfs/`
- Implement functions to:
  - Load PDFs / HTML and extract text
  - Clean text (remove headers/footers, duplicated content, weird spacing)
  - Attach metadata (for each document), e.g.:  
    - `project_id` (e.g. `"bitcoin"`)  
    - `doc_id` (e.g. `"btc_whitepaper"`)  
    - `source_path`
- Return a list of document dicts, e.g.:  
  ```python
  [
      {"project": "bitcoin", "doc_id": "btc_whitepaper", "text": "..."},
      {"project": "ethereum", "doc_id": "eth_whitepaper", "text": "..."},
      ...
  ]

**Deliverables:**
- Function `load_corpus()` returning a list of documents with metadata
- (Optional but helpful) Saved cleaned text files in `data/texts/`

---

### 2. Index & Retrieval (RAG core)  
**Owner:** Person 2  
**File:** `src/indexer.py`

**Goal:** Build a searchable **vector index** over all document chunks.

**Steps / Tasks:**
- Take the documents from `corpus.py` (`load_corpus()`)
- Implement **chunking**:
  - Split each full document (`text`) into smaller chunks (e.g. 300–800 tokens or by headings)
  - For each chunk, keep metadata: `project`, `doc_id`, `chunk_id`, and the chunk `text`
- Compute **embeddings** for each chunk using an embedding model
- Store chunks + embeddings in a simple Python index object (e.g. a dict)
- Implement:
  - `build_index(documents: List[Dict[str, Any]]) -> Dict[str, Any]`
    (build and return an index object or structure)
  - `retrieve(question: str, index: Dict[str, Any], project_filter: Optional[List[str]] = None, k: int = 5) -> List[Dict[str, Any]]:`
    (returns top-k chunks, each with metadata like `project`, `doc_id`, `chunk_id`, `text`, `score`)
- Create small tests / demo code to:
  - Run a sample query
  - Print the top retrieved chunks

**Deliverables:**
- `build_index()` and `retrieve()` functions
- Short description / docstring of how many projects & chunks the index contains

---

### 3. LLM Pipeline (Obligatory)
**Owner:** Person 3  
**File:** `src/pipeline.py`

**Goal:** Orchestrate multiple LLM calls + retrieval into a coherent answer pipeline.

**Core Steps / Tasks:**

1. **Question Analysis**
   - Implement `analyze_question(question: str, available_projects: list[str]) -> dict`
   - Output example:
     - `{"projects": ["bitcoin", "ethereum"], "type": "comparison", "needs_image": False}`
   - Tasks:
     - Detect which project(s) are mentioned or implied
     - Classify question type (overview / tokenomics / risk / comparison / other)
     - Decide if an image could be helpful (`needs_image` flag)

2. **Retrieval**
   - Implement `retrieve_for_question(question: str, analysis: dict, index) -> list[dict]`
   - Tasks:
     - Use `analysis["projects"]` as a filter (if not empty)
     - Call `indexer.retrieve(...)` with appropriate query and `k`
     - Return a list of relevant chunks

3. **Answer Generation**
   - Implement `generate_answer(question: str, retrieved_chunks: list[dict]) -> dict`
   - Output structure, for example:
     - `"answer_text"`: final answer as string  
     - `"citations"`: list of references (e.g. `[(project, doc_id, chunk_id), ...]`)  
     - `"raw_model_output"`: (optional) raw LLM response
   - Tasks:
     - Provide the question + retrieved chunks to the LLM.
     - Instruct the LLM to:
       - Use only the given context for claims about projects
       - Produce a clear structure (e.g. Summary / Details / Risks / Sources)
       - Insert citations that map back to chunk metadata

4. **Optional Answer Review**
   - Implement `review_answer(question: str, retrieved_chunks: list[dict], draft_answer: dict) -> dict`
   - Tasks:
     - Second LLM call that:
       - Checks grounding (is the answer consistent with the chunks?)
       - Improves clarity / structure
       - Adds or enforces disclaimer (e.g. “no financial advice”)
   - Return a refined answer object (same structure as `generate_answer`)

5. **Pipeline Wrapper**
   - Implement `pipeline_answer(question: str, index, available_projects: list[str]) -> dict`
   - Tasks:
     - Call `analyze_question(...)`
     - Call `retrieve_for_question(...)`
     - Call `generate_answer(...)`
     - Optionally call `review_answer(...)`
     - Return a result dict used directly by the notebook, including:
       - Final answer text
       - Citations
       - Any flags / metadata (e.g. whether an image is suggested)

**Deliverables:**
- Working `pipeline_answer(...)` function
- Docstrings explaining each step and how they connect

---

### 3b. Optional Fine-Tuning (if time allows)  
**Owner:** Person 3  
**File:** `src/fine_tuning.py` (plus small changes in `src/pipeline.py`)

**Goal (Option A):** Fine-tune a **question classifier** to make Step 1 (question analysis) more consistent and robust.

**Steps / Tasks:**

1. **Create Training Data (Synthetic or Semi-Manual)**
   - Decide on labels:
     - `project` label: e.g. `"bitcoin"`, `"ethereum"`, `"uniswap"`, `"multi"`, `"unknown"`
     - `type` label: `"overview"`, `"tokenomics"`, `"risk"`, `"comparison"`, `"other"`
   - Build a small dataset of examples:
     - User-style questions mapped to `(project_label, type_label)`
   - Save dataset as JSONL/CSV for fine-tuning

2. **Run Fine-Tuning**
   - Implement `run_fine_tuning_job(training_data_path: str) -> str`
     - Starts a fine-tuning job via API/client
     - Returns a `model_id` of the fine-tuned classifier
   - Document:
     - Base model used
     - Basic settings (epochs, etc.)
     - Final `model_id`

3. **Use Fine-Tuned Classifier in Pipeline**
   - Implement `classify_question_with_finetuned_model(question: str, available_projects: list[str]) -> dict`
     - Returns same structure as `analyze_question(...)`
   - Update `pipeline.py`:
     - Add a config flag or parameter to switch between:
       - Base model `analyze_question(...)`, and
       - Fine-tuned classifier `classify_question_with_finetuned_model(...)`
   - In the notebook, show a small comparison:
     - The same questions analyzed by base vs. fine-tuned classifier

**Deliverables:**
- `fine_tuning.py` with training + inference helpers
- Example outputs and short discussion in the notebook:
  - How fine-tuning was done
  - Before/after examples of question analysis

*(This part is optional and should not block the main pipeline.)*

---

### 4. Image Generation Tool & Notebook Integration  
**Owner:** Person 4  
**Files:** `src/imaging.py`, `notebook/main.ipynb`

**Goal:** Allow the LLM to call an **image-generation tool** to produce tokenomics/flow diagrams.

**Steps / Tasks:**

1. **Image Tool Implementation**
   - Implement `generate_tokenomics_image(prompt: str) -> str`
     - Calls an external image generation API
     - Returns an **image URL or file path**
   - Handle basic errors (API not available, etc.)
   - Optionally save images in `data/images/`

2. **Tool Integration with LLM**
   - Define `generate_tokenomics_image` as an **LLM tool / function-call**
   - Decide integration logic:
     - If `analysis["needs_image"] == True`, or
     - If the user explicitly asks for a visualization (keywords like "draw", "diagram", "visualize")
   - Let the LLM construct a visual prompt using:
     - The user question
     - Key information from retrieved chunks
     - The structure of the final answer

3. **Notebook & Demo Integration**
   - In `notebook/main.ipynb`:
     - Load corpus & index
     - Define a simple interface to:
       - Enter or select a question
       - Call `pipeline_answer(...)`
       - Optionally trigger image generation (automatically or via a flag)
     - Display:
       - Final answer text
       - Citations (e.g. short listing of sources / chunk IDs)
       - Generated image inline (if possible) or as a link

**Deliverables:**
- `imaging.py` with a working image tool function
- A clear demo in the notebook showing:
  - Question → pipeline → answer + sources + (optional) image