# DocsGPT API Notebook

This notebook demonstrates the capabilities of [DocsGPT](https://github.com/arc53/DocsGPT), an open-source Retrieval-Augmented Generation (RAG) system designed to query documents using local LLMs (powered by `llama-cpp-python`).

Each section explains:
- What the function does
- Why and when to use it
- How to implement it in a modular, API-style workflow

## Description

- Loads and configures a local LLaMA model using `llama-cpp-python`.
- Ingests a variety of document formats via REST API or CLI (`PDF`, `CSV`, `DOCX`, etc.).
- Allows natural language queries over the ingested documents.
- Demonstrates manual prompt injection with retrieved chunks.
- Provides an interactive Q&A chat loop with memory.
- Includes error handling and response parsing.
- Aligns with modular API tutorial practices (as in LangChain and Causify).


## References

- [DocsGPT GitHub Repository](https://github.com/arc53/DocsGPT)
- [Jupyter Notebook Style Guide – Causify](https://github.com/causify-ai/helpers/blob/master/docs/coding/all.jupyter_notebook.how_to_guide.md)


## Citation

- Arc53, DocsGPT: https://github.com/arc53/DocsGPT


## Notebook Naming Convention

This notebook follows the recommended template and naming style for API tutorials:

**`BitcoinLLMQA.API.ipynb`**


## Autoreload (for development)

In [45]:
%load_ext autoreload
%autoreload 2

## Imports

In [46]:
import requests
from llama_cpp import Llama
import logging

##  Configuration & Logger

In [47]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load a Local LLM with `llama-cpp-python`

We start by loading a quantized LLaMA model using the `llama_cpp` Python API. This model will power all downstream document question-answering.

 DocsGPT can run locally without cloud dependencies when paired with quantized models. If you want secure, offline LLM-powered document search.

In [48]:
from llama_cpp import Llama

model_path = "llama-2-7b-chat.Q4_K_M.gguf"
llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    n_threads=6,
    n_gpu_layers=1
)

llama_model_load_from_file_impl: using device Metal (Apple M3) - 10277 MiB free
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_

## Initializing DocsGPT with a Local Model

The following code is used to run an initial text to confirm that DocsGPT is successfully loaded and ready to use. This step ensures that the local model is properly configured and operational before proceeding with further queries or tasks.


In [None]:
context = """
Bitcoin dataset summary:
- Avg. price: $30,000
- Volatility: High on weekends
- Volume peak: 1.2M on Jan 5, 2024
"""

In [18]:
question = "What was the highest volume recorded and when did it occur?"

# Construct the full prompt (manual simulation of RAG retrieval)
prompt = f"""
You are an assistant trained to answer questions using the context below.

Context:
{context}

Question: {question}
Answer:
"""


### Setting Parameters and Variables

This cell defines several parameters and variables used throughout the notebook:

- `stop`: A list of tokens used to stop the LLM's response generation.
- `temperature`: A float value controlling the randomness of the LLM's output.
- `max token`: The maximum number of tokens the LLM is allowed to generate in response.
- `prompt`: A string that includes both the context and the question. It's the full message sent to the LLM.

In [30]:
response = llm(prompt, max_tokens=250, stop=["\n"], temperature=0.7)

Llama.generate: 540 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   24284.62 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   71824.71 ms /    19 runs   ( 3780.25 ms per token,     0.26 tokens per second)
llama_perf_context_print:       total time =   71841.71 ms /    20 tokens


### Extracting and Displaying the Answer

The following code extracts the assistant's response from the LLM's output and prints both the question and the corresponding answer:

- `answer`: Extracts the text of the response from the LLM's output and removes any leading or trailing whitespace.
- `print("Q:", question)`: Displays the question asked.
- `print("A:", answer)`: Displays the assistant's answer.


In [None]:
answer = response['choices'][0]['text'].strip()
print("Q:", question)
print("A:", answer)

Q: What was the highest volume recorded and when did it occur?
A: Based on the provided context, the highest volume recorded for Bitcoin was 1.2 million on January 5, 2024.


## Manual Prompt Injection for Local LLM

This section demonstrates how to manually compose prompts using document chunks.

In [None]:
context = """The document contains time series data for Bitcoin transactions..."""

question = "What was the highest volume recorded and when did it occur?"

prompt = f"""
Context:
{context}

Question: {question}
"""

response = llm(prompt, max_tokens=500, stop=["\n"], temperature=0.7)

# Interactive Chat with DocsGPT

The following code snippet demonstrates how to create an interactive chat loop using DocsGPT. The assistant responds to user queries based on the uploaded documents. The chat history is maintained to provide context for subsequent questions.

- **`chat_history`**: Stores the conversation history between the user and the assistant.
- **`user_input`**: Captures the user's input from the console.
- **`prompt`**: Combines the chat history with a predefined instruction to generate a response.
- **`response`**: The assistant's reply generated by the LLM.
- **`stop`**: A list of tokens used to terminate the assistant's response generation.

The loop continues until the user provides an empty input, at which point the chat ends with a goodbye message.

In [47]:
chat_history = ""
while True:
    user_input = input("You: ").strip()
    if not user_input:
        print("\n Exiting chat. Goodbye!")
        break

    chat_history += f"### User: {user_input}\n### Assistant:"
    prompt = (
        "You are a helpful assistant answering questions using uploaded documents.\n"
        f"{chat_history}"
    )
    response = llm(prompt, max_tokens=500, stop=["\n", "###"], temperature=0.7)
    answer = response['choices'][0]['text'].strip()
    print(f"User: {user_input}")
    print(f"Assistant: {answer}\n")
    chat_history += f" {answer}\n"


You: What is the stock market price trend


Llama.generate: 30 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =   17552.35 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    5135.51 ms /    40 runs   (  128.39 ms per token,     7.79 tokens per second)
llama_perf_context_print:       total time =    5144.44 ms /    41 tokens


User: What is the stock market price trend
Assistant: I'm just an AI, I don't have real-time access to current stock market data. However, I can provide you with some general information on stock market trends.

You: you can take reference from this website https://tradingeconomics.com/united-states/stock-market


Llama.generate: 70 prefix-match hit, remaining 36 prompt tokens to eval
llama_perf_context_print:        load time =   17552.35 ms
llama_perf_context_print: prompt eval time =    3897.24 ms /    36 tokens (  108.26 ms per token,     9.24 tokens per second)
llama_perf_context_print:        eval time =    5473.61 ms /    51 runs   (  107.33 ms per token,     9.32 tokens per second)
llama_perf_context_print:       total time =    9381.96 ms /    87 tokens


User: you can take reference from this website https://tradingeconomics.com/united-states/stock-market
Assistant: Thank you for providing the website link. Based on the data available on the site, the stock market price trend in the United States has been generally upward over the past few years, with some fluctuations and corrections along the way.

You: which commodity has is the highest stocks currently


Llama.generate: 157 prefix-match hit, remaining 22 prompt tokens to eval
llama_perf_context_print:        load time =   17552.35 ms
llama_perf_context_print: prompt eval time =    2443.75 ms /    22 tokens (  111.08 ms per token,     9.00 tokens per second)
llama_perf_context_print:        eval time =    8664.69 ms /    82 runs   (  105.67 ms per token,     9.46 tokens per second)
llama_perf_context_print:       total time =   11126.53 ms /   104 tokens


User: which commodity has is the highest stocks currently
Assistant: I apologize, but I don't have access to real-time commodity prices or stock market data. However, I can suggest some reliable sources of information on commodity prices, such as the World Bank or the United States Department of Agriculture. These sources provide historical data on commodity prices, including agricultural commodities like wheat, corn, and soybeans.

You: what is the summary of this book https://sopheaksrey.wordpress.com/wp-content/uploads/2012/04/rich_dad_poor_dad_by_robert_t-_kiyosaki.pdf


Llama.generate: 261 prefix-match hit, remaining 69 prompt tokens to eval
llama_perf_context_print:        load time =   17552.35 ms
llama_perf_context_print: prompt eval time =    6153.62 ms /    69 tokens (   89.18 ms per token,    11.21 tokens per second)
llama_perf_context_print:        eval time =    8654.76 ms /    63 runs   (  137.38 ms per token,     7.28 tokens per second)
llama_perf_context_print:       total time =   14825.15 ms /   132 tokens


User: what is the summary of this book https://sopheaksrey.wordpress.com/wp-content/uploads/2012/04/rich_dad_poor_dad_by_robert_t-_kiyosaki.pdf
Assistant: I'm just an AI and do not have the capability to read or summarize books. However, I can provide you with a brief summary of the book "Rich Dad Poor Dad" by Robert T. Kiyosaki based on the plot and main ideas presented in the book.

You: please do provide


Llama.generate: 393 prefix-match hit, remaining 14 prompt tokens to eval
llama_perf_context_print:        load time =   17552.35 ms
llama_perf_context_print: prompt eval time =    1913.23 ms /    14 tokens (  136.66 ms per token,     7.32 tokens per second)
llama_perf_context_print:        eval time =    8503.61 ms /    76 runs   (  111.89 ms per token,     8.94 tokens per second)
llama_perf_context_print:       total time =   10433.78 ms /    90 tokens


User: please do provide
Assistant: "Rich Dad Poor Dad" is a personal finance book written by Robert T. Kiyosaki that challenges conventional beliefs about money and wealth. The book is based on the author's own experiences and the teachings of his mentor, Rich Dad - a wealthy real estate investor. The main ideas presented in the book are:

You: 

 Exiting chat. Goodbye!


# Showcasing DocsGPT Feature: Fetching a URL and Answering a Question

This section demonstrates how DocsGPT can fetch content from a URL, process it, and answer a question based on the retrieved context. We use the Wikipedia page on Artificial Intelligence as an example.

#### Steps:
1. **Fetch the Web Page**: The content of the page is retrieved using the `requests` library and parsed with `BeautifulSoup`.
2. **Extract Context**: The text content of the page is truncated to fit within the context window of the LLM.
3. **Ask a Question**: A question is posed about the content of the page, and the LLM generates an answer using the provided context.

#### Example:
- **URL**: [Artificial Intelligence - Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)
- **Question**: "What is Artificial Intelligence according to this article?"
- **Answer**: The LLM processes the context and provides a concise response based on the article's content.

In [58]:
# Ingest a single web page
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.get_text())





Artificial intelligence - Wikipedia


























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Goals




Toggle Goals subsection





1.1
Reasoning and problem-solving








1.2
Knowledge representation








1.3
Planning and decision-making








1.4
Learning








1.5
Natural language processing








1.6
Perception








1.7
Social intelligence








1.8
General intelligence










2
Techniques




Toggle Techniques subsection


In [59]:
# Ask a question using scraped text as context
page_text = soup.get_text()
context = page_text[:1000]  # Truncate to fit context window
question = "What is Artificial Intelligence according to this article?"
prompt = f"""
You are a helpful assistant. Use the context below to answer the question.

Context:
{context}

Question: {question}
Answer:
"""

response = llm(prompt, max_tokens=200, stop=["\n", "###"], temperature=0.7)
answer = response['choices'][0]['text'].strip()

print("\nQ:", question)
print("A:", answer)


Llama.generate: 1 prefix-match hit, remaining 524 prompt tokens to eval
llama_perf_context_print:        load time =    5569.22 ms
llama_perf_context_print: prompt eval time =   12013.03 ms /   524 tokens (   22.93 ms per token,    43.62 tokens per second)
llama_perf_context_print:        eval time =    3969.02 ms /    41 runs   (   96.81 ms per token,    10.33 tokens per second)
llama_perf_context_print:       total time =   15990.12 ms /   565 tokens



Q: What is Artificial Intelligence according to this article?
A: According to the article, Artificial Intelligence (AI) is the branch of computer science that deals with creating intelligent machines, particularly intelligent computer programs. (Paragraph 1)


# Using DocsGPT to Analyze CSV Files Locally

DocsGPT can process CSV files stored on your device to answer questions based on the data they contain. This feature is particularly useful for analyzing structured datasets without requiring cloud-based processing.

## Steps:
1. **Load the CSV File**: The dataset is loaded into a pandas DataFrame.
2. **Set Context**: The DataFrame is passed as context to the LLM.
3. **Ask a Question**: A question related to the dataset is constructed and sent to the LLM.
4. **Generate a Response**: The LLM processes the context and provides an answer based on the data.

## Example:
- **Dataset**: `bitcoin_prices.csv`
- **Question**: "What is the average bitcoin price?"
- **Answer**: The LLM calculates and returns the average price based on the dataset.

## Code Example:

In [60]:
import pandas as pd
df= pd.read_csv('bitcoin_prices.csv')

In [61]:

context = df
question = "What is the average bitcoin price"
prompt = f"""
You are a helpful assistant. Use the context below to answer the question.

Context:
{context}

Question: {question}
Answer:
"""

response = llm(prompt, max_tokens=400, stop=["\n", "###"], temperature=0.7)
answer = response['choices'][0]['text'].strip()

print("\nQ:", question)
print("A:", answer)


Llama.generate: 23 prefix-match hit, remaining 518 prompt tokens to eval
llama_perf_context_print:        load time =    5569.22 ms
llama_perf_context_print: prompt eval time =    9141.84 ms /   518 tokens (   17.65 ms per token,    56.66 tokens per second)
llama_perf_context_print:        eval time =    1145.05 ms /    17 runs   (   67.36 ms per token,    14.85 tokens per second)
llama_perf_context_print:       total time =   10293.76 ms /   535 tokens



Q: What is the average bitcoin price
A: The average bitcoin price is $30835.09.


##  Feature Comparison: DocsGPT vs LangChain vs OpenAI

| Feature                   | OpenAI GPT API (Cloud-Based) | LangChain (RAG Framework) | **DocsGPT** (Self-Hosted Solution) |
|--------------------------|-------------------------------|---------------------------|-------------------------------------|
| **Cloud Dependency**      | Required                      | Optional                  | Not Required                        |
| **Local Execution**       | Not Supported                 | Supported                 | Fully Supported                     |
| **Source Attribution**    | Not Available                 | Basic                     | Full Citation Support               |
| **REST API Integration**  | Proprietary API               | Custom Implementation     | Native REST API                     |
| **Interface Options**     | SDK Only                      | CLI Focus                 | CLI + Web UI                        |
| **Document Compatibility**| Requires Custom Parsing       | PDF/HTML/TXT              | PDF/HTML/CSV/DOCX/EPUB + 10+ Formats|
