#### Prerequisites
 - Ollama installed

##### Dependencies:
  - "unstructured[pdf]"
  - langchain
  - langchain-community
  - langchain-huggingface
  - langchain-ollama
  - langchain-chroma
  - chromadb
  - sentence-transformers

### Imports

In [None]:
import re
import textwrap
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM
from transformers import AutoTokenizer
from unstructured.documents.elements import ListItem, NarrativeText, Title, Element
from unstructured.partition.pdf import partition_pdf

notebook_path = Path().resolve()

print("Imports ready")

Imports ready


### Load

In [15]:
FILE_PATH = "documents"
FILE_NAME = "OWASP-Top-10-for-LLMs-2025.pdf"
START_MARKER = "LLM01:2025"
END_MARKER = "Appendix 1"

elements = partition_pdf(filename=f"{notebook_path.parent}/{FILE_PATH}/{FILE_NAME}")

start_index = None
end_index = None
for i, e in enumerate(elements):
    if START_MARKER in e.text.strip() and isinstance(e, Title):
        start_index = i

    if END_MARKER in e.text.strip() and isinstance(e, Title):
        end_index = i
        break

elements_after_start = elements[start_index:end_index + 1]

content_elements = [
    e for e in elements[start_index:end_index]
    if isinstance(e, (Title, NarrativeText, ListItem, Table))
]

print(len(content_elements))

Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss while decompressing corrupted data
Data-loss 

592


### Chunk

In [16]:
SKIP_FILLER = "OWASP Top 10 for LLM Applications v2.0"
JUNK_SECTION_TITLES = {
    "Reference Links",
    "Related Frameworks and Taxonomies",
}

chunks = []
current_chunk_text = []
section_context = ""
current_heading = ""
is_skipping_section = False

for e in content_elements:
    text = e.text.strip()
    if not text or text == SKIP_FILLER:
        continue

    if isinstance(e, Title):
        if section_context and text != section_context:
            if len(current_chunk_text) > 1:
                 metadata = {"section": section_context, "heading": current_heading, "source": f"{FILE_NAME}"}
                 chunks.append({"page_content": "\n".join(current_chunk_text), "metadata": metadata})
                 current_chunk_text = []

        if text in JUNK_SECTION_TITLES:
            is_skipping_section = True
            continue
        
        is_skipping_section = False
        
        if re.match(r"^LLM\d{2}:2025", text):
            section_context = text
        
        if not current_chunk_text:
            current_heading = text
            current_chunk_text.append(text)

    elif not is_skipping_section:
        current_chunk_text.append(text)

if current_chunk_text and not is_skipping_section:
    metadata = {"section": section_context, "heading": current_heading, "source": f"{FILE_NAME}"}
    chunks.append({"page_content": "\n".join(current_chunk_text), "metadata": metadata})


print(f"✓ Created {len(chunks)} chunks with hierarchical metadata.")

✓ Created 109 chunks with hierarchical metadata.


### Tokenize

In [17]:
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")

def split_by_tokens(text, max_tokens=512, overlap=50):
    """
    Splits a single string of text into smaller strings based on token count.
    """
    tokens = tokenizer.encode(text, add_special_tokens=False)
    
    if len(tokens) <= max_tokens:
        return [text]
        
    result = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens)
        result.append(chunk_text)
        start += max_tokens - overlap

    return result

rag_ready_chunks = []

for ch in chunks:
    original_text = ch['page_content']
    metadata = ch['metadata']
    
    sub_texts = split_by_tokens(original_text, max_tokens=512, overlap=50)
    
    for sub_text in sub_texts:
        rag_ready_chunks.append({
            "page_content": sub_text,
            "metadata": metadata
        })

print(f"Original chunks: {len(chunks)}")
print(f"RAG-ready sub-chunks: {len(rag_ready_chunks)}")

if rag_ready_chunks:
    max_len = max(len(tokenizer.encode(c['page_content'], add_special_tokens=False)) for c in rag_ready_chunks)
    print(f"Max tokens in any RAG-ready chunk: {max_len}")
else:
    print("No RAG-ready chunks were created.")

Token indices sequence length is longer than the specified maximum sequence length for this model (579 > 512). Running this sequence through the model will result in indexing errors


Original chunks: 109
RAG-ready sub-chunks: 111
Max tokens in any RAG-ready chunk: 512


### Embed & Store

In [18]:
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

documents = [
    Document(page_content=chunk['page_content'], metadata=chunk['metadata'])
    for chunk in rag_ready_chunks
]

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embedding_model,
    collection_name="owasp_db_v1",
    # Optional: To save to disk, specify a directory
    persist_directory="./chroma_db_v1" 
)
print("✓ Vector store ready!")

✓ Vector store ready!


### Model

In [19]:
llm = OllamaLLM(
    model="llama3.2:1b",
    temperature=0.1,
)

print("✓ Model ready: lama3.2:1b (via Ollama)")
test_response = llm.invoke("Say 'ready' if you're working")
print(f"✓ Model test: {test_response}")

✓ Model ready: lama3.2:1b (via Ollama)
✓ Model test: I'm ready to work.


### QA Setup

In [20]:
custom_prompt = PromptTemplate(
    template="""You are an expert assistant for answering questions about the OWASP Top 10 for LLM Applications document.
    Use only the following retrieved context to answer the question.
    If you don't know the answer from the context provided, just say that you do not have enough information to answer.
    Be concise and directly answer the question.
    
    CONTEXT: {context}
    QUESTION: {question}
    
    Provide a concise security-focused answer:""",
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": custom_prompt}
)

In [21]:
def ask(question: str, qa_chain):
    """
    Asks a question using the QA chain and prints the answer along with the source documents.
    """
    print(f"-> QUESTION: {question}\n")
    
    response = qa_chain.invoke({"query": question})
    
    print("ANSWER:")
    print("============================================================")
    # Use textwrap to make the answer readable
    print(textwrap.fill(response['result'], width=80))
    print("============================================================\n")
    
    print("SOURCES:")
    print("============================================================")
    print(f"Number of sources: {len(response['source_documents'])}")
    
    for i, source in enumerate(response['source_documents']):
        print(f"\n--- Source {i+1} ---")
        
        metadata = source.metadata
        print(f"  SECTION: {metadata.get('section', 'N/A')}")
        print(f"  HEADING: {metadata.get('heading', 'N/A')}")
        
        print("\n  CONTENT:")
        print(textwrap.fill(source.page_content, width=78, initial_indent="  ", subsequent_indent="  "))
        print("--------------------")
        
    return response

### Ask questions

In [22]:
res = ask("What risks are associated with insecure output handling?", qa_chain)
# res = ask("What is data poisoning?", qa_chain)
# res = ask("How to mitigate supply chain vulnerabilities", qa_chain)

-> QUESTION: What risks are associated with insecure output handling?

ANSWER:
Insecure output handling in large language models can lead to several security
risks, including:  1. **XSS (Cross-Site Scripting)** and **CSRF (Cross-Site
Request Forgery)** 2. **SSRF (Server-Side Request Forgery)** 3. **Privilege
Escalation**: LLM-generated content can be used to gain unauthorized access to
sensitive data or systems. 4. **Remote Code Execution**: Malicious inputs can be
injected into the LLM, allowing attackers to execute arbitrary code on backend
systems.  These risks are exacerbated by conditions such as:  - Granting LLM
privileges beyond intended use - Indirect prompt injection attacks - Lack of
proper output encoding and monitoring - Absence of rate limiting or anomaly
detection for LLM usage.

SOURCES:
Number of sources: 3

--- Source 1 ---
  SECTION: LLM05:2025 Improper Output Handling
  HEADING: LLM05:2025 Improper Output Handling

  CONTENT:
  LLM05:2025 Improper Output Handling Imp