### Trying out RAG with ollama and chromadb
ollama is installed in the python environment venvcrc5 from where this notebook is started.

ollama is recommended over hugging face for local experimentation

it uses a docker-like syntax

In [1]:
!ollama
!ollama --version
!ollama list

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.
ollama version is 0.5.7
NAME                       ID              SIZE      MODIFIED     
deepseek-r1:1.5b           e0979632db5a    1.1 GB    6 days ago      
nomic-embed-text:latest    0a109f422b47    274 MB    6 days ago      
deepseek-r1:7b             755ced02ce7b    4.7 GB    7 days ago      
deepseek-r1:14b            c333b7232bdb    9.0 GB    3 weeks ago   

In [3]:
!ollama help pull

Pull a model from a registry

Usage:
  ollama pull MODEL [flags]

Flags:
  -h, --help       help for pull
      --insecure   Use an insecure registry

Environment Variables:
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)


#### The pdfreader translates any pdf document to text readable by the model

In [2]:
from pypdf import PdfReader
# my textbook, 5th ed
reader = PdfReader("/home/mort/LaTeX/new projects/CRC5/main.pdf")
total_pages = len(reader.pages)
all_text = ""
for page_num in range(total_pages):
    page = reader.pages[page_num]
    all_text += page.extract_text()
f = open("/home/mort/crc5pdfreader/main.txt", "w")
f.write(all_text)
f.close()

#### Here the original LaTeX files are collected instead

In [1]:
import glob
# Find `.tex` files in LaTeX directory and all subdirectories
tex_files = glob.glob('/home/mort/LaTeX/new projects/CRC5/**/chapter[1-9].tex', recursive = True)
tex_files.sort()
f = open("/home/mort/crc5latex/main.txt", "w")
for file in tex_files:
    g = open(file, "r")
    content = g.read()
    f.write(content)
    g.close()
f.close()    
    

#### Using crawl4ai to scrape web pages

In [9]:
!python3 scripts/testcrawler.py "https://mortcanty.github.io/src/"

[1;36m[[0m[36mINIT[0m[1;36m][0m[36m...[0m[36m. → Crawl4AI [0m[1;36m0.7[0m[36m.[0m[1;36m2[0m[36m [0m
Crawled 64 pages in total
URL: https://mortcanty.github.io/src/
MarkDown: # Mort Canty's Home Page
![](https://mortcanty.github.io/src/mortandandrea.jpg) |    
  
  
Dr. Mort Canty   
Juelich, Germany   
[contact](https://mortcanty.github.io/src/contact.html)  
---|---  
## [Software](https://mortcanty.github.io/src/software.html) |  ![](https://mortcanty.github.io/src/bookcover4.jpg)  
---|---  
## [ Blog](http://fwenvi-idl.blogspot.com/)
## [ Publications](https://mortcanty.github.io/src/publications.html)
## Teaching
[HU Berlin](https://mortcanty.github.io/src/hu_berlin.html) ENVI/IDL Course, June 14-17, 2010   
[ZFL Bonn](https://mortcanty.github.io/src/zfl_2012.html) Radiometric Normalization Course, March/April 2012   
[ZFL Bonn](https://mortcanty.github.io/src/zfl_2013.html), ENVI/IDL Course, November, 2013.  
[FZJ Juelich](https://mortcanty.github.io/src/stats20

#### The LlamaParse version of the text pdf (parsed from the web API) is stored in /home/mort/crc5llamaparse

#### Code for preprocessing the RAG supplementary text

In [11]:
import os
import re
import ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter

def readtextfiles(path):
    text_contents = {}
    directory = os.path.join(path)
    
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory, filename)
        
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()
        
            text_contents[filename] = content
        
        return text_contents

def chunksplitter(text, chunk_size=512, chunk_overlap=128):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,  # Desired chunk size in tokens
        chunk_overlap=chunk_overlap,  # Overlap between chunks
        separators=["\n\n", "\n", " "]  # Split by paragraphs, then sentences, then words
    )
    return splitter.split_text(text)

# use the nomic-embed-text model to calculate vector embeddings for all text chunks
def getembedding(chunks):
    embeds = ollama.embed(model="nomic-embed-text", input=chunks)
    return embeds.get('embeddings', [])

#### Add the supplementary text to a new database collection

In [9]:
import chromadb
#chromaclient = chromadb.PersistentClient(path="/home/mort/crc5imagery/crc5rag")
chromaclient = chromadb.PersistentClient(path="/home/mort/crc5rag")
chromaclient.delete_collection("crc5rag")
collection = chromaclient.create_collection(name="crc5rag", metadata={"hnsw:space": "cosine"}  )

# the RAG supplementary data, here using the llamaparse version of the main pdf
textdocspath = "/home/mort/crc5llamaparse"
text_data = readtextfiles(textdocspath)

# read, break into chunks, embed and add to the chroma vector database 
for filename, text in text_data.items():
    # default chunk size 512, overlap 128 
    chunks = chunksplitter(text, chunk_size=256, chunk_overlap=32)
    embeds = getembedding(chunks)
    chunknumber = list(range(len(chunks)))
    ids = [filename + str(index) for index in chunknumber]
    metadatas = [{"source": filename} for index in chunknumber]
    collection.add(ids=ids, documents=chunks, embeddings=embeds, metadatas=metadatas)


#### Execute a query with llama3.1 or deepseek-r1 and the supplementary text (RAG)

In [2]:
%%capture
!ollama pull nomic-embed-text
#!ollama pull llama3.1
!ollama pull deepseek-r1:1.5b

In [3]:
import gradio as gr
import html
import ollama
import chromadb

# Initialize ChromaDB client and collection
#chromaclient = chromadb.PersistentClient(path="/home/mort/crc5imagery/crc5rag")
chromaclient = chromadb.PersistentClient(path="/home/mort/crc5rag")
collection = chromaclient.get_collection(name="crc5rag")

def ragask(query):
    # Embed the query
    queryembed = ollama.embed(model="nomic-embed-text", input=query)['embeddings']
    # Retrieve related documents
    relateddocs = '\n\n'.join(collection.query(query_embeddings=queryembed, n_results=16)['documents'][0])
    # Generate a response
    prompt = f"Answer the question: {query} referring to the following text as a resource: {relateddocs}"
    response = ollama.generate(model="deepseek-r1:14b", prompt=prompt, stream=False)['response']   
    # Ensure the response is valid Markdown
    return html.escape(response)

# Launch Gradio Interface (ChatInterface not appropriate for RAG application!)
gr.Interface(fn=ragask,inputs=gr.Textbox(lines=2, placeholder="Enter your question here..."),
             outputs="markdown",
             description="Ask questions about the book contents",
             title="Image Analysis, Classification and Change Detection in Remote Sensing").launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


