### Trying out RAG with ollama and chromadb
ollama is installed in the python environment venvcrc5 from where this notebook is started.

ollama is recommended over hugging face for local experimentation

it uses a docker-like syntax

In [1]:
!ollama --version
!ollama list

ollama version is 0.5.7
NAME                       ID              SIZE      MODIFIED     
deepseek-r1:latest         0a8c26691023    4.7 GB    22 hours ago    
nomic-embed-text:latest    0a109f422b47    274 MB    22 hours ago    
llama3.1:latest            46e0c10c039e    4.9 GB    9 days ago      


#### The pdfreader translates any pdf document to text readable by the model

In [16]:
from pypdf import PdfReader
# my textbook, 5th ed
reader = PdfReader("/home/mort/LaTeX/new projects/CRC5/main.pdf")
total_pages = len(reader.pages)
all_text = ""
for page_num in range(total_pages):
    page = reader.pages[page_num]
    all_text += page.extract_text()
f = open("/home/mort/temp/main.txt", "w")
f.write(all_text)
f.close()

#### Here the original LaTeX files are collected instead

In [30]:
import glob
# Find `.tex` files in LaTeX directory and all subdirectories
tex_files = glob.glob('/home/mort/LaTeX/new projects/CRC5/**/chapter[1-9].tex', recursive = True)
tex_files.sort()
f = open("/home/mort/temp/main.txt", "w")
for file in tex_files:
    g = open(file, "r")
    content = g.read()
    f.write(content)
    g.close()
f.close()    
    

#### Code for preprocessing the RAG supplementary text

In [2]:
import os
import re
import ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter

def readtextfiles(path):
    text_contents = {}
    directory = os.path.join(path)
    
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory, filename)
        
            with open(file_path, "r", encoding="utf-8") as file:
                content = file.read()
        
            text_contents[filename] = content
        
        return text_contents

def chunksplitter(text, chunk_size=256, chunk_overlap=50):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,  # Desired chunk size in characters or tokens
        chunk_overlap=chunk_overlap,  # Overlap between chunks
        separators=["\n\n", "\n", " "]  # Split by paragraphs, then sentences, then words
    )
    return splitter.split_text(text)

# use the nomic-embed-text model to calculate vector embeddings for all text chunks
def getembedding(chunks):
    embeds = ollama.embed(model="nomic-embed-text", input=chunks)
    return embeds.get('embeddings', [])

#### Add the supplementary text to a new database collection

In [8]:
import chromadb
chromaclient = chromadb.PersistentClient(path="/home/mort/crc5imagery/crc5rag")
chromaclient.delete_collection("crc5rag")
collection = chromaclient.create_collection(name="crc5rag", metadata={"hnsw:space": "cosine"}  )

# the RAG supplementary data
textdocspath = "/home/mort/temp"
text_data = readtextfiles(textdocspath)

# read, break into chunks, embed and add to the chroma vector database 
for filename, text in text_data.items():
    # chunk size set to 256, overlap to 50 (defaults)
    chunks = chunksplitter(text)
    embeds = getembedding(chunks)
    chunknumber = list(range(len(chunks)))
    ids = [filename + str(index) for index in chunknumber]
    metadatas = [{"source": filename} for index in chunknumber]
    collection.add(ids=ids, documents=chunks, embeddings=embeds, metadatas=metadatas)


#### Execute a query with llama3.1 or deepseek-r1 and the supplementary text (RAG)

In [6]:
#%%capture
!ollama pull nomic-embed-text
#!ollama pull llama3.1
!ollama pull deepseek-r1

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest 
writing manifest 
success [?25h
[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling

In [9]:
import gradio as gr
import ollama
import chromadb

chromaclient = chromadb.PersistentClient(path="/home/mort/crc5imagery/crc5rag")
collection = chromaclient.get_collection(name="crc5rag")

def ragask(query):
    # embed the current query
    queryembed = ollama.embed(model="nomic-embed-text", input=query)['embeddings']
    # use the embedded current query to retrieve the n_results most relevant document chunks
    relateddocs = '\n\n'.join(collection.query(query_embeddings=queryembed, n_results=10)['documents'][0])
    # generate an answer
    prompt = f"Answer the question: {query}, referring to the following text as a resource: {relateddocs}"
    return ollama.generate(model="deepseek-r1", prompt=prompt, stream=False)['response']

gr.Interface(fn=ragask, inputs="text", outputs="text").launch()

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




T**Iterated Multivariate Alteration Detection (iMAD): A Step-by-Step Explanation**

The iterated multivariate alteration detection (iMAD) algorithm is a statistical method used for change detection between two multispectral images taken at different times. It helps identify changes in land cover, such as deforestation or urban expansion, by analyzing variations in spectral data.

**Key Components of iMAD:**

1. **Input Images**: The process requires two N-band optical/infrared images capturing the same scene at two distinct time points.

2. **Algorithm Purpose**: The goal is to detect changes (alterations) in ground reflectance across multiple image bands, accounting for both radiometric and spectral variations over iterations.

3. **Steps in iMAD**:
   - **Initialization**: Start with the original images.
   - **Iterative Process**: Repeatedly apply a weighted version of Maximum Autocorrelation Factor (MAF) analysis to compute variates that highlight change patterns.
   - **Weighting**: In each iteration, certain bands are down-weighted if their contribution to change is deemed significant in prior iterations. This step ensures robust detection by focusing on consistent changes across all bands.

4. **Output Variates**: The result includes a set of MAD/MAD-iMAD variates that represent the change patterns between the images.

5. **Change Detection**: Variates are thresholded at a specified significance level (e.g., 0.0001). Significant values indicate areas where changes have occurred, while insignificant values show stable regions like built-up or forested areas.

**Implementation and Usage:**

- **Python Script**: The provided Python script (`iMad.py`) implements the iMAD algorithm using a function that iterates through a specified number of times. Each iteration refines the weighting to enhance change detection accuracy.
  
- **Example Workflow**: 
  - Run `scripts/iMad` with input and temporal image paths.
  - Use `scripts/iMadmap` for significance thresholding, generating a change map.

**Significance:**

iMAD is particularly useful in environmental monitoring as it provides detailed insights into change patterns across extensive areas efficiently. By iteratively refining the analysis, iMAD ensures more reliable detection of subtle or persistent changes compared to single-step methods.