# Obsidian Augmented Retrival Generation
*Jupyter notebook will be used only for prototyping*

Steps:
1) Read the document and understand the language and the format
2) Produce contextual blocking to retrive the text and the document
3) Produce the embedding contextually on the language
4) Given the query and the retrived document give a .md response so that table and formula can be correctly generated
5) Crate a frontend :)

Internal Steps
1) Read the document and understand the language and the format
    1) explore all the folder
    2) read and extract the text of the file
    3) Store the file name to avoid repetition [TO IMPLEMENT]

In [90]:
# 1) Read of the document
from pathlib import Path
from pypdf import PdfReader
from langchain_core.documents import Document
from langdetect import detect

BASE_DOC_DIR = Path('/Users/matteo/projects/obsidian_rag/sample_data')

def detect_lang(text: str) -> str:
    text = text.strip()
    if not text:
        return "unknown"
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Explore all the folder
def maps_file(root: Path):
    maps = {}
    for path in root.rglob("*"):
        # Exclude hidden files
        if any(part.startswith(".") for part in path.parts):
            continue
        # Iterate over files
        if path.is_file():
            # I use the last two parts of the path as the key so that we don't have duplicate keys'
            sliced_path = list(path.parts[-2:])
            sliced_path[-1] = Path(sliced_path[-1]).stem
            name = "/".join(sliced_path)
            maps[name] = path
    return maps

# Reading of the file md
def read_md(file_path: Path):
    with open(file_path, "r") as f:
        return f.read()

def read_pdf(file_path: Path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        content = page.extract_text()
        if content:
            text += content + "\n"
    return text

def extract_resources(root: Path):
    maps = maps_file(root)
    resources = {}
    not_supported = []
    for name, path in maps.items():
        if path.suffix == ".md":
            text = read_md(path)
            metadata = {
            "name": str(Path(name).stem.split("/")[-1]),
            "source": str(path),
            "language": detect_lang(text),
            "format": "md"}
            resources[name] = Document(page_content= text, metadata=metadata )
        elif path.suffix == ".pdf":
            text = read_pdf(path)
            metadata = {
            "name": str(Path(name).stem.split("/")[-1]),
            "source": str(path),
            "language": detect_lang(text),
            "format": "pdf"}
            resources[name] = Document(page_content= text, metadata=metadata)
        else:
            not_supported.append(name)
    return resources, not_supported

resources, not_supported = extract_resources(BASE_DOC_DIR)
print("This resources are not compatible: " + str(not_supported))


This resources are not compatible: ['Photo/maximal_margin_classifier', 'Photo/Pasted image 20250318142424', 'Photo/LDA_QDA', 'Photo/Screenshot 2025-04-08 at 13.51.52', 'Photo/AB940178-4B36-4A4A-B219-A6C277A1685F_1_102_a', 'Photo/ChatGPT Image May 2025 from Matteo', 'Photo/Test_and_train_error', 'Photo/Screenshot 2025-07-06 at 10.33.40', 'Photo/Screenshot 2025-03-03 at 19.13.01', 'Photo/Screenshot 2025-02-27 at 14.22.39', 'Photo/Untitled', 'Photo/Pasted image 20250318142233', 'Photo/augmented_space', 'Photo/Hyperplanes', 'Photo/Screenshot 2025-06-30 at 17.29.40', 'Photo/Bayes error rate', 'Photo/U-shape cross validation', 'Photo/Screenshot 2025-03-03 at 19.07.39', 'Photo/Logistic_function_ML', 'Photo/Screenshot 2025-04-23 at 13.37.12', 'Photo/Rplot02', 'Photo/ChatGPT Image Apr 23 2025 from Ridisegnare grafico alberi', 'Photo/Screenshot 2025-04-23 at 14.38.27', 'Photo/Screenshot 2025-02-28 at 09.11.10', 'Photo/Screenshot 2025-04-23 at 11.05.22', 'Photo/Pasted image 20250405130303', 'Phot

The next step is the chunking:
1) Create a summary of the text
2) Embedding in the text

In [66]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, MarkdownTextSplitter


def chunking(resources):
    pass


chunking(resources)



[['# Esercizi esame\n\nConsiderare le seguenti funzioni che implementano gli algoritmi di ordinamento "bubble sort" e "bubble sort con check di terminazione". Modificarle in modo che ognuna restituisca il numero di CONFRONTI effettuati e il numero di SCAMBI effettuati (si consiglia di utilizzare la funzione\xa0_ "return x, y_").  \n\nSuccessivamente implementare la funzione\xa0_ **bubble_sort_comparison**_ _ _ **(input_list)** __\xa0che riceve in input una lista di interi\xa0**_ input_list_**, la ordina attraverso i due algoritmi e infine\xa0stampa\xa0(usando la funzione\xa0**_print()_**) la somma dei confronti e degli scambi effettuati dall\'algoritmo "bubble sort" se il numero dei confronti effettuati da "bubble sort" è minore rispetto al "bubble sort con check di terminazione",\xa0**altrimenti** stampa\xa0la somma dei confronti e degli scambi effettuati dall\'algoritmo "bubble sort con check di terminazione".\n\n(indicazione: se ad esempio x è il numero di confronti effettuati e y è

Embedding using mistal

In [92]:
# Embedding se devono essere in un array
def create_embeddings(texts, emb_model):
    embeddings = emb_model.embed_documents(texts)
    return embeddings

# Code starts here
from langchain.embeddings import init_embeddings
import chromadb
from langchain_chroma import Chroma
from tqdm import tqdm

# Init embedding model
emb = init_embeddings("ollama:mistral", base_url="http://localhost:11434")

# Init chorma client and collection
chroma_client = chromadb.HttpClient(host='localhost', port=8000)
vector_store = Chroma(
    client=chroma_client,
    collection_name="Obsidian",
    embedding_function=emb,
)

def first_embeddings(resources):
    docs = list(resources.values())
    for i in tqdm(range(0, len(docs), 1)):
        batch = docs[i:i+1]
        vector_store.add_documents(batch)

first_embeddings(resources)


100%|██████████| 32/32 [06:59<00:00, 13.10s/it]


In [98]:
# Try the database by retriving a query
query = "Alberi binari"
results = vector_store.similarity_search(query, k=3)

for i, doc in enumerate(results, 1):
    print(f"\nResults: {i}")
    print(doc.metadata)
    print(doc.page_content)



Results: 1
{'format': 'md', 'name': 'Strutture dati avanzate', 'language': 'it', 'source': '/Users/matteo/projects/obsidian_rag/sample_data/Algoritmi/Strutture dati avanzate.md'}
# Strutture dati avanzate
## Alberi
Un albero è una struttura dati gerarchica che consiste in un insieme di nodi collegati tra loro da archi direzionati. 
Un nodo è un contenitore di dati, che ha relazioni di parentela con gli altri nodi:
- **nodo radice**: Il nodo in cima all'albero
- **nodi genitori e nodi figli**: 
- **nodi ascendenti e nodi discendenti**

### Definizioni 
|                                                                                                                     |                                                                                      |
| ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| **Cammino**: una sequenza