# Definition

- data : 2025
- Chuncking used : 512 for chunck size and overlap 64
- Embedding used : `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`

In [1]:
model_name = "paraphrase-multilingual-MiniLM-L12-v2"

# Load data

In [2]:
import json

with open('../files/2025/output_2025_cleaned.json', 'r', encoding='utf-8') as f:
    data = json.load(f) 

data[0]

{'metadata': {'document_id': 'F2025001',
  'domain': '',
  'year': 2025,
  'journal_number': '001',
  'hijri_date': 'Mardi 7 Rajab 1446',
  'gregorian_date': '7 janvier 2025',
  'document_link': 'https://www.joradp.dz/FTP/JO-FRANCAIS/2025/F2025001.pdf'},
 'content': "decrets décret présidentiel n° 24-432 du 29 joumada ethania 1446 correspondant au 31 décembre 2024 portant transfert de crédits au titre du budget de l'etat. le président de la république, sur le rapport conjoint du ministre des finances, du ministre de l'agriculture, du développement rural et de la pêche et du ministre des travaux publics et des infrastructures de base, vu la constitution, notamment ses articles 91-7° et 141 alinéa 1er vu la loi organique n° 18-15 du 22 dhou el idja 1439 correspondant au 2 septembre 2018, modifiée et complétée, relative aux lois de finances vu la loi n° 23-22 du 11 joumada ethania 1445 correspondant au 24 décembre 2023 portant loi de finances pour 2024 vu le décret exécutif n° 24-10 du 24

# Chunking

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configuration for LEGAL TEXT
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,  # Optimal for legal context
    chunk_overlap=120,  # Preserve context between chunks
    length_function=len
)


chunks_with_metadata = []
for doc in data:
    try:
        text_chunks = text_splitter.split_text(doc['content'])
        
        for i, chunk in enumerate(text_chunks):
            chunk_meta = {
                **doc['metadata'],  
                "chunk_id": f"{doc['metadata'].get('document_id', 'unknown')}_chunk{i}",
                "total_chunks": len(text_chunks),
                "chunk_number": i+1
            }
            chunks_with_metadata.append({
                "text": chunk,
                "metadata": chunk_meta
            })
            
            print(f"✅ Created chunk {i+1}/{len(text_chunks)}")
            print(f"Metadata: {chunk_meta}")
            print(f"Content Preview: {chunk}")
            
    except KeyError as e:
        print(f"⚠️ Error processing document: Missing key {e}")
        continue

print(f"\nTotal chunks created: {len(chunks_with_metadata)}")

✅ Created chunk 1/161
Metadata: {'document_id': 'F2025001', 'domain': '', 'year': 2025, 'journal_number': '001', 'hijri_date': 'Mardi 7 Rajab 1446', 'gregorian_date': '7 janvier 2025', 'document_link': 'https://www.joradp.dz/FTP/JO-FRANCAIS/2025/F2025001.pdf', 'chunk_id': 'F2025001_chunk0', 'total_chunks': 161, 'chunk_number': 1}
Content Preview: decrets décret présidentiel n° 24-432 du 29 joumada ethania 1446 correspondant au 31 décembre 2024 portant transfert de crédits au titre du budget de l'etat. le président de la république, sur le rapport conjoint du ministre des finances, du ministre de l'agriculture, du développement rural et de la pêche et du ministre des travaux publics et des infrastructures de base, vu la constitution, notamment ses articles 91-7° et 141 alinéa 1er vu la loi organique n° 18-15 du 22 dhou el idja 1439 correspondant au 2
✅ Created chunk 2/161
Metadata: {'document_id': 'F2025001', 'domain': '', 'year': 2025, 'journal_number': '001', 'hijri_date': 'Mardi 7 Ra

# Embedding

In [4]:
!pip install chromadb sentence-transformers

Collecting chromadb
  Using cached chromadb-1.0.4-cp39-abi3-win_amd64.whl.metadata (7.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi==0.115.9 (from chromadb)
  Using cached fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemetry_api-1.32.0-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetr

  error: subprocess-exited-with-error
  
  Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  building 'hnswlib' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Failed to build installable wheels for some pyproject.toml based projects (chroma-hnswlib)


In [5]:
from sentence_transformers import SentenceTransformer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("🧠 Using device:", device)

# Load model directly
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device=device)



🧠 Using device: cpu


In [None]:
texts = [chunk["text"] for chunk in chunks_with_metadata]
embeddings = model.encode(texts, batch_size=16, show_progress_bar=True)

Batches:   0%|          | 0/208 [00:00<?, ?it/s]

# Vector store

In [9]:
!pip install chromadb

Collecting chromadb
  Using cached chromadb-1.0.4-cp39-abi3-win_amd64.whl.metadata (7.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi==0.115.9 (from chromadb)
  Using cached fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemetry_api-1.32.0-py3-none-any.whl.metadata (1.6 kB)
Collecting opentelemetr

  error: subprocess-exited-with-error
  
  Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  exit code: 1
  
  [5 lines of output]
  running bdist_wheel
  running build
  running build_ext
  building 'hnswlib' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Failed to build installable wheels for some pyproject.toml based projects (chroma-hnswlib)


In [11]:
import chromadb

# Start ChromaDB persistent client
chroma_client = chromadb.PersistentClient(path="./legal_rag_db/")

# Create or get the collection WITHOUT embedding_function now
collection = chroma_client.get_or_create_collection(
    name=f"legal_documents_{model_name}",
    metadata={"hnsw:space": "cosine"}
)

ModuleNotFoundError: No module named 'chromadb'

In [11]:
for idx, (chunk_data, embedding) in enumerate(zip(chunks_with_metadata, embeddings)):
    try:
        collection.add(
            documents=[chunk_data["text"]],
            metadatas=[chunk_data["metadata"]],
            embeddings=[embedding.tolist()],
            ids=[f"chunk_{idx}"]
        )
    except Exception as e:
        print(f"Error adding chunk {idx+1}: {str(e)}")

# Test

In [18]:
query = "Quels sont les organes de gouvernance mentionnés dans l'école militaire, et quels pourraient être leurs rôles respectifs ?"
query_embedding = model.encode([query])[0]

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=10,
    include=["documents", "metadatas", "distances"]
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
# Single line print of documents
print("\n\n".join(results['documents'][0]))

de formation paramédicale de santé militaire, désignées ci-après l’ école . art. 2. l’école est un établissement public à caractère administratif, doté de la personnalité morale et de l’autonomie financière. art. 3. l’école est placée sous la tutelle du ministre de la défense nationale. a ce titre, elle est assujettie à toutes les dispositions statutaires et réglementaires applicables aux établissements militaires de formation. les pouvoirs de tutelle sur l’école sont exercés, par délégation, par le

leur profil de formation spécialisée, des fonctions de commandement, de direction, d’encadrement et d’expertise dans le domaine de l’aéronautique et de l’aviation militaire, au sein des structures des forces aériennes et/ou des autres composantes de l’armée nationale populaire. sous-section 3 le corps spécifique des forces navales art. 15. le corps spécifique des forces navales est le regroupement d'officiers de carrière appartenant à des corps sui generis des forces navales, qui exercent,