In [1]:
from langchain.docstore.document import Document
from langchain_openai import ChatOpenAI
from langchain_community import embeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pypdf import PdfReader


In [13]:
document_path = "/Users/mariaborca/Documents/AI_2023-2024/Semestrul 4/Machine Learning/Know-Your-Rights/data/codul-muncii.pdf"

reader = PdfReader(document_path)
def extract_text_from_pdf(reader):
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text.strip()
    
text = extract_text_from_pdf(reader)
print(f"Extracted text length: {len(text)} characters")

Extracted text length: 220676 characters


## 1. Character Text Splitting

**Example 1:** Bad Setup
This chunking isn't good. Because we used an empty string ('') as the separator and no overlap, some words get split between chunks, and the meaning is hard to follow. 

In [25]:
from langchain_text_splitters import CharacterTextSplitter

char_splitter_small_chunks = CharacterTextSplitter(
    separator='', 
    chunk_size=100, 
    chunk_overlap=0
)

small_chunks = char_splitter_small_chunks.split_text(text)

small_char_documents = [Document(page_content=chunk) for chunk in small_chunks]

print("=== Example 1: Bad Character Splitter ===")
print(f"Total chunks: {len(small_char_documents)}\n")

print(f"--- Chunk 1 (Length: {len(small_char_documents[0].page_content)} characters) ---")
print(small_char_documents[0].page_content)
print("\n")

print(f"--- Chunk 2 (Length: {len(small_char_documents[1].page_content)} characters) ---")
print(small_char_documents[1].page_content)


=== Example 1: Bad Character Splitter ===
Total chunks: 2207

--- Chunk 1 (Length: 100 characters) ---
Codul muncii 
Legea 53 din 2003 
Publicata in Monitorul Oficial, nr. 75 din 5 februarie 2003 
Republ


--- Chunk 2 (Length: 100 characters) ---
icata in Monitorul Oficial, nr. 345 din 18 mai 2011 
Actualizata in 17 octombrie 2022 prin Legea 283


**Example 2: Character Splitter with Newline Separator**

In this test, I used `\n` as the separator, a `chunk_size` of 100, and an overlap of 10. This produced **better results** than the previous one because it avoids breaking words in the middle — it tries to split at line breaks instead.

However, the chunks are **still a bit messy** because legal texts often have long lines or paragraphs. That’s why many chunks ended up **slightly larger than 100 characters**.

> ⚠️ **Why those warning messages appear:**
> When a line or paragraph is **longer than `chunk_size`** and can't be split further (due to the separator), the splitter **includes the whole segment** anyway. That’s why you see:

```
Created a chunk of size 112, which is longer than the specified 100
```

So while this setup avoids broken words, it's still not ideal for structured documents like laws. A more advanced splitter (like `RecursiveCharacterTextSplitter`) will do a better job.


In [37]:
char_splitter = CharacterTextSplitter(
    separator='\n', 
    chunk_size=100, 
    chunk_overlap=10
)

chunks = char_splitter.split_text(text)

char_documents = [Document(page_content=chunk) for chunk in chunks]

print("=== Example 1: Character Splitter ===")
print(f"Total chunks: {len(char_documents)}\n")

print(f"--- Chunk 1 (Length: {len(char_documents[0].page_content)} characters) ---")
print(char_documents[0].page_content)
print("\n")

print(f"--- Chunk 2 (Length: {len(char_documents[1].page_content)} characters) ---")
print(char_documents[1].page_content)

Created a chunk of size 112, which is longer than the specified 100
Created a chunk of size 111, which is longer than the specified 100
Created a chunk of size 124, which is longer than the specified 100
Created a chunk of size 107, which is longer than the specified 100
Created a chunk of size 117, which is longer than the specified 100
Created a chunk of size 120, which is longer than the specified 100
Created a chunk of size 113, which is longer than the specified 100
Created a chunk of size 118, which is longer than the specified 100
Created a chunk of size 118, which is longer than the specified 100
Created a chunk of size 110, which is longer than the specified 100
Created a chunk of size 116, which is longer than the specified 100
Created a chunk of size 117, which is longer than the specified 100
Created a chunk of size 111, which is longer than the specified 100
Created a chunk of size 103, which is longer than the specified 100
Created a chunk of size 109, which is longer tha

=== Example 1: Character Splitter ===
Total chunks: 2416

--- Chunk 1 (Length: 92 characters) ---
Codul muncii 
Legea 53 din 2003 
Publicata in Monitorul Oficial, nr. 75 din 5 februarie 2003


--- Chunk 2 (Length: 57 characters) ---
Republicata in Monitorul Oficial, nr. 345 din 18 mai 2011


### **Example 3: Recursive Character Splitter**

In this test, I used `RecursiveCharacterTextSplitter` with a `chunk_size` of 1000 and `chunk_overlap` of 20. This splitter tries to break the text in a smart order: first at paragraph breaks (`\n\n`), then lines, then spaces, and only breaks words if needed.

The result is **much better** than simple character-based chunking. Most of the chunks keep the context and avoid cutting sentences in awkward places.

> ⚠️ **Still, some sentences are cut mid-way**
> For example, one chunk ends right in the middle of a legal clause, and the rest continues in the next chunk. This happens because the chunk size is a bit too small to fit the full idea or paragraph.

📌 **Possible improvement:**
Increasing `chunk_size` (e.g., to 1200 or 1500) might help preserve full thoughts, especially in legal texts where paragraphs are long.


In [57]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1400, 
    chunk_overlap=100, 
    separators=["\n\n", "\n"]
)
chunks = text_splitter.split_text(text)
char_documents = [Document(page_content=chunk) for chunk in chunks]
print("=== Example 2: Recursive Character Splitter ===")
print(f"Total chunks: {len(char_documents)}\n")
print(f"--- Chunk 1 (Length: {len(char_documents[0].page_content)} characters) ---")
print(char_documents[0].page_content)
print("\n")
print(f"--- Chunk 2 (Length: {len(char_documents[1].page_content)} characters) ---")
print(char_documents[1].page_content)
print("\n")
print(f"--- Chunk 3 (Length: {len(char_documents[2].page_content)} characters) ---")
print(char_documents[2].page_content)


=== Example 2: Recursive Character Splitter ===
Total chunks: 168

--- Chunk 1 (Length: 1321 characters) ---
Codul muncii 
Legea 53 din 2003 
Publicata in Monitorul Oficial, nr. 75 din 5 februarie 2003 
Republicata in Monitorul Oficial, nr. 345 din 18 mai 2011 
Actualizata in 17 octombrie 2022 prin Legea 283 din 2022 
 
 
TITLUL I 
Dispozitii generale 
 
CAPITOLUL I 
Domeniul de aplicare 
 
 
Articolul 1 
(1) Prezentul cod reglementeaza domeniul raporturilor de munca, modul in care se efectueaza controlul aplicarii 
reglementarilor din domeniul raporturilor de munca, precum si jurisdictia muncii. 
(2) Prezentul c od se aplica si raporturilor de munca reglementate prin legi speciale, numai in masura in care 
acestea nu contin dispozitii specifice derogatorii. 
(3) De la prevederile art. 39 alin. (1) lit. m^1), art. 40 alin. (2) lit. j), art. 194 alin. (2) pot fi prevazute prin legi 
speciale dispozitii specifice derogatorii numai pentru raporturile de munca sau de serviciu desfasurate d

### 🧪 Example 3: Semantic Chunking (OpenAI)

This method uses semantic similarity to split the text, aiming to group together sentences with similar meaning.

**Observations:**

* The chunk sizes vary **significantly**. For instance, Chunk 1 is \~680 characters, Chunk 4 is over **11,000 characters**, while Chunks 2 and 3 are only **8 and 46 characters**, respectively.
* Chunks 2 and 3 are clearly **broken mid-sentence**, likely because the model misinterpreted legal references like "art. 39 alin. (1)" as semantic boundaries. These are **typical artifacts in legal texts**, which often use structured references that don't signal real content boundaries.
* Despite these issues, the **overall quality** of the larger chunks is good. Most of the longer chunks preserve **semantic coherence**, keeping related ideas together better than naive character splitting.
* However, the method is still **not ideal for juridical or structured legal documents**, which rely heavily on **formatting, articles, and numbered clauses**. It may require preprocessing or tuning to better handle such content.

**Recommendation:** For legal texts, semantic chunking can help retain meaning, but it should be combined with structure-aware methods to avoid fragmenting important sections.


In [None]:
# Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(
    OpenAIEmbeddings(api_key="add here the key"
), breakpoint_threshold_type="percentile"
)

documents = text_splitter.create_documents([text])
print("=== Example 3: Semantic Chunk Splitter ===")
print(f"Total chunks: {len(documents)}\n")
print(f"--- Chunk 1 (Length: {len(documents[0].page_content)} characters) ---")
print(documents[0].page_content)
print("\n")
print(f"--- Chunk 2 (Length: {len(documents[1].page_content)} characters) ---")
print(documents[1].page_content)


=== Example 3: Semantic Chunk Splitter ===
Total chunks: 65

--- Chunk 1 (Length: 685 characters) ---
Codul muncii 
Legea 53 din 2003 
Publicata in Monitorul Oficial, nr. 75 din 5 februarie 2003 
Republicata in Monitorul Oficial, nr. 345 din 18 mai 2011 
Actualizata in 17 octombrie 2022 prin Legea 283 din 2022 
 
 
TITLUL I 
Dispozitii generale 
 
CAPITOLUL I 
Domeniul de aplicare 
 
 
Articolul 1 
(1) Prezentul cod reglementeaza domeniul raporturilor de munca, modul in care se efectueaza controlul aplicarii 
reglementarilor din domeniul raporturilor de munca, precum si jurisdictia muncii. (2) Prezentul c od se aplica si raporturilor de munca reglementate prin legi speciale, numai in masura in care 
acestea nu contin dispozitii specifice derogatorii. (3) De la prevederile art.


--- Chunk 2 (Length: 8 characters) ---
39 alin.


In [45]:
print("=== Example 3: Semantic Chunk Splitter ===")
print(f"Total chunks: {len(documents)}\n")
print(f"--- Chunk 1 (Length: {len(documents[0].page_content)} characters) ---")
print(documents[0].page_content)
print("\n")
print(f"--- Chunk 2 (Length: {len(documents[1].page_content)} characters) ---")
print(documents[1].page_content)
print("\n")
print(f"--- Chunk 3 (Length: {len(documents[2].page_content)} characters) ---")
print(documents[2].page_content)
print("\n")
print(f"--- Chunk 4 (Length: {len(documents[3].page_content)} characters) ---")
print(documents[3].page_content)

=== Example 3: Semantic Chunk Splitter ===
Total chunks: 65

--- Chunk 1 (Length: 685 characters) ---
Codul muncii 
Legea 53 din 2003 
Publicata in Monitorul Oficial, nr. 75 din 5 februarie 2003 
Republicata in Monitorul Oficial, nr. 345 din 18 mai 2011 
Actualizata in 17 octombrie 2022 prin Legea 283 din 2022 
 
 
TITLUL I 
Dispozitii generale 
 
CAPITOLUL I 
Domeniul de aplicare 
 
 
Articolul 1 
(1) Prezentul cod reglementeaza domeniul raporturilor de munca, modul in care se efectueaza controlul aplicarii 
reglementarilor din domeniul raporturilor de munca, precum si jurisdictia muncii. (2) Prezentul c od se aplica si raporturilor de munca reglementate prin legi speciale, numai in masura in care 
acestea nu contin dispozitii specifice derogatorii. (3) De la prevederile art.


--- Chunk 2 (Length: 8 characters) ---
39 alin.


--- Chunk 3 (Length: 46 characters) ---
(1) lit. m^1), art. 40 alin. (2) lit. j), art.


--- Chunk 4 (Length: 11450 characters) ---
194 alin. (2) pot fi prevazu

In [47]:
from pydantic import BaseModel
from langchain import hub
from langchain.chains import create_extraction_chain_pydantic

prompt_template = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPEN_AI_KEY, temperature=0.0)
runnable = prompt_template | llm

class Sentences(BaseModel):
    sentences: list[str]

extraction_chain = create_extraction_chain_pydantic(pydantic_schema = Sentences, llm=llm)
def get_propositions(text):
    runnable_output = runnable.invoke({"input": text}).content
    propositions = extraction_chain.invoke(runnable_output)["text"][0].sentences
    return propositions

paragraphs = text.split("\n\n")
text_propositions = []
for i, par in enumerate(paragraphs):
   propositions = get_propositions(par)
   text_propositions.extend(propositions)
   print(f"Done with paragraph {i+1} / {len(paragraphs)}")

print("=== Example 4: Propositions Extraction ===")
print(f"Total propositions: {len(text_propositions)}\n")
print(f"--- Proposition 1 (Length: {len(text_propositions[0])} characters) ---")
print(text_propositions[0])
print("\n")
print(f"--- Proposition 2 (Length: {len(text_propositions[1])} characters) ---")
print(text_propositions[1])
print("\n")
print(f"--- Proposition 3 (Length: {len(text_propositions[2])} characters) ---")
print(text_propositions[2])
print("\n")
print(f"--- Proposition 4 (Length: {len(text_propositions[3])} characters) ---")
print(text_propositions[3])



Done with paragraph 1 / 1
=== Example 4: Propositions Extraction ===
Total propositions: 1

--- Proposition 1 (Length: 53 characters) ---
Codul muncii este reglementat prin Legea 53 din 2003.




IndexError: list index out of range

In [54]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from llama_index.core import SimpleDirectoryReader
import os

# Step 1: Load text from a PDF
documents = SimpleDirectoryReader(input_dir="./data").load_data()
raw_text = "\n\n".join(doc.text for doc in documents)

# Step 2: Initial character-based chunking (simple + fast)
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
initial_chunks = splitter.split_text(raw_text)

# Step 3: Setup the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0, api_key=OPEN_AI_KEY)

# Step 4: Function to refine chunks using GPT
def refine_chunk(chunk_text):
    prompt = f"""
You are an expert in Romanian legal documents.

The following text is part of a legal code (e.g. Codul Fiscal). Your task is to:

1. Split it into coherent legal sections (e.g., one article or paragraph group).
2. Each section must be complete and self-contained.
3. Prefix each section with its heading or number (like "Articolul 12").

Text:
\"\"\"
{chunk_text}
\"\"\"

Return the result as a list of clearly separated sections, numbered or titled.
    """
    response = llm.invoke(prompt)
    return response.content.strip()

# Step 5: Process each initial chunk
refined_chunks = []
for i, chunk in enumerate(initial_chunks):
    print(f"Processing chunk {i+1}/{len(initial_chunks)}...")
    refined = refine_chunk(chunk)
    refined_chunks.append(refined)

# Step 6: Save or print results
with open("refined_legal_chunks.txt", "w", encoding="utf-8") as f:
    for i, refined in enumerate(refined_chunks):
        f.write(f"\n\n--- Refined Chunk {i+1} ---\n{refined}")
        print(f"\n--- Refined Chunk {i+1} ---\n{refined[:1000]}")


Processing chunk 1/3879...
Processing chunk 2/3879...
Processing chunk 3/3879...
Processing chunk 4/3879...
Processing chunk 5/3879...
Processing chunk 6/3879...
Processing chunk 7/3879...
Processing chunk 8/3879...
Processing chunk 9/3879...
Processing chunk 10/3879...
Processing chunk 11/3879...
Processing chunk 12/3879...
Processing chunk 13/3879...
Processing chunk 14/3879...
Processing chunk 15/3879...
Processing chunk 16/3879...
Processing chunk 17/3879...
Processing chunk 18/3879...
Processing chunk 19/3879...
Processing chunk 20/3879...
Processing chunk 21/3879...
Processing chunk 22/3879...
Processing chunk 23/3879...
Processing chunk 24/3879...
Processing chunk 25/3879...
Processing chunk 26/3879...
Processing chunk 27/3879...
Processing chunk 28/3879...
Processing chunk 29/3879...
Processing chunk 30/3879...
Processing chunk 31/3879...
Processing chunk 32/3879...
Processing chunk 33/3879...
Processing chunk 34/3879...
Processing chunk 35/3879...
Processing chunk 36/3879...
P

KeyboardInterrupt: 