# RAG Mini Project
## Milestone #1 : Create and store Chunks
This notebook shows how to create text chunks from MS Word Documents.  
I used five Word Documents on the topic of Agentic AI. 

- Chunks all the word documents in a directory
- Uses python-docs to extract paragraph text for chunking
- Paragraphs are merged depending on parameterizable  max chunk size
- Document cleaning recommended for best results
- remove diagrams and unnecessary text
- merge paragraphs that are semantically similar

## Deliverables:
- Selection of multiple documents for your RAG project
- Capture chunks in a pickle file for next step (Embeddings)

In [4]:
import sys
print(sys.executable)


/usr/local/bin/python3


In [5]:
import sys
!{sys.executable} -m pip install python-docx


Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting lxml>=3.1.0
  Downloading lxml-5.3.1-cp311-cp311-macosx_10_9_universal2.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: lxml, python-docx
Successfully installed lxml-5.3.1 python-docx-1.1.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [3]:
from docx import Document

def extract_text(file_path):
    """Extracts and cleans text from a .docx file."""
    doc = Document(file_path)
    paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
    return paragraphs

# Example usage
file_path = "Agentic_AI_Introduction.docx"  # Replace with your actual file path
paragraphs = extract_text(file_path)
print(paragraphs[:5])  # Print first 5 paragraphs


['Agentic AI refers to artificial intelligence systems that exhibit autonomous decision-making, adaptability, and goal-directed behavior. Unlike traditional AI, which primarily follows predefined rules or relies on statistical pattern recognition, agentic AI is characterized by its ability to plan, reason, and take initiative in dynamic environments. This type of AI is particularly relevant for applications that require independent problem-solving, such as robotics, autonomous agents, and strategic decision-making systems.']


In [1]:
import sys
!{sys.executable} -m pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.9/275.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting transformers<5.0.0,>=4.41.0
  Downloading transformers-4.49.0-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch>=1.11.0
  Downloading torch-2.6.0-cp311-none-macosx_11_0_arm64.whl (66.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.6.1-cp311-cp311-macosx_1

In [4]:
import re
from sentence_transformers import SentenceTransformer, util

# Load pre-trained model for semantic similarity
model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_text_fixed(paragraphs, chunk_size=500):
    """Chunks text into fixed-size pieces."""
    chunks = []
    current_chunk = ""

    for paragraph in paragraphs:
        if len(current_chunk) + len(paragraph) > chunk_size:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = paragraph
        else:
            current_chunk += " " + paragraph if current_chunk else paragraph

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def chunk_text_semantic(paragraphs, similarity_threshold=0.8):
    """Merges semantically similar paragraphs."""
    chunks = []
    current_chunk = paragraphs[0]

    for i in range(1, len(paragraphs)):
        sim_score = util.pytorch_cos_sim(model.encode(current_chunk), model.encode(paragraphs[i])).item()

        if sim_score >= similarity_threshold:
            current_chunk += " " + paragraphs[i]
        else:
            chunks.append(current_chunk.strip())
            current_chunk = paragraphs[i]

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

# Test with one document
paragraphs = extract_text("Agentic_AI_Introduction.docx")  # Replace with your actual file
fixed_chunks = chunk_text_fixed(paragraphs, chunk_size=500)
semantic_chunks = chunk_text_semantic(paragraphs, similarity_threshold=0.8)

print("Fixed-size chunks:", fixed_chunks[:3])  # Print first 3 chunks
print("Semantic chunks:", semantic_chunks[:3])  # Print first 3 chunks


Fixed-size chunks: ['Agentic AI refers to artificial intelligence systems that exhibit autonomous decision-making, adaptability, and goal-directed behavior. Unlike traditional AI, which primarily follows predefined rules or relies on statistical pattern recognition, agentic AI is characterized by its ability to plan, reason, and take initiative in dynamic environments. This type of AI is particularly relevant for applications that require independent problem-solving, such as robotics, autonomous agents, and strategic decision-making systems.']
Semantic chunks: ['Agentic AI refers to artificial intelligence systems that exhibit autonomous decision-making, adaptability, and goal-directed behavior. Unlike traditional AI, which primarily follows predefined rules or relies on statistical pattern recognition, agentic AI is characterized by its ability to plan, reason, and take initiative in dynamic environments. This type of AI is particularly relevant for applications that require independe

In [6]:
import os

# List all .docx files in the working directory
docx_files = [f for f in os.listdir() if f.endswith('.docx')]

all_chunks_fixed = []
all_chunks_semantic = []

for file in docx_files:
    print(f"Processing: {file}")
    paragraphs = extract_text(file)

    # Fixed-size chunking
    fixed_chunks = chunk_text_fixed(paragraphs, chunk_size=500)
    all_chunks_fixed.extend(fixed_chunks)

    # Semantic chunking
    semantic_chunks = chunk_text_semantic(paragraphs, similarity_threshold=0.8)
    all_chunks_semantic.extend(semantic_chunks)

print(f"Total Fixed-size Chunks: {len(all_chunks_fixed)}")
print(f"Total Semantic Chunks: {len(all_chunks_semantic)}")


Processing: Agentic_AI_Applications.docx
Processing: Agentic_AI_Technical_Aspects.docx
Processing: Agentic_AI_Future_Trends.docx
Processing: Agentic_AI_Challenges.docx
Processing: Agentic_AI_Introduction.docx
Total Fixed-size Chunks: 5
Total Semantic Chunks: 5


In [7]:
import pickle

# Save chunks into a pickle file
with open("chunks.pkl", "wb") as f:
    pickle.dump({"fixed_chunks": all_chunks_fixed, "semantic_chunks": all_chunks_semantic}, f)

print("Chunks saved successfully to chunks.pkl")


Chunks saved successfully to chunks.pkl


In [8]:
import pickle

# Load the pickle file
with open("chunks.pkl", "rb") as f:
    data = pickle.load(f)

print("Loaded pickle file successfully!")
print("Fixed chunks sample:", data["fixed_chunks"][:2])
print("Semantic chunks sample:", data["semantic_chunks"][:2])


Loaded pickle file successfully!
Fixed chunks sample: ['Agentic AI has diverse real-world applications across various industries. In healthcare, it can optimize treatment plans based on patient data. In finance, it enhances algorithmic trading by making independent market predictions. In cybersecurity, agentic AI can autonomously detect and mitigate threats in real time. Furthermore, its integration into robotics enables self-driving cars and industrial automation systems to function with minimal human intervention.', 'The technical foundation of Agentic AI includes reinforcement learning, neuro-symbolic AI, and multi-agent systems. Reinforcement learning enables AI to optimize decision-making through trial and error, while neuro-symbolic approaches integrate logical reasoning with neural networks to enhance adaptability. Multi-agent systems, where multiple AI entities collaborate or compete, further enhance the robustness and scalability of agentic AI. These elements collectively allo