<a href="https://colab.research.google.com/github/keduog/LLM/blob/main/text_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
# STEP 1: Install required libraries
!apt-get install poppler-utils -y
!apt-get install tesseract-ocr -y
!apt-get install tesseract-ocr-eng -y
!pip install pdf2image pytesseract pandas

# STEP 2: Import libraries
from pdf2image import convert_from_path
import pytesseract
import pandas as pd
import re

# STEP 3: Upload your PDF file
from google.colab import files
uploaded = files.upload()

# Change this to the name of your uploaded file
pdf_path = list(uploaded.keys())[0]

# STEP 4: Convert PDF to images
images = convert_from_path(pdf_path)
print(f"Total pages converted: {len(images)}")

# STEP 5: Extract text using OCR from each page
all_text = ""
for i, img in enumerate(images):
    text = pytesseract.image_to_string(img, lang='eng')
    all_text += text + "\n"

print("\n--- Preview of Extracted Text ---\n")
print(all_text[:1000])

# STEP 6: Chunk text into 200–300 words
def chunk_text(text, min_words=200, max_words=300):
    sentences = re.split(r'(?<=[.?!])\s+', text)
    chunks, current = [], []
    count = 0

    for sentence in sentences:
        words = sentence.split()
        if count + len(words) > max_words:
            if count >= min_words:
                chunks.append(" ".join(current))
                current, count = [], 0
        current.extend(words)
        count += len(words)

    if current:
        chunks.append(" ".join(current))
    return chunks

chunks = chunk_text(all_text)
print(f"Total chunks created: {len(chunks)}")

# STEP 7: Add chapter and section info (heuristically)
structured_chunks = []
chapter, section = None, None

for i, chunk in enumerate(chunks):
    chapter_match = re.search(r'(Chapter\s+\d+)', chunk, re.IGNORECASE)
    section_match = re.search(r'(\d+\.\d+)', chunk)

    if chapter_match:
        chapter = chapter_match.group(1)
    if section_match:
        section = section_match.group(1)

    structured_chunks.append({
        "chapter": chapter or f"Unknown-{i//5}",
        "section": section or f"Unknown-{i}",
        "text": chunk
    })

# STEP 8: Create and save CSV
df = pd.DataFrame(structured_chunks)
df.to_csv("chunked_output.csv", index=False)

# STEP 9: Download the CSV
files.download("chunked_output.csv")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1.1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


Saving biologygrade9.pdf to biologygrade9 (4).pdf
Total pages converted: 171

--- Preview of Extracted Text ---

Biology

Student Textbook

Grade 9

1%)
=
c
fer
)
=]
a
—
s
A:
fey
fe]
fe)
x

6 epein

Federal Democratic Republic of Ethiopia ISBN 37 Bes n-U-OURch Federal Democratic Republic of Ethiopia
Ministry of Education Hl & Ministry of Education

9 6">

 

[il

Pokal K

This textbook is the property of your school.
Take good care not to damage or lose it.

Here are 10 ideas to help take care of the book:

Cover the book with protective material, such as plastic,
old newspapers or magazines.

Always keep the book in a clear dry place.

Be sure your hands are clearn when you use the book.
Do not write on the cover or inside pages.

Use a piece of paper or cardboard as a bookmark.
Never tear or cut out any picture or pages.

Repair any torn pages with paste or tape.

Pack the book carefully when you pleace it in your
school bag.

Handle the book with care when passing it to another
per

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [14]:
# STEP 1: Install required libraries
!apt-get install poppler-utils -y
!apt-get install tesseract-ocr -y
!apt-get install tesseract-ocr-eng -y
!pip install pdf2image pytesseract pandas

# STEP 2: Import libraries
from pdf2image import convert_from_path
import pytesseract
import pandas as pd
import re

# STEP 3: Upload PDF file
from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

# STEP 4: Convert PDF to images
images = convert_from_path(pdf_path)
print(f"Total pages converted: {len(images)}")

# STEP 5: Extract text from each image
all_text = ""
for img in images:
    text = pytesseract.image_to_string(img, lang='eng')
    all_text += text + "\n"

print("\n--- Preview of Extracted Text ---\n")
print(all_text[:1000])

# STEP 6: Identify units and sections
unit_pattern = r'(Unit\s+\d+)'  # e.g., "Unit 1"
section_pattern = r'([A-Z][A-Za-z\s]+\s+of\s+[A-Z][a-z]+|[A-Z][A-Za-z\s]+)'  # heuristic for title case section headers

# Split by units
unit_blocks = re.split(unit_pattern, all_text)
structured_data = []

# STEP 7: Iterate through units and extract sections
for i in range(1, len(unit_blocks), 2):
    unit_title = unit_blocks[i].strip()
    unit_text = unit_blocks[i + 1]

    # Split sections inside this unit
    section_splits = re.split(r'(?=\n[A-Z][^\n]{5,50}\n)', unit_text)  # assume section headers are in title case and isolated by newlines

    for section in section_splits:
        lines = section.strip().split('\n')
        if not lines:
            continue
        section_title = lines[0].strip()
        section_body = " ".join(lines[1:]).strip()

        # Clean and chunk section body
        words = section_body.split()
        chunks = [' '.join(words[i:i+300]) for i in range(0, len(words), 300)]

        for chunk in chunks:
            structured_data.append({
                "unit": unit_title,
                "section": section_title,
                "text": chunk
            })

# STEP 8: Create DataFrame and save CSV
df = pd.DataFrame(structured_data)
df.to_csv("structured_chunks.csv", index=False)

# STEP 9: Download CSV
files.download("structured_chunks.csv")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1.1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


Saving biology.pdf to biology.pdf
Total pages converted: 171

--- Preview of Extracted Text ---

Biology

Student Textbook

Grade 9

1%)
=
c
fer
)
=]
a
—
s
A:
fey
fe]
fe)
x

6 epein

Federal Democratic Republic of Ethiopia ISBN 37 Bes n-U-OURch Federal Democratic Republic of Ethiopia
Ministry of Education Hl & Ministry of Education

9 6">

 

[il

Pokal K

This textbook is the property of your school.
Take good care not to damage or lose it.

Here are 10 ideas to help take care of the book:

Cover the book with protective material, such as plastic,
old newspapers or magazines.

Always keep the book in a clear dry place.

Be sure your hands are clearn when you use the book.
Do not write on the cover or inside pages.

Use a piece of paper or cardboard as a bookmark.
Never tear or cut out any picture or pages.

Repair any torn pages with paste or tape.

Pack the book carefully when you pleace it in your
school bag.

Handle the book with care when passing it to another
person.

When using

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>