# Kleio Walkthrough

## About

Author: Jared Neumann

This package is designed to take a PDF document with or without a text layer, or raw text, and return a complete, corrected version of that text. Text is extracted using common OCR tools, if necessary, and the text is then passed to an LLM. The LLM then makes corrections to each chunk. Additional functions can be called, such as:
- Layout analysis and annotation
- Revised collation (e.g., to eliminate headers and footers, etc.)
- Translation

## Import Statements

In [1]:
# we'll have to set up a duplicate logger in the notebook
import logging
import os

# Get the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create a StreamHandler for the notebook
stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stream_handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(stream_handler)

# set propagation to false to prevent double logging
logger.propagate = False

from kleio.ocr import *
from kleio.image_utils import *
from kleio.correction import *
from kleio.collation import *

import matplotlib.pyplot as plt

## Getting Raw Text

There are a few allowable file types: PDF with text, PDF without text, images, and plain text. The type is automatically inferred, and the extracted text is returned. A few options for OCR are available.

In [2]:
example_filepath_0 = "../tests/test_input/test_0.pdf"
example_filepath_1 = "../tests/test_input/test_1.jpg"
example_filepath_2 = "../tests/test_input/test_2.pdf"

IMAGE_CONFIG = {
    "grayscale": True,
    "resize": False,
    "threshold": True,
    "deskew": False,
    "dilate_and_erode": False,
    "blur": False
}

text_0 = retrieve_text(example_filepath_0, image_kwargs=IMAGE_CONFIG)
text_1 = retrieve_text(example_filepath_1, image_kwargs=IMAGE_CONFIG)
text_2 = retrieve_text(example_filepath_2, image_kwargs=IMAGE_CONFIG)

2024-01-14 02:52:48,419 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_0.pdf


2024-01-14 02:52:48,419 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_0.pdf


2024-01-14 02:52:48,420 - kleio.ocr - INFO - File provided


2024-01-14 02:52:48,420 - kleio.ocr - INFO - File provided


2024-01-14 02:52:48,481 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_0.pdf


2024-01-14 02:52:48,481 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_0.pdf


2024-01-14 02:52:49,719 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_1.jpg


2024-01-14 02:52:49,719 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_1.jpg


2024-01-14 02:52:49,720 - kleio.ocr - INFO - File provided


2024-01-14 02:52:49,720 - kleio.ocr - INFO - File provided


2024-01-14 02:52:49,720 - kleio.ocr - INFO - Getting page text from image file ../tests/test_input/test_1.jpg


2024-01-14 02:52:49,720 - kleio.ocr - INFO - Getting page text from image file ../tests/test_input/test_1.jpg


2024-01-14 02:52:49,735 - kleio.ocr - INFO - Preprocessing image


2024-01-14 02:52:49,735 - kleio.ocr - INFO - Preprocessing image


2024-01-14 02:52:50,943 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_2.pdf


2024-01-14 02:52:50,943 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_2.pdf


2024-01-14 02:52:50,944 - kleio.ocr - INFO - File provided


2024-01-14 02:52:50,944 - kleio.ocr - INFO - File provided


2024-01-14 02:52:50,945 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_2.pdf


2024-01-14 02:52:50,945 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_2.pdf


2024-01-14 02:52:50,945 - kleio.ocr - INFO - Converting PDF file ../tests/test_input/test_2.pdf to image


2024-01-14 02:52:50,945 - kleio.ocr - INFO - Converting PDF file ../tests/test_input/test_2.pdf to image


In [3]:
# let's inspect one of the outputs
print(text_1)
print(len(text_1["pages"]))
for page in text_1["pages"]:
    print(page)

{'filename': 'test_1.jpg', 'extension': 'JPG', 'pages': ['PREFACE.\n\nHavine concluded our Firft Volume, we\nwould be deficient in gratitude did we not return\nthinks to the Public, in general, for the favourable\nreception our labours have experienced; and to\nthofe Scientific Gentlemen, in particular, who have\naflifted us with Communications, as well as Hints\nrefpecting the future condudting of the Work,\n\nAs the grand Object of it is to diffufe Philofo-\nphical Knowledge among every Clafs of Society,\nand to give the Public as early an Account as pof-\nfible of every thing new or curious in the fcientific\nWorld, both at Home and on the Continent, we\nflatter ourfelves with the hope that the fame liberal\nPatronage we have hitherto experienced will be\ncontinued; and that Scientific Men will afford us\nthat Support and Affiftance which they may think\nour Attempt entitled to. Whatever may be our\nfuture Succefs, no Exertions fhall be wanting on our\npart to render the Work ufeful

## Correcting the Raw Text

In [4]:

CORRECTION_KWARGS = {
    "filename": text_0["filename"],
    "extension": text_0["extension"],
    "filetype": "graphic book",
    "ocr_software": "pytesseract",
    "image_preprocessing_software": "opencv",
    "date": "modern",
    "language": "English",
    "comments": "This is the Dungeons and Dragons 5e Player's Handbook",
}

In [6]:
correction = get_correction(
    text = text_0,
    api_key=os.getenv("OPENAI_API_KEY"),
    llm_provider="openai",
    model_name="gpt-3.5-turbo-16k",
    temperature=0,
    more_info=CORRECTION_KWARGS,
    chunk_size=4096,
)

2024-01-14 02:53:13,283 - kleio.correction - INFO - Getting correction from LLM


2024-01-14 02:53:13,283 - kleio.correction - INFO - Getting correction from LLM


2024-01-14 02:53:13,284 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 02:53:13,284 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 02:53:13,285 - kleio.llm_utils - INFO - Creating OpenAI LLM


2024-01-14 02:53:13,285 - kleio.llm_utils - INFO - Creating OpenAI LLM
  0%|          | 0/293 [00:00<?, ?it/s]2024-01-14 02:53:22,629 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  1%|          | 2/293 [00:09<22:10,  4.57s/it]2024-01-14 02:55:08,290 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  1%|          | 3/293 [01:54<3:45:41, 46.70s/it]2024-01-14 02:55:21,602 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  1%|▏         | 4/293 [02:08<2:44:54, 34.24s/it]2024-01-14 02:55:37,985 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  2%|▏         | 5/293 [02:24<2:14:32, 28.03s/it]2024-01-14 02:55:51,325 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
  2%|▏         | 6/293 [02:37<1:50:44, 23.15s/it]2024-01-14 02:56:06,436 - httpx - INFO - H

In [None]:
for page in correction:
    print(page)

PREFACE.

Having concluded our First Volume, we
would be deficient in gratitude did we not return
thanks to the Public, in general, for the favourable
reception our labours have experienced; and to
those Scientific Gentlemen, in particular, who have
assisted us with Communications, as well as Hints
respecting the future conducting of the Work.

As the grand Object of it is to diffuse Philoso-
phical Knowledge among every Class of Society,
and to give the Public as early an Account as pos-
sible of everything new or curious in the scientific
World, both at Home and on the Continent, we
flatter ourselves with the hope that the same liberal
Patronage we have hitherto experienced will be
continued; and that Scientific Men will afford us
that Support and Assistance which they may think
our Attempt entitled to. Whatever may be our
future Success, no Exertions shall be wanting on our
part to render the Work useful to Society, and espe-
cially to the Arts and Manufactures of Great Britain
whic

## Collating the corrected pages

In [None]:
# As you can see, there are some issues we might still want to resolve
# E.g., words that are split across lines,
# duplicate words on adjacent pages (old texts do this a lot),
# headers and footers that are not part of the main text, etc.

# So, we need to collate the pages
COLLATION_KWARGS = {
    "remove_headers_and_footers": True,
    "remove_page_numbers": True,
    "remove_excess_space": True,
    "remove_empty_lines": False,
    "remove_line_breaks": False,
    "remove_word_breaks": True,
    "add_section_tags": True,
    "keep_page_breaks": True,
}

collated_text = collate(
    pages = correction,
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-3.5-turbo-16k",
    temperature=0,
    llm_provider="openai",
    chunk_size=4096,
    more_info=COLLATION_KWARGS,
)

2024-01-14 02:47:57,207 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 02:47:57,207 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 02:47:57,208 - kleio.llm_utils - INFO - Creating OpenAI LLM


2024-01-14 02:47:57,208 - kleio.llm_utils - INFO - Creating OpenAI LLM
2024-01-14 02:48:01,073 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [None]:
print(collated_text)

[SECTION_HEADER]PREFACE.[/SECTION_HEADER]

Having concluded our First Volume, we would be deficient in gratitude did we not return thanks to the Public, in general, for the favourable reception our labours have experienced; and to those Scientific Gentlemen, in particular, who have assisted us with Communications, as well as Hints respecting the future conducting of the Work.

As the grand Object of it is to diffuse Philosophical Knowledge among every Class of Society, and to give the Public as early an Account as possible of everything new or curious in the scientific World, both at Home and on the Continent, we flatter ourselves with the hope that the same liberal Patronage we have hitherto experienced will be continued; and that Scientific Men will afford us that Support and Assistance which they may think our Attempt entitled to. Whatever may be our future Success, no Exertions shall be wanting on our part to render the Work useful to Society, and especially to the Arts and Manufac