# Kleio Walkthrough

## About

Author: Jared Neumann

This package is designed to take a PDF document with or without a text layer, or raw text, and return a complete, corrected version of that text. Text is extracted using common OCR tools, if necessary, and the text is then passed to an LLM. The LLM then makes corrections to each chunk. Additional functions can be called, such as:
- Layout analysis and annotation
- Revised collation (e.g., to eliminate headers and footers, etc.)
- Translation

## Import Statements

In [1]:
# we'll have to set up a duplicate logger in the notebook
import logging
import os

# Get the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create a StreamHandler for the notebook
stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stream_handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(stream_handler)

# set propagation to false to prevent double logging
logger.propagate = False

from kleio.ocr import *
from kleio.image_utils import *
from kleio.correction import *
from kleio.collation import *

import matplotlib.pyplot as plt

## Getting Raw Text

There are a few allowable file types: PDF with text, PDF without text, images, and plain text. The type is automatically inferred, and the extracted text is returned. A few options for OCR are available.

In [2]:
example_filepath_0 = "../tests/test_input/test_0.pdf"
example_filepath_1 = "../tests/test_input/test_1.jpg"
example_filepath_2 = "../tests/test_input/test_2.pdf"

IMAGE_CONFIG = {
    "grayscale": True,
    "resize": False,
    "threshold": True,
    "deskew": False,
    "dilate_and_erode": False,
    "blur": False
}

text_0 = retrieve_text(example_filepath_0, image_kwargs=IMAGE_CONFIG)
text_1 = retrieve_text(example_filepath_1, image_kwargs=IMAGE_CONFIG)
text_2 = retrieve_text(example_filepath_2, image_kwargs=IMAGE_CONFIG)

2024-01-14 04:15:56,747 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_0.pdf


2024-01-14 04:15:56,747 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_0.pdf


2024-01-14 04:15:56,748 - kleio.ocr - INFO - File provided


2024-01-14 04:15:56,748 - kleio.ocr - INFO - File provided


2024-01-14 04:15:56,765 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_0.pdf


2024-01-14 04:15:56,765 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_0.pdf


2024-01-14 04:15:56,776 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_1.jpg


2024-01-14 04:15:56,776 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_1.jpg


2024-01-14 04:15:56,777 - kleio.ocr - INFO - File provided


2024-01-14 04:15:56,777 - kleio.ocr - INFO - File provided


2024-01-14 04:15:56,777 - kleio.ocr - INFO - Getting page text from image file ../tests/test_input/test_1.jpg


2024-01-14 04:15:56,777 - kleio.ocr - INFO - Getting page text from image file ../tests/test_input/test_1.jpg


2024-01-14 04:15:56,792 - kleio.ocr - INFO - Preprocessing image


2024-01-14 04:15:56,792 - kleio.ocr - INFO - Preprocessing image


2024-01-14 04:15:58,003 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_2.pdf


2024-01-14 04:15:58,003 - kleio.ocr - INFO - Retrieving text from ../tests/test_input/test_2.pdf


2024-01-14 04:15:58,004 - kleio.ocr - INFO - File provided


2024-01-14 04:15:58,004 - kleio.ocr - INFO - File provided


2024-01-14 04:15:58,005 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_2.pdf


2024-01-14 04:15:58,005 - kleio.ocr - INFO - Getting text from PDF file ../tests/test_input/test_2.pdf


2024-01-14 04:15:58,006 - kleio.ocr - INFO - Converting PDF file ../tests/test_input/test_2.pdf to image


2024-01-14 04:15:58,006 - kleio.ocr - INFO - Converting PDF file ../tests/test_input/test_2.pdf to image


In [3]:
# let's inspect one of the outputs
print(text_0)
print(len(text_0["pages"]))
for page in text_0["pages"]:
    print(page)

{'filename': 'test_0.pdf', 'extension': 'PDF', 'pages': ['[ 2l6 ]\nReceived June 8, 17 dp.\nX X X . A  Letter from  Mr. Lane, Apothe\xad\ncary, \nin Alderfgate-ftreet, to the H\nnourable Henry Cavendifh, \nR. S. on \nthe Solubility o f Iron in fimple \n, by \nthe Intervention of fixed A ir.\nAlderfgate-ftreet, June 5, 1769*\nSi r ,\nRead Nov. 23, F\' g " 1H  E various impregnations of mi- \n7 9 \nJL \nneral waters have always been very \ndifficult to explain : and whoever has read the divers, \nand often contradi&ory reafonings upon the fubjedt, \nmud clearly perceive, that there is ffill room for dif- \ncoveries in this part of natural hiftory*\nYou, Sir, by your accounts of fixed air, and of \nRathbone-place water, related in the laft volume of \nPhilofophical Tranfadtions, have obliged the public \nwith many additional lights on this branch of know* \nledge^ and, from your known accuracy, and diligent \npurfuits in moft philofophical inquiries, the learned \nWorld has great reafon t

## Correcting the Raw Text

In [4]:

CORRECTION_KWARGS = {
    "filename": text_0["filename"],
    "extension": text_0["extension"],
    "filetype": "academic article",
    "ocr_software": "pytesseract",
    "image_preprocessing_software": "opencv",
    "date": "1770-1800",
    "language": "British English",
    "comments": "This is an article from the Philosophical Transactions of the Royal Society",
}

In [5]:
correction = get_correction(
    text = text_0,
    api_key=os.getenv("OPENAI_API_KEY"),
    llm_provider="openai",
    model_name="gpt-3.5-turbo-16k",
    temperature=0,
    more_info=CORRECTION_KWARGS,
    chunk_size=4096,
)

2024-01-14 04:15:59,415 - kleio.correction - INFO - Getting correction from LLM


2024-01-14 04:15:59,415 - kleio.correction - INFO - Getting correction from LLM


2024-01-14 04:15:59,416 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 04:15:59,416 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 04:15:59,417 - kleio.llm_utils - INFO - Creating OpenAI LLM


2024-01-14 04:15:59,417 - kleio.llm_utils - INFO - Creating OpenAI LLM
  0%|          | 0/3 [00:00<?, ?it/s]2024-01-14 04:16:35,819 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 33%|███▎      | 1/3 [00:36<01:12, 36.33s/it]2024-01-14 04:16:42,283 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
 67%|██████▋   | 2/3 [00:42<00:18, 18.76s/it]2024-01-14 04:16:47,349 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|██████████| 3/3 [00:47<00:00, 15.95s/it]


In [6]:
for page in correction:
    print(page)

[216]
Received June 8, 17 dp.
X X X. A Letter from Mr. Lane, Apothe­
cary,
in Alderfgate-ftreet, to the Honourable Henry Cavendifh,
R. S. on
the Solubility of Iron in simple
, by
the Intervention of fixed Air.
Alderfgate-ftreet, June 5, 1769*
Sir,
Read Nov. 23, F'g " 1H E various impregnations of mi-
79
neral waters have always been very
difficult to explain: and whoever has read the diverse,
and often contradictory reasonings upon the subject,
must clearly perceive, that there is still room for dif-
coveries in this part of natural history.
You, Sir, by your accounts of fixed air, and of
Rathbone-place water, related in the last volume of
Philosophical Transactions, have obliged the public
with many additional lights on this branch of know-
ledge, and, from your known accuracy, and diligent
pursuits in most philosophical inquiries, the learned
World has great reason to hope for many other new
and useful improvements.
To your judgment there-
fore, I submit the following experiments; wh

## Collating the corrected pages

In [13]:
# As you can see, there are some issues we might still want to resolve
# E.g., words that are split across lines,
# duplicate words on adjacent pages (old texts do this a lot),
# headers and footers that are not part of the main text, etc.

# So, we need to collate the pages
COLLATION_KWARGS = {
    "remove_headers_and_footers": True,
    "remove_page_numbers": True,
    "remove_excess_space": True,
    "remove_empty_lines": False,
    "remove_line_breaks": True,
    "remove_word_breaks": True,
    "add_section_tags": False,
    "keep_page_breaks": False,
}

collated_text = collate(
    pages = correction,
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4",
    temperature=0,
    llm_provider="openai",
    chunk_size=2048,
    more_info=COLLATION_KWARGS,
)

2024-01-14 04:23:59,203 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 04:23:59,203 - kleio.llm_utils - INFO - Creating OpenAI prompt for OCR correction task


2024-01-14 04:23:59,204 - kleio.llm_utils - INFO - Creating OpenAI LLM


2024-01-14 04:23:59,204 - kleio.llm_utils - INFO - Creating OpenAI LLM
  0%|          | 0/1 [00:00<?, ?it/s]2024-01-14 04:24:31,845 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
100%|██████████| 1/1 [00:32<00:00, 32.63s/it]


In [14]:
print(collated_text)

Received June 8, 17 dp. XXX. A Letter from Mr. Lane, Apothecary, in Aldersgate-street, to the Honourable Henry Cavendish, R. S. on the Solubility of Iron in simple, by the Intervention of fixed Air. Aldersgate-street, June 5, 1769. Sir, Read Nov. 23, F'g " 1H E various impregnations of mineral waters have always been very difficult to explain: and whoever has read the diverse, and often contradictory reasonings upon the subject, must clearly perceive, that there is still room for discoveries in this part of natural history.

You, Sir, by your accounts of fixed air, and of Rathbone-place water, related in the last volume of Philosophical Transactions, have obliged the public with many additional lights on this branch of knowledge, and, from your known accuracy, and diligent pursuits in most philosophical inquiries, the learned World has great reason to hope for many other new and useful improvements. To your judgment therefore, I submit the following experiments; which are intended to s