## Installation

Dependencies to run the notebook

 * Install [ipykernel](https://pypi.org/project/ipykernel/).
 
OCR and related 

 * Install [tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html). Tested with tesseract-ocr-w64-setup-5.3.3.20231005.exe binary.
 * Install [pytesseract](https://pypi.org/project/pytesseract/).
 * Install [pdf2image](https://pypi.org/project/pdf2image/). Notice it has a dependency on [poppler](https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler) follow the steps detailed and validate with the following command everything is set.

```bash
pdftoppm -h
```

Other dependencies

 * Install [openai](https://pypi.org/project/openai/).

### Troubleshooting

```
PDFInfoNotInstalledError: Unable to get page count.
Is poppler installed and in PATH?
```

Make sure the PATH contains the full path to the poppler installation folder, for example:

```
C:\temp\utils\poppler\Library\bin\
```






The following commands needs to be executed only once or when you update dependencies

In [None]:
!pip install ipykernel


In [None]:
!pip install pytesseract

In [None]:
!pip install pdf2image

In [None]:
!pip install openai

## Code

Snippet of what is needed to do.

### Initialization

In [1]:
from openai import OpenAI

In [2]:
import pytesseract
from pytesseract import image_to_string
from tqdm import tqdm
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

In [3]:
pytesseract.get_languages()

['afr',
 'amh',
 'ara',
 'asm',
 'aze',
 'aze_cyrl',
 'bel',
 'ben',
 'bod',
 'bos',
 'bre',
 'bul',
 'cat',
 'ceb',
 'ces',
 'chi_sim',
 'chi_sim_vert',
 'chi_tra',
 'chi_tra_vert',
 'chr',
 'cos',
 'cym',
 'dan',
 'deu',
 'div',
 'dzo',
 'ell',
 'eng',
 'enm',
 'epo',
 'equ',
 'est',
 'eus',
 'fao',
 'fas',
 'fil',
 'fin',
 'fra',
 'frk',
 'frm',
 'fry',
 'gla',
 'gle',
 'glg',
 'grc',
 'guj',
 'hat',
 'heb',
 'hin',
 'hrv',
 'hun',
 'hye',
 'iku',
 'ind',
 'isl',
 'ita',
 'ita_old',
 'jav',
 'jpn',
 'jpn_vert',
 'kan',
 'kat',
 'kat_old',
 'kaz',
 'khm',
 'kir',
 'kmr',
 'kor',
 'lao',
 'lat',
 'lav',
 'lit',
 'ltz',
 'mal',
 'mar',
 'mkd',
 'mlt',
 'mon',
 'mri',
 'msa',
 'mya',
 'nep',
 'nld',
 'nor',
 'oci',
 'ori',
 'osd',
 'pan',
 'pol',
 'por',
 'pus',
 'que',
 'ron',
 'rus',
 'san',
 'sin',
 'slk',
 'slv',
 'snd',
 'spa',
 'spa_old',
 'sqi',
 'srp',
 'srp_latn',
 'sun',
 'swa',
 'swe',
 'syr',
 'tam',
 'tat',
 'tel',
 'tgk',
 'tha',
 'tir',
 'ton',
 'tur',
 'uig',
 'ukr',
 'u

In [4]:
import os

api_key = os.environ['OPENAI_API_KEY']

client = OpenAI(
  api_key=api_key
)



### Helper functions

In [6]:
def write_strings_to_file(strings, filename):
    # Combine the strings into a single text
    combined_text = ' '.join(strings)

    # Write the text to a file
    with open(filename, 'w') as file:
        file.write(combined_text)
    return True

In [7]:
from pdf2image import convert_from_path
from PIL import Image

def get_text_from_pdf(pdf_path, language='eng'):
    # Warning: other languages need to install other tesseract components
    pages = convert_from_path(pdf_path)
    ocr_pages = []
    for i,page in enumerate(pages):
        text = image_to_string(page, lang=language)
        ocr_pages.append([text])
        #page_footer = f"---end of page {i+1}---"
        #print(page_footer)
    print(f"pages {i}")
    return ocr_pages

In [8]:
def ungarble(text, model="gpt-4"):
    
    text = text.replace("\n"," ")
    
    prompt = f"""
    ---- objective ----
    fix the spacing and general layout of the following text and add needed words for it to make sense, only when they are incomplete, inintelligible or garbled
    
    ---- input format ----
    text with incomplete words, wrong spacing, wrongly placed new lines and inconsistent formatting
    
    ---- output format ----
    only fixed input, format, new lines and consistency in order to make it as readable as possible, no extra text or explanations

    ---- input ----
    {text}
   
    ---- output ----
    
    """
    
    response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=4000,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
    )
    
    return  response.choices[0].message.content.strip('"')

In [9]:

def handle_text_from_pdf(pdf_path, language='eng', use_completion=False, model="gpt-4"):
    ocr_pages = get_text_from_pdf(pdf_path, language)

    clean = []
    for page in tqdm(ocr_pages):
        text = ungarble(page[0], model) if use_completion else page[0]
        clean.append(text)

    output_path = f"{pdf_path}.txt"
    result = write_strings_to_file(clean, output_path)
    if result:
        print(f"Output saved to {output_path}")


In [None]:
use_completion = False # 30s
#use_completion = True

model = "gpt-4" # 4:30m
model = "gpt-4-turbo-preview"
model = "gpt-3.5-turbo-16k"
model = "gpt-3.5-turbo" # 1:30m up to 4096 completion tokens
model = "gpt-4o"

pdf_path = 'C:/tmp/test.pdf'
language = 'spa'

handle_text_from_pdf(pdf_path, language, use_completion, model)



pages 5


100%|██████████| 6/6 [00:00<?, ?it/s]

Output saved to C:/temp/saia/guyer/caso01/pregunta02/rch-EST-NCT-100000-17978787.pdf.txt



