# Parsing PDF Documents

When extracting text from a PDF document, there are generally two main approaches:

1. Extraction without OCR:

  This method involves extracting text directly from the PDF document without relying on OCR. It assumes that the PDF contains embedded text that can be extracted by parsing the PDF structure and extract the text elements present in the document. However, this method may not work effectively if the PDF contains scanned images or non-searchable text.
  
2. Extraction with OCR:
   
  OCR is used when the PDF document contains scanned images or non-searchable text. OCR technology recognizes the text within these images and converts it into machine-readable text. This can also work for machine-generated documents if converted to images.

In this notebook, we'll test both approaches with different methods, and compare results.





### Installing required packages

In [None]:
%%capture
!pip install PyPDF2 pycryptodome
!pip install pdfplumber
!pip install gdown
!sudo add-apt-repository --yes ppa:alex-p/tesseract-ocr5
!sudo apt update
!sudo apt install -y tesseract-ocr
!sudo apt-get install -y poppler-utils
!sudo apt-get install parallel -y
!pip install pdf2image pytesseract==0.3.10
!pip install pymupdf
!pip install beautifulsoup4
!pip install easyocr
!pip install Pillow==9.5.0

'%%capture\n!pip install PyPDF2 pycryptodome\n!pip install pdfplumber\n!pip install gdown\n!sudo add-apt-repository --yes ppa:alex-p/tesseract-ocr5\n!sudo apt update\n!sudo apt install -y tesseract-ocr\n!sudo apt-get install -y poppler-utils\n!sudo apt-get install parallel -y\n!pip install pdf2image pytesseract==0.3.10\n!pip install pymupdf\n!pip install beautifulsoup4\n!pip install easyocr\n!pip install Pillow==9.5.0'

Restart runtime before running cells below.

In [None]:
from PyPDF2 import PdfReader
import pandas as pd
import random
import os
import time
import pdfplumber
import shutil
import gdown
import fitz
from bs4 import BeautifulSoup
import numpy as np
import subprocess
import glob
import shutil
from pdf2image import convert_from_path
from concurrent.futures import ProcessPoolExecutor
import easyocr

### Getting the documents

In [None]:
# Download from Gdrive
url = 'https://drive.google.com/uc?id=1ghTn9RMykwfrWaSjyw9PFGqvMKI4uBov'
output = '/content/docs.zip'
gdown.download(url, output, quiet=False)

# Unzip data
docs_directory = 'documents'
shutil.unpack_archive('docs.zip', docs_directory)

# Get the paths of all documents for parsing
document_paths = []
for root, directories, files in os.walk(docs_directory):
    for filename in files:
        filepath = os.path.join(root, filename)
        document_paths.append(os.path.join(root, filename))

print("Number of documents:", len(document_paths))

Downloading...
From: https://drive.google.com/uc?id=1ghTn9RMykwfrWaSjyw9PFGqvMKI4uBov
To: /content/docs.zip
100%|██████████| 41.0M/41.0M [00:00<00:00, 45.5MB/s]


Number of documents: 20


In [None]:
# We'll store the results of all approaches within a pandas dataframe

columns = {
    'Report ID': int,
    'Report Name': str,
    'Bank Name': str,
    'Report Date': str,
    'Page ID': int,
    'Page Text': str,
    'Parsing Method': str
}

### Approach 1: Extraction without OCR

#### Using PyPDF2

We'll now use PyPDF2 to test this approach with multiple PDF documents, and keep track of the extracted text and the execution time to provide an overview later.

In [None]:
class Document:
  def __init__(self, filepath, id, method="pypdf"):
    self.filepath = filepath
    self.method = method
    self.report_id = id
    self.report_name = os.path.splitext(os.path.basename(filepath))[0]
    self.pages = self.read()

  def __str__(self):
    return " ".join([page_text for page_text in self.pages['Page Text'].values])

  def __len__(self):
    return len(self.pages)

  def read(self):
      common_attrs = {
          'Report ID': self.report_id,
          'Report Name': self.report_name,
          'Bank Name': 'Unknown',
          'Report Date': 'Unknown',
          'Parsing Method': self.method
      }

      method_attrs = {
          'pypdf': {
              'reader': PdfReader(self.filepath),
              'extract_text': lambda page: page.extract_text()
          },
          'pdfplumber': {
              'reader': pdfplumber.open(self.filepath),
              'extract_text': lambda page: page.extract_text(x_tolerance=5, y_tolerance=5, layout=False)
          },
          'pymupdf':{
              'reader':fitz.open(self.filepath),
              'extract_text': lambda page: BeautifulSoup(page.get_textpage().extractXHTML(), "lxml").get_text(separator=" ")
          }
      }

      attrs = method_attrs.get(self.method)
      if attrs:
          reader = attrs['reader']

          num_pages = len(reader) if self.method == "pymupdf" else len(reader.pages)

          pages = []
          for page_num in range(num_pages):
              page = reader[page_num] if self.method == "pymupdf" else reader.pages[page_num]
              page_row = {
                  **common_attrs,
                  'Page ID': page_num,
                  'Page Text': attrs['extract_text'](page)
              }
              pages.append(page_row)
          df_documents = pd.DataFrame(pages, columns=columns).astype(columns)
          return df_documents

      else:
          raise ValueError(f"Invalid method: {self.method}")


In [None]:
def crawl_parse(method="pypdf"):
  df_documents = pd.DataFrame(columns=columns.keys()).astype(columns)
  profiling = []
  for id, path in enumerate(document_paths):
    start = time.time()
    doc = Document(path, id=id,method=method)
    end = time.time()
    df_documents = pd.concat([df_documents, doc.pages], ignore_index=True)
    profiling_row = {"Report Name": doc.report_name,
                     "Length": len(str(doc)),
                     "Total Pages": len(doc),
                     "Parsing Time": end - start,
                     "Parsing Method": doc.method
                     }
    profiling.append(profiling_row)
  df_profiling = pd.DataFrame(profiling)
  return df_documents, df_profiling

In [None]:
def print_random_pages(df):
  for _, row in df.sample(n=5).iterrows():
      print(f"Page ID: {row['Page ID']}")
      print(f"Report Name: {row['Report Name']}")
      print(f"Page Text: {row['Page Text']}")
      print("*"*10)

In [None]:
%%time
documents_1, profiling_1 = crawl_parse(method="pypdf")

CPU times: user 32.4 s, sys: 206 ms, total: 32.6 s
Wall time: 33.2 s


In [None]:
documents_1.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,Citi Global Wealth Investments \nFX Snapshot\n...,pypdf
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,"Source: Bloomberg L.P.\n(K = Thousand, M = Mil...",pypdf
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,Citi FX interest rate Forecast %\nSource: Citi...,pypdf
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,Important Disclosure\n“Citi analysts” refers t...,pypdf
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,Fixed Income\nMUSIN GS\nFIXED INCOME Goldman S...,pypdf


In [None]:
profiling_1.describe(include='all')

Unnamed: 0,Report Name,Length,Total Pages,Parsing Time,Parsing Method
count,20,20.0,20.0,20.0,20
unique,20,,,,1
top,fx_insight_e_16_janvier_2023,,,,pypdf
freq,1,,,,20
mean,,53853.2,23.15,1.659652,
std,,78665.360947,30.135265,2.224184,
min,,10630.0,4.0,0.181192,
25%,,19520.75,7.0,0.296401,
50%,,21096.5,8.0,0.468302,
75%,,59660.5,26.0,1.836791,


In [None]:
print_random_pages(documents_1)

Page ID: 30
Report Name: bnp_parisbas_global_view_2023
Page Text: INVESTMENT OUTLOOK FOR 2023  - 31 -
ASSET CLASS OVERVIEW
Global government bonds (H)
Global corporate bonds (H) Global corporate high-yield (H) Commodities (H) Developed equities (UH) Global real estate (UH) 
H: hedged; UH: unhedged 
Source: Bloomberg, Quant Research Group, BNP Paribas Asset ManagementIndices used: global real estate (RNGL), developed equities (MSDEWIN), global government bonds (SBWGEC), global corporate bonds (LGCPTREH), global corporate high-yield (LG30TRUH), commodities (BCOMHET); Bloomberg ticker in bracketsPast performance or achievement is not indicative of current or future performance.Performance+
Performance-2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
21.2% 32% 11.5% 15.6% 8.4% 0.1% 30% 6.7% 36.9% 14.1%
6.5% 19.5% 10.4% 10.7% 7.5% -0.2% 25.3% 6.3% 31.1% -8.0%
0% 8.4% 1% 10.1% 3.7% -2.7% 13.3% 5.7% 25.6% -14.4%
-0.1% 7.5% -0.5% 8.1% 0.3% -3.8% 9.2% 4.8% 2.5% -14.7%
-0.1% 2.6% -0.7% 4.6% -0.

In [None]:
documents_1.to_csv("parsed-docs-pypdf.csv", index=False)

#### Using pdfplumber

We'll now use pdfplumber to test this approach with multiple PDF documents, and keep track of the extracted text and the execution time to provide an overview later.

In [None]:
%%time
documents_2, profiling_2 = crawl_parse(method="pdfplumber")

CPU times: user 1min 27s, sys: 617 ms, total: 1min 27s
Wall time: 1min 28s


In [None]:
documents_2.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,"Citi Global Wealth Investments\nJanuary 16, 20...",pdfplumber
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,Citi Global Wealth Investments\nFX Snapshot\nS...,pdfplumber
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,Citi Global Wealth Investments\nFX Snapshot\nC...,pdfplumber
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,Citi Global Wealth Investments\nFX Snapshot\nI...,pdfplumber
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,"Fixed Income\nJuly 29, 2022\nMUSINGS\nPOLICY P...",pdfplumber


In [None]:
profiling_2.describe(include='all')

Unnamed: 0,Report Name,Length,Total Pages,Parsing Time,Parsing Method
count,20,20.0,20.0,20.0,20
unique,20,,,,1
top,fx_insight_e_16_janvier_2023,,,,pdfplumber
freq,1,,,,20
mean,,54437.25,23.15,4.437592,
std,,80134.711991,30.135265,5.882953,
min,,9780.0,4.0,0.555941,
25%,,19444.75,7.0,0.943884,
50%,,20989.5,8.0,1.348397,
75%,,59341.25,26.0,4.773229,


In [None]:
print_random_pages(documents_2)

Page ID: 59
Report Name: kkr_global_view_2023
Page Text: in Exhibit 106. Moreover, with the region so out of favor, Exhibit 107
there are now a select group of high quality public-to-private
opportunities around these themes that were previously not Public Indices in Indonesia Aren’t Fully Capturing
Demographic Tailwinds and Rising GDP-per Capita via
available, in our view.
Technology
Exhibit 106 MSCI Indonesia Sector Weights, %
Don’t Judge European Investment Opportunities by the
State of the Public Markets 12.7%
Private Equity Returns Over Public Markets, %
(Real Cumulative Outperformance Since 2010)
400%
350% US Europe 0%
300% Communication Services Technology
250%
Note: Telekom Indonesia is 10.95% of Communications Services. Data as at November
200% 2022. Source: MSCI.
150%
100%
Exhibit 108
50%
0% Gaining Exposure to EM and European Tech or China
(50%) Consumer Upgrades Through Public Equities Is
9002 0102 1102 2102 3102 4102 5102 6102 7102 8102 9102 0202 Challenging
Sector Concent

In [None]:
documents_2.to_csv("parsed-docs-pdfplumber.csv", index=False)

#### Using PyMUPDF


This method involves using PyMuPDF (fitz) and BeautifulSoup Python libraries to extract structured data from PDF documents. PyMuPDF enables us to open, read, and convert specific PDF pages into XHTML format, preserving the layout and structure of the original document. With BeautifulSoup, we parse the obtained XHTML content and extract text. This approach is particularly useful for dealing with complex layouts and structured elements in PDFs, allowing us to access and utilize the data in a structured manner without relying on OCR.

In [None]:
%%time
documents_3, profiling_3 = crawl_parse(method="pymupdf")

CPU times: user 3.32 s, sys: 73 ms, total: 3.39 s
Wall time: 3.39 s


In [None]:
documents_3.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,\n Citi Global Wealth Investments \n FX Snaps...,pymupdf
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,"\n Source: Bloomberg L.P. \n (K = Thousand, M ...",pymupdf
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,\n Citi FX interest rate Forecast % \n Source:...,pymupdf
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,\n Important Disclosure \n “Citi analysts” ref...,pymupdf
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,\n Fixed Income \n MUSINGS \n FIXED INCOME Go...,pymupdf


In [None]:
profiling_3.describe(include='all')

Unnamed: 0,Report Name,Length,Total Pages,Parsing Time,Parsing Method
count,20,20.0,20.0,20.0,20
unique,20,,,,1
top,fx_insight_e_16_janvier_2023,,,,pymupdf
freq,1,,,,20
mean,,55427.85,23.15,0.168646,
std,,80615.641196,30.135265,0.179018,
min,,11207.0,4.0,0.051143,
25%,,19923.25,7.0,0.076668,
50%,,21527.0,8.0,0.0917,
75%,,61486.5,26.0,0.163624,


In [None]:
print_random_pages(documents_3)

Page ID: 33
Report Name: jpmorgan_private_banking_global_view_2023
Page Text: 
 Source: J.P. Morgan Asset Management. Note: The left bars represent 2022, and right bars 2023 estimates. 
 2023 LTCMAS PRESENT IMPRESSIVE EXPECTED RETURNS 
 Annualized expected return, % 
   
   
 STRONGER MARKETS  
 Here’s what the  bear  market of   2022 has   delivered:   
 As noted, a dramatic reset in valuations—higher yields, lower stock multiples—has, in our view, created the most attractive entry point for a traditional portfolio of stocks and bonds in over a decade. In fact, our long-term outlook for returns across asset classes are materially higher than they were just last year.  
 LONG-TERM CAPITAL MARKET ASSUMPTIONS  
 PRESENT FAVORABLE EXPECTED RETURNS  
 Forecasted annual return over the next 10-15 years, %  
 2022 LTCMAs Equities  Fixed Income  
 9.8 10.1 2023 LTCMAs 7.9 6.9 6.8 6.5  
 4.6 4.1 3.7  3.9 2.6 2.1  
 U.S. Large Cap EAFE Equity EM Equity Muni Bonds U.S. Agg Bonds U.S. HY Bonds  


In [None]:
documents_3.to_csv("parsed-docs-pymupdf.csv", index=False)

### Approach 2: Extraction with OCR

#### Using Tesseract OCR

We'll now use Tesseract to test this approach with multiple PDF documents, and keep track of the extracted text and the execution time to provide an overview later.

In [None]:
class DocumentOCR:
  def __init__(self, filepath, id, method="tesseract", reader=None):
    self.filepath = filepath
    self.method = method
    self.report_id = id
    self.report_name = os.path.splitext(os.path.basename(filepath))[0]
    if reader != None:
      self.reader = reader
    self.pages = self.read()


  def __str__(self):
    return " ".join([page_text for page_text in self.pages['Page Text'].values])

  def __len__(self):
    return len(self.pages)

  def tess_ocr(self):
      shell_commands = [
          "mkdir -p temp_images",
          f"pdftoppm -tiff '{self.filepath}' temp_images/page",
          "mkdir -p temp_output",
          "ls temp_images | parallel 'tesseract \"temp_images/{}\" \"temp_output/{.}\" --oem 3 -l eng'",
          "rm -rf temp_images"
      ]

      for command in shell_commands:
          subprocess.run(command, shell=True)

      outputs = glob.glob(os.path.join("temp_output", "*.txt"))
      sorted_outputs = sorted(outputs, key=lambda file_path: int(file_path.split('-')[-1].split('.')[0]))

      return sorted_outputs

  def easy_ocr(self):
    images = convert_from_path(
        pdf_path=self.filepath,
        output_folder=None,
        dpi=300,
        grayscale=False,
        paths_only=False,
    )
    ocr_results = ['' for _ in range(len(images))]
    for page_num, image in enumerate(images):
        result = self.reader.readtext(np.array(image), paragraph=True, detail=0, batch_size=32, workers=2)
        ocr_results[page_num] = ' '.join(result)
    return ocr_results

  def read(self):
        pages = []
        if self.method == "tesseract":
            ocr_outputs = self.tess_ocr()

            for page_num, output_file in enumerate(ocr_outputs):
                with open(output_file, "r") as file:
                    page_text = file.read()

                page_row = {
                    'Report ID': self.report_id,
                    'Report Name': self.report_name,
                    'Bank Name': 'Unknown',
                    'Report Date': 'Unknown',
                    'Page ID': page_num,
                    'Page Text': page_text,
                    'Parsing Method': self.method
                }
                pages.append(page_row)

            shutil.rmtree("temp_output")
        elif self.method == "easyocr":
            ocr_outputs = self.easy_ocr()
            for page_num, page_text in enumerate(ocr_outputs):
                page_row = {
                    'Report ID': self.report_id,
                    'Report Name': self.report_name,
                    'Bank Name': 'Unknown',
                    'Report Date': 'Unknown',
                    'Page ID': page_num,
                    'Page Text': page_text,
                    'Parsing Method': self.method
                }
                pages.append(page_row)

        return pd.DataFrame(pages,columns=columns.keys()).astype(columns)


In [None]:
def crawl_parse_ocr(method="tesseract"):
  df_documents = pd.DataFrame(columns=columns.keys()).astype(columns)
  profiling = []
  for id, path in enumerate(document_paths):
    start = time.time()
    if method == "easyocr":
      reader = easyocr.Reader(['en'])
      doc = DocumentOCR(path, id=id,method=method, reader=reader)
    else:
      doc = DocumentOCR(path, id=id,method=method)
    end = time.time()
    df_documents = pd.concat([df_documents, doc.pages], ignore_index=True)
    profiling_row = {"Report Name": doc.report_name,
                     "Length": len(str(doc)),
                     "Total Pages": len(doc),
                     "Parsing Time": end - start,
                     "Parsing Method": doc.method
                     }
    profiling.append(profiling_row)
  df_profiling = pd.DataFrame(profiling)
  return df_documents, df_profiling

In [None]:
%%time
documents_4, profiling_4 = crawl_parse_ocr(method="tesseract")

CPU times: user 6.63 s, sys: 892 ms, total: 7.52 s
Wall time: 34min 3s


In [None]:
documents_4.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,GO\nPlease note and carefully read the Import ...,tesseract
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,Please note and carefully read the Import Disc...,tesseract
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,GO\nPlease note and carefully read the Import ...,tesseract
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,=~\nciti Citi Global Wealth Investments\nFX Sn...,tesseract
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,Asset\n\nGoldman |\nManagement\n\nSachs\n\nFix...,tesseract


In [None]:
profiling_4.describe(include='all')

Unnamed: 0,Report Name,Length,Total Pages,Parsing Time,Parsing Method
count,20,20.0,20.0,20.0,20
unique,20,,,,1
top,fx_insight_e_16_janvier_2023,,,,tesseract
freq,1,,,,20
mean,,55775.5,23.15,102.166835,
std,,77681.62692,30.135265,136.500964,
min,,11787.0,4.0,19.494749,
25%,,20296.0,7.0,36.938411,
50%,,21954.0,8.0,39.933532,
75%,,58420.75,26.0,117.933431,


In [None]:
print_random_pages(documents_4)

Page ID: 0
Report Name: fx_insight_e_9_9_2023
Page Text: GO
Please note and carefully read the Import Disclosure on the last part

F te
citi Citi Global Wealth Investments
January 9, 2023 FX Snapshot

Major Currencies Performance

Year-To-
Date
Change

USD 103.88 0.3% 105.58 | 103.52 | -1.6% | 113.32 | 103.52 | -7.5% | 114.78 | 94.63 0.3%
EUR/USD | 1.0644 | -0.6% | 1.0705 | 1.0467 1.7% 1.0705 | 0.9702 8.7% 1.1495 | 0.9536 | -0.6%
USD/JPY | 132.08 0.7% 137.78 | 130.80 | -3.6% | 150.15 | 130.80 | -9.0% | 151.95 | 113.47 | 0.7%
GBP/USD | 1.2093 0.1% 1.2426 | 1.1908 | -0.3% | 1.2426 | 1.0968 8.3% 1.3749 | 1.0350 | 0.1%
USD/CAD | 1.3444 | -0.8% | 1.3699 | 1.3444 | -1.5% | 1.3885 | 1.3275 | -2.2% | 1.3977 | 1.2403 | -0.8%
AUD/USD | 0.6877 0.9% | 0.6877 | 0.6670 | 2.8% 0.6877 | 0.6199 74% | 0.7661 | 0.6170 | 0.9%
NZD/USD | 0.6347 | 0.0% | 0.6464 | 0.6234 | 0.4% 0.6464 | 0.5562 | 12.1% | 0.7034 | 0.5512 | 0.0%
USD/CHF | 0.9279 | 0.4% | 0.9420 | 0.9232 | -1.5% | 1.0133 | 0.9232 | -6.3% | 1.0148

In [None]:
documents_4.to_csv("parsed-docs-tesseract.csv", index=False)

#### Using EasyOCR

In [None]:
%%time
documents_5, profiling_5 = crawl_parse_ocr(method="easyocr")

CPU times: user 31min 13s, sys: 3min 35s, total: 34min 49s
Wall time: 39min 14s


In [None]:
documents_5.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,Please note and carefully read the Import Disc...,easyocr
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,Please note and carefully read the Import Disc...,easyocr
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,Please note and carefully read the Import Disc...,easyocr
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,citi Citi Global Wealth Investments FX Snapsho...,easyocr
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,"Goldman Sachs Fixed Income July 29, 2022 Asset...",easyocr


In [None]:
profiling_5.describe(include='all')

Unnamed: 0,Report Name,Length,Total Pages,Parsing Time,Parsing Method
count,20,20.0,20.0,20.0,20
unique,20,,,,1
top,fx_insight_e_16_janvier_2023,,,,easyocr
freq,1,,,,20
mean,,56245.35,23.15,117.715604,
std,,78057.626514,30.135265,150.607534,
min,,13382.0,4.0,25.7955,
25%,,20294.5,7.0,42.568501,
50%,,22465.0,8.0,48.450626,
75%,,59186.0,26.0,121.687645,


In [None]:
print_random_pages(documents_5)

Page ID: 8
Report Name: kkr_global_view_2023
Page Text: Real Assets: We Still Favor Collateral-Based Cash Flows Exhibit 5 and Continue to Pound the Table on this Theme. Our proprietary survey work suggests that because too many We See a Structural Labor Shortage That Emerged During COVID in Key Markets Such as the United States investors are still underweight Real Assets in their portfolios, there remains a high degree of latent demand for this asset Pre-COVID Trend and KKR GMAA Forecast of U.S. Labor Force Size, OOOs class across insurance companies, family offices, and endowments and foundations. Moreover, the fundamentals 185,000 Pre-COVID Trend are compelling, especially on the Energy, Asset-Based Base Case Finance, Real Estate Credit, and Infrastructure sides of the 180,000 00008 High Case business. Also, as we detail below, we think that Real Assets, 088808 Low Case 175,000 and Energy in particular, could be a really important hedge if the dollar is not as strong in 2023. 170,000

In [None]:
documents_5.to_csv("parsed-docs-easyocr.csv", index=False)

## Results

This notebook presented a comparison of two main approaches for extracting text from financial PDF documents.

*   The first approach doesn't rely on OCR and explores three methods: PyPDF, PDFplumber, and PyMuPDF

  => PyPDF offers simplicity but lacks preservation of whitespaces, making it struggle with complex layouts. PDFplumber improves accuracy and performance, yet it faces issues with random and useless whitespaces, incomplete/incorrect words, and a lack of layout or table structure preservation. PyMuPDF demonstrates robust text extraction capabilities, excelling in accuracy and speed even with complex document layouts. However, it fails sometimes to parse complex structures.

*   The second approach explores two methods: Tesseract OCR, and EasyOCR:

  => Tesseract OCR demonstrated improved speed when executed directly through shell commands on the OS, bypassing the need for its wrapper Pytesseract. Tesseract OCR achieved decent extraction of paragraphs and clearly visible text blocks when detected. While it managed to preserve table structures to some extent, the output requires further cleaning to remove line breaks and excessive whitespaces.

  On the other hand, EasyOCR exhibited higher accuracy in text extraction; however, it lacked preservation of layout and spacing. As a result, table structures were lost and extracted as plain text, ignoring their original formatting. The good thing is that most of the text was captured and extracted.

Considering the comparison, we conclude that PyMuPDF is the best choice here for text extraction from non-scanned PDF documents. Its ability to handle complex layouts, decent accuracy, and remarkable speed, requiring almost no time to parse a whole document, make it an ideal solution for efficient and reliable text extraction. In contrast, OCR methods may present resource-intensive challenges and less certain results.

In the next step, we'll use the parsed documents for further text analysis.