# PDF Extraction

This notebook shows examples of text extraction from pdf files with different packages

**Table of contents**<a id='toc0_'></a>    
- 1. [Non OCR extraction methods](#toc1_)    
  - 1.1. [Pypdf2         ](#toc1_1_)    
  - 1.2. [Pymupdf         ](#toc1_2_)    
  - 1.3. [Fitz         ](#toc1_3_)    
  - 1.4. [Unstructured         ](#toc1_4_)    
    - 1.4.1. [Unstructured local pdf loader](#toc1_4_1_)    
    - 1.4.2. [Unstructured api loader](#toc1_4_2_)    
- 2. [OCR and table extraction methods](#toc2_)    
  - 2.1. [Unstructured Pytesseract loader](#toc2_1_)    
  - 2.2. [Paddle OCR loader](#toc2_2_)    
- 3. [Evaluate loaded docs by embedding similarity](#toc3_)    
  - 3.1. [Embedding & Storage](#toc3_1_)    
  - 3.2. [Similarity search](#toc3_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [5]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

import glob
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter

## 1. <a id='toc1_'></a>[Non OCR extraction methods](#toc0_)

In [6]:
# Provide location of pdf files
folder_loc = os.path.join(kit_dir,'data/sample_data/sample_pdfs/')
pdf_files = [f for f in glob.glob(f'{folder_loc}/*.pdf')]
sample_pdf = pdf_files[0]

##### Load text splitter

In [3]:
text_splitter = RecursiveCharacterTextSplitter(
        # Set a small chunk size, just to make splitting evident.
        chunk_size = 500,
        chunk_overlap  = 100,
        length_function = len,
        separators = ["\n\n\n","\n\n", "\n", "."]
    )

### 1.1. <a id='toc1_1_'></a>[Pypdf2](https://pypdf2.readthedocs.io/en/3.0.0/)          [&#8593;](#toc0_)

In [6]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(sample_pdf)
docs_pypdf2 = loader.load_and_split(text_splitter = text_splitter)

for doc in docs_pypdf2:
    print(f'{doc.page_content}\n---')


Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like 
Google or Facebook, you’ll be hard-pressed to attract top talent, resulting in a subpar AI solution. It’s time you demand a
---
solution to this challenge and it’s time that companies meet this need.
  01 Copyright © 2021 SambaNova Systems, Inc. All rights reserved.ROADBLOCKS TO INNOVATION
UNLOCKING THE FUTURE
SambaNova enables organizations to build and deploy AI solutions for natural language processing, high-resolution 
computer vision, and recommendation. With state-of-the-art accuracy, unmatched scalability, and ease of use,
---
SambaNova delivers AI capabilities at a fraction of the time and expense it takes to develop complex in-house 
infrastructur

### 1.2. <a id='toc1_2_'></a>[Pymupdf](https://pymupdf.readthedocs.io/en/latest/)          [&#8593;](#toc0_)

In [5]:
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(sample_pdf)
docs_pymupdf = loader.load_and_split(text_splitter = text_splitter)

for doc in docs_pymupdf:
    print(f'{doc.page_content}\n---')


Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like 
Google or Facebook, you’ll be hard-pressed to attract top talent, resulting in a subpar AI solution. It’s time you demand a
---
solution to this challenge and it’s time that companies meet this need.
  01
Copyright © 2021 SambaNova Systems, Inc. All rights reserved.
ROADBLOCKS TO INNOVATION
UNLOCKING THE FUTURE
SambaNova enables organizations to build and deploy AI solutions for natural language processing, high-resolution 
computer vision, and recommendation. With state-of-the-art accuracy, unmatched scalability, and ease of use,
---
SambaNova delivers AI capabilities at a fraction of the time and expense it takes to develop complex in-house 
infrastructu

### 1.3. <a id='toc1_3_'></a>[Fitz](https://pymupdf.readthedocs.io/en/latest/module.html)          [&#8593;](#toc0_)

Pymupdf uses the fitz package underneath but using Fitz provides more flexibility.

In [6]:
import fitz
from src.multi_column import column_boxes
from langchain.schema import Document

docs = fitz.open(sample_pdf)

docs_fitz=[]
for page, page in enumerate(docs):
    full_text = ''
    bboxes = column_boxes(page, footer_margin=100, no_image_text=True)
    for rect in bboxes:
        full_text += page.get_text(clip=rect, sort=True)
    metadata={"source":sample_pdf}
    doc = Document(page_content=full_text, metadata=metadata)
    docs_fitz.append(doc)
docs_fitz = text_splitter.split_documents(docs_fitz)
for doc in docs_fitz:
    print(f'{doc.page_content}\n---')
    

Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
AI is increasingly being adopted across commercial and public sector industries and is as disruptive 
today as the advent of the internet a few decades ago. And like the internet—AI promises decisive 
competitive and operational advantages to organizations that can leverage it for innovation sooner 
rather than later.
ROADBLOCKS TO INNOVATION
---
rather than later.
ROADBLOCKS TO INNOVATION
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
• Critical skill gaps due to machine learning talent scarcity
• Lack of expertise in computing architectures
• Difculty in keeping on top of latest models and techniques
• Investment justiﬁcation without proof of prior impact
UNLOCKING THE FUTURE
---
• Investment justiﬁcation without proof of prior impact
UNLOCKING THE FUTURE
The AI talent shortage continues to burden companies across all indus

### 1.4. <a id='toc1_4_'></a>[Unstructured](https://unstructured.io/)          [&#8593;](#toc0_)

The Unstructured package offers text extraction capabilities from various document types and seamlessly integrates with [Langchain]((https://python.langchain.com/docs/integrations/document_loaders/unstructured_file)).


#### 1.4.1. <a id='toc1_4_1_'></a>[Unstructured local pdf loader](#toc0_)

In [8]:
from langchain.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader(sample_pdf)
docs_unstructured = loader.load_and_split(text_splitter = text_splitter)

for doc in docs_unstructured:
    print(f"{doc.page_content}\n---")

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION
---
ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE
---
UNLOCKING THE FUTURE

The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like Google or Facebook,

#### 1.4.2. <a id='toc1_4_2_'></a>[Unstructured api loader](#toc0_)

In [4]:
from langchain.document_loaders import UnstructuredAPIFileLoader
# register at Unstructured.io to get a free API Key
load_dotenv(os.path.join(repo_dir,'.env'))
loader = UnstructuredAPIFileLoader(sample_pdf, 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url =os.environ.get('UNSTRUCTURED_URL'))
docs_unstructured_api = loader.load_and_split(text_splitter = text_splitter)
for doc in docs_unstructured_api:
    print(f"{doc.page_content}\n---")

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION
---
ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE
---
UNLOCKING THE FUTURE

The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like Google or Facebook,

## 2. <a id='toc2_'></a>[OCR and table extraction methods](#toc0_)

### 2.1. <a id='toc2_1_'></a>[Unstructured Pytesseract loader](#toc0_)

For runing this loader you should install the pyteseract and poppler-utils packages in your machine, or run this notebook over the data_extarction docker container

This loader uses behind the scenes Unstructured and pytesseract module to perform a layout detection, then transcribe text, and tables as Html tables

In [None]:
from data_extraction.src.pdf_table_text_extraction import UnstructuredPdfPytesseractLoader

loader = UnstructuredPdfPytesseractLoader(sample_pdf)
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


commercial and public sector industries and is as disruptive decades ago. And like the internet—Al promises decisive to organizations that can leverage it for innovation sooner

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shorta

### 2.2. <a id='toc2_2_'></a>[Paddle OCR loader](#toc0_)

For runing this loader you should run this notebook over the paddle-ocr environment the data_extarction_paddel docker container

This loader uses behind the scenes Paddle OCR and Paddle Structure modules to perform a layout detection, then mask images and equations, transcribe text, and tables as Html tables

In [None]:
from data_extraction.src.multi_column_ocr import PaddleOCRLoader

loader = PaddleOCRLoader(sample_pdf, output_folder=os.path.join(kit_dir,'data/extraction'))
docs=loader.load()

for doc in docs:
    print(f'{doc.page_content}\n---\n')

[2024/02/06 15:47:02] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, shape_info_filename=None, precision='fp32', gpu_mem=500, image_dir=None, det_algorithm='DB', det_model_dir='/Users/jorgep/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_box_type='quad', det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, det_fce_box_type='poly', rec_algorithm='SVTR_LCNet', rec_model_dir='/Users/jorgep/.paddleocr/whl/rec/en/en_PP-OCRv3_rec_infer', rec_image_shape='3, 48, 320', rec_batch_num=6, max_te

## 3. <a id='toc3_'></a>[Evaluate loaded docs by embedding similarity](#toc0_)

### 3.1. <a id='toc3_1_'></a>[Embedding & Storage](#toc0_)

In [11]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
                                            embed_instruction="", # no instructions needed for candidate passages
                                            query_instruction="Represent this sentence for searching relevant passages: ",
                                            encode_kwargs=encode_kwargs)
vectorstore_pypdf2 = FAISS.from_documents(documents=docs_pypdf2, embedding=embd_model)
vectorstore_pymupdf = FAISS.from_documents(documents=docs_pymupdf, embedding=embd_model)
vectorstore_fitz = FAISS.from_documents(documents=docs_fitz, embedding=embd_model)
vectorstore_unstructured_local = FAISS.from_documents(documents=docs_unstructured, embedding=embd_model)
vectorstore_unstructured_api = FAISS.from_documents(documents=docs_unstructured_api, embedding=embd_model)

load INSTRUCTOR_Transformer
max_seq_length  512


### 3.2. <a id='toc3_2_'></a>[Similarity search](#toc0_)

In [13]:
query = "what is the valuie offer of sambanova?"

ans = vectorstore_pypdf2.similarity_search(query)
print("-------PyPDF2 Loader----------\n")
print(ans[0].page_content)


ans_2 = vectorstore_pymupdf.similarity_search(query)
print("--------PyMuPDF Loader------------\n")
print(ans_2[0].page_content)


ans_3 = vectorstore_fitz.similarity_search(query)
print("--------Fitz loader------------\n")
print(ans_3[0].page_content)

ans_4 = vectorstore_unstructured_local.similarity_search(query)
print("-------Unstructured local Loader----------\n")
print(ans_4[0].page_content)


ans_5 = vectorstore_unstructured_api.similarity_search(query)
print("--------Unstructured api loader------------\n")
print(ans_5[0].page_content)


-------PyPDF2 Loader----------

met that criteria.”
SambaNova Systems is an AI innovation company that empowers organizations to deploy best-in-class solutions for computer vision, 
natural language processing, recommendation systems, and AI for science with conﬁdence. SambaNova’s ﬂagship oﬀering, 
Dataﬂow-as-a-Service, helps organizations rapidly deploy AI in days, unlocking new revenue and boosting operational eﬃciency.
--------PyMuPDF Loader------------

met that criteria.”
SambaNova Systems is an AI innovation company that empowers organizations to deploy best-in-class solutions for computer vision, 
natural language processing, recommendation systems, and AI for science with conﬁdence. SambaNova’s ﬂagship ofering, 
Dataﬂow-as-a-Service, helps organizations rapidly deploy AI in days, unlocking new revenue and boosting operational efciency.
--------Fitz loader------------

HARNESS THE POWER OF NEXT-GENERATION AI
SambaNova enables organizations to build and deploy AI solutions for na