<a href="https://colab.research.google.com/github/luc4t/LLM/blob/main/Unstructured_PDFtoJSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Quick Tour**

The following examples show how to get started with the `unstructured` library. See
our [documentation page](https://unstructured-io.github.io/unstructured) for a full description
of the features in the library.

Another way to try out the `unstructured` library is by running a docker container -- compatible with either Intel/AMD or Apple Silicon! Check out the [instructions for using the docker image](https://github.com/Unstructured-IO/unstructured#dizzy-instructions-for-using-the-docker-image).

In [14]:
# Install Requirements
!apt-get -qq install poppler-utils tesseract-ocr
# Upgrade Pillow to latest version
%pip install -q --user --upgrade pillow
# Install Python Packages
%pip install -q unstructured["all-docs"]==0.12.6
# NOTE: you may also upgrade to the latest version with the command below,
#       though a more recent version of unstructured will not have been tested with this notebook
# %pip install -q --upgrade unstructured

In [35]:
!apt-get -qq install tesseract-ocr-deu

In [36]:
!apt-get -qq install tesseract-ocr-eng

In [2]:
# Install NLTK Data
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

### PDF Parsing

There are two strategies availalbe for parsing PDF documents: "fast" and "hi_res." The default strategy is "hi_res"

If your main objective is extracting text from a "clean" PDF, i.e. one that does not include text in images that require OCR), go with the "fast" option.

Otherwise, if your PDF may have images with text to extract, or, you prefer to have better structured Elements that better characterize the text items within the document, go with with the "hi_res" option.

Naturally, "fast" is faster than "hi_res" -- by an order of magnitude!

In [28]:
from unstructured.partition.pdf import partition_pdf

In [33]:
# Define parameters for Unstructured's library

## include_page_breaks
# include page breaks (default is False)
include_page_breaks = True

## strategy
# The strategy to use for partitioning the PDF. Valid strategies are "hi_res", "ocr_only", and "fast".
# When using the "hi_res" strategy, the function uses a layout detection model to identify document elements.
# hi_res" is used for analyzing PDFs and extracting table structure (default is "auto")
strategy = "hi_res"

## infer_table_structure
# Only applicable if `strategy=hi_res`.
# If True, any Table elements that are extracted will also have a metadata field named "text_as_html" where the table's text content is rendered into an html string.
# I.e., rows and cells are preserved.
# Whether True or False, the "text" field is always present in any Table element and is the text content of the table (no structure).

if strategy == "hi_res": infer_table_structure = True
else: infer_table_structure = False

## extract_element_types
# Get images of tables
if infer_table_structure == True: extract_element_types=['Table']
else: extract_element_types=None

## max_characters
# The maximum number of characters to include in a partition (document element)
# If None is passed, no maximum is applied.
# Only applies to the "ocr_only" strategy (default is 1500)
if strategy != "ocr_only": max_characters = None

## languages
# The languages to use for the Tesseract agent.
# To use a language, you'll first need to install the appropriate Tesseract language pack.
languages = ["eng+deu"] # example if more than one "eng+por" (default is "eng")

## model_name
# @requires_dependencies("unstructured_inference")
# yolox: best model for table extraction. Other options are yolox_quantized, detectron2_onnx and chipper depending on file layout
# source: https://unstructured-io.github.io/unstructured/best_practices/models.html
model_name = "yolox"

In [31]:
elements = partition_pdf(filename="Apple 10k.pdf",
        include_page_breaks=include_page_breaks,
        strategy=strategy,
        infer_table_structure=infer_table_structure,
        extract_element_types=extract_element_types,
        max_characters=max_characters,
        languages=languages,
        model_name=model_name,
        )

elements_fast = partition_pdf("Apple 10k.pdf", strategy="fast")

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's examine the types of elements returned for both the "hi_res" and "fast" strategies:

In [32]:
# get output as json
filename = "Apple 10k 3.pdf"
from unstructured.staging.base import elements_to_json
elements_to_json(elements, filename=f"{filename}.json")
# Takes a while for file to show up on the Google Colab

Let's display the type and text of some of the elements in the document: