In [16]:
!pip install indoxMiner python-dotenv pdfminer.six pi_heif unstructured_inference unstructured_pytesseract tesseract pytesseract

Collecting tesseract
  Downloading tesseract-0.1.3.tar.gz (45.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.6/45.6 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Building wheels for collected packages: tesseract
  Building wheel for tesseract (setup.py) ... [?25l[?25hdone
  Created wheel for tesseract: filename=tesseract-0.1.3-py3-none-any.whl size=45562552 sha256=6c2fecc6f10d2afd25f70e04c059009de4a7ba0b4c7dba88f082f6d7884493c0
  Stored in directory: /root/.cache/pip/wheels/71/c9/aa/698c579693e83fdda9ad6d6f0d8f61ed986e27925ef576f109
Successfully built tesseract
Installing collected packages: tesseract, pytesseract
Successfully installed pytesseract-0.3.13 tesseract-0.1.3


In [18]:
!apt-get install -y poppler-utils
!apt-get install -y tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Import necessary classes from the indoxMiner library

This import statement brings in various components from the `indoxMiner` package, which are used to define schemas, configure data extraction, process documents, and specify output formats for structured data extraction. Each imported class or object serves a specific role:

- **ExtractorSchema**: Defines the schema to guide data extraction from documents.
- **Field**: Represents an individual data field to be extracted, used in schema definition.
- **FieldType**: Specifies data types (e.g., text, number) for each field in a schema.
- **ValidationRule**: Allows the application of validation checks (e.g., formatting, regex) on extracted data.
- **OutputFormat**: Determines the format (e.g., JSON, CSV) in which extracted data is output.
- **Extractor**: The primary class responsible for executing the data extraction process based on the schema.
- **DocumentProcessor**: Manages the document processing steps (e.g., reading and pre-processing) before extraction.
- **ProcessingConfig**: Provides configuration options for customizing document processing and extraction.
- **Schema**: Represents the full schema containing multiple fields for a complete data extraction structure.
- **OpenAi**: Provides an interface to OpenAI's language model, which can enhance document processing and extraction.


In [2]:
from indoxMiner import (
    ExtractorSchema,
    Field,
    FieldType,
    ValidationRule,
    Extractor,
    DocumentProcessor,
    ProcessingConfig,
    Schema,
    OpenAi
)


In [3]:
from google.colab import files
uploaded = files.upload()

In [5]:
openai_extractor = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")

# Create the Extractor with a Specified Schema

In this cell, an instance of `Extractor` is created using `openai_extractor` as the language model and `Schema.Invoice` as the extraction schema.

The extractor schema (`Schema`) defines multiple document structures, each tailored for extracting specific types of information. Users can set the schema manually depending on the type of document they need to process. Here’s a breakdown of the available schemas in this example:

### Available Schemas

- **Passport**: Extracts details from a passport, including information like passport number, name, nationality, and date of birth.
- **Invoice**: Extracts key data from an invoice, such as invoice number, company details, itemized charges, and tax amounts.
- **Flight Ticket**: Extracts flight information, including ticket number, passenger name, flight details, and class of travel.
- **Bank Statement**: Extracts information from a bank statement, including account details, transaction history, and balances.
- **Medical Record**: Extracts medical details like patient information, diagnoses, medications, and physician details.
- **Driver License**: Extracts license details, including license number, holder information, and validity dates.

### Customization with Manual Schema Example

Users can define their own schemas by specifying `ExtractorSchema` with `Field` definitions that suit their data extraction needs. Here’s an example of a custom, manually defined schema that captures line-item details from an invoice:

```python
schema = ExtractorSchema(
    fields=[
        Field(
            name="amount",
            description="The quantity or hours of service/product (e.g., 2.25, 40.3)",
            field_type=FieldType.FLOAT,
        ),
        Field(
            name="description",
            description="Description of the service or product provided",
            field_type=FieldType.STRING,
        ),
        Field(
            name="price_per_unit",
            description="Price per unit in euro (e.g., 135.00)",
            field_type=FieldType.FLOAT,
        ),
        Field(
            name="total_price",
            description="Total price for this line item in euro (amount * price_per_unit)",
            field_type=FieldType.FLOAT,
        ),
        Field(
            name="invoice_id",
            description="ID of the invoice",
            field_type=FieldType.INTEGER,
        ),
    ],
)
```

In [6]:
# Create the extractor
extractor = Extractor(llm=openai_extractor,schema=Schema.Invoice)


In [7]:
import os

folder_path = "/content/invoice_pdf_dataset"
invoices = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pdf')]

test_dataset=invoices[:10]

In [4]:
# Set your OpenAI API key
import os
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

# Document Processing with `DocumentProcessor`

In this cell, an instance of `DocumentProcessor` is created to handle document processing tasks using a specified dataset (`test_dataset`). This process prepares the document data for extraction by performing any necessary pre-processing steps.

### Steps Explained:

1. **Initialize the Document Processor**: `doc_processor = DocumentProcessor(test_dataset)`  
   - `DocumentProcessor` is initialized with a dataset (`test_dataset`) that contains the documents to be processed.
   - This dataset serves as the input source for the data extraction process.

2. **Process the Document**: `data = doc_processor.process()`  
   - The `process` method is called on the `DocumentProcessor` instance.
   - This method processes each document in `test_dataset` and returns the extracted data in a structured format.
   - The processed `data` can now be used with an `Extractor` instance to perform further extraction or analysis.

3. **Inspect the Processed Data**: `data`  
   - The `data` variable holds the processed document information, ready for extraction or further steps in the pipeline.

This setup allows for the seamless transition from raw documents to structured data, preparing it for schema-driven extraction.


In [9]:
doc_processor = DocumentProcessor(test_dataset)

# Process the document
data = doc_processor.process()
data

{'invoice_Lindsay Williams_46681.pdf': [Document(page_content='Superstore INVOICE # 46681 Date: Jan 03 2012 Bill To: Ship To: me vue pe . Ship Mode: Standard Class Lindsay Williams Mosul, Ninawa, Iraq Balance Due: $2,748.62 heya Quantity latclic) Amount Novimex Swivel Stool, Set of Two 4 $666.84 $2,667.36 Chairs, Furniture, FUR-CH-5414 Subtotal: $2,667.36 Shipping: $81.26 Total: $2,748.62 Notes: Thanks for your business! Terms: Order ID : IZ-2012-LW699061-4091 1', metadata={'filename': 'invoice_Lindsay Williams_46681.pdf', 'filetype': 'application/pdf', 'page_number': 1, 'source': '/content/invoice_pdf_dataset/invoice_Lindsay Williams_46681.pdf'})],
 'invoice_Corey Catlett_6832.pdf': [Document(page_content='Superstore INVOICE # 6832 Date: Nov 25 2012 To: Ship To: . . Ship Mode: Standard Class Corey Catlett Chapeco, Santa Catarina, Brazil Balance Due: $3,476.60 Bill heya Quantity latclic) Amount Hoover Stove, White 3 $1,133.22 $3,399.66 Appliances, Office Supplies, OFF-AP-4745 Subtotal:

# Extracting and Structuring Data into a DataFrame

In this cell, the `Extractor` instance processes the previously prepared document data (`data`) to extract specific fields as defined by the schema. The extracted data is then converted into a DataFrame format for easier analysis and manipulation.

### Steps Explained:

1. **Data Extraction**: `extracted_data = extractor.extract(data)`  
   - The `extract` method is called on the `extractor` instance, passing in `data` (the processed document data).
   - This method applies the specified schema to `data`, extracting fields according to the schema definitions (e.g., fields from `Schema.Invoice`).
   - The resulting `extracted_data` contains the structured, schema-aligned information.

2. **Conversion to DataFrame**: `extracted_data_df = extractor.to_dataframe(extracted_data)`  
   - The `to_dataframe` method converts `extracted_data` into a DataFrame format.
   - This structured format makes it easy to view, analyze, and manipulate the data using DataFrame operations.

3. **Display the DataFrame**: `extracted_data_df`  
   - Displaying `extracted_data_df` shows the extracted data in a tabular form.
   - This final output allows for easy inspection of the extracted fields across multiple documents, facilitating analysis or further processing.

By converting to a DataFrame, the extracted information is organized in a format that is convenient for downstream tasks such as visualization, reporting, or exporting to other formats (e.g., CSV, Excel).


In [11]:
extracted_data = extractor.extract(data)
extracted_data_df = extractor.to_dataframe(extracted_data)
extracted_data_df

Unnamed: 0,Invoice Number,Date,Customer Name,Total Amount,Item List
0,46681,2012-01-03,Lindsay Williams,2748.62,"[Novimex Swivel Stool, Set of Two, Chairs, Fur..."
1,6832,2012-11-25,Corey Catlett,3476.6,"[Hoover Stove, White, Appliances, Office Suppl..."
2,21272,2012-07-19,Tracy Hopkins,580.92,"[Hewlett Personal Copier, Color, Copiers, Tech..."
3,50544,2012-02-14,Muhammed Ufa,5403.47,"[Canon Fax Machine, Color, Copiers, Technology..."
4,23250,2012-11-10,Theone Pippenger,6084.97,"[Tenex File Cart, Single Width 8, Storage, Off..."
5,9240,2012-12-07,Steven Ward,9184.5,"[Nokia Signal Booster, VoIP, Phones, Technolog..."
6,22216,2012-07-29,Heather Jas,938.18,"[KitchenAid Microwave, White, Appliances, Offi..."
7,46524,2012-12-23,Jennifer Ferguson,17531.14,"[Barricks Computer Table, with Bottom Storage,..."
8,39139,2012-04-18,Allen Goldenen,182.15,"[GBC Durable Plastic Covers, Binders, Office S..."
9,14228,2012-09-14,Doug Jacobs,5328.95,"[Tenex Frame, Durable]"
