# Passport Data Extraction Using Indox and OCR

In this notebook, we will demonstrate how to extract structured data from passport images using the Indox API and Optical Character Recognition (OCR) techniques. The primary focus will be on processing the images and extracting relevant data fields from them, allowing us to automate the extraction of critical information typically found in passports.

## Overview of IndoxMiner

**IndoxMiner** is a powerful document processing and data extraction tool designed to streamline the process of converting unstructured document data into structured, usable information. Key features of IndoxMiner include:

- **Schema-Based Extraction**: Allows users to define custom extraction schemas for various document types (e.g., invoices, passports, flight tickets), enabling tailored data extraction that meets specific requirements.

- **Validation Rules**: Supports the application of validation rules on extracted data to ensure data integrity and accuracy. Users can define patterns and constraints that extracted fields must adhere to.

- **Multi-Model Support**: Offers integration with multiple OCR models, such as Tesseract, EasyOCR, and PaddleOCR, allowing users to select the most suitable model for their specific needs.

- **OpenAI Integration**: Leverages OpenAI's language models to enhance document understanding and improve extraction quality, particularly for complex or poorly structured documents.

- **Output Formats**: Provides flexibility in output formats, enabling extracted data to be saved in various formats such as JSON, CSV, or directly into DataFrames for further analysis.

- **Efficient Document Processing**: The DocumentProcessor class manages the lifecycle of document processing, including reading, pre-processing, and applying OCR, ensuring an efficient workflow from raw documents to structured data.

- **User-Friendly API**: Simplifies interaction with complex machine learning models and OCR tools through a user-friendly API, making it accessible for users with varying levels of technical expertise.

## Setup

Before we begin, we need to install the necessary libraries that will be used in this demonstration. If you are using Google Colab, you can install them using the commands below (uncomment the lines if needed).



In [None]:
!pip install indoxMiner
!pip install opencv-python
!pip install paddlepaddle paddleocr  # or easyocr, tesseract depending on your choice


Collecting paddlepaddle
  Using cached paddlepaddle-2.6.2-cp310-cp310-manylinux1_x86_64.whl.metadata (8.6 kB)
Collecting paddleocr
  Using cached paddleocr-2.9.1-py3-none-any.whl.metadata (8.5 kB)
Collecting albumentations==1.4.10 (from paddleocr)
  Using cached albumentations-1.4.10-py3-none-any.whl.metadata (38 kB)
Using cached paddlepaddle-2.6.2-cp310-cp310-manylinux1_x86_64.whl (126.0 MB)
Using cached paddleocr-2.9.1-py3-none-any.whl (544 kB)
Using cached albumentations-1.4.10-py3-none-any.whl (161 kB)
Installing collected packages: paddlepaddle, albumentations, paddleocr
  Attempting uninstall: albumentations
    Found existing installation: albumentations 1.4.20
    Uninstalling albumentations-1.4.20:
      Successfully uninstalled albumentations-1.4.20
Successfully installed albumentations-1.4.10 paddleocr-2.9.1 paddlepaddle-2.6.2


## Import Necessary Libraries
Now, let’s import the necessary libraries required for our task. Each library plays a specific role in the workflow:

*   **os**: Provides a way to interact with the operating system, allowing us to read and manipulate file paths.
*   **cv2**: Part of OpenCV, a computer vision library that enables image processing tasks such as reading images, resizing, and applying filters.
*   **gc**: The garbage collection module that helps in managing memory usage by cleaning up unreferenced objects in Python.
*   **pandas**: A data manipulation library that provides powerful data structures like DataFrames, making it easier to handle and analyze structured data.

## Core Components of IndoxMiner

### 1. NerdToken API
   - **Description**: NerdToken API offers a model designed to work with real-time data extraction in specialized document formats.
   - **Best For**: Real-time or streaming tasks where data extraction needs to be as efficient as possible.
   - **Features**:
      - Advanced **temperature, frequency, and presence penalties** for tailored responses.
      - Optimized for quick setup and results.
   - **Usage**: Suited for specialized applications requiring low latency and immediate data retrieval.


### 2. DocumentProcessor
The **DocumentProcessor** class is responsible for managing the workflow of document processing. This includes reading the documents, applying Optical Character Recognition (OCR), and preparing the extracted text for further processing.

- **Key Features**:
  - **Document Handling**: Capable of processing various document formats, including images and PDFs, ensuring versatility in handling input sources.
  - **Pre-processing Capabilities**: Includes methods to pre-process documents (e.g., resizing, filtering), which can improve OCR accuracy and the quality of extracted data.
  - **Batch Processing**: Supports processing multiple documents in a single operation, optimizing efficiency when dealing with large datasets.

### 3. ProcessingConfig
The **ProcessingConfig** class provides configuration options for customizing the document processing and extraction pipeline. This allows users to specify parameters relevant to the OCR process and data extraction.

- **Key Features**:
  - **OCR Model Selection**: Users can choose from different OCR models (e.g., PaddleOCR, Tesseract, EasyOCR) based on their specific requirements.
  - **Image Preprocessing Options**: Enables configurations related to image handling, such as resolution adjustments, image formats, and processing methods.
  - **Fine-tuning for Accuracy**: Users can tweak settings to optimize the OCR process for specific document types or to accommodate varying quality levels in the source documents.

### 4. Schema
The **Schema** class defines the structure of the data to be extracted from documents. It specifies the fields to be extracted, along with their types and any associated validation rules.

- **Key Features**:
  - **Predefined Schemas**: Offers several built-in schemas tailored for common document types (e.g., passports, invoices, flight tickets), streamlining the extraction process.
  - **Custom Schema Definition**: Users can create their own schemas by specifying field names, descriptions, and types, allowing for flexibility to accommodate unique document formats.
  - **Field Validation**: Integrates validation rules to ensure that the extracted data meets specific criteria (e.g., regex patterns), enhancing data quality.

### 5. Extractor
The **Extractor** class is the primary component responsible for executing the data extraction process based on the defined schema. It takes the processed document data and applies the extraction logic to retrieve structured information.

- **Key Features**:
  - **Schema-Driven Extraction**: Uses the provided schema to guide the extraction process, ensuring that the right fields are captured from the processed documents.
  - **Data Structuring**: Converts the raw extracted text into structured data formats, making it easier to analyze and manipulate.
  - **Output Conversion**: Provides methods to convert extracted data into various formats, such as Pandas DataFrames, JSON, or CSV, facilitating downstream data processing tasks.


In [1]:
# Import necessary libraries
import os
import pandas as pd
from indoxMiner import NerdTokenApi, DocumentProcessor, ProcessingConfig, Schema, Extractor


## Initialize Indox API

We will start by initializing the Indox LLM API with our API key. Replace `'YOUR_API_KEY'` with your actual Indox API key.


In [3]:
# Initialize Indox LLM
indox_api = NerdTokenApi(api_key='YOUR_API_KEY', model='')  # Replace with your actual API key

## Define the Passport Images from Local Directory

Next, we will define the directory containing the passport images. We will read all images from this directory.


In [4]:
# Define the directory containing passport images
image_directory = 'data/passport_dataset_jpg/'

# List all image files in the specified directory
passport_images = [os.path.join(image_directory, f) for f in os.listdir(image_directory) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]


## Check Image Paths

Before processing, we need to ensure that the provided image paths are valid and that the files exist.


In [5]:
# Check if the paths exist
for image_path in passport_images:
    if not os.path.exists(image_path):
        raise FileNotFoundError(f"The image file {image_path} does not exist.")


## Initialize OCR Processor
Now, we will set up the OCR processor configuration. Depending on your requirements, you can choose between different OCR models (e.g., PaddleOCR or Tesseract) for text extraction from images.

## Installing OCR Dependencies
If you decide to use Tesseract, you will need to install it along with the Python wrapper. Alternatively, you can use EasyOCR or PaddleOCR, which are easier to set up in Colab.



In [None]:
# For Tesseract OCR
!pip install pytesseract
!sudo apt install tesseract-ocr

# For EasyOCR
!pip install easyocr

# For PaddleOCR
!pip install paddlepaddle paddleocr

## Initialize OCR Configuration
Now, we will initialize the processor with the OCR configuration, specifying which OCR model to use.

In [6]:
# Initialize processor with config for OCR
config = ProcessingConfig(
    ocr_for_images=True,
    ocr_model='easyocr'  # or 'tesseract' or 'paddle'
)

## Predefined Schemas in IndoxMiner

IndoxMiner offers several predefined schemas designed for specific document types. Each schema contains fields relevant to the type of document being processed. Here are some of the available predefined schemas:

### Schema.Passport
Extracts details from passports, including:
- **Passport Number**
- **Given Names**
- **Surname**
- **Date of Birth**
- **Place of Birth**
- **Nationality**
- **Gender**
- **Date of Issue**
- **Date of Expiry**
- **Place of Issue**
- **Machine Readable Zone (MRZ)**

### Schema.Invoice
Extracts key data from invoices, such as:
- **Invoice Number**
- **Company Name**
- **Itemized Charges**
- **Total Amount**
- **Tax Amount**

### Schema.FlightTicket
Extracts flight information including:
- **Ticket Number**
- **Passenger Name**
- **Flight Details**
- **Class of Travel**

### Schema.BankStatement
Extracts details from bank statements, including:
- **Account Number**
- **Transaction History**
- **Balances**

### Schema.MedicalRecord
Extracts medical details like:
- **Patient Information**
- **Diagnoses**
- **Medications**

### Schema.DriverLicense
Extracts license details including:
- **License Number**
- **Holder Information**
- **Validity Dates**

## Custom Schema Definition

Users can also define their custom schemas using the `ExtractorSchema` class, specifying the fields and their data types to tailor the extraction process for specific document needs.

### Example of a Custom Schema
```python
custom_schema = ExtractorSchema(
    fields=[
        Field(name="amount", description="The quantity or hours of service/product", field_type=FieldType.FLOAT),
        Field(name="description", description="Description of the service or product", field_type=FieldType.STRING),
        Field(name="invoice_id", description="ID of the invoice", field_type=FieldType.INTEGER),
    ]
)


## Initialize the Data Extractor

Next, we will initialize the extractor, which will use the Indox LLM and the defined passport schema to extract structured data.


In [7]:
# Initialize the Extractor with the Indox LLM and the passport schema
extractor = Extractor(llm=indox_api, schema=Schema.Passport)



## Document Processing with DocumentProcessor
In this section, we will create an instance of DocumentProcessor to handle document processing tasks using the specified dataset. This process prepares the document data for extraction by performing any necessary pre-processing steps.

### Steps Explained
#### Initialize the Document Processor:
```python
processor = DocumentProcessor(passport_images)
```
The DocumentProcessor is initialized with a dataset (`passport_images`) that contains the images to be processed. This dataset serves as the input source for the data extraction process.

#### Process the Document:
```python
results = processor.process(config)
```
The `process` method is called on the DocumentProcessor instance. This method processes each document in the `passport_images` and returns the extracted data in a structured format.

#### Inspect the Processed Data:
```python
print(results)
```
The `results` variable holds the processed document information, ready for extraction or further steps in the pipeline. This setup allows for the seamless transition from raw documents to structured data, preparing it for schema-driven extraction.
```

## Extract Data from Passports

We will now process each passport image to extract the text and then use the extractor to get structured data.


In [8]:
# Choose the first image from the list
image_to_process = passport_images[0]  # Or choose any other index if needed

# Initialize the processor for the selected image
processor = DocumentProcessor([image_to_process])  # Pass only one image

# Process the document to extract text using OCR
results = processor.process(config)

# Extract structured data using the extractor
extraction_results = extractor.extract(results)

# For example, to see the result:
print(extraction_results)


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


ExtractionResults(data=[{'items': [{'Passport Number': '3-1053698-2', 'Given Names': 'DANA', 'Surname': 'TALMON', 'Date of Birth': '1970-03-02', 'Place of Birth': 'JERUSALEM ISRAEL', 'Nationality': 'ISRAELI', 'Gender': 'F', 'Date of Issue': '2001-09-05', 'Date of Expiry': '2006-09-04', 'Place of Issue': 'TALMON', 'MRZ': 'P<ISRTAL<TALMON<<DANA<<<<<<<<<<<<<<<<<<<<<<<0000000<<0ISR7003022F06090453<1053698<2<<<34'}]}], raw_responses=['{\n    "Passport Number": "3-1053698-2",\n    "Given Names": "DANA",\n    "Surname": "TALMON",\n    "Date of Birth": "1970-03-02",\n    "Place of Birth": "JERUSALEM ISRAEL",\n    "Nationality": "ISRAELI",\n    "Gender": "F",\n    "Date of Issue": "2001-09-05",\n    "Date of Expiry": "2006-09-04",\n    "Place of Issue": "TALMON",\n    "MRZ": "P<ISRTAL<TALMON<<DANA<<<<<<<<<<<<<<<<<<<<<<<0000000<<0ISR7003022F06090453<1053698<2<<<34"\n}'], validation_errors={0: ['Item 1: Passport Number does not match pattern ^[A-Z0-9]{6,9}$', 'Item 1: Passport Number exceeds maxi


## Extracting and Structuring Data into a DataFrame

In this section, the `Extractor` instance processes the previously prepared document data to extract specific fields as defined by the schema. The extracted data is then converted into a DataFrame format for easier analysis and manipulation.

### Steps Explained

1. **Data Extraction**:
   ```python
   extracted_data = extractor.extract(data)
   ```
   - The `extract` method is called on the `extractor` instance, passing in the processed document data.
   - This method applies the specified schema to the data, extracting fields according to the schema definitions.

2. **Conversion to DataFrame**:
   ```python
   extracted_data_df = extractor.to_dataframe(extracted_data)
   ```
   - The `to_dataframe` method converts `extracted_data` into a DataFrame format.
   - This structured format makes it easy to view, analyze, and manipulate the data using DataFrame operations.

3. **Display the DataFrame**:
   ```python
   print(extracted_data_df)
   ```
   - Displaying `extracted_data_df` shows the extracted data in a tabular form.
   - This final output allows for easy inspection of the extracted fields across multiple documents, facilitating analysis or further processing.
   - By converting to a DataFrame, the extracted information is organized in a format that is convenient for downstream tasks such as visualization, reporting, or exporting to other formats (e.g., CSV, Excel).


In [None]:
# Convert extraction results to DataFrame
dataframes = [extractor.to_dataframe(result) for result in extraction_results]

final_df = pd.concat(dataframes, ignore_index=True)
final_df


Unnamed: 0,Passport Number,Given Names,Surname,Date of Birth,Place of Birth,Nationality,Gender,Date of Issue,Date of Expiry,Place of Issue,MRZ
0,0000000,DANA,TALMON,1970-03-02,ISRAEL,ISRAELI,F,2001-09-05,2006-09-04,JERUSALEM,P<ISRTAL<TALMON<DANA<<<<<<<<<<<<<<0000000<<01S...
1,AA0000O0,JOAO JOSE,SILVA ONNAMO,1970-09-06,,BRA,M,2000-12-15,2050-12-14,,AA000000<0BRA7009068F5012147<<<00
2,SA0009816,ELLINE MARYSE,GOSSINI,1979-10-31,BRAZZAVILLE,CONGOLAISE,,2012-08-29,2017-08-28,BRAZZAVILLE,PSCOGGOSSINIELLINEMARYSE<<<<SA00098160C0G79103...



## Conclusion

In this demo, we successfully extracted structured data from passport images using the Indox API and OCR techniques. This process automates the retrieval of critical information typically found in passports, such as passport number, name, date of birth, and more.

### Future Work

Consider enhancing this demo by adding error handling for OCR failures, integrating logging for better traceability, or extending the extraction schema for additional fields. You can also explore using a different OCR model to compare performance and accuracy.

### Additional Features of IndoxMiner

- **Dynamic Schema Adaptation**: Users can define and adapt schemas dynamically, allowing for easy adjustments to the data extraction process as document formats change.

- **Comprehensive Documentation**: IndoxMiner comes with thorough documentation that helps users understand how to implement features, troubleshoot issues, and optimize their extraction processes.

- **Community Support**: As an open-source project, IndoxMiner benefits from community contributions and support, enabling continuous improvement and feature enhancement.

```
