## 📋 Table of Contents

This notebook covers the following sections:

1. **Image Classification with LLM (Images + Prompt)**: Leverage GPT-4 Omni for classification tasks through prompt engineering, directly processing images (document scans).
2. **Image Classification with MaaS (SLM) (Images + Prompt)**: Utilize the Phi-3 Vision version of MaaS for classification, directly processing images (document scans).

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbb-ai-smart-document-processing"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

from utils.ml_logging import get_logger

logger = get_logger()

Directory changed to C:\Users\pablosal\Desktop\gbb-ai-smart-document-processing


## Image Classification with LLM (Images + Prompt)

Leverage GPT-4 Omni for classification tasks through prompt engineering, directly processing images (document scans).


In [2]:
import pandas as pd

df_test = pd.read_csv(r"utils\data\scanned\image_data.csv")
df_test = df_test[df_test['set'] == 'test']
filtered_df = df_test.groupby('label').head(10)
filtered_df.head()

Unnamed: 0,location,label,set
70,utils\data\scanned\test\scientific report\scie...,scientific report,test
71,utils\data\scanned\test\scientific report\scie...,scientific report,test
72,utils\data\scanned\test\scientific report\scie...,scientific report,test
73,utils\data\scanned\test\scientific report\scie...,scientific report,test
74,utils\data\scanned\test\scientific report\scie...,scientific report,test


In [3]:
CLASSIFICATION_PROMPT = """
### Task: Document Classification

#### Inputs:
- **IMAGE**: An image of the document to be classified.

#### Instructions:
You are required to classify the document based on the provided scanned text. The document should be categorized into the most appropriate category from the list below. If the document does not fit neatly into one category, choose the category that best matches the majority of its characteristics.

1. **letter**:
    - **Description**: A written or printed message addressed to a specific person or organization.
    - **Common Characteristics**:
        - Contains salutations and closings.
        - Often addressed to a specific person or entity.
        - Includes date, sender's and recipient's addresses.

2. **form**:
    - **Description**: A document with blank fields for the user to fill out with specific information.
    - **Common Characteristics**:
        - Contains predefined fields and labels.
        - Includes spaces for user input.
        - Often used for data collection, applications, and surveys.

3. **email**:
    - **Description**: An electronic message exchanged between people using electronic devices.
    - **Common Characteristics**:
        - Contains email headers such as "From", "To", "Subject", and "Date".
        - Includes conversational text.
        - May include attachments or references to other documents.

4. **handwritten**:
    - **Description**: Any document written manually by hand.
    - **Common Characteristics**:
        - Contains handwritten text.
        - May include various writing styles and penmanship.
        - Often lacks formal structure.

5. **advertisement**:
    - **Description**: A public notice promoting a product, service, or event.
    - **Common Characteristics**:
        - Contains promotional language and visuals.
        - Includes information about a product, service, or event.
        - Often designed to attract attention.

6. **scientific report**:
    - **Description**: A detailed account of a scientific study or experiment.
    - **Common Characteristics**:
        - Includes sections such as introduction, methods, results, and conclusion.
        - Contains data, graphs, and references.
        - Written in a formal and structured manner.

7. **scientific publication**:
    - **Description**: An article published in a scientific journal.
    - **Common Characteristics**:
        - Includes abstract, introduction, methodology, results, and discussion.
        - Contains citations and references.
        - Peer-reviewed and follows academic standards.

8. **specification**:
    - **Description**: A detailed description of the requirements, design, or performance of a product or system.
    - **Common Characteristics**:
        - Includes technical details and standards.
        - Structured format with headings and subheadings.
        - Often used in engineering and manufacturing.

9. **file folder**:
    - **Description**: A document that serves as a cover or holder for other documents.
    - **Common Characteristics**:
        - Contains a label or title indicating the contents.
        - Often used for organizing and storing multiple related documents.
        - May include a table of contents or summary.

10. **news article**:
    - **Description**: A written piece reporting on current events or topics of interest.
    - **Common Characteristics**:
        - Contains headlines and bylines.
        - Includes factual reporting and quotes.
        - Published in newspapers, magazines, or online platforms.

11. **budget**:
    - **Description**: A financial plan outlining expected income and expenses.
    - **Common Characteristics**:
        - Includes numerical data and tables.
        - Details various categories of income and expenditure.
        - Often used for financial planning and analysis.

12. **invoice**:
    - **Description**: A commercial document issued by a seller to a buyer, indicating the products, quantities, and agreed prices for products or services.
    - **Common Characteristics**:
        - Contains the term "Invoice".
        - Includes seller and buyer information, invoice number, date, and payment terms.
        - Lists products or services provided, quantities, unit prices, and total amount due.
        - May include tax details and payment instructions.

13. **presentation**:
    - **Description**: A document used to communicate information visually, often as slides.
    - **Common Characteristics**:
        - Includes slides with text, images, and graphs.
        - Organized in a structured format.
        - Often used in meetings and lectures.

14. **questionnaire**:
    - **Description**: A set of questions designed to gather information from respondents.
    - **Common Characteristics**:
        - Contains multiple questions.
        - May include multiple-choice, open-ended, or scale-based questions.
        - Used for surveys and research.

15. **resume**:
    - **Description**: A document summarizing an individual's education, work experience, and skills.
    - **Common Characteristics**:
        - Includes sections such as contact information, work experience, education, and skills.
        - Written in a concise and structured format.
        - Used for job applications.

16. **memo**:
    - **Description**: A brief written message used for internal communication within an organization.
    - **Common Characteristics**:
        - Contains a header with the recipient, sender, date, and subject.
        - Includes concise information or instructions.
        - Often used for announcements and updates.

#### Steps to Classify the Document:
1. Carefully examine the text provided in markdown format, which includes the OCR output of the scanned document.
2. Identify key elements and details within the document that match the descriptions of the categories listed above.
3. Select the category that best fits the document based on its content and purpose. If the document seems to fit into more than one category, choose the one that matches the majority of its characteristics.
4. Provide the name of the category as a single word (e.g., letter, form, email) based on your analysis of the document content.

#### Output:
- Return only the chosen category as a plain string without any additional characters. Remember the categories are: "letter", "form", "email", "handwritten", "advertisement", "scientific report", "scientific publication", "specification", "file folder", "news article", "budget", "invoice", "presentation", "questionnaire", "resume", "memo" Do not add any additional characters."""

In [4]:
from src.aoai.azure_openai import AzureOpenAIManager

az_manager = AzureOpenAIManager(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    chat_model_name=os.getenv("DEPLOYMENT_ID"),
    api_key=os.getenv("OPENAI_API_KEY"),
    api_version=os.getenv("DEPLOYMENT_VERSION"),
)

In [5]:
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from tabulate import tabulate
import time

y_true_list = []
y_pred_list = []
list_lock = threading.Lock()
error_count = 0
successful_times = []


def process_row(index, row):
    """
    Process a single row to generate chat response and return the true and predicted labels.

    Parameters:
    index (int): The index of the row.
    row (pd.Series): The row data.

    Returns:
    tuple: A tuple containing the true label and predicted label.
    """
    global error_count
    start_time = time.time()
    try:
        full_path = row["location"]
        document_type = row["label"]

        logger.info(
            f"Processing row {index}: Full Path: {full_path}, Document Type: {document_type}"
        )

        result_4o = az_manager.generate_chat_response(
            system_message_content="""You are an AI assistant specializing in the classification of documents derived from images.
                                      Your task is to analyze the details within the images and apply deep reasoning to accurately categorize the documents.""",
            query=CLASSIFICATION_PROMPT,
            image_paths=[full_path],
            conversation_history=[],
            temperature=0,
        )

        if result_4o is None or result_4o[0] is None:
            logger.warning(
                f"Result is None for index {index}. File location: {full_path}"
            )
            error_count += 1
            return (None, None)
        else:
            end_time = time.time()
            successful_times.append(end_time - start_time)
            logger.info(
                f"Successfully processed row {index} in {end_time - start_time:.4f} seconds."
            )
            return (document_type, result_4o[0])
    except Exception as e:
        logger.error(f"Error processing row {index}: {e}")
        error_count += 1
        return (None, None)


with ThreadPoolExecutor(max_workers=5) as executor:
    futures = {
        executor.submit(process_row, index, row): index
        for index, row in filtered_df.iterrows()
    }

    for future in as_completed(futures):
        true_label, pred_label = future.result()
        if true_label is not None and pred_label is not None:
            with list_lock:
                y_true_list.append(true_label)
                y_pred_list.append(pred_label)

# Convert the lists to numpy arrays
y_true = np.array(y_true_list)
y_pred = np.array(y_pred_list)

2024-08-14 10:35:45,535 - micro - MainProcess - INFO     Processing row 70: Full Path: utils\data\scanned\test\scientific report\scientific report_0.png, Document Type: scientific report (3751155626.py:process_row:31)
2024-08-14 10:35:45,551 - micro - MainProcess - INFO     Processing row 71: Full Path: utils\data\scanned\test\scientific report\scientific report_1.png, Document Type: scientific report (3751155626.py:process_row:31)
2024-08-14 10:35:45,555 - micro - MainProcess - INFO     Processing row 72: Full Path: utils\data\scanned\test\scientific report\scientific report_2.png, Document Type: scientific report (3751155626.py:process_row:31)
2024-08-14 10:35:45,559 - micro - MainProcess - INFO     Processing row 73: Full Path: utils\data\scanned\test\scientific report\scientific report_3.png, Document Type: scientific report (3751155626.py:process_row:31)
2024-08-14 10:35:45,584 - micro - MainProcess - INFO     Processing row 74: Full Path: utils\data\scanned\test\scientific report

In [6]:
from utils.time import calculate_statistics

calculate_statistics(successful_times, error_count)

+----------------------------------------+---------+
| Statistic                              |   Value |
| Average time per run (seconds)         |  2.2506 |
+----------------------------------------+---------+
| Median time per run (seconds)          |  2.0036 |
+----------------------------------------+---------+
| Minimum time per run (seconds)         |  1.2941 |
+----------------------------------------+---------+
| Maximum time per run (seconds)         |  9.5673 |
+----------------------------------------+---------+
| 95th percentile time per run (seconds) |  3.4515 |
+----------------------------------------+---------+
| 99th percentile time per run (seconds) |  5.2066 |
+----------------------------------------+---------+
| Number of errors                       |  0      |
+----------------------------------------+---------+


In [7]:
from src.evaluations import ml
import numpy as np

labels = np.array(
    [
        "letter",
        "form",
        "email",
        "handwritten",
        "advertisement",
        "scientific report",
        "scientific publication",
        "specification",
        "file folder",
        "news article",
        "budget",
        "invoice",
        "presentation",
        "questionnaire",
        "resume",
        "memo",
    ]
)

# Evaluate the model with the simulated data
metrics, conf_matrix, class_report = ml.evaluate_model(
    y_true, y_pred, labels, show_visualization=True
)

2024-08-14 10:37:03,778 - micro - MainProcess - INFO     Evaluating model performance... (ml.py:evaluate_model:34)
2024-08-14 10:37:03,785 - micro - MainProcess - INFO     Invalid predictions detected and marked as 'hallucination':
  Hallucination  Count
0     inventory      1
1           fax      1 (ml.py:evaluate_model:49)
2024-08-14 10:37:03,793 - micro - MainProcess - INFO     True labels corresponding to hallucinations:
  True Label  Count
0       form      2 (ml.py:evaluate_model:60)
2024-08-14 10:37:03,796 - micro - MainProcess - INFO     Length of y_true_filtered: 158 (ml.py:evaluate_model:66)
2024-08-14 10:37:03,802 - micro - MainProcess - INFO     Length of y_pred_filtered: 158 (ml.py:evaluate_model:67)
2024-08-14 10:37:03,878 - micro - MainProcess - INFO     Accuracy: 0.7215 (ml.py:evaluate_model:90)
2024-08-14 10:37:03,881 - micro - MainProcess - INFO     Precision: 0.7656 (ml.py:evaluate_model:91)
2024-08-14 10:37:03,882 - micro - MainProcess - INFO     Recall: 0.7215 (ml.

## Image Classification with MaaS (SLM) (Images + Prompt)

Utilize the Phi-3 Vision version of MaaS for classification, directly processing images (document scans).

In [8]:
from src.paas.phi3 import phi_3_vision_inference

In [9]:
CLASSIFICATION_PROMPT = """
### Task: Document Classification

#### Inputs:
- **IMAGE**: An image of the document to be classified.

#### Instructions:
You are required to classify the document based on the provided scanned text. The document should be categorized into the most appropriate category from the list below. If the document does not fit neatly into one category, choose the category that best matches the majority of its characteristics.

1. **letter**:
    - **Description**: A written or printed message addressed to a specific person or organization.
    - **Common Characteristics**:
        - Contains salutations and closings.
        - Often addressed to a specific person or entity.
        - Includes date, sender's and recipient's addresses.

2. **form**:
    - **Description**: A document with blank fields for the user to fill out with specific information.
    - **Common Characteristics**:
        - Contains predefined fields and labels.
        - Includes spaces for user input.
        - Often used for data collection, applications, and surveys.

3. **email**:
    - **Description**: An electronic message exchanged between people using electronic devices.
    - **Common Characteristics**:
        - Contains email headers such as "From", "To", "Subject", and "Date".
        - Includes conversational text.
        - May include attachments or references to other documents.

4. **handwritten**:
    - **Description**: Any document written manually by hand.
    - **Common Characteristics**:
        - Contains handwritten text.
        - May include various writing styles and penmanship.
        - Often lacks formal structure.

5. **advertisement**:
    - **Description**: A public notice promoting a product, service, or event.
    - **Common Characteristics**:
        - Contains promotional language and visuals.
        - Includes information about a product, service, or event.
        - Often designed to attract attention.

6. **scientific report**:
    - **Description**: A detailed account of a scientific study or experiment.
    - **Common Characteristics**:
        - Includes sections such as introduction, methods, results, and conclusion.
        - Contains data, graphs, and references.
        - Written in a formal and structured manner.

7. **scientific publication**:
    - **Description**: An article published in a scientific journal.
    - **Common Characteristics**:
        - Includes abstract, introduction, methodology, results, and discussion.
        - Contains citations and references.
        - Peer-reviewed and follows academic standards.

8. **specification**:
    - **Description**: A detailed description of the requirements, design, or performance of a product or system.
    - **Common Characteristics**:
        - Includes technical details and standards.
        - Structured format with headings and subheadings.
        - Often used in engineering and manufacturing.

9. **file folder**:
    - **Description**: A document that serves as a cover or holder for other documents.
    - **Common Characteristics**:
        - Contains a label or title indicating the contents.
        - Often used for organizing and storing multiple related documents.
        - May include a table of contents or summary.

10. **news article**:
    - **Description**: A written piece reporting on current events or topics of interest.
    - **Common Characteristics**:
        - Contains headlines and bylines.
        - Includes factual reporting and quotes.
        - Published in newspapers, magazines, or online platforms.

11. **budget**:
    - **Description**: A financial plan outlining expected income and expenses.
    - **Common Characteristics**:
        - Includes numerical data and tables.
        - Details various categories of income and expenditure.
        - Often used for financial planning and analysis.

12. **invoice**:
    - **Description**: A commercial document issued by a seller to a buyer, indicating the products, quantities, and agreed prices for products or services.
    - **Common Characteristics**:
        - Contains the term "Invoice".
        - Includes seller and buyer information, invoice number, date, and payment terms.
        - Lists products or services provided, quantities, unit prices, and total amount due.
        - May include tax details and payment instructions.

13. **presentation**:
    - **Description**: A document used to communicate information visually, often as slides.
    - **Common Characteristics**:
        - Includes slides with text, images, and graphs.
        - Organized in a structured format.
        - Often used in meetings and lectures.

14. **questionnaire**:
    - **Description**: A set of questions designed to gather information from respondents.
    - **Common Characteristics**:
        - Contains multiple questions.
        - May include multiple-choice, open-ended, or scale-based questions.
        - Used for surveys and research.

15. **resume**:
    - **Description**: A document summarizing an individual's education, work experience, and skills.
    - **Common Characteristics**:
        - Includes sections such as contact information, work experience, education, and skills.
        - Written in a concise and structured format.
        - Used for job applications.

16. **memo**:
    - **Description**: A brief written message used for internal communication within an organization.
    - **Common Characteristics**:
        - Contains a header with the recipient, sender, date, and subject.
        - Includes concise information or instructions.
        - Often used for announcements and updates.

#### Steps to Classify the Document:
1. Carefully examine the text provided in markdown format, which includes the OCR output of the scanned document.
2. Identify key elements and details within the document that match the descriptions of the categories listed above.
3. Select the category that best fits the document based on its content and purpose. If the document seems to fit into more than one category, choose the one that matches the majority of its characteristics.
4. Provide the name of the category as a single word (e.g., letter, form, email) based on your analysis of the document content.

#### Output:
- Return only the chosen category as a plain string without any additional characters. Remember the categories are: "letter", "form", "email", "handwritten", "advertisement", "scientific report", "scientific publication", "specification", "file folder", "news article", "budget", "invoice", "presentation", "questionnaire", "resume", "memo" Do not add any additional characters.
"""

In [10]:
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from tabulate import tabulate
import time

y_true_list = []
y_pred_list = []
list_lock = threading.Lock()
error_count = 0
successful_times = []


def process_row(index, row):
    """
    Process a single row to generate chat response and return the true and predicted labels.

    Parameters:
    index (int): The index of the row.
    row (pd.Series): The row data.

    Returns:
    tuple: A tuple containing the true label and predicted label.
    """
    global error_count
    start_time = time.time()
    try:
        full_path = row["location"]
        document_type = row["label"]

        logger.info(
            f"Processing row {index}: Full Path: {full_path}, Document Type: {document_type}"
        )

        result_phi = phi_3_vision_inference(
            prompt=CLASSIFICATION_PROMPT, image_path=full_path
        )

        if result_phi is None:
            logger.warning(
                f"Result is None for index {index}. File location: {full_path}"
            )
            error_count += 1
            return (None, None)
        else:
            cleaned_result_4o = result_phi["output"]
            end_time = time.time()
            successful_times.append(end_time - start_time)
            logger.info(
                f"Successfully processed row {index} in {end_time - start_time:.4f} seconds."
            )
            return (document_type, cleaned_result_4o)
    except Exception as e:
        logger.error(f"Error processing row {index}: {e}")
        error_count += 1
        return (None, None)


with ThreadPoolExecutor(max_workers=2) as executor:
    futures = {
        executor.submit(process_row, index, row): index
        for index, row in df_test.iterrows()
    }

    for future in as_completed(futures):
        true_label, pred_label = future.result()
        if true_label is not None and pred_label is not None:
            with list_lock:
                y_true_list.append(true_label)
                y_pred_list.append(pred_label)

# Convert the lists to numpy arrays
y_true = np.array(y_true_list)
y_pred = np.array(y_pred_list)

2024-08-14 10:37:08,574 - micro - MainProcess - INFO     Processing row 70: Full Path: utils\data\scanned\test\scientific report\scientific report_0.png, Document Type: scientific report (3590940605.py:process_row:31)


2024-08-14 10:37:08,582 - micro - MainProcess - INFO     Processing row 71: Full Path: utils\data\scanned\test\scientific report\scientific report_1.png, Document Type: scientific report (3590940605.py:process_row:31)
2024-08-14 10:37:08,790 - micro - MainProcess - ERROR    Error processing row 70: <urlopen error [Errno 11001] getaddrinfo failed> (3590940605.py:process_row:54)
2024-08-14 10:37:08,791 - micro - MainProcess - ERROR    Error processing row 71: <urlopen error [Errno 11001] getaddrinfo failed> (3590940605.py:process_row:54)
2024-08-14 10:37:08,793 - micro - MainProcess - INFO     Processing row 72: Full Path: utils\data\scanned\test\scientific report\scientific report_2.png, Document Type: scientific report (3590940605.py:process_row:31)
2024-08-14 10:37:08,798 - micro - MainProcess - INFO     Processing row 73: Full Path: utils\data\scanned\test\scientific report\scientific report_3.png, Document Type: scientific report (3590940605.py:process_row:31)
2024-08-14 10:37:08,80

In [11]:
from utils.time import calculate_statistics

calculate_statistics(successful_times, error_count)

2024-08-14 10:37:14,469 - micro - MainProcess - ERROR    No successful execution times recorded. (time.py:calculate_statistics:84)


In [12]:
y_true

array([], dtype=float64)

In [13]:
y_pred

array([], dtype=float64)

In [14]:
from src.evaluations import ml

# Evaluate the model with the simulated data
metrics, conf_matrix, class_report = ml.evaluate_model(
    y_true, y_pred, show_visualization=True
)

TypeError: evaluate_model() missing 1 required positional argument: 'labels'

## Conclusion

Laveraging GPT-4 Omni with multimodality provides better out-of-the-box scores across all classes and offers improved latency compared to the previous method using Document Intelligence OCR and passing text to GPT-4 Omni. Some calculations could be refined with additional layers and performance engineering. So far, the multimodal method with GPT-4 Omni seems to be more performant in terms of quality and latency. In contrast, Phi-3 Vision requires fine-tuning to achieve similar performance, especially with the calculations. We will be addressing the fine-tuning here.