# Problem Statement
To develop data structurizer tool which is capable of extracting specific, relevant information from a variety of unstructured documents, including invoices, receipts, bills, medical prescriptions, and general documents.

The extracted data should be organized into a structured DataFrame format, providing a clean and usable dataset for further analysis or applications.

## Expected Output

1. Working Notebook in any of the platform using python
2. The final output should be a well-structured DataFrame containing the extracted data.
3. Each DataFrame should represent a single document, and each column should correspond to a specific data element (e.g., invoice number, date, total amount, item descriptions).

## Additional Considerations
1. Document Format: Consider the variety of document formats you'll be dealing with (e.g., PDF, Word, images) and choose appropriate tools or techniques for each.
2. Prompt Engineering: Craft effective prompts to guide the LLM in extracting the desired information.
3. Model Selection: Choose an LLM that is suitable for the task and the available resources.


## Evaluation Criteria
1. Data Quality and Accuracy:
> * Completeness: Were all relevant data elements extracted from the documents?
> * Accuracy: Is the extracted data correct and consistent with the original documents?
> * Consistency: Are there any inconsistencies or contradictions within the extracted data?

2. Data Structurization:
> * Data Organization: Is the data organized in a clear and understandable manner within the DataFrame?
> * Data Quality: Are there any missing values, inconsistencies, or errors in the structured data?

3. LLM Usage:
> * Prompt Engineering: Were the prompts used to guide the LLM effective in extracting the desired information?
> * Model Selection: Was the chosen LLM appropriate for the task?
> * Fine-Tuning: Was the LLM effectively fine-tuned on a relevant dataset?

4. Code Quality:
> * Readability: Is the code well-structured, commented, and easy to understand?
> * Efficiency: Is the code efficient in terms of computational resources and execution time?

# Solution Overview

## Installing Req. Libraries

In [None]:
!sudo apt-get install -q tesseract-ocr

Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
!sudo apt-get install -q tesseract-ocr-eng

Reading package lists...
Building dependency tree...
Reading state information...
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
!pip install -qU langchain langchain-community PyMuPDF unstructured[all-docs] pytesseract nltk langchain-openai docx2txt

## Importing Req. Libraries

In [None]:
# IMPORTING REQ MODULES
import os
import fitz
import nltk
import pandas as pd
from langchain.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.image import UnstructuredImageLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
model_name = "chatgpt-4o-latest" #"gpt-4o-mini-2024-07-18"
openai_api_key = "" # add openai key
input_folder_path = '/content/input/'
csv_path = '/content/csv/'

## Prompt

In [None]:
# PROMPT TEMPLATE - INPUT PARAMETER: data
PROMPT_TEMPLATE = """
**Context:**
You are tasked with extracting important information from a given text and creating a JSON file where each column corresponds to a specific data element along with metadata.

**Roles:**
Assume the role of a data extraction specialist who is adept at identifying key pieces of information and organizing them systematically.

**Instructions:**
1. Read the given text thoroughly to understand the context and the key pieces of information.
2. Identify and extract the all information that should be included in the JSON file.
3. Organize the extracted information into specific data elements, ensuring that each element has a corresponding key.
4. Include metadata that captures additional contextual information or attributes relevant to the extracted data.
5. Format the extracted information and metadata into a JSON structure such as {{\'key1\':\'value1\',\'key2\':\'value2\',.........,\'metadata\':\'actual document information\'}}.
6. Ensure that the JSON file is well-structured and follows the specified format.

**Constraints:**
- Ensure all extracted data is accurate and relevant.
- Follow the JSON format strictly without any deviations.
- Include all necessary metadata to provide context for the extracted information.

**Examples:**
- Given text: "John Doe, a software engineer, joined XYZ Corp in 2020. His email is john.doe@xyz.com."
  - Extracted JSON: {{\'name\':\'John Doe\',\'occupation\':\'software engineer\',\'company\':\'XYZ Corp\',\'year_of_joining\':\'2020\',\'email\':\'john.doe@xyz.com\',\'metadata\':\'Extracted from a personnel record\'}}

**Output Format:**
- The output should be a JSON file formatted as follows:
  ```
  {{
    \'key1\':\'value1\',
    \'key2\':\'value2\',
    ...
    \'metadata\':\'actual document information\'
  }}
  ```

**Evaluation Criteria:**
- Completeness: All key pieces of information are extracted and included in the JSON file.
- Accuracy: Extracted data should be accurate and free from errors.
- Structure: The JSON file should be well-organized and follow the specified format.
- Relevance: Only the important and relevant information should be included.

**Actual_problem:**
{data}"""

In [None]:
# CREATION OF LANGCHAIN SIMPLE CHAIN
model = ChatOpenAI(model=model_name,temperature=0,openai_api_key=openai_api_key)
parser = JsonOutputParser()
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["data"],partial_variables={"format_instructions": parser.get_format_instructions()},
)
chain = prompt | model | parser

## Data Extraction

In [None]:
# Define a dictionary to map file extensions to loaders
LOADERS = {
    ('.pdf', '.PDF'): PyMuPDFLoader,
    ('.png', '.jpg', '.jpeg'): UnstructuredImageLoader,
    ('.doc', '.docx'): Docx2txtLoader
}

folder_path = input_folder_path

# Iterate over files in the directory
for filename in os.listdir(folder_path):
    file_extension = os.path.splitext(filename)[1].lower()
    loader_class = None

    # Determine the appropriate loader
    for ext_tuple, loader in LOADERS.items():
        if file_extension in ext_tuple:
            loader_class = loader
            break

    if loader_class is None:
        # Handle files without a known loader
        loader_class = UnstructuredImageLoader

    try:
        file_path = os.path.join(folder_path, filename)
        loader = loader_class(file_path)
        docs = loader.load()

        json_data = chain.invoke({"data": docs})
        df = pd.json_normalize(json_data)
        original_filename = docs[0].metadata['source'].split('/')[-1]
        base_filename = os.path.splitext(original_filename)[0]
        df.to_csv(f'{csv_path}{base_filename}.csv', index=False)

        print(f"Data is extracted from {original_filename} and written to {base_filename}.csv")
        print("...."*30)
        display(df)
        print(" ")
        print("===="*30)

    except Exception as e:
        print(f"An error occurred with file {filename}: {e}")
        continue