# LayoutLMv3-Invoice Extract: Fine-Tuning for Invoice Understanding

**Project Idea:**  
This project focuses on fine-tuning the **LayoutLMv3** model on the SERIO dataset to improve invoice understanding and entity extraction. The primary goal is to enhance the model's ability to accurately interpret complex invoice layouts, identifying key information such as invoice numbers, dates, total amounts, and line items. By leveraging the LayoutLMv3 model's visual and textual learning capabilities, the project aims to achieve more efficient and accurate processing of invoices for real-world applications in financial management and automated data extraction.

**Objectives:**
- Fine-tune LayoutLMv3 on the SERIO dataset for better invoice understanding.
- Improve the model's ability to recognize key entities such as invoice numbers, dates, total amounts, and line items.
- Apply the model in real-world scenarios to automate invoice data extraction and financial management.

**Conclusion:**  
By enhancing the LayoutLMv3 model's ability to accurately interpret invoices, this project seeks to advance automated solutions for financial document processing, leading to more streamlined workflows in financial and administrative tasks.


## Required Libraries and Imports

The following libraries and modules are essential for data processing, model training, evaluation, and visualization. They include utilities for handling datasets, image transformations, tokenization, and metrics calculation. The LayoutLMv3 model and Trainer from Hugging Face’s `transformers` library are also imported to facilitate token classification tasks on structured documents like invoices.

- **os, glob, shutil**: For file handling and directory management.
- **PIL (Python Imaging Library)**: For image processing and rendering.
- **cv2 (OpenCV)**: For advanced image manipulation and visualization.
- **torch, torchvision**: For building and training deep learning models.
- **transformers**: To use the LayoutLMv3 model, tokenizers, and the Trainer class.
- **sklearn**: To calculate metrics such as accuracy, precision, recall, and F1 score.
- **matplotlib**: For visualizing images and bounding boxes.
- **tqdm**: To display progress bars during training.

In [None]:
import os
import glob
import json 
import random
from pathlib import Path
from difflib import SequenceMatcher
import shutil
from PIL import Image, ImageDraw, ImageFont
import cv2
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from IPython.display import display
import matplotlib
from matplotlib import pyplot, patches
from time import perf_counter
import random
import torch
from torch.utils.data import Dataset
from PIL import Image
from torchvision import transforms
from transformers import LayoutLMv3Tokenizer
from tqdm import tqdm
import pandas as pd
import matplotlib.pyplot as plt
from transformers import TrainingArguments, Trainer
from transformers import LayoutLMv3ForTokenClassification, AutoConfig
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np



### let's see one invoice to gain some insights

In [None]:
sroie_folder_path = Path('/kaggle/input/sroie-datasetv2/SROIE2019')
example_file = Path('X51005365187.txt')

In [None]:
image = Image.open("/kaggle/input/sroie-datasetv2/SROIE2019/train/img/X00016469612.jpg")
image = image.convert("RGB")
new_image = image.resize((300, 600))
new_image


## Data Prepocessing

## Reading Bounding Boxes and Words from Text Files

This function `read_bbox_and_words` reads bounding box coordinates and text from a given file and processes the data into a structured format. The text file is expected to contain comma-separated values where the first eight values represent the bounding box coordinates (in terms of four points: x0, y0, x1, y1, x2, y2, x3, y3) and the remaining values correspond to the associated text. The function splits these values, stores them in a list, and finally converts them into a Pandas DataFrame.

### Steps:
1. **File Reading and Parsing**: The function opens the specified file, processes each line, and extracts bounding box coordinates along with the corresponding text.
2. **Data Storage**: The parsed data is stored in a list and then converted into a DataFrame for easier handling and manipulation.
3. **Bounding Box Conversion**: The bounding box coordinates are explicitly converted into integers for future processing.
4. **Dropping Unnecessary Columns**: Some of the bounding box columns are dropped to simplify the data (e.g., `x1`, `y1`, `x3`, and `y3` are removed).
5. **Preview**: The function returns the processed DataFrame, and we display the first few rows of the file and the DataFrame to verify the result.

The `head()` command is used to show the first five lines from the input file and the first few rows of the resulting DataFrame.


In [None]:
def read_bbox_and_words(path: Path):
  bbox_and_words_list = []

  with open(path, 'r', errors='ignore') as f:
    for line in f.read().splitlines():
      if len(line) == 0:
        continue
        
      split_lines = line.split(",")

      bbox = np.array(split_lines[0:8], dtype=np.int32)
      text = ",".join(split_lines[8:])

      # From the splited line we save (filename, [bounding box points], text line).
      # The filename will be useful in the future
      bbox_and_words_list.append([path.stem, *bbox, text])
    
  dataframe = pd.DataFrame(bbox_and_words_list, columns=['filename', 'x0', 'y0',
                                                         'x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'line'])

  # Explicitly convert only the bounding box columns to integers
  bbox_columns = ['x0', 'y0','x1', 'y1', 'x2', 'y2', 'x3', 'y3']  # Adjust based on your actual columns
  dataframe[bbox_columns] = dataframe[bbox_columns].astype(np.int16)
    
  dataframe = dataframe.drop(columns=['x1', 'y1', 'x3', 'y3'])

  return dataframe
bbox_file_path = sroie_folder_path / "test/box" / example_file
print("== File content ==")
!head -n 5 "{bbox_file_path}"

bbox = read_bbox_and_words(path=bbox_file_path)
print("\n== Dataframe ==")
bbox.head(5)

### Read Entities from JSON: 
This function reads invoice entities from a JSON file and returns them as a Pandas DataFrame.


In [None]:
def read_entities(path: Path):
  with open(path, 'r') as f:
    data = json.load(f)

  dataframe = pd.DataFrame([data])
  return dataframe


# Example usage
entities_file_path = sroie_folder_path /  "test/entities" / example_file
print("== File content ==")
!head "{entities_file_path}"

entities = read_entities(path=entities_file_path)
print("\n\n== Dataframe ==")
entities

### Assign Line Label:
This function assigns a label to a line of text based on its similarity to entity names from a DataFrame, returning the matching entity type or "O" for no match.


In [None]:
def assign_line_label(line: str, entities: pd.DataFrame):
    line_set = line.replace(",", "").strip().split()
    for i, column in enumerate(entities):
        entity_set =  entities.iloc[0, i].replace(",", "").strip().split()
        
        
        matches_count = 0
        for l in line_set:
            if any(SequenceMatcher(a=l, b=b).ratio() > 0.8 for b in entity_set):
                matches_count += 1
            
            if (column.upper() == 'ADDRESS' and (matches_count / len(line_set)) >= 0.5) or \
               matches_count == len(entity_set):
                return column.upper()

    return "O"


line = bbox.loc[1,"line"]
label = assign_line_label(line, entities)
print("Line:", line)
print("Assigned label:", label)

### Assign Labels:
This function assigns labels to words based on their bounding box dimensions and the presence of entities, ensuring unique assignments for critical fields like TOTAL and DATE while preventing conflicts.


In [None]:
def assign_labels(words: pd.DataFrame, entities: pd.DataFrame):
    max_area = {"TOTAL": (0, -1), "DATE": (0, -1)}  # Value, index
    already_labeled = {"TOTAL": False,
                       "DATE": False,
                       "ADDRESS": False,
                       "COMPANY": False,
                       "O": False
    }

    # Go through every line in $words and assign it a label
    labels = []
    for i, line in enumerate(words['line']):
        label = assign_line_label(line, entities)

        already_labeled[label] = True
        if (label == "ADDRESS" and already_labeled["TOTAL"]) or \
           (label == "COMPANY" and (already_labeled["DATE"] or already_labeled["TOTAL"])):
            label = "O"
         # Assign to the largest bounding box
        if label in ["TOTAL", "DATE"]:
            x0_loc = words.columns.get_loc("x0")
            bbox = words.iloc[i, x0_loc:x0_loc+4].to_list()
            area = (bbox[2] - bbox[0]) + (bbox[3] - bbox[1])

            if max_area[label][0] < area:
                max_area[label] = (area, i)

            label = "O"

        labels.append(label)

    labels[max_area["DATE"][1]] = "DATE"
    labels[max_area["TOTAL"][1]] = "TOTAL"

    words["label"] = labels
    return words


# Example usage
bbox_labeled = assign_labels(bbox, entities)
bbox_labeled.head(15)

In [None]:
bbox_labeled.columns

### Split Line:
This function splits a line into individual words while maintaining the same bounding box coordinates for each word, as the research indicates that they share the same context.


In [None]:
def split_line(line: pd.Series) -> list:
    """
    Splits a line into words and updates bounding box coordinates for each word.
    
    Parameters:
        line (pd.Series): A pandas Series with 'x0', 'x2', and 'line' columns.
    
    Returns:
        list: A list of lists where each sublist contains updated values for the line.
    """
    # Ensure the line has the necessary columns
    if not {'x0', 'x2', 'line'}.issubset(line.index):
        raise ValueError("The line must contain 'x0', 'x2', and 'line' columns.")

    # Extract current bounding box information
    x0 = line['x0']
    x2 = line['x2']
    bbox_width = line['x2'] - line['x0']  # Example width, adjust as needed
    line_str = line['line']

    words = line_str.split()
    new_lines = []

    # Iterate through each word and calculate new bounding box coordinates
    for index, word in enumerate(words):

        # Create a new Series for the updated line
        line_copy = line.copy()
        line_copy['x0'] = x0
        line_copy['x2'] = x2
        line_copy['line'] = word
        
        # Append the updated line to the new_lines list
        new_lines.append(line_copy.to_list())

        # Update x0 for the next word
    return new_lines



# Example usage
new_lines = split_line(bbox_labeled.loc[1])
print("Original row:")
display(bbox_labeled.loc[1:1,:])

print("Splitted row:")
pd.DataFrame(new_lines, columns=bbox_labeled.columns)

In [None]:
def dataset_creator(folder: Path):
  bbox_folder = folder / 'box'
  entities_folder = folder / 'entities'
  img_folder = folder / 'img'

  # Sort by filename so that when zipping them together
  # we don't get some other file (just in case)
  entities_files = sorted(entities_folder.glob("*.txt"))
  bbox_files = sorted(bbox_folder.glob("*.txt"))
  img_files = sorted(img_folder.glob("*.jpg"))

  data = []

  print("Reading dataset:")
  for bbox_file, entities_file, img_file in tqdm(zip(bbox_files, entities_files, img_files), total=len(bbox_files)):            
    # Read the files
    bbox = read_bbox_and_words(bbox_file)
    entities = read_entities(entities_file)
    image = Image.open(img_file)

    # Assign labels to lines in bbox using entities
    bbox_labeled = assign_labels(bbox, entities)
    del bbox

    # Split lines into separate tokens
    new_bbox_l = []
    for index, row in bbox_labeled.iterrows():
      new_bbox_l += split_line(row)
    new_bbox = pd.DataFrame(new_bbox_l, columns=bbox_labeled.columns)
    new_bbox[['x0', 'y0', 'x2', 'y2']] = new_bbox[['x0', 'y0', 'x2', 'y2']].astype(np.int16)

    del bbox_labeled


    # Do another label assignment to keep the labeling more precise 
    for index, row in new_bbox.iterrows():
      label = row['label']

      if label != "O":
        entity_values = entities.iloc[0, entities.columns.get_loc(label.lower())]
        entity_set = entity_values.split()
        
        if any(SequenceMatcher(a=row['line'], b=b).ratio() > 0.7 for b in entity_set):
            label = "S-" + label
        else:
            label = "O"
      
      new_bbox.at[index, 'label'] = label

    width, height = image.size
  
    data.append([new_bbox, width, height])

  return data

In [None]:
dataset_train = dataset_creator(sroie_folder_path / 'train')
dataset_test = dataset_creator(sroie_folder_path / 'test')

### train-test split

In [None]:
random.seed(42)
random.shuffle(dataset_test)
dataset_val = dataset_test[174:]
dataset_test = dataset_test[:174]
print(len(dataset_val))
print (len(dataset_test))

In [None]:
dataset_train[0][0]["x0"]

In [None]:
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
from transformers import LayoutLMv3Tokenizer
import torch

class InvoiceDataset(Dataset):
    def __init__(self, invoice_list, tokenizer, image_folder_path):
        self.invoice_list = invoice_list
        self.tokenizer = tokenizer
        self.image_folder_path = image_folder_path
        self.label_map = {
            "S-COMPANY": 0,
            "S-ADDRESS": 1,
            "S-DATE": 2,
            "S-TOTAL": 3,
            "O": 4,  # For 'Other'
        }
        self.max_length = 512
        
        # Modify the transform to resize to 224x224, which aligns with model expectations
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),  # Resize to 224x224
            transforms.ToTensor(),           
            transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  
        ])

    def __len__(self):
        return len(self.invoice_list)

    def __getitem__(self, idx):
        invoice_data = self.invoice_list[idx]
        word_df = invoice_data[0]  
        image_width = invoice_data[1]  
        image_height = invoice_data[2]  
        image_path = word_df["filename"].iloc[0]  

        words = []
        bboxes = []
        labels = []

        # Load and preprocess the image
        image_path = f"{self.image_folder_path}/{image_path}.jpg"
        try:
            image = Image.open(image_path).convert("RGB")
        except Exception as e:
            print(f"Error loading image {image_path}: {e}")
            return None  
        
        image = self.transform(image)  # Shape: [3, 224, 224]

        # Add a batch dimension to the image tensor
        image = image.unsqueeze(0)  # Now shape: [1, 3, 224, 224]

        for _, word_data in word_df.iterrows():
            word = word_data['line'] 
            label = word_data['label']
            bbox = [
                word_data['x0'],
                word_data['y0'],
                word_data['x2'],
                word_data['y2']
            ] 

            # Normalize the bounding boxes
            normalized_bbox = [
                bbox[0] * 1000 / image_width,
                bbox[1] * 1000 / image_height,
                bbox[2] * 1000 / image_width,
                bbox[3] * 1000 / image_height
            ]

            words.append(word)
            bboxes.append(normalized_bbox)  
            labels.append(label)

        # Tokenize the words with bounding boxes
        tokens = self.tokenizer(
            words,
            boxes=bboxes,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            is_split_into_words=True,
            return_tensors="pt"
        )

        # Convert labels to numerical format
        labels = [self.label_map.get(label, self.label_map["O"]) for label in word_df['label'].tolist()]

        # Pad labels to max length with -100
        padded_labels = labels + [-100] * (self.max_length - len(labels))  # Use -100 for padding
        labels_tensor = torch.tensor(padded_labels, dtype=torch.long)

        # Ensure the bbox tensor is correctly padded
        bbox_tensor = tokens['bbox'].squeeze(0)  
        if bbox_tensor.size(0) < self.max_length:
            padding = torch.zeros((self.max_length - bbox_tensor.size(0), 4), dtype=torch.float32)  # Pad with zeros
            bbox_tensor = torch.cat([bbox_tensor, padding], dim=0)

        # Convert everything to long 
        input_ids_tensor = tokens['input_ids'].squeeze(0).long()
        attention_mask_tensor = tokens['attention_mask'].squeeze(0).long()

        return  {
            'input_ids': input_ids_tensor,
            'attention_mask': attention_mask_tensor,
            'bbox': bbox_tensor.to(torch.long),
            'labels': labels_tensor,
            'pixel_values': image.squeeze(0)  # Remove batch dimension for the final output shape [3, 224, 224]
        }

# Initialize tokenizer
tokenizer = LayoutLMv3Tokenizer.from_pretrained("mp-02/layoutlmv3-large-cord2")

# Create your dataset
image_folder_path = '/kaggle/input/sroie-datasetv2/SROIE2019/train/img'
dataset = InvoiceDataset(dataset_train, tokenizer=tokenizer, image_folder_path=image_folder_path)
val_set = InvoiceDataset(dataset_val, tokenizer=tokenizer, image_folder_path="/kaggle/input/sroie-datasetv2/SROIE2019/test/img")


### Ensure Tensor Shapes
It is crucial to verify that the shapes of all tensors are correct before proceeding with model training. This includes ensuring that input tensors match the expected dimensions of the model and that target tensors (labels) align with the input tensor shapes. Proper shape management helps prevent runtime errors and ensures the model learns effectively.


In [None]:
print(dataset[0]["input_ids"].shape)
print(dataset[0]["attention_mask"].shape)
print(dataset[0]["bbox"].shape)
print(dataset[0]["labels"].shape)
print(dataset[0]["pixel_values"].shape)

In [None]:
dataset[0]["bbox"]

In [None]:
import warnings
from transformers import logging as transformers_logging

warnings.filterwarnings("ignore")
transformers_logging.set_verbosity_error()


In [None]:
import wandb
wandb.login(key="43683e6439b3f848199c0e333e5ffdc8c1695604")

## Training

In this section, we define the training parameters and initialize the `Trainer` for fine-tuning the LayoutLMv3 model on the SERIO dataset. The training arguments include evaluation strategies, logging settings, and learning rate specifications. The model is configured to freeze the first few layers to retain pre-trained weights while allowing the rest to be trainable. The `compute_metrics` function is utilized to evaluate the model's performance during training.

The following code initializes the model and the training process.


In [None]:
import numpy as np
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import LayoutLMv3ForTokenClassification, Trainer, TrainingArguments

def compute_metrics(pred):
    logits = pred.predictions
    labels = pred.label_ids 
    
    predictions = np.argmax(logits, axis=-1) 

    mask = labels != -100  
    labels = labels[mask]
    predictions = predictions[mask]

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# Define class weights
class_weights = torch.tensor([5.0, 5.0, 5.0, 5.0, 1.0])  # Adjusted weights

model = LayoutLMv3ForTokenClassification.from_pretrained(
    "mp-02/layoutlmv3-large-cord2",
    num_labels=5, 
    hidden_dropout_prob=0.2 
)

for idx, param in enumerate(model.parameters()):
    param.requires_grad = idx >= 8

def custom_loss_func(logits, labels):
    loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)  # Use class weights here
    return loss_fct(logits.view(-1, model.config.num_labels), labels.view(-1))

model.loss_fct = custom_loss_func

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=100,
    num_train_epochs=40,
    learning_rate=1e-5,
    report_to='wandb',
    run_name='layoutlmv3-training',  
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model='f1', 
    greater_is_better=True
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,  # Use the entire training dataset
    eval_dataset=val_set,    # Use the entire validation dataset
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

## Best Model Selection

The best model is the one with the highest F1 score on the validation set. This metric provides a balance between precision and recall, making it a suitable choice for evaluating model performance in tasks where class distribution may be imbalanced.


In [None]:
best_model = LayoutLMv3ForTokenClassification.from_pretrained('/kaggle/working/results/checkpoint-6573')

In [None]:
trainer = Trainer(
    model=best_model,
    args=training_args,
    compute_metrics=compute_metrics
)

training_eval = trainer.evaluate(eval_dataset=dataset)
print("Training Evaluation:", training_eval)

val_eval = trainer.evaluate(eval_dataset=val_set)
print("Validation Evaluation:", val_eval)


## Evaluating Model performance on the Test Set

In this section, we will evaluate the performance of our trained model on the test set.

In [None]:
test_set = InvoiceDataset(dataset_test, tokenizer=tokenizer, image_folder_path="/kaggle/input/sroie-datasetv2/SROIE2019/test/img")
test_evaluation = trainer.evaluate(eval_dataset= test_set)
print(test_evaluation)

### Great Achievement!
We have achieved an impressive F1 score of **95.8%**! 🎉

## Output Production Phase

In this phase of the pipeline, we aim to produce the output in JSON file format. This is crucial for integrating the model's predictions with other applications or systems that require structured data.

### Steps to Produce JSON Output:

1. **Extract Predictions**: After evaluating the model on the validation/test dataset, extract the relevant predictions (e.g., bounding boxes, labels, company name, date, address, and total).
  
2. **Structure the Data**: Organize the extracted data into a dictionary format. Each entry should correspond to a specific field that we want to include in the JSON output.

3. **Convert to JSON**: Use Python's built-in `json` module to convert the structured data into JSON format.

4. **Save the JSON File**: Write the JSON data to a file for further use or analysis.

This phase is typically referred to as the **Output Generation Phase** in the machine learning pipeline, where we focus on converting model predictions into a consumable format.


### I will use on observation from test set to extract the output as json file a

In [208]:
def generate_labels(sample)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    observation = {k: v for k, v in sample.items() if k != 'labels'}

    input_ids = observation['input_ids'].unsqueeze(0)
    attention_mask = observation['attention_mask'].unsqueeze(0)
    bbox = observation['bbox'].unsqueeze(0)
    pixel_values = observation['pixel_values'].unsqueeze(0)

    best_model = best_model.to(device)

    with torch.no_grad():
        outputs = best_model(input_ids=input_ids, 
                        attention_mask=attention_mask, 
                        bbox=bbox, 
                        pixel_values=pixel_values)
    return outputs

outputs = generate_labels(test_set[0])

SyntaxError: invalid syntax (563779616.py, line 1)

In [None]:
outputs

In [None]:
sample = dataset_test[0][0].drop(["label"], axis= 1)

In [None]:
dataset_test[0][0]['label']

In [None]:
import pandas as pd
import torch

def logits_to_labels(outputs, label_map: dict) -> pd.Series:
    """
    Convert logits from TokenClassifierOutput to labels based on the label mapping.

    Parameters:
        outputs (TokenClassifierOutput): The output from the model containing logits.
        label_map (dict): A mapping from label names to indices.

    Returns:
        pd.Series: A Pandas Series containing the predicted labels.
    """
    # Extract logits from the outputs
    logits = outputs.logits  # Access the logits attribute

    # Get the predicted indices from the logits
    predicted_indices = torch.argmax(logits, dim=-1)

    # Create a reverse mapping from indices to labels
    index_to_label = {v: k for k, v in label_map.items()}

    # Map predicted indices to labels
    # Use squeeze to remove unnecessary dimensions (batch size = 1 assumed)
    predicted_labels = [index_to_label[idx.item()] for idx in predicted_indices.squeeze()]

    # Create a Pandas Series from the predicted labels
    labels_series = pd.Series(predicted_labels)

    return labels_series

# Example label mapping
label_map = {
    "S-COMPANY": 0,
    "S-ADDRESS": 1,
    "S-DATE": 2,
    "S-TOTAL": 3,
    "O": 4,  # For 'Other'
}

# Assuming 'outputs' is your TokenClassifierOutput object
# Convert logits to labels
labels_series = logits_to_labels(outputs, label_map)
print(labels_series)


In [None]:
len(sample)

In [None]:
sample_output =pd.concat([sample, labels_series[:len(sample)]], axis= 1)

In [None]:
sample_output.columns = ['filename', 'x0', 'y0', 'x2', 'y2', 'line', 'label']  # Rename columns as needed


In [None]:
sample_output

In [None]:
import pandas as pd
from collections import Counter  # Ensure to import Counter

def reverse_words_and_vote(df: pd.DataFrame) -> pd.DataFrame:
    # Group by bounding box coordinates
    grouped = df.groupby(['x0', 'y0', 'x2', 'y2'])

    final_labels = []
    
    for (x0, y0, x2, y2), group in grouped:
        # Reverse the words in the same bounding box
        reversed_words = ' '.join(reversed(group['line'].tolist()))
        
        # Count the occurrences of each label
        label_counts = Counter(group['label'])
        
        # Get the most common label
        most_common_label, count = label_counts.most_common(1)[0]
        
        # Add the reversed words along with bounding box coordinates and label to the final output
        final_labels.append({
            'filename': group['filename'].iloc[0],
            'x0': x0,
            'y0': y0,
            'x2': x2,
            'y2': y2,
            'line': reversed_words,
            'label': most_common_label,
            'count': count
        })

    # Create a new DataFrame with the final results
    final_df = pd.DataFrame(final_labels)

    return final_df

# Example usage
# Assuming `sample_output` is your existing DataFrame
# sample_output = pd.DataFrame(...)  # Your DataFrame goes here
new_sample_output = reverse_words_and_vote(sample_output)

# Display the new DataFrame
print(new_sample_output.drop("count", axis=1))


In [None]:
new_sample_output.drop(["count","filename"], axis= 1).to_json(f"/kaggle/working/sample_output.json", orient='records', lines=True)

In [None]:
import json

# Path to your JSON file
json_file_path = '/kaggle/working/sample_output.json'  # Update with your actual JSON file path

# Initialize an empty list to store the loaded JSON data
data = []

try:
    # Open and read the JSON file
    with open(json_file_path, 'r') as file:
        # Check if the file is line-delimited JSON (each line is a separate JSON object)
        for line in file:
            try:
                # Load each line as a separate JSON object
                data.append(json.loads(line))
            except json.JSONDecodeError as e:
                print(f"Error decoding JSON on line: {line.strip()} - {e}")

    # If the file is not line-delimited, you can load it as a whole
    # Uncomment this section if you expect the entire file to be a single JSON object or array
    # data = json.load(file)

except FileNotFoundError:
    print(f"File not found: {json_file_path}")
except json.JSONDecodeError as e:
    print(f"Error decoding JSON: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

# Print the loaded data in a readable JSON format
try:
    print(json.dumps(data, indent=4))  # Pretty print the JSON data
except Exception as e:
    print(f"Error printing JSON data: {e}")


In [None]:
import pandas as pd
import json
from PIL import Image, ImageDraw
import matplotlib.pyplot as plt

label_colors = {
    "S-COMPANY": "blue",
    "S-ADDRESS": "red",
    "S-DATE": "green",
    "S-TOTAL": "orange",
    "O": "black"
}

def draw_bounding_boxes(image_path, df):
    try:
        image = Image.open(image_path)
        
    except FileNotFoundError:
        print(f"Image file not found: {image_path}")
        return None

    draw = ImageDraw.Draw(image)
    
    for index, row in df.iterrows():
        x0, y0, x2, y2, label = row['x0'], row['y0'], row['x2'], row['y2'], row['label']
        
        color = label_colors.get(label, "black")
        
        draw.rectangle([x0, y0, x2, y2], outline=color, width=2)
        
    return image

image_filename = sample_output['filename'].iloc[0]
image_path = f"/kaggle/input/sroie-datasetv2/SROIE2019/test/img/{image_filename}.jpg"

output_image = draw_bounding_boxes(image_path, new_sample_output)

if output_image is not None:
    plt.figure(figsize=(10, 10))
    plt.imshow(output_image)
    plt.axis('off')
    plt.show()
else:
    print("No image to display.")


In [199]:
!pip install huggingface_hub




### Push Our Model to Hugging Face

In this section, we will upload our fine-tuned model to the Hugging Face Model Hub. This allows others to easily access and use our model.

In [205]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [206]:
# Import necessary libraries
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import HfApi, HfFolder

# Step 1: Load Your Model Checkpoint
checkpoint_path = '/kaggle/working/results/checkpoint-6573'  # Update with your checkpoint path
model = AutoModelForTokenClassification.from_pretrained(checkpoint_path)



model_name = "MohmaedElnamir/fine-tuned-layoutlmv3-sroie"  # Update with your Hugging Face username and desired model name


model.push_to_hub(model_name)


print(f"Model uploaded to Hugging Face: {model_name}")


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

In [207]:
tokenizer_name = "mp-02/layoutlmv3-large-cord2"  # Replace with the original model name you used
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
model.push_to_hub(model_name)


README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/MohmaedElnamir/fine-tuned-layoutlmv3-sroie/commit/a0f4d957af28c85155a0fc5a5979e7739a029115', commit_message='Upload LayoutLMv3ForTokenClassification', commit_description='', oid='a0f4d957af28c85155a0fc5a5979e7739a029115', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MohmaedElnamir/fine-tuned-layoutlmv3-sroie', endpoint='https://huggingface.co', repo_type='model', repo_id='MohmaedElnamir/fine-tuned-layoutlmv3-sroie'), pr_revision=None, pr_num=None)