# PDF Processing

This notebook focuses on extracting structured data from LEGO instruction manuals in PDF format. Specifically, we'll use tools for computer vision, OCR and image processing to identify and catalog which LEGO pieces are used in each construction step of a LEGO set by analyzing the manual's PDF content.

The primary goal is to create an automatic tool for parsing LEGO instruction PDFs and generating a comprehensive inventory of pieces organized by construction step. From this information we'll create a database which we'll use later on.

### Imports and definitions

In [28]:
# Autoreload the modules when the source code changes
%load_ext autoreload
%autoreload 2

from pathlib import Path
import sys
import os
import csv

# Code profiling
from cProfile import Profile
from pstats import SortKey, Stats
import time
import fitz

# Get the absolute path of the current directory
current_dir = Path().resolve()
# If we're in the notebooks or src directory, move up one level
# to the project root directory
project_root = current_dir.parent if current_dir.name in ['notebooks', 'src'] else current_dir
# Add the project root directory to Python's path
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))
# Add /src to Python's path
src_dir = project_root / 'src'
if str(src_dir) not in sys.path:
    sys.path.append(str(src_dir))

# Change the current working directory
os.chdir(project_root)

from src.pdf_processor import PDFProcessor
from src.step_detector import StepDetector
from src.utils.logger import setup_logging
from src.piece_matcher import PieceMatcher
from src.data_loader import LegoDataLoader
from src.utils.pdf_parts_list_extractor import extract_parts_list_from_pdf
from src.utils.export_db import export_database_to_csv
import logging

logger = logging.getLogger(__name__)

setup_logging(log_to_console=True, log_level=logging.DEBUG)

data_dir = Path('data')
images_path = data_dir / 'training' / 'images'
manuals_path = data_dir / 'training' / 'manuals'
manuals_to_sets = data_dir / 'BrickMapper DB' / 'manuals.csv'
inventories_csv = data_dir / 'inventories.csv'
processed_booklets_dir = data_dir / 'processed_booklets'
annotations_file_path = data_dir / 'training' / 'annotations' / 'merged' / 'annotations.json'
db_url = os.getenv('DATABASE_URL')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Training the YOLO Model

YOLO is used to identify step data in the LEGO manuals.
About 500 pages from 4 different manuals were annotated using [CVAT](https://www.cvat.ai/), split into training, validation and test sets (60%, 20%, 20% respectively) and used to train a YOLO11 n model, 
 11 being the latest YOLO version. The 'n' model was chosen due to its size, and has proven to be more than adequate for the task.

Here's an example of an annotated page:
![image.png](attachment:image.png)

There are two types of annotations:
* Yellow boxes denote step boxes (Where all the parts used for this step are listed)
* Blue boxes denote the step number. Step numbers are usually incremented throughout the build process, but they can some times reset when switching to a new manual, or even within the same manual.
So to make sure location pointers are unique we'll use the set number, booklet number, page number and step number when saving a piece's usage.

In [2]:
# If the model has already been trained, we don't need to train it again
model_path = Path("runs/train/exp/weights/best.pt")

if not model_path.exists():
    StepDetector.train_model(images_dir=images_path, annotations_path = annotations_file_path, model_save_path=model_path)

## Processing manuals

Now that the model has been trained we can use it to process manuals it hasn't seen yet.

To process a manual, the following steps should be taken:

* Put a PDF of the manual in `manuals_path`
* Add the manual to the `manuals_to_sets` csv file (Format: manual_ID,set_number,booklet_number)
* Add the manual ID to the `manuals` list below:

In [29]:
#manuals = ['6532380', '6519906', '6285327', '6372678']
manuals = ['6285327']

In [32]:
# for Debug
db_path = "sqlite:///test_db.db"

# Initialize processors
pdf_processor = PDFProcessor()

with open(manuals_to_sets, "r", newline="") as csvfile:
    reader = csv.reader(csvfile)
    try:
        # Skip the header row
        manuals_dict = {}
        next(reader, None)
        for row in reader:
            manuals_dict[row[0]] = [row[1], row[2]]
    except IndexError:
        logger.error("Error reading manuals to sets CSV file.")
        logger.error(f"Row: {row}")
        sys.exit(1)

pieces_by_steps = {}
all_sets_set_pieces = {}
matchers = {}
documents = {}
total_steps = {}

st = time.time()
        
with Profile() as profile:
    # Process manuals
    # Start with finding the parts list in each manual
    for manual in manuals:
        if manual not in manuals_dict:
            logger.warning(f"Manual ID {manual} not found in manuals dictionary.")
            continue

        # Set up directories
        pdf_path = manuals_path / f"{manual}.pdf"
        documents[manual] = fitz.open(pdf_path)

        # You should only set these directories if you want to save the intermediate images
        booklet_images_dir = data_dir / "processed_booklets" / manual
        identified_set_pieces_dir = booklet_images_dir / "identified set pieces"
        piece_list_images_dir = booklet_images_dir / "piece list images"
        rejects_dir = booklet_images_dir / "rejected inventory images"

        set_num = manuals_dict[manual][0]
        booklet_num = manuals_dict[manual][1]
        logger.info(f"Looking for the parts list in manual {manual}...")
        parts_list_pages = pdf_processor.find_parts_list_pages(documents[manual])
        if not parts_list_pages:
            # Couldn't find a parts list in the manual.
            # We'll check later if other manuals from the same set have one
            continue

        if set_num not in all_sets_set_pieces:
            all_sets_set_pieces[set_num] = {}

        all_sets_set_pieces[set_num][booklet_num], _ = (
            extract_parts_list_from_pdf(documents[manual],
                                        parts_list_pages,
                                        identified_set_pieces_dir=identified_set_pieces_dir, 
                                        parts_list_images_dir=piece_list_images_dir,
                                        rejected_images_dir=rejects_dir,
                                        )
        )
    
    # Match each manual with corresponding parts list
    for manual in manuals:
        # Set up directories

        # You should only set these directories if you want to save the intermediate images
        #booklet_images_dir = processed_booklets_dir / manual
        step_pieces_dir = booklet_images_dir / "step pieces"
        steps_dir = booklet_images_dir / "steps"
        page_images_dir = booklet_images_dir / "pages"

        set_num = manuals_dict[manual][0]
        booklet_num = manuals_dict[manual][1]

        if set_num in all_sets_set_pieces:
            if booklet_num in all_sets_set_pieces[set_num]:
                set_pieces = all_sets_set_pieces[set_num][booklet_num]
            else:
                # The parts list wasn't found for this manual.
                # Try to use the parts list from another manual in the same set
                _, set_pieces = next(
                    iter(all_sets_set_pieces[set_num].items())
                )
        else:
            pdf_processor.logger.warning(
                f"No parts list matched for manual {manual} (set # {set_num})"
            )
            continue

        # Extract steps
        logger.info(f"Processing manual {manual}...")
        try:
            all_step_pieces, total_steps[manual] = pdf_processor.process_manual(documents[manual],
                                                                                steps_path=steps_dir, 
                                                                                step_pieces_path=step_pieces_dir,
                                                                                page_images_path=page_images_dir,
                                                                                )
            logger.info(f"Found {total_steps[manual]} steps in manual {manual}.")

        except Exception as e:
            logger.error(f"Error processing manual {manual}: {e}")
            continue
        
        matcher = PieceMatcher()

        matcher.add_step_pieces(all_step_pieces)
        matcher.add_set_pieces(set_pieces)

        matcher.match_pieces(n_workers=16)

        matchers[manual] = matcher

        # Store the findings in the database
        # loader = LegoDataLoader(
        # db_path, manuals_to_sets, inventories_csv, int(manual)
        # )
        # loader.process_and_load_manual_data(
        #     matchers[manual].matched
        # )
    
    logger.info("Finished processing manuals. Summary:")
    for manual, matcher in matchers.items():
        logger.info(f"Manual {manual}:")
        logger.info(f"Steps found: {total_steps[manual]}")
        logger.info(f"Matched pieces: {len(matcher.matched_step_pieces)}")
        logger.info(f"Unmatched pieces: {len(matcher.unmatched_step_pieces)}")
        logger.info(f"Total pieces: {len(matcher.matched_step_pieces) + len(matcher.unmatched_step_pieces)}")
    logger.info(f"Total steps found: {sum(total_steps.values())}")
    logger.info(f"Total pieces matched: {sum(len(matcher.matched_step_pieces) for matcher in matchers.values())}")
    logger.info(f"Total pieces not matched: {sum(len(matcher.unmatched_step_pieces) for matcher in matchers.values())}")
    logger.info(f"Total pieces found: {sum(len(matcher.matched_step_pieces) + len(matcher.unmatched_step_pieces) for matcher in matchers.values())}")

    for doc in documents.values():
        doc.close()

    with open("stats.txt", "w") as file:
        stats = (
            Stats(profile, stream=file)
            .strip_dirs()
            .sort_stats(SortKey.CUMULATIVE)
        )
        stats.print_stats()

    et = time.time()
    elapsed_time = et - st
    print("Execution time:", elapsed_time, "seconds")

2025-04-22 20:04:19 - __main__ - INFO - Looking for the parts list in manual 6285327...
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Analyzing page 84/84...
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Element IDs: 1
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Number of columns: 0
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Is parts list: False
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Analyzing page 83/84...
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Element IDs: 60
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Number of columns: 5
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Is parts list: True
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Analyzing page 82/84...
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Element IDs: 69
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Number of columns: 5
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Is parts list: True
2025-04-22 20:04:19 - src.pdf_processor - DEBUG - Analyzing page 81/8

## Show matched pieces

To check our matching algorithm, let's look at all the piece images extracted from the step boxes and see which piece from the inventory it was matched to.

In [33]:
# Create the comparison DataFrame for each processed manual
for manual, matcher in matchers.items():
    matcher.display_set_pieces_summary()
    matched_df = matcher.create_comparison_dataframe()
    unmatched_df = matcher.create_comparison_dataframe(compare_unmatched=True)
    PieceMatcher.display_dataframe_statistics({"matched": matched_df, "unmatched": unmatched_df})
    print("Matched step pieces for manual ID", manual)
    matcher.display_comparison_dataframe(matched_df, "matched")
    print("Unmatched step pieces for manual ID", manual)
    matcher.display_comparison_dataframe(unmatched_df, "unmatched")
    

Unnamed: 0,Element ID,Set Piece Image,Color Histogram,Match Status,Best Similarity
78,4256828,,,Matched,0.9
118,278173,,,Matched,0.864
41,4565433,,,Matched,0.859
77,447726,,,Matched,0.85
26,4211481,,,Matched,0.828
73,617926,,,Matched,0.811
39,4211473,,,Matched,0.803
75,4100378,,,Matched,0.792
114,4619657,,,Matched,0.788
71,366626,,,Matched,0.77



Summary Statistics:
Total Set Pieces: 128
Matched Pieces: 61 (47.7%)
Unmatched Pieces: 67 (52.3%)


Unnamed: 0,matched,unmatched
Total Pieces,139.0,92.0
Min Similarity,0.602,0.0
Max Similarity,0.9,0.6
Mean Similarity,0.692,0.452
Min Histogram Similarity,0.0,0.0
Max Histogram Similarity,1.0,0.825
Mean Histogram Similarity,0.921,0.086


Matched step pieces for manual ID 6285327


Unnamed: 0,Element ID,Page,Step,Step Piece Image,Set Piece Image,Similarity Score,Step Piece Histogram,Set Piece Histogram,Histogram Similarity
0,4541984,6,1,,,0.743,,,0.998
1,4541984,10,8,,,0.743,,,0.998
2,243126,7,5,,,0.753,,,0.997
3,243126,11,10,,,0.753,,,0.997
4,243126,42,60,,,0.753,,,0.997
5,6185995,7,4,,,0.705,,,0.998
6,6185995,15,17,,,0.705,,,0.998
7,6185995,37,55,,,0.705,,,0.998
8,6185995,41,59,,,0.705,,,0.998
9,6185995,43,61,,,0.705,,,0.998


Unmatched step pieces for manual ID 6285327


Unnamed: 0,Element ID,Page,Step,Step Piece Image,Set Piece Image,Similarity Score,Step Piece Histogram,Set Piece Histogram,Histogram Similarity
0,287726,6,2,,,0.543,,,0.0
1,407026,6,3,,,0.552,,,0.0
2,407026,39,57,,,0.552,,,0.0
3,6258607,6,3,,,0.466,,,0.0
4,6258607,38,56,,,0.466,,,0.0
5,6258607,41,59,,,0.466,,,0.0
6,241226,7,5,,,0.511,,,0.0
7,241226,50,68,,,0.511,,,0.0
8,3000023,8,6,,,0.584,,,0.0
9,614126,11,9,,,0.527,,,0.0


## Store the steps data in the database

In [None]:
# Debug
db_path = "sqlite:///test_db.db"

for manual in manuals:
    if manual not in matchers:
        logger.warning(f"No matching data for manual ID {manual}")
        continue

    loader = LegoDataLoader(
        db_path, manuals_to_sets, inventories_csv, int(manual)
    )
    loader.process_and_load_manual_data(
        matchers[manual].matched
    )

## Export the database to CSV files

In [None]:
db_path = "sqlite:///test_db.db"

export_database_to_csv(db_path)