# PDF Transcriber Library - Usage Examples

This notebook demonstrates how to use the PDF Transcriber library to convert PDF documents to Markdown with tables in HTML format using Google Gemini API with parallel processing.

## Import the Library

Now let's import the necessary components from the library:

In [1]:
import logging
import os
import sys

# Add directory to sys.path so we can import the pdf_parser module
sys.path.append("..")


from pdf_parser import PDFParser

# Set up logging to see what's happening
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger("pdf-transcriber-example")


  from .autonotebook import tqdm as notebook_tqdm


## Set Up API Key

You need to set up your Google Gemini API key. You can either set it as an environment variable or pass it directly to the PDFParser:

In [2]:
# Set your api key here
# os.environ["GOOGLE_API_KEY"] = "your-api-key-here"  # Replace with your actual API key

## Example 1: Basic Usage - Transcribe a Single PDF

In [3]:
# Initialize the parser
parser = PDFParser(
    dpi=150,
    max_workers=8,  # Number of parallel workers
)

# Path to your PDF file
pdf_path = "data/raw/rca.pdf"  # Replace with your PDF file path

# Transcribe and save to a file
output_path = "data/processed/rca.md"
parser.transcribe_pdf(pdf_path, output_path)

print(f"Transcription saved to {output_path}")


2025-02-27 23:41:26,297 - pdf_parser - INFO - Starting parallel transcription of data/raw/rca.pdf
Converting PDF to images: 100%|██████████| 160/160 [00:02<00:00, 61.40page/s]
2025-02-27 23:41:28,974 - pdf_parser - INFO - Converted PDF to 160 images
2025-02-27 23:41:28,974 - pdf_parser - INFO - Selected 3 sample pages for prompt generation
Generating custom prompt: 100%|██████████| 1/1 [00:14<00:00, 14.02s/it]
2025-02-27 23:41:42,992 - pdf_parser - INFO - Generated custom prompt for transcription
Transcribing pages in parallel:   7%|▋         | 11/160 [00:17<03:28,  1.40s/it]2025-02-27 23:51:49,689 - pdf_parser - ERROR - Error transcribing page: 504 Deadline Exceeded
Transcribing pages in parallel:   8%|▊         | 12/160 [10:06<4:53:26, 118.96s/it]2025-02-27 23:52:00,679 - pdf_parser - ERROR - Error transcribing page: 504 Deadline Exceeded
Transcribing pages in parallel: 100%|██████████| 160/160 [10:17<00:00,  3.86s/it]  
2025-02-27 23:52:00,690 - pdf_parser - INFO - Transcription com

Transcription saved to data/processed/rca.md
