# PDFMiner Example Notebook

This notebook demonstrates how to use the `pdfminer.six` library to extract text from PDF files. The examples include basic text extraction as well as advanced text extraction with layout preservation.

## Step 1: Install `pdfminer.six`

First, we need to install the `pdfminer.six` library. Run the cell below to install it.

In [1]:
!pip install pdfminer.six


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 2: Import Necessary Modules

Next, we import the necessary modules from `pdfminer`. These modules will help us extract text from PDF files.

In [2]:
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp
from io import StringIO

## Step 3: Extract Text from PDF

Here, we define a function to extract text from a PDF file. This basic function will simply extract all text content without preserving the layout.

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Parameters:
    pdf_path (str): The path to the PDF file.

    Returns:
    str: The extracted text.
    """
    text = extract_text(pdf_path)
    return text

### Example: Extract Text from PDF

In [4]:
pdf_path = 'sample-docs/mobile-home-manual.pdf'
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

MOBILEHOME MANUAL 

RULE 9 - 55 AND RETIRED DISCOUNT 

OHIO 
RULES 

If  the  following  criteria  are  met,  reduce  the  otherwise  applicable  Standard  or  Deluxe  Policy 
Package Premium by 10%. 

1. 

2. 

One of the Named Insureds must be age 55 or older. 

Both the Named Insured and Spouse, if any, are not presently gainfully employed full-
time or actively seeking full-time gainful employment. 

3. 

The Insured Residence must be the Principal Residence of the Applicant. 

RULE 10 - CLASSIFICATION 

Mobile Homes are classified either Class 1 or Class 2. 

1. 

Class 1 rates and premiums apply to owner-occupied one-family Mobile Home which 
meet the following requirements: 

a. 

b. 

Principal residence of occupant 

Used exclusively for residential purposes 

2. 

All other mobile homes are Class 2.  Premiums are determined by applying the factor 
shown on the Supplementary Rate Page. 

RULE 11 - PREMIUM DETERMINATION 

The premium calculations should be done in the following

## Step 4: Advanced Extraction with Layout Preservation

This function extracts text from a PDF file while preserving the layout. It is useful for PDFs where the structure of the text matters.

In [5]:
def extract_text_with_layout(pdf_path):
    """
    Extracts text from a PDF file while preserving the layout.

    Parameters:
    pdf_path (str): The path to the PDF file.

    Returns:
    str: The extracted text with layout preservation.
    """
    output_string = StringIO()
    with open(pdf_path, 'rb') as f:
        extract_text_to_fp(f, output_string, laparams=LAParams())
    return output_string.getvalue()

### Example: Advanced Text Extraction

In [6]:
pdf_path = './sample-docs/USENIX-Example-Paper.pdf'
extracted_text_layout = extract_text_with_layout(pdf_path)
print(extracted_text_layout)

USENIX Example Paper

Pekka Nikander
Aalto University

Jane-Ellen Long
USENIX Association

Abstract
This is an example for a USENIX paper, in the form
of an HTML/CSS template. Being heavily self-ref-
erential, this template illustrates the features in-
cluded in this template. It is expected that the
prospective authors using HTML/CSS would create
a new document based on this template, remove
the content, and start writing their paper.

Note that in this template, you may have a mul-
ti-paragraph abstract. However, that it is not nec-
essarily a good practice. Try to keep your abstract
in one paragraph, and remember that the optimal
length for an abstract is 200-300 words.

1 Introduction
For the purposes of USENIX conference publica-
tions, the authors, not the USENIX staff, are solely
responsible for the content and formatting of their
paper. The purpose of this template is to help
those authors that want to use HTML/CSS to write
their papers. This template has been prepared by
Håkon