Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Ruby bindings for Google's PDFium project

This allows Ruby efficiently to extract information from PDF files.

It currently has only very rudimantary PDF editing capabilities.

API Documentation is also available and the test directory has examples of usage.

Installing

The gem requires both the PDFium and freeimage libraries.

An Ubuntu PPA is available for PDFium.

Freeimage should be installable via system packages.

In memory render and extraction

# Assuming AWS::S3 is already authorized elsewhere
bucket = AWS::S3.new.buckets['my-pdfs']

pdf = PDFium::Document.from_memory bucket.objects['secrets.pdf'].read
pdf.pages.each do | page |

  # render the complete page as a PNG with the height locked to 1000 pixels
  # The width will be calculated to maintain the proper aspect ratio
  path = "secrets/page-#{page.number}.png"
  bucket.objects[path].write page.as_image(height: 1000).data(:png)

  # extract and save each embedded image as a PNG
  page.images.each do | image |
    path = "secrets/page-#{page.number}-image-#{image.index}.png"
    bucket.objects[path].write image.data(:png)
  end

  # Extract text from page.  Will be encoded as UTF-16LE by default
  path = "secrets/page-#{page.number}-text.txt"
  bucket.objects[path].write page.text

end

Open and saveing

pdf = PDFium::Document.new("test.pdf")
pdf.save

Document information

Page count:

pdf.page_count

PDF Metadata:

pdf.metadata

Returns a hash with keys = :title, :author :subject, :keywords, :creator, :producer, :creation_date, :mod_date

Bookmarks

def print_bookmarks(list, indent=0)
    list.bookmarks.each do | bm |
        print ' ' * indent
        puts bm.title
        print_marks( bm.children )
    end
end
print_bookmarks( pdf.bookmarks )

Render page as an image

pdf.each_page | page |
    page.as_image(width: 800).save("test-{page.number}.png")
end

Extract embedded images from page

doc = PDFium::Document.new("test.pdf")
page = doc.page_at(0)
page.images do |image|
    img.save("page-0-image-#{image.index}.png")
end

Text access

Text is returned as a UTF-16LE encoded string. Future version may return position information as well

pdf.page_at(0).text.encode!("ASCII-8BIT")

About

Ruby bindings for PDFium

Resources

License

Packages

No packages published