Skip to content
Ruby bindings for PDFium
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ext/pdfium_ext
lib
test
.gitignore
.ruby-version
.yardopts
Gemfile
Guardfile
LICENSE.txt
README.md
Rakefile
pdfium.gemspec

README.md

Ruby bindings for Google's PDFium project

This allows Ruby efficiently to extract information from PDF files.

It currently has only very rudimantary PDF editing capabilities.

API Documentation is also available and the test directory has examples of usage.

Installing

The gem requires both the PDFium and freeimage libraries.

An Ubuntu PPA is available for PDFium.

Freeimage should be installable via system packages.

In memory render and extraction

# Assuming AWS::S3 is already authorized elsewhere
bucket = AWS::S3.new.buckets['my-pdfs']

pdf = PDFium::Document.from_memory bucket.objects['secrets.pdf'].read
pdf.pages.each do | page |

  # render the complete page as a PNG with the height locked to 1000 pixels
  # The width will be calculated to maintain the proper aspect ratio
  path = "secrets/page-#{page.number}.png"
  bucket.objects[path].write page.as_image(height: 1000).data(:png)

  # extract and save each embedded image as a PNG
  page.images.each do | image |
    path = "secrets/page-#{page.number}-image-#{image.index}.png"
    bucket.objects[path].write image.data(:png)
  end

  # Extract text from page.  Will be encoded as UTF-16LE by default
  path = "secrets/page-#{page.number}-text.txt"
  bucket.objects[path].write page.text

end

Open and saveing

pdf = PDFium::Document.new("test.pdf")
pdf.save

Document information

Page count:

pdf.page_count

PDF Metadata:

pdf.metadata

Returns a hash with keys = :title, :author :subject, :keywords, :creator, :producer, :creation_date, :mod_date

Bookmarks

def print_bookmarks(list, indent=0)
    list.bookmarks.each do | bm |
        print ' ' * indent
        puts bm.title
        print_marks( bm.children )
    end
end
print_bookmarks( pdf.bookmarks )

Render page as an image

pdf.each_page | page |
    page.as_image(width: 800).save("test-{page.number}.png")
end

Extract embedded images from page

doc = PDFium::Document.new("test.pdf")
page = doc.page_at(0)
page.images do |image|
    img.save("page-0-image-#{image.index}.png")
end

Text access

Text is returned as a UTF-16LE encoded string. Future version may return position information as well

pdf.page_at(0).text.encode!("ASCII-8BIT")
You can’t perform that action at this time.