PDF pages > concatenate > resize > jpg (but faster) #3774

simoami · 2023-08-23T21:34:37Z

Hello, I have the following workflow to convert a PDF file to a JPEG image for web viewing. This workflow becomes slow with large PDFs (50 pages+) and would like to find out if there are avenues to improve processing speed:

const allPagesWidth = 1280
const allPagesHeight = 77000 (represents the height of all pages combined)

const imageBuffer = await sharp({
  create: {
    width: allPagesWidth,
    height: allPagesHeight,
    channels: 3,
    background: { r: 255, g: 255, b: 255 },
  },
  limitInputPixels: false,
})
  .composite(images) // PDF pages to be concatenated into a single large image
  .png() // tried to use raw() instead to reduce conversion time but got a Input buffer contains unsupported image format 
  .toBuffer()
  .then((data) => {
  
    // TIMING(composite + png + toBuffer) takes  20.072s
    
    if (
      allPagesWidth > MAX_JPEG_DIMENSION ||
      allPagesHeight > MAX_JPEG_DIMENSION
    ) {
      // if concatenated image exceeds jpeg max dimensions, apply a resize to constrain within safe size limit
      return sharp(data)
        .resize({
          width: MAX_JPEG_DIMENSION,
          height: MAX_JPEG_DIMENSION,
          fit: sharp.fit.inside,
        })
        .toBuffer();
    }
    return data;
  });
  
// TIMING(resize + toBuffer) takes 13.714s

const resultImage = await sharp(imageBuffer)
  .jpeg()
  .toBuffer({ resolveWithObject: true });

// TIMING(jpeg+toBuffer) takes 698.289ms

I added some timing checkpoints to show the timing breakdown:

composite + png + toBuffer: 20.072s
resize + toBuffer: 13.714s
jpeg + toBuffer: 698.289ms

Is there a chance some of the intermediary steps can be omitted? e.g. toBuffer() called 3 times and png() is used to make the chained commands work even though the format isn't needed. Any tips to improve performance are welcome.

The text was updated successfully, but these errors were encountered:

lovell · 2023-08-24T07:42:00Z

The timings here suggest it's the relatively simple PNG decode to JPG encode roundtrip that is the slowest part.

Given you're using PDF inputs I presume you're also using a globally-installed libvips compiled with support for a PDF library. Choice of PNG and ZLIB libraries will have an impact on PNG decode time, which alternatives have you tried?

As with any performance question, a standalone repo with all code, dependencies and images that allows someone else to reproduce would be useful.

simoami · 2023-08-30T00:08:38Z

@lovell Thanks for the quick reply. I'm new to Sharp and just learning about how to optimize it as I take over some existing customer code. We have an image processing service (that spins up child processes to process queue jobs). The service keeps restarting when processing large files using the composite chain of command I shared above, with 30 page docs and above. The cluster detects service becoming unresponsive and spins up a new one in its place. I have yet to monitor and conclude it it's a cpu or memory spike. I'm leaning towards the second. Also, The service's dockerfile doesn't have libvips explicitly installed. This runs on generic EC2 instances in AWS via EKS if that's of interest to you. Also I see examples of docker files with SHARP_IGNORE_GLOBAL_LIBVIPS=1 but I don't know where to read more about it and if it's advised.

lovell · 2023-09-27T08:29:47Z

@simoami Were you able to make any progress with this?

simoami · 2023-09-28T17:21:29Z

Hi @lovell yes, and sorry for the last email lacking focus. The PDF to image conversion I wish to implement needs to scale up to hundreds of concurrent uses. To that effect, I started working on a process that limits memory usage by using the file system when the number of pages is above a predefined threshold.

Conversion Workflow:

I will post the code and profiling charts shortly.

simoami · 2023-09-28T18:48:22Z

@lovell Below is the code I wrote to implement the process workflow in my previous comment and corresponding cpu/memory profiles (ran on Mac M2 Max 32Gb). The memory based process takes ~ 14s, while the file system based process takes ~19s .
The first steps of the process (items 1,2,5) are handled by another library pdfjs-dist as I couldn't find any pdf specific apis in Sharp.

Timing breakdown for memory based processing:

pdfToPages: 8.768s
_getMaxWidth: 0.036ms
_resizeImages: 2.392s
_getTotalHeight: 0.034ms
sharp.composite.raw.toBuffer: 2.128s
sharp.resize.jpeg.toFile: 2.266ms

Timing breakdown for the file-based processing:

pdfToPages: 9.215s
_getMaxWidth: 0.046ms
_resizeImages: 3.154s
_getTotalHeight: 0.123ms
sharp.composite.raw.toBuffer: 4.651s
sharp.resize.jpeg.toFile: 1.494ms

Follow-up questions:

Does sharp have the ability to resize a collection of images to a specific width and then concatenate them by stacking them on top of each other all in one chained execution? and
potentially also limiting the max width and height to a specified value to allow for jpg exports?
is it possible for Sharp to stream operation in order to reduce memory usage and disk io?

Let me know if you see any possibility of improving this process, especially the file-based one.

Profiling

Memory-based PDF conversion (50 page pdf doc)

File System-based PDF conversion (50 page pdf doc)

Source (Partial)

/**
 * Maximum dimension allowed for JPEG images.
 */
const MAX_JPEG_DIMENSION = 65_500;

const PAGE_CONVERSION_CONCURRENCY = 3;
/**
 * threshold for saving to disk if the PDF page count exceeds this value
 */
const MIN_PAGES_FOR_DISK_SAVE = 10;

/**
  * Main method
  */
async function pdfToImage(
  pdfFileOrBuffer: string | ArrayBufferLike,
  props?: PdfToImageOptions,
): Promise<sharp.OutputInfo> {
  const pages = await _pdfToPages(pdfFileOrBuffer, props);

  const shouldSaveToDisk = pages.length >= MIN_PAGES_FOR_DISK_SAVE;
  const scale =
    props?.viewportScale !== undefined
      ? props.viewportScale
      : (PDF_CONVERSION_OPTIONS_DEFAULTS.viewportScale as number);

  // find the page with the max width and return its value. We're going to use it as the reference width for all pages.
  const maxWidth = _getMaxWidth(pages, scale);

  const imagesToConcat: sharp.OverlayOptions[] = await _resizeImages(pages, maxWidth, shouldSaveToDisk);

  // total height represents the vertical space occupied by all the pdf pages stacked up on top of each other.
  // This is after stretching pages to have the same width and keeping their aspect ratio
  const totalHeight = _getTotalHeight(imagesToConcat);

  log.info(`generate a single page from ${imagesToConcat.length} page(s) with size ${maxWidth} x ${totalHeight}`);

  const fullImage = await sharp({
    create: {
      width: maxWidth,
      height: totalHeight,
      channels: 3,
      background: { r: 255, g: 255, b: 255 },
    },
    limitInputPixels: false,
  })
    .composite(imagesToConcat)
    .raw()
    .toBuffer({ resolveWithObject: true });

  log.info(`output image produced with size ${fullImage.info.width} x ${fullImage.info.height}`);

  // check if resulting height is larger than the JPEG size limit
  // Resize output image accordingly height
  const imageToSave: sharp.Sharp = await sharp(fullImage.data, {
    raw: fullImage.info,
    // prevents error: Input image exceeds pixel limit
    limitInputPixels: false,
  });

  if (fullImage.info.height > MAX_JPEG_DIMENSION) {
    imageToSave.resize(null, MAX_JPEG_DIMENSION);

    log.info(
      `output image resized to safe jpg limits with size ${Math.round(
        (fullImage.info.width * MAX_JPEG_DIMENSION) / fullImage.info.height,
      )} x ${MAX_JPEG_DIMENSION}`,
    );
  }

  const outputFile = path.resolve(props?.outputFolder || __dirname, 'composite.jpg');

  log.info(`save image to ${outputFile}`);
  const output = imageToSave
    // Enhance text clarity for OCR
    // .normalise()
    .jpeg({ quality: 80 })
    .toFile(outputFile);
  return output;
}

async function _pdfToPages(
  pdfFileOrBuffer: string | ArrayBufferLike,
  props?: PdfToImageOptions,
): Promise<PageOutput[]> {
  try {
    const isBuffer: boolean = Buffer.isBuffer(pdfFileOrBuffer);

    const pdfFileBuffer: ArrayBuffer = isBuffer
      ? (pdfFileOrBuffer as ArrayBuffer)
      : await readFile(pdfFileOrBuffer as string);

    const canvasFactory = new NodeCanvasFactory();
    const docInitParams = _getPDFDocInitParams(props);
    docInitParams.data = new Uint8Array(pdfFileBuffer);
    docInitParams.canvasFactory = canvasFactory;

    const pdfDocument: pdfApiTypes.PDFDocumentProxy = await pdfjsLib.getDocument(docInitParams).promise;
    const pageNumbers: number[] = Array.from({ length: pdfDocument.numPages }, (_, index) => index + 1);
    const shouldSaveToDisk = pdfDocument.numPages >= MIN_PAGES_FOR_DISK_SAVE;
    let pageName = PDF_CONVERSION_OPTIONS_DEFAULTS.outputFileMask;
    
    if (props?.outputFileMask) {
      pageName = props.outputFileMask;
    }
    if (!pageName && !isBuffer) {
      pageName = path.parse(pdfFileOrBuffer as string).name;
    }

    const pageOutputs: PageOutput[] = await Bluebird.map(
      pageNumbers,
      (pageNumber) => _renderSinglePage(pdfDocument, pageNumber, pageName, canvasFactory, shouldSaveToDisk, props),
      // no concurrency if saving to disk to reduce memory usage
      { concurrency: shouldSaveToDisk ? 1 : PAGE_CONVERSION_CONCURRENCY },
    );
    await pdfDocument.cleanup();
    return pageOutputs;
  } catch (err) {
    log.error(err as Error);
    throw err;
  }
}

function _getMaxWidth(pages: PageOutput[], scale: number) {
  return Math.min(
    Math.floor(pages.reduce((previous, page) => Math.max(page.width, previous), 0) * scale),
    MAX_JPEG_DIMENSION,
  );
}

function _getTotalHeight(images: sharp.OverlayOptions[]) {
  return images.reduce((total, current) => total + (current.raw?.height ?? 0), 0);
}

async function _resizeImages(pages: PageOutput[], targetWidth: number, shouldSaveToDisk: boolean) {
  const imagesToConcat: sharp.OverlayOptions[] = [];
  let totalHeight = 0;

  for (let i = 0; i < pages.length; i++) {
    const { content, path: filePath } = pages[i];

    if (shouldSaveToDisk) {
      // resize and save image to disk
      const parsedPath = path.parse(filePath);
      const newFilePath = path.join(parsedPath.dir, `${parsedPath.name}_resized${parsedPath.ext}`);

      const resizedImage = await sharp(filePath).resize(targetWidth).jpeg({ quality: 80 }).toFile(newFilePath);
      const roundedWidth = Math.floor(resizedImage.width);
      const roundedHeight = Math.floor(resizedImage.height);

      log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);

      imagesToConcat.push({
        input: filePath,
        raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.channels },
        left: 0,
        top: totalHeight,
        limitInputPixels: false,
      });
      totalHeight += roundedHeight;
    } else {
      // retain image as buffer
      const resizedImage = await sharp(content).resize(targetWidth).raw().toBuffer({ resolveWithObject: true });
      const roundedWidth = Math.floor(resizedImage.info.width);
      const roundedHeight = Math.floor(resizedImage.info.height);

      log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);

      imagesToConcat.push({
        input: resizedImage.data,
        raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.info.channels },
        left: 0,
        top: totalHeight,
        limitInputPixels: false,
      });
      totalHeight += roundedHeight;
    }
  }
  return imagesToConcat;
}

lovell · 2023-09-29T11:53:13Z

Does sharp have the ability to resize a collection of images to a specific width and then concatenate them by stacking them on top of each other all in one chained execution?

Please see #1580

potentially also limiting the max width and height to a specified value to allow for jpg exports?

If I understand correctly, you could add a second pipeline to resize the concatenated image to within JPEG limits.

is it possible for Sharp to stream operation in order to reduce memory usage and disk io?

You could experiment with random access input:

sharp(input, { sequentialRead: false })...

...which reduces disk I/O at the cost of increased memory usage. However this question is probably about the compositing step, so #1580 is relevant again, plus #179 might also be of interest.

lovell · 2023-11-04T15:38:56Z

I hope this information helped. Please feel free to re-open with more details if further assistance is required.

simoami added the question label Aug 23, 2023

simoami changed the title ~~PDF Pages -> Images -> concatenate -> resize -> to JPG~~ PDF -> Images -> concatenate -> resize -> jpg (But Faster) Aug 23, 2023

simoami changed the title ~~PDF -> Images -> concatenate -> resize -> jpg (But Faster)~~ PDF -> Images -> concatenate -> resize -> jpg (but faster) Aug 23, 2023

simoami changed the title ~~PDF -> Images -> concatenate -> resize -> jpg (but faster)~~ PDF pages > concatenate > resize > jpg (but faster) Aug 23, 2023

lovell closed this as completed Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF pages > concatenate > resize > jpg (but faster) #3774

PDF pages > concatenate > resize > jpg (but faster) #3774

simoami commented Aug 23, 2023 •

edited

Loading

lovell commented Aug 24, 2023

simoami commented Aug 30, 2023

lovell commented Sep 27, 2023

simoami commented Sep 28, 2023 •

edited

Loading

simoami commented Sep 28, 2023 •

edited

Loading

lovell commented Sep 29, 2023

lovell commented Nov 4, 2023

PDF pages > concatenate > resize > jpg (but faster) #3774

PDF pages > concatenate > resize > jpg (but faster) #3774

Comments

simoami commented Aug 23, 2023 • edited Loading

lovell commented Aug 24, 2023

simoami commented Aug 30, 2023

lovell commented Sep 27, 2023

simoami commented Sep 28, 2023 • edited Loading

simoami commented Sep 28, 2023 • edited Loading

Profiling

Source (Partial)

lovell commented Sep 29, 2023

lovell commented Nov 4, 2023

simoami commented Aug 23, 2023 •

edited

Loading

simoami commented Sep 28, 2023 •

edited

Loading

simoami commented Sep 28, 2023 •

edited

Loading