Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF pages > concatenate > resize > jpg (but faster) #3774

Closed
simoami opened this issue Aug 23, 2023 · 7 comments
Closed

PDF pages > concatenate > resize > jpg (but faster) #3774

simoami opened this issue Aug 23, 2023 · 7 comments
Labels

Comments

@simoami
Copy link

simoami commented Aug 23, 2023

Hello, I have the following workflow to convert a PDF file to a JPEG image for web viewing. This workflow becomes slow with large PDFs (50 pages+) and would like to find out if there are avenues to improve processing speed:

const allPagesWidth = 1280
const allPagesHeight = 77000 (represents the height of all pages combined)

const imageBuffer = await sharp({
  create: {
    width: allPagesWidth,
    height: allPagesHeight,
    channels: 3,
    background: { r: 255, g: 255, b: 255 },
  },
  limitInputPixels: false,
})
  .composite(images) // PDF pages to be concatenated into a single large image
  .png() // tried to use raw() instead to reduce conversion time but got a Input buffer contains unsupported image format 
  .toBuffer()
  .then((data) => {
  
    // TIMING(composite + png + toBuffer) takes  20.072s
    
    if (
      allPagesWidth > MAX_JPEG_DIMENSION ||
      allPagesHeight > MAX_JPEG_DIMENSION
    ) {
      // if concatenated image exceeds jpeg max dimensions, apply a resize to constrain within safe size limit
      return sharp(data)
        .resize({
          width: MAX_JPEG_DIMENSION,
          height: MAX_JPEG_DIMENSION,
          fit: sharp.fit.inside,
        })
        .toBuffer();
    }
    return data;
  });
  
// TIMING(resize + toBuffer) takes 13.714s

const resultImage = await sharp(imageBuffer)
  .jpeg()
  .toBuffer({ resolveWithObject: true });

// TIMING(jpeg+toBuffer) takes 698.289ms

I added some timing checkpoints to show the timing breakdown:

composite + png + toBuffer: 20.072s
resize + toBuffer: 13.714s
jpeg + toBuffer: 698.289ms

Is there a chance some of the intermediary steps can be omitted? e.g. toBuffer() called 3 times and png() is used to make the chained commands work even though the format isn't needed. Any tips to improve performance are welcome.

@simoami simoami changed the title PDF Pages -> Images -> concatenate -> resize -> to JPG PDF -> Images -> concatenate -> resize -> jpg (But Faster) Aug 23, 2023
@simoami simoami changed the title PDF -> Images -> concatenate -> resize -> jpg (But Faster) PDF -> Images -> concatenate -> resize -> jpg (but faster) Aug 23, 2023
@simoami simoami changed the title PDF -> Images -> concatenate -> resize -> jpg (but faster) PDF pages > concatenate > resize > jpg (but faster) Aug 23, 2023
@lovell
Copy link
Owner

lovell commented Aug 24, 2023

The timings here suggest it's the relatively simple PNG decode to JPG encode roundtrip that is the slowest part.

Given you're using PDF inputs I presume you're also using a globally-installed libvips compiled with support for a PDF library. Choice of PNG and ZLIB libraries will have an impact on PNG decode time, which alternatives have you tried?

As with any performance question, a standalone repo with all code, dependencies and images that allows someone else to reproduce would be useful.

@simoami
Copy link
Author

simoami commented Aug 30, 2023

@lovell Thanks for the quick reply. I'm new to Sharp and just learning about how to optimize it as I take over some existing customer code. We have an image processing service (that spins up child processes to process queue jobs). The service keeps restarting when processing large files using the composite chain of command I shared above, with 30 page docs and above. The cluster detects service becoming unresponsive and spins up a new one in its place. I have yet to monitor and conclude it it's a cpu or memory spike. I'm leaning towards the second. Also, The service's dockerfile doesn't have libvips explicitly installed. This runs on generic EC2 instances in AWS via EKS if that's of interest to you. Also I see examples of docker files with SHARP_IGNORE_GLOBAL_LIBVIPS=1 but I don't know where to read more about it and if it's advised.

@lovell
Copy link
Owner

lovell commented Sep 27, 2023

@simoami Were you able to make any progress with this?

@simoami
Copy link
Author

simoami commented Sep 28, 2023

Hi @lovell yes, and sorry for the last email lacking focus. The PDF to image conversion I wish to implement needs to scale up to hundreds of concurrent uses. To that effect, I started working on a process that limits memory usage by using the file system when the number of pages is above a predefined threshold.

Conversion Workflow:
pdf to image conversion

I will post the code and profiling charts shortly.

@simoami
Copy link
Author

simoami commented Sep 28, 2023

@lovell Below is the code I wrote to implement the process workflow in my previous comment and corresponding cpu/memory profiles (ran on Mac M2 Max 32Gb). The memory based process takes ~ 14s, while the file system based process takes ~19s .
The first steps of the process (items 1,2,5) are handled by another library pdfjs-dist as I couldn't find any pdf specific apis in Sharp.

Timing breakdown for memory based processing:

pdfToPages: 8.768s
_getMaxWidth: 0.036ms
_resizeImages: 2.392s
_getTotalHeight: 0.034ms
sharp.composite.raw.toBuffer: 2.128s
sharp.resize.jpeg.toFile: 2.266ms

Timing breakdown for the file-based processing:

pdfToPages: 9.215s
_getMaxWidth: 0.046ms
_resizeImages: 3.154s
_getTotalHeight: 0.123ms
sharp.composite.raw.toBuffer: 4.651s
sharp.resize.jpeg.toFile: 1.494ms

Follow-up questions:

  1. Does sharp have the ability to resize a collection of images to a specific width and then concatenate them by stacking them on top of each other all in one chained execution? and
  2. potentially also limiting the max width and height to a specified value to allow for jpg exports?
  3. is it possible for Sharp to stream operation in order to reduce memory usage and disk io?

Let me know if you see any possibility of improving this process, especially the file-based one.

Profiling

Memory-based PDF conversion (50 page pdf doc)
memory-based conversion

File System-based PDF conversion (50 page pdf doc)
file-system-based-conversion

Source (Partial)

/**
 * Maximum dimension allowed for JPEG images.
 */
const MAX_JPEG_DIMENSION = 65_500;

const PAGE_CONVERSION_CONCURRENCY = 3;
/**
 * threshold for saving to disk if the PDF page count exceeds this value
 */
const MIN_PAGES_FOR_DISK_SAVE = 10;

/**
  * Main method
  */
async function pdfToImage(
  pdfFileOrBuffer: string | ArrayBufferLike,
  props?: PdfToImageOptions,
): Promise<sharp.OutputInfo> {
  const pages = await _pdfToPages(pdfFileOrBuffer, props);

  const shouldSaveToDisk = pages.length >= MIN_PAGES_FOR_DISK_SAVE;
  const scale =
    props?.viewportScale !== undefined
      ? props.viewportScale
      : (PDF_CONVERSION_OPTIONS_DEFAULTS.viewportScale as number);

  // find the page with the max width and return its value. We're going to use it as the reference width for all pages.
  const maxWidth = _getMaxWidth(pages, scale);

  const imagesToConcat: sharp.OverlayOptions[] = await _resizeImages(pages, maxWidth, shouldSaveToDisk);

  // total height represents the vertical space occupied by all the pdf pages stacked up on top of each other.
  // This is after stretching pages to have the same width and keeping their aspect ratio
  const totalHeight = _getTotalHeight(imagesToConcat);

  log.info(`generate a single page from ${imagesToConcat.length} page(s) with size ${maxWidth} x ${totalHeight}`);

  const fullImage = await sharp({
    create: {
      width: maxWidth,
      height: totalHeight,
      channels: 3,
      background: { r: 255, g: 255, b: 255 },
    },
    limitInputPixels: false,
  })
    .composite(imagesToConcat)
    .raw()
    .toBuffer({ resolveWithObject: true });

  log.info(`output image produced with size ${fullImage.info.width} x ${fullImage.info.height}`);

  // check if resulting height is larger than the JPEG size limit
  // Resize output image accordingly height
  const imageToSave: sharp.Sharp = await sharp(fullImage.data, {
    raw: fullImage.info,
    // prevents error: Input image exceeds pixel limit
    limitInputPixels: false,
  });

  if (fullImage.info.height > MAX_JPEG_DIMENSION) {
    imageToSave.resize(null, MAX_JPEG_DIMENSION);

    log.info(
      `output image resized to safe jpg limits with size ${Math.round(
        (fullImage.info.width * MAX_JPEG_DIMENSION) / fullImage.info.height,
      )} x ${MAX_JPEG_DIMENSION}`,
    );
  }

  const outputFile = path.resolve(props?.outputFolder || __dirname, 'composite.jpg');

  log.info(`save image to ${outputFile}`);
  const output = imageToSave
    // Enhance text clarity for OCR
    // .normalise()
    .jpeg({ quality: 80 })
    .toFile(outputFile);
  return output;
}

async function _pdfToPages(
  pdfFileOrBuffer: string | ArrayBufferLike,
  props?: PdfToImageOptions,
): Promise<PageOutput[]> {
  try {
    const isBuffer: boolean = Buffer.isBuffer(pdfFileOrBuffer);

    const pdfFileBuffer: ArrayBuffer = isBuffer
      ? (pdfFileOrBuffer as ArrayBuffer)
      : await readFile(pdfFileOrBuffer as string);

    const canvasFactory = new NodeCanvasFactory();
    const docInitParams = _getPDFDocInitParams(props);
    docInitParams.data = new Uint8Array(pdfFileBuffer);
    docInitParams.canvasFactory = canvasFactory;

    const pdfDocument: pdfApiTypes.PDFDocumentProxy = await pdfjsLib.getDocument(docInitParams).promise;
    const pageNumbers: number[] = Array.from({ length: pdfDocument.numPages }, (_, index) => index + 1);
    const shouldSaveToDisk = pdfDocument.numPages >= MIN_PAGES_FOR_DISK_SAVE;
    let pageName = PDF_CONVERSION_OPTIONS_DEFAULTS.outputFileMask;
    
    if (props?.outputFileMask) {
      pageName = props.outputFileMask;
    }
    if (!pageName && !isBuffer) {
      pageName = path.parse(pdfFileOrBuffer as string).name;
    }

    const pageOutputs: PageOutput[] = await Bluebird.map(
      pageNumbers,
      (pageNumber) => _renderSinglePage(pdfDocument, pageNumber, pageName, canvasFactory, shouldSaveToDisk, props),
      // no concurrency if saving to disk to reduce memory usage
      { concurrency: shouldSaveToDisk ? 1 : PAGE_CONVERSION_CONCURRENCY },
    );
    await pdfDocument.cleanup();
    return pageOutputs;
  } catch (err) {
    log.error(err as Error);
    throw err;
  }
}

function _getMaxWidth(pages: PageOutput[], scale: number) {
  return Math.min(
    Math.floor(pages.reduce((previous, page) => Math.max(page.width, previous), 0) * scale),
    MAX_JPEG_DIMENSION,
  );
}

function _getTotalHeight(images: sharp.OverlayOptions[]) {
  return images.reduce((total, current) => total + (current.raw?.height ?? 0), 0);
}

async function _resizeImages(pages: PageOutput[], targetWidth: number, shouldSaveToDisk: boolean) {
  const imagesToConcat: sharp.OverlayOptions[] = [];
  let totalHeight = 0;

  for (let i = 0; i < pages.length; i++) {
    const { content, path: filePath } = pages[i];

    if (shouldSaveToDisk) {
      // resize and save image to disk
      const parsedPath = path.parse(filePath);
      const newFilePath = path.join(parsedPath.dir, `${parsedPath.name}_resized${parsedPath.ext}`);

      const resizedImage = await sharp(filePath).resize(targetWidth).jpeg({ quality: 80 }).toFile(newFilePath);
      const roundedWidth = Math.floor(resizedImage.width);
      const roundedHeight = Math.floor(resizedImage.height);

      log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);

      imagesToConcat.push({
        input: filePath,
        raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.channels },
        left: 0,
        top: totalHeight,
        limitInputPixels: false,
      });
      totalHeight += roundedHeight;
    } else {
      // retain image as buffer
      const resizedImage = await sharp(content).resize(targetWidth).raw().toBuffer({ resolveWithObject: true });
      const roundedWidth = Math.floor(resizedImage.info.width);
      const roundedHeight = Math.floor(resizedImage.info.height);

      log.info(`resized page ${pages[i].pageNumber} to ${roundedWidth} x ${roundedWidth}`);

      imagesToConcat.push({
        input: resizedImage.data,
        raw: { width: roundedWidth, height: roundedHeight, channels: resizedImage.info.channels },
        left: 0,
        top: totalHeight,
        limitInputPixels: false,
      });
      totalHeight += roundedHeight;
    }
  }
  return imagesToConcat;
}

@lovell
Copy link
Owner

lovell commented Sep 29, 2023

Does sharp have the ability to resize a collection of images to a specific width and then concatenate them by stacking them on top of each other all in one chained execution?

Please see #1580

potentially also limiting the max width and height to a specified value to allow for jpg exports?

If I understand correctly, you could add a second pipeline to resize the concatenated image to within JPEG limits.

is it possible for Sharp to stream operation in order to reduce memory usage and disk io?

You could experiment with random access input:

sharp(input, { sequentialRead: false })...

...which reduces disk I/O at the cost of increased memory usage. However this question is probably about the compositing step, so #1580 is relevant again, plus #179 might also be of interest.

@lovell
Copy link
Owner

lovell commented Nov 4, 2023

I hope this information helped. Please feel free to re-open with more details if further assistance is required.

@lovell lovell closed this as completed Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants