Skip to content

plesiecki/browserless

 
 

Repository files navigation

browserless

Last version Build Status Coverage Status Dependency status Dev Dependencies Status NPM Status

A puppeter-like Node.js library for interacting with Headless production scenarios.

Why

Although you can think puppeteer could be enough, there is a set of use cases that make sense built on top of puppeteer and they are necessary to support into robust production scenario, like:

  • Sensible good defaults, aborting unnecessary requests based of what you are doing (e.g, aborting image request if you just want to get .html content).
  • Privacy by default, blocking tracker requests.
  • Easily create a pool of instance (via @browserless/pool).
  • Built-in AdBlocker (soon).

Install

browserless is built on top of puppeteer, so you need to install it as well.

$ npm install puppeteer browserless --save

You can use browserless together with puppeteer, puppeteer-core or puppeteer-firefox.

Internally, the library is divided into different packages based on the functionality

Usage

The browserless API is like puppeteer, but doing more things under the hood (not too much, I promise).

For example, if you want to take an screenshot, just do:

const browserless = require('browserless')()

browserless
  .screenshot('http://example.com', { device: 'iPhone 6' })
  .then(buffer => {
    console.log(`your screenshot is here!`)
  })

You can see more common recipes at @browserless/examples.

API

All methods follow the same interface:

  • url (required): The target URL
  • options: Specific settings for the method (optional).

The methods returns a Promise or a Node.js callback if pass an additional function as the last parameter.

.constructor(options)

It creates the browser instance, using puppeter.launch method.

// Creating a simple instance
const browserless = require('browserless')()

or passing specific launchers options:

// Creating an instance for running it at AWS Lambda
const browserless = require('browserless')({
  ignoreHTTPSErrors: true,
  args: [
    '--disable-gpu',
    '--single-process',
    '--no-zygote',
    '--no-sandbox',
    '--hide-scrollbars'
  ]
})

options

See puppeteer.launch#options.

By default the library will be pass a well known list of flags, so probably you don't need any additional setup.

timeout

type:number
default: 30000

This setting will change the default maximum navigation time.

puppeteer

type:Puppeteer
default: puppeteer|puppeteer-core|puppeteer-firefox

It's automatically detected based on your dependencies being supported puppeteer, puppeteer-core or puppeteer-firefox.

Alternatively, you can pass it.

incognito

type:boolean
default: false

Every time a new page is created, it will be an incognito page.

An incognito page will not share cookies/cache with other browser pages.

.html(url, options)

It returns the full HTML content from the target url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const html = await browserless.html(url)
  console.log(html)
})()

options

See page.goto.

Additionally, you can setup:

waitFor

type:string|function|number
default: 0

Wait a quantity of time, selector or function using page.waitFor.

waitUntil

type:array
default: ['networkidle0']

Specify a list of events until consider navigation succeeded, using page.waitForNavigation.

userAgent

It will setup a custom user agent, using page.setUserAgent method.

viewport

It will setup a custom viewport, using page.setViewport method.

abortTypes

type: array
default: ['image', 'media', 'stylesheet', 'font', 'xhr']

A list of resourceType requests that can be aborted in order to make the process faster.

abortTrackers

type: boolean
default: true

It will be abort request coming for tracking domains.

.text(url, options)

It returns the full text content from the target url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const text = await browserless.text(url)
  console.log(text)
})()

options

They are the same than .html method.

.pdf(url, options)

It generates the PDF version of a website behind an url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const buffer = await browserless.pdf(url)
  console.log(`PDF generated!`)
})()

options

See page.pdf.

Additionally, you can setup:

media

Changes the CSS media type of the page using page.emulateMedia.

device

It generate the PDF using the device descriptor name settings, like userAgent and viewport.

userAgent

It will setup a custom user agent, using page.setUserAgent method.

viewport

It will setup a custom viewport, using page.setViewport method.

.screenshot(url, options)

It takes a screenshot from the target url.

const browserless = require('browserless')

;(async () => {
  const url = 'https://example.com'
  const buffer = await browserless.screenshot(url)
  console.log(`Screenshot taken!`)
})()

options

See page.screenshot.

Additionally, you can setup:

The options provided are passed to page.pdf.

Additionally, you can setup:

device

It generate the PDF using the device descriptor name settings, like userAgent and viewport.

userAgent

It will setup a custom user agent, using page.setUserAgent method.

viewport

It will setup a custom viewport, using page.setViewport method.

.devices

List of all available devices preconfigured with deviceName, viewport and userAgent settings.

These devices are used for emulation purposes.

.getDevice(deviceName)

Get a specific device descriptor settings by descriptor name.

const browserless = require('browserless')

browserless.getDevice('Macbook Pro 15')

// {
//   name: 'Macbook Pro 15',
//   userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X …',
//   viewport: {
//     width: 1440,
//     height: 900,
//     deviceScaleFactor: 1,
//     isMobile: false,
//     hasTouch: false,
//     isLandscape: false
//   }
// }

Advanced

The following methods are exposed to be used in scenarios where you need more granularity control and less magic.

.browser

It returns the internal browser instance used as singleton.

const browserless = require('browserless')

;(async () => {
  const browserInstance = await browserless.browser
})()

.evaluate(page, response)

It exposes an interface for creating your own evaluate function, passing you the page and response.

const browserless = require('browserless')()

const getUrlInfo = browserless.evaluate((page, response) => ({
  statusCode: response.status(),
  url: response.url(),
  redirectUrls: response.request().redirectChain()
}))

;(async () => {
  const url = 'https://example.com'
  const info = await getUrlInfo(url)

  console.log(info)
  // {
  //   "statusCode": 200,
  //   "url": "https://example.com/",
  //   "redirectUrls": []
  // }
})()

Note you don't need to close the page; It will be done under the hood.

Internally the method performs a .goto.

.goto(page, options)

It performs a smart page.goto, blocking ads trackers) requests and other requests based on resourceType.

const browserless = require('browserless')

;(async () => {
  const page = await browserless.page()
  await browserless.goto(page, {
    url: 'http://savevideo.me',
    abortTypes: ['image', 'media', 'stylesheet', 'font']
  })
})()

options

url

type: string

The target URL

abortTypes

type: string
default: []

A list of req.resourceType() to be blocked.

abortTrackers

type: boolean
default: true

It will be abort request coming for tracking domains.

abortTrackers

type: boolean
default: true

It will be abort request coming for tracking domains.

waitFor

type:string|function|number
default: 0

Wait a quantity of time, selector or function using page.waitFor.

waitUntil

type:array
default: ['networkidle2', 'load', 'domcontentloaded']

Specify a list of events until consider navigation succeeded, using page.waitForNavigation.

userAgent

It will setup a custom user agent, using page.setUserAgent method.

viewport

It will setup a custom viewport, using page.setViewport method.

args

type: object

The settings to be passed to page.goto.

.page()

It returns a standalone browser new page.

const browserless = require('browserless')

;(async () => {
  const page = await browserless.page()
})()

Pool of Instances

browserless uses internally a singleton browser instance.

You can use a pool instances using @browserless/pool package.

const createBrowserless = require('@browserless/pool')
const browserlessPool = createBrowserless({
  poolOpts: {
    max: 15,
    min: 2
  }
})

The API is the same than browserless. now the constructor is accepting an extra option called poolOpts.

This setting is used for initializing the pool properly. You can see what you can specify there at node-pool#opts.

Also, you can interact with a standalone browserless instance of your pool.

const createBrowserless = require('browserless')
const browserlessPool = createBrowserless.pool()

// get a browserless instance from the pool
browserlessPool(async browserless => {
  // get a page from the browser instance
  const page = await browserless.page()
  await browserless.goto(page, { url: url.toString() })
  const html = await page.content()
  console.log(html)
  process.exit()
})

You don't need to think about the acquire/release step: It's done automagically ✨.

Packages

browserless is internally divided into multiple packages for ensuring just use the mininum quantity of code necessary for your user case.

Package Version Dependencies
browserless npm Dependency Status
@browserless/pool npm Dependency Status
@browserless/devices npm Dependency Status
@browserless/goto npm Dependency Status
@browserless/benchmark npm Dependency Status
@browserless/examples npm Dependency Status

Benchmark

For testing different approach, we included a tiny benchmark tool called @browserless/benchmark.

FAQ

Q: Why use browserless over Puppeteer?

browserless not replace puppeteer, it complements. It's just a syntactic sugar layer over official Headless Chrome oriented for production scenarios.

Q: Why do you block ads scripts by default?

Headless navigation is expensive compared with just fetch the content from a website.

In order to speed up the process, we block ads scripts by default because they are so bloat.

Q: My output is different from the expected

Probably browserless was too smart and it blocked a request that you need.

You can active debug mode using DEBUG=browserless environment variable in order to see what is happening behind the code:

DEBUG=browserless node index.js

Consider open an issue with the debug trace.

Q: Can I use browserless with my AWS Lambda like project?

Yes, check chrome-aws-lambda to setup AWS Lambda with a binary compatible.

License

browserless © Kiko Beats, Released under the MIT License.
Authored and maintained by Kiko Beats with help from contributors.

logo designed by xinh studio.

kikobeats.com · GitHub Kiko Beats · Twitter @kikobeats

About

A puppeter-like Node.js library for interacting with Headless production scenarios.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 85.5%
  • HTML 14.5%