Dumps web page outputs including JavaScript generated contents.
Visit here. If the server is sleeping, it takes several seconds to wake up.
Access the app address following the path /www/
with query parameters of the GET or POST method.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fexample.org
Only the url
parameter is required. The rest is optional.
For boolean values, use 1
or 0
instead of true
or false
.
A URL-encoded URL to fetch.
Note: It is important to pass an URL-encoded value especially when the URL includes query parameters not to mix with the current parameters and the requested URL parameters.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com%2F
The output type. Accepts the following values:
json
(default) - outputs the site source code, the HTTP header, the HTTP status code, and content type as JSON with the following root keys:url
- (string) the requested URL.query
- (array) the HTTP request query key-value pairs.resourceType
- (string) the request source type.contentType
- (string) the HTTP response content type, same as the HTTP headerContent-Type
entry.status
- (integer) the HTTP status code as a number such as200
and404
.heaers
- (array) the HTTP header.body
- (string) the HTTP body, usually an HTML document.
text
,txt
- outputs the site source as a text document. Use this for non-html documents such as XML and JSON.html
,htm
- outputs the site source ashtml
orhtm
. HTTP header will be omitted.mhtml
- outputs the site source asmhtml
.png
,jpg
,jpeg
- outputs a screenthot image of the sitepdf
What elements to omit when the json
output is specified. Pass non-empty values such as 1.
e.g. This omits the query
and body
elements from the response.
http(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com&output=json&omit[query]=1&omit[body]=1
Sets how the browser should be viewed.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com&output=jpg&set_viewport=1&viewport[width]=800&viewpor[height]=1200&viewport[deviceScaleFactor]=5
Accepts the following arguments, same as Puppeteer's page.setViewport()
method arguments.
width
(number) page width in pixels.height
(number) page height in pixels.deviceScaleFactor
(number) Specify device scale factor (can be thought of as dpr). Defaults to1
.isMobile
(boolean) Whether themeta viewport
tag is taken into account. Defaults tofalse
.isLandscape
(boolean) Specifies if viewport is in landscape mode. Defaults tofalse
.
Does not accept the following arguments.
hasTouch
Sets screenshot options. This takes effect when the output
parameter is either of jpg
, jpeg
, png
, or gif
.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com%2F&output=jpg&screenshot[quality]=10&screenshot[omitBackground]=1
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F&output=png&screenshot[clip][x]=50&screenshot[clip][y]=80&screenshot[clip][width]=700&screenshot[clip][height]=200
Accepts the following arguments, same as Puppeteer's page.setViewport()
method arguments.
quality
(number) The quality of the image, between 0-100. Not applicable topng
images.clip
(object) An object which specifies clipping region of the page. Should have the following fields:
x
(number) x-coordinate of top-left corner of clip areay
(number) y-coordinate of top-left corner of clip areawidth
(number) width of clipping areaheight
(number) height of clipping areaomitBackground
(boolean) Hides default white background and allows capturing screenshots with transparency. Defaults tofalse
.
Does not accept the following arguments.
path
encoding
type
fullPage
- when theclip
argument is not set, the full page screenshot will be taken.
Specifies whether to reload page in the internal browser. This is useful for cookie-dependant web pages.
Accepts 0
, 1
, or 2
.
0
: does not reload the page.1
: reloads only when the HTTP status is larger or equal to400
, such as404
,500
.2
: reloads regardless of the HTTP status.
If a value that is not listed above is passed and it yields true
, the value of 2
will be applied.
Decides whether to use browser caches.
Accepts 1
or 0
.
The browser connection timeout in milliseconds.
If the WPD_TIMEOUT
environment variable value is set and shorter than this value, the WPD_TIMEOUT
value will be used.
Default: 29000
.
Specifies a user agent.
For a site that requires a basic authentication, set a user name with this parameter.
For a site that requires a basic authentication, set a password with this parameter.
When the output type is set to pdf
, the following sub-arguments of the pdf
parameter is accepted.
For more details please see puppeteer's pdf options as the arguments are the same except some unsupported arguments.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fgithub.com&output=pdf&pdf[scale]=0.5&pdf[printBackground]=1&pdf[pageRanges]=1-3&pdf[format]=Legal
scale
(number) Scale of the webpage rendering. Defaults to1
. Scale amount must be between 0.1 and 2.displayHeaderFooter
(boolean) Display header and footer. Defaults tofalse
.headerTemplate
(string) HTML template for the print header. Should be valid HTML markup with following classes used to inject printing values into them:
date
formatted print datetitle
document titleurl
document locationpageNumber
current page numbertotalPages
total pages in the documentfooterTemplate
(string) HTML template for the print footer. Should use the same format as theheaderTemplate
.printBackground
(boolean) Print background graphics. Defaults tofalse
.landscape
(boolean) Paper orientation. Defaults tofalse
.pageRanges
(string) Paper ranges to print, e.g., '1-5, 8, 11-13'. Defaults to the empty string, which means print all pages.format
(string) Paper format. If set, takes priority overwidth
orheight
options. Defaults to 'Letter'. Accepts the following values.
Letter
: 8.5in x 11inLegal
: 8.5in x 14inTabloid
: 11in x 17inLedger
: 17in x 11inA0
: 33.1in x 46.8inA1
: 23.4in x 33.1inA2
: 16.54in x 23.4inA3
: 11.7in x 16.54inA4
: 8.27in x 11.7inA5
: 5.83in x 8.27inA6
: 4.13in x 5.83inwidth
(string|number) Paper width, accepts values labeled with units.height
(string|number) Paper height, accepts values labeled with units.margin
(object) Paper margins, defaults to none.
top
(string|number) Top margin, accepts values labeled with units.right
(string|number) Right margin, accepts values labeled with units.bottom
(string|number) Bottom margin, accepts values labeled with units.left
(string|number) Left margin, accepts values labeled with units.preferCSSPageSize
(boolean) Give any CSS@page
size declared in the page priority over what is declared inwidth
andheight
orformat
options. Defaults tofalse
, which will scale the content to fit the paper size.The
width
,height
, andmargin
options accept values labeled with units. Unlabeled values are treated as pixels.All possible units are:
px
- pixelin
- inchcm
- centimetermm
- millimeter
path
(string)
Additional HTTP headers sent to the page.
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F&output=jpg&headers[Accept-Language]=en&headers[dnt]=1
Cookies to set.
Accepts a linear array holding objects with the following key-value pairs.
name
requiredvalue
requireddomain
url
path
expires
Unix time in seconds.httpOnly
secure
sameSite
<"Strict"|"Lax">
If the domain
argument is missing, the url
argument will be automatically set with the requesting URL.
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F&output=jpg&cookies[0][name]=foo&cookies[0][value]=bar&
The args
argument for the puppeteer.launch()
method. For accepted arguments, please see here.
e.g.
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com%2F&output=jpg&args[]=--lang=en-GB
Format: scheme://username:password@ipaddress:port
For example, to set socks4://127.0.0.1:1080
,
http(s)://{app address}/www/?url=https%3A%2F%2Fwww.google.com&output=png&proxy=socks4%3A%2F%2F127.0.0.1:1080
Blocks specified resources. This has the following sub argument keys.
- types
- urls
Specifies the types to block.
Accepted values:
image
stylesheet
font
script
By default, when the output type is html' or
json, and no
blockvalue is passed,
image,
stylesheet, and
font` are added by default.
http(s)://{app address}/www/?url=https%3A%2F%2Fwww.amazon.com%2Fgp%2Fgoldbox&output=png&block[types][]=script
Specifies the part of URLs to block. Use asterisk (*
) to match any characters.
Such as:
*.optimizely.com
googleadservices.com
http(s)://{app address}/www/?url=https%3A%2F%2Fwww.amazon.com%2Fgp%2Fgoldbox&output=png&block[urls][]=googleadservices.com
Determines when Puppeteer decides the page is fully loaded. The same as the waitUntil
parameter of the goto()
page method.. Accepted values are load
, domcontentloaded
, networkidle0
, and networkidle2
.
Default: load
.
load
- consider navigation to be finished when the load event is fired.domcontentloaded
- consider navigation to be finished when the DOMContentLoaded event is fired.networkidle0
- consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.networkidle2
- consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.
Performs certain actions on the loaded web page such as click, remove, type, wait for something and so on.
The action parameter must be a numeric linear array holding key-value pairs of action type and action value.
For example, the following request will perform a search on DuckDuckGo.
http(s)://{app address}/www/?url=https%3A%2F%2Fduckduckgo.com&output=png&action[0][select]=%23search_form_input_homepage&action[1][type]=Web%20Page%20Dumper&action[2][click]=%23search_button_homepage&action[3][waitForNavigation]=
Notice that actions are performed sequentially. In the above example, it is interpreted as
[
{
select: #search_form_input
},
{
type: Web Page Dumper
},
{
click: #search_button_homepage
},
{
waitForNavigation:
},
]
The available action types are as follows.
Selects elements specified with a selector. Use this before an action that does not have a selector parameter.
Accepts a value of selector. The selector can be XPath.
Clicks a first found element specified with a selector.
Accepts a value of selector. The selector can be XPath.
This clicks on the top-right icon that expands the app panel on Google home page.
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com&output=jpg&action[][click]=a.gb_C
Removes elements specified with a selector.
Accepts a value of selector. The selector can be XPath.
This removes the top banner element on the Google search page. Notice that .k1zIA
is the class selector of the banner container.
http(s)://{app address}/www/?url=https%3A%2F%2Fgoogle.com&output=png&action[][remove]=.k1zIA
Types given characters.
This action does not accept a selector. Use the select
action before this to specify an element for typing.
Selects an item from a <select>
tag.
Accepts a value of selector. * This action does not support XPath.
Extracts elements.
This replaces the elements' innerHTML with the HTML body tag inner HTML. Use this to lighten up the result source code.
Accepts a value of selector.
Extracts elements.
Similar to the extract
action except that this removes all head tag elements so the styles will be lost.
Accepts a value of selector.
Waits for an element to appear.
Accepts a value of selector. The selector can be XPath.
Waits for the next page to load, used with clicking a link or submitting a form.
Waits for certain milliseconds.
Accepts a value of positive number.
On Heroku, each HTTP request should be responded within 30 seconds to avoid the recurring 503 error.
To set the timeout, use the WPD_TIMEOUT
environment variable. It accepts milliseconds such as 29000
.
There are mainly two options:
- a) Create a file named
.env
with the following entry in the project root directory (the same location as app.js).
WPD_TIMEOUT=29000
- b) On Heroku, go to Dashboard -> (Choose your App) -> Settings -> Config Vars and add
WPD_TIMEOUT
with a value such as29000
.
To enable the access to the app's log, you need to set an environment variable of WPD_LOG_ROUTE
with a value serving as the root name (part of URL path).
There are mainly two options:
- a) Create a file named
.env
with the following entry in the project root directory (the same location as app.js).
#LOGGING
WPD_LOG_ROUTE=log
- b) On Heroku, go to Dashboard -> (Choose your App) -> Settings -> Config Vars and add
WPD_LOG_ROUTE
with a value such aslog
.
In the above examples, log
is used for the route name. You can set your desired name.
There are four log types available, which are, request
, browser
, debug
and error
. Say, the route name is log
, then the following pages will be available.
Logs HTTP requests.
Format:
http(s)://{app address}/{log route}/request/{YYYY-MM-DD}
Example:
https://web-page-dumper.herokuapp.com/log/request/2021-06-27
Logs browser activities.
Format:
http(s)://{app address}/{log route}/browser/{YYYY-MM-DD}
Example:
https://web-page-dumper.herokuapp.com/log/browser/2021-06-27
Logs debug information.
Format:
http(s)://{app address}/{log route}/debug/{YYYY-MM-DD}
Example:
https://web-page-dumper.herokuapp.com/log/debug/2021-06-27
Logs errors.
Format:
http(s)://{app address}/{log route}/error/{YYYY-MM-DD}
Example:
https://web-page-dumper.herokuapp.com/log/error/2021-06-27
This web application is meant to run on Heroku.
- Log in to Heroku. If you don't have an account create a Heroku account.
- Click
- In the following page, enter your desired app name and press the
Deploy App
button which will start deploying. - After finishing the deployment, click on
Manage App
. - In the following page, click on
Open App
.
If you get the following error,
error while loading shared libraries: libnss3.so: cannot open shared object file: No such file or directory
You need to manually add the following buildpack through the Heroku UI (Dashboard -> {Your App} -> Settings -> Buildpacks).
MIT