- Module agnostic. You can use any package like
node-fetch
,axios
orgot
for requests. - Rich customization options.
- Handle different pagination types:
None
- Single page. No pagination required.TotalPages
- The response contains information about the last page.HasMore
- The total number of pages cannot be pre-calculated.Cursor
- A cursor from the previous page is needed to fetch the next page.List
- Predefined list of links or other request options.
- Fully typed.
- Zero dependencies.
Note that this package is ESM-only and requires at least Node.js 16.0.0.
npm install scraperflow
yarn add scraperflow
A quick example of using this package to find all npm packages containing "pagination api".
import got from 'got';
import { ScraperFlow, PaginationType } from 'scraperflow';
const QUERY = 'pagination api';
const PER_PAGE = 20;
const scraper = ScraperFlow.create({
paginationType: PaginationType.TotalPages,
paginationStart: 0,
concurrency: 3,
// Initiating a global "this" context
initThisContext: () => {
return {
got: got.extend({
timeout: {
request: 15000,
},
headers: {
'X-Requested-With': 'XMLHttpRequest',
'X-Spiferack': '1',
},
}),
packages: [],
};
},
// The function where we make the request
fetchHandler: async function (_ctx, page) {
const response = await this.got({
url: 'https://www.npmjs.com/search',
searchParams: {
q: QUERY,
page: page,
perPage: PER_PAGE,
},
responseType: 'json',
resolveBodyOnly: true,
});
return response;
},
// Here we resolve the total number of pages
resolveTotalPages: (_ctx, response) => {
return Math.ceil(response.total / PER_PAGE);
},
// And finally, here we extract the payload
responseHandler: function (_ctx, response) {
for (const obj of response.objects) {
this.packages.push(obj.package);
}
},
});
scraper.startOnce().then(() => {
console.log('packages:', scraper.globalContext.packages);
});
Control the scraping process.
Create a new ScraperFlow
instance.
options
ScraperFlowOptions
Validated options.
The current running status.
Reference to the global "this" context initiated by initThisContext
.
Reference to the list of flows contexts initiated by initFlowContext
.
- Returns:
true
if started successfully,false
if already running
Start the scraper.
- Returns:
Promise<void>
A promise that resolves when the cycle ends
Start the scraper just for one cycle.
forced
Iftrue
will stop immediately, otherwise will wait for the current cycle to end- Returns:
Promise<void>
A promise that resolves when finally stopped
Stop the scraper.
Emitted when the scraper starts.
Emitted when the scraper stops.
summary
CycleSummary
Emitted at the end of each cycle.
key
The option keymsg
A warning message
Emitted when validation of optional parameters fails. In this case, the default value will be used.
err
An error object
Emitted when a general error occurred, most likely because of a bad configuration.
In some cases, execution may be stopped due to this type of error. This can be checked by accessing isRunning
property.
err
An error objectargs
List of arguments passed tofetchHandler
Emitted when an error occurs inside fetchHandler
.
err
An error object
Emitted when an error occurs inside resolveTotalPages
, resolveHasMore
, resolveCursor
or resolveList
.
err
An error object
Emitted when an error occurs inside responseHandler
.
err
An error object
Emitted when an error occurs inside summaryHandler
.
None
Single page. No pagination required.TotalPages
The response contains information about the last page.HasMore
The total number of pages cannot be pre-calculated.Cursor
A cursor from the previous page is needed to fetch the next page.List
Predefined list of links or other request options.
For PaginationType.None
extends CommonOptions
- Type:
PaginationType.None
Single page. No pagination required.
- Type:
(ctx: TFlowContext) => TResponse | Promise<TResponse>
Scraping function.
For PaginationType.TotalPages
extends CommonOptions, ConcurrencyOptions
- Type:
PaginationType.TotalPages
The response contains information about the last page.
- Type:
(ctx: TFlowContext, page: number) => TResponse | Promise<TResponse>
Scraping function.
- Type:
(ctx: TFlowContext, response: TResponse) => number | Promise<number>
Determine the total number of pages.
- Type:
number
- Default:
1
Number pagination begin with.
- Type:
boolean
- Default:
false
To start fetching ahead, before finding out the total number of pages.
For PaginationType.HasMore
extends CommonOptions, ConcurrencyOptions
- Type:
PaginationType.HasMore
The total number of pages cannot be pre-calculated.
- Type:
(ctx: TFlowContext, page: number) => TResponse | Promise<TResponse>
Scraping function.
- Type:
(ctx: TFlowContext, response: TResponse) => boolean | Promise<boolean>
Determine if there is at least one more page.
- Type:
number
- Default:
1
Number pagination begin with.
For PaginationType.Cursor
extends CommonOptions
- Type:
PaginationType.Cursor
A cursor from the previous page is needed to fetch the next page.
Because of that, concurrency options are not available for this method.
- Type:
(ctx: TFlowContext, cursor?: TCursor) => TResponse | Promise<TResponse>
Scraping function.
- Type:
(ctx: TFlowContext, response: TResponse) => TCursor | Promise<TCursor | null | undefined> | null | undefined
Resolve cursor for the next page.
A nullish cursor should be returned for the last page.
For PaginationType.List
extends CommonOptions, ConcurrencyOptions
- Type:
PaginationType.List
Predefined list of links or other request options.
- Type:
(ctx: TFlowContext, item: TPageListItem) => TResponse | Promise<TResponse>
Scraping function.
- Type:
() => TPageListItem[] | Promise<TPageListItem[]>
Determine the list of links or other request options.
Will be called before each cycle.
- Type:
number | [number, number] | (ctx?: TFlowContext) => number
- Default:
[1000, 2000]
Interval between flow executions.
Can be defined as a static number, number interval or function.
- Type:
'dynamic' | 'fixed'
- Default:
'dynamic'
Interval strategy for flows.
dynamic
:
Interval is counted from the start of the previous execution.
If execution takes longer than the interval, the next execution will be delayed until the previous one has finished.
For example if interval is 1000ms
and execution takes 750ms
, the next execution will start after 250ms
.
fixed
:
Interval is counted from the end of the previous execution.
In this case the interval is fixed and execution time does not affect it.
- Type:
number | [number, number] | () => number
- Default:
undefined
Interval between scraping cycles. Fallback to interval
if not defined.
Can be defined as a static number, number interval or function.
- Type:
'dynamic' | 'fixed'
- Default:
'fixed'
Interval strategy for cycles.
Same as intervalStrategy
, but for cycles.
- Type:
(oldCtx?: TThisContext) => TThisContext
- Default:
undefined
Initialize global "this" context object.
If the function throws, execution will be stopped!
- Type:
boolean
- Default:
false
Reset global "this" context at the start of each cycle.
- Type:
(oldCtx?: TFlowContext) => TFlowContext | TFlowContext[]
- Default:
undefined
Initialize flow context object.
If an array is returned, then the concurrency will be equal to the length of the array, ignoring the concurrency
option.
For pagination type where concurrency
option is not available, only one context at a time will be used.
If the function throws or returns an empty array, execution will be stopped!
- Type:
boolean
- Default:
false
Reset flows contexts at the start of each cycle.
- Type:
(ctx: TFlowContext, response: TResponse) => void | Promise<void>
- Default:
undefined
Response handling.
Can and should be used to handle responses, without worrying about errors that might abort the cycle.
- Type:
(summary: CycleSummary) => void | Promise<void>
CycleSummary - Default:
undefined
Cycle summary. Called at the end of each cycle.
Error Handling Policy.
- Type:
number
- Default:
2
Extra attempts to fetch the page.
This does not include the first attempt. Set 0
to disable retries.
- Type:
boolean
- Default:
true
Try to fetch the failed page through different contexts.
- Type:
boolean
- Default:
false
Skip the failed page if possible, after reaching the retry limit.
If set to true
, it is recommended to use maxTotalPageFails
and/or maxConsecutivePageFails
options.
This option will not be valid for "cursor" pagination type.
- Type:
number
- Default:
Infinity
Max total failed pages to end the cycle.
A value equal to or less than 0
will end the cycle immediately after the first failed page.
- Type:
number
- Default:
Infinity
Max consecutive failed pages to end the cycle.
A value equal to or less than 0
will end the cycle immediately after the first failed page.
- Type:
boolean | ('validationWarning' | 'generalError' | 'fetchError' | 'resolveError' | 'responseHandleError' | 'summaryHandleError')[]
- Default:
['validationWarning', 'generalError']
What type of logs will be printed to console.
Set to true
to enable all logs, or false
to disable them.
- Type:
number
- Default:
1
Number of parallel flows for scraping.
Can be changed later to dynamically regulate the concurrency.
- Type:
boolean
- Default:
true
To remove the context of redundant flows at the start of each cycle.
Used to regulate flow context lifecycle in case of dynamic concurrency.
completed
(boolean
) If cycle was completed or aborted.stats
Cycle stats.totalPageCount
(number
) Total number of handled pages.failedPageList
(number[]
) List of failed pages.totalErrorCount
(number
) Total number of errors thrown by fetch or resolve handlers.timings
Information about the execution time.startedAt
(number
) Unix timestamp for the start of this cycle, inms
.total
(number
) Total execution time, inms
.avg
Information about the average page handling time.all
(number
) Average handling time for both successful and failed pages, inms
.successful
(number
) Average handling time for successful pages, inms
.failed
(number
) Average handling time for failed pages, inms
.