Skip to content

Latest commit

 

History

History
2653 lines (1704 loc) · 85.5 KB

modules.md

File metadata and controls

2653 lines (1704 loc) · 85.5 KB

crawlee-one / Exports

crawlee-one

Table of contents

Interfaces

Type Aliases

Variables

Functions

Type Aliases

AllActorInputs

Ƭ AllActorInputs: InputActorInput & CrawlerConfigActorInput & PerfActorInput & StartUrlsActorInput & LoggingActorInput & ProxyActorInput & PrivacyActorInput & RequestActorInput & OutputActorInput & MetamorphActorInput

Defined in

src/lib/input.ts:17


ApifyCrawleeOneIO

Ƭ ApifyCrawleeOneIO: CrawleeOneIO<ApifyEnv, ApifyErrorReport, ApifyEntryMetadata>

Integration between CrawleeOne and Apify.

This is the default integration.

Defined in

src/lib/integrations/apify.ts:39


ArrVal

Ƭ ArrVal<T>: T[number]

Unwrap Array to its item(s)

Type parameters

Name Type
T extends any[] | readonly any[]

Defined in

src/utils/types.ts:9


CaptureError

Ƭ CaptureError: (input: CaptureErrorInput) => MaybePromise<void>

Type declaration

▸ (input): MaybePromise<void>

Parameters
Name Type
input CaptureErrorInput
Returns

MaybePromise<void>

Defined in

src/lib/error/errorHandler.ts:24


CaptureErrorInput

Ƭ CaptureErrorInput: PickRequired<Partial<CrawleeOneErrorHandlerInput>, "error">

Defined in

src/lib/error/errorHandler.ts:23


CrawleeOneActorDefWithInput

Ƭ CrawleeOneActorDefWithInput<T>: Omit<CrawleeOneActorDef<T>, "input"> & { input: T["input"] | null ; state: Record<string, unknown> }

CrawleeOneActorDef object where the input is already resolved

Type parameters

Name Type
T extends CrawleeOneCtx

Defined in

src/lib/actor/types.ts:280


CrawleeOneActorRouterCtx

Ƭ CrawleeOneActorRouterCtx<T>: Object

Context passed from actor to route handlers

Type parameters

Name Type
T extends CrawleeOneCtx

Type declaration

Name Type Description
actor CrawleeOneActorInst<T> -
metamorph Metamorph Trigger actor metamorph, using actor's inputs as defaults.
pushData <T>(oneOrManyItems: T | T[], options: PushDataOptions<T>) => Promise<any[]> Actor.pushData with extra optional features: - Limit the number of entries pushed to the Dataset based on the Actor input - Transform and filter entries via Actor input. - Add metadata to entries before they are pushed to Dataset. - Set which (nested) properties are personal data optionally redact them for privacy compliance.
pushRequests <T>(oneOrManyItems: T | T[], options?: PushRequestsOptions<T>) => Promise<any[]> Similar to Actor.openRequestQueue().addRequests, but with extra features: - Limit the max size of the RequestQueue. No requests are added when RequestQueue is at or above the limit. - Transform and filter requests. Requests that did not pass the filter are not added to the RequestQueue.

Defined in

src/lib/actor/types.ts:75


CrawleeOneHookCtx

Ƭ CrawleeOneHookCtx<T>: Pick<CrawleeOneActorInst<T>, "input" | "state"> & { io: T["io"] ; itemCacheKey: typeof itemCacheKey ; sendRequest: typeof gotScraping }

Context passed to user-defined functions passed from input

Type parameters

Name Type
T extends CrawleeOneCtx

Defined in

src/lib/actor/types.ts:104


CrawleeOneHookFn

Ƭ CrawleeOneHookFn<TArgs, TReturn, T>: (...args: [...TArgs, CrawleeOneHookCtx<T>]) => MaybePromise<TReturn>

Type parameters

Name Type
TArgs extends any[] = []
TReturn void
T extends CrawleeOneCtx = CrawleeOneCtx

Type declaration

▸ (...args): MaybePromise<TReturn>

Parameters
Name Type
...args [...TArgs, CrawleeOneHookCtx<T>]
Returns

MaybePromise<TReturn>

Defined in

src/lib/actor/types.ts:129


CrawleeOneRouteCtx

Ƭ CrawleeOneRouteCtx<T, RouterCtx>: Parameters<Parameters<CrawlerRouter<T["context"] & RouterCtx>["addHandler"]>[1]>[0]

Context object provided in CrawlerRouter

Type parameters

Name Type
T extends CrawleeOneCtx
RouterCtx extends Record<string, any> = {}

Defined in

src/lib/router/types.ts:7


CrawleeOneRouteHandler

Ƭ CrawleeOneRouteHandler<T, RouterCtx>: Parameters<CrawlerRouter<T["context"] & RouterCtx>["addHandler"]>[1]

Function that's passed to router.addHandler(label, handler)

Type parameters

Name Type
T extends CrawleeOneCtx
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Defined in

src/lib/router/types.ts:13


CrawleeOneRouteMatcher

Ƭ CrawleeOneRouteMatcher<T, RouterCtx>: MaybeArray<RegExp | CrawleeOneRouteMatcherFn<T, RouterCtx>>

Function or RegExp that checks if the CrawleeOneRoute this Matcher belongs to should handle the given request.

If the Matcher returns truthy value, the request is passed to the action function of the same CrawleeOneRoute.

The Matcher can be:

  • Regular expression
  • Function
  • Array of <RegExp | Function>

Type parameters

Name Type
T extends CrawleeOneCtx
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Defined in

src/lib/router/types.ts:56


CrawleeOneRouteMatcherFn

Ƭ CrawleeOneRouteMatcherFn<T, RouterCtx>: (url: string, ctx: CrawleeOneRouteCtx<T, RouterCtx>, route: CrawleeOneRoute<T, RouterCtx>, routes: Record<T["labels"], CrawleeOneRoute<T, RouterCtx>>) => unknown

Type parameters

Name Type
T extends CrawleeOneCtx
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Type declaration

▸ (url, ctx, route, routes): unknown

Function variant of Matcher. Matcher that checks if the CrawleeOneRoute this Matcher belongs to should handle the given request.

If the Matcher returns truthy value, the request is passed to the action function of the same CrawleeOneRoute.

Parameters
Name Type
url string
ctx CrawleeOneRouteCtx<T, RouterCtx>
route CrawleeOneRoute<T, RouterCtx>
routes Record<T["labels"], CrawleeOneRoute<T, RouterCtx>>
Returns

unknown

Defined in

src/lib/router/types.ts:68


CrawleeOneRouteWrapper

Ƭ CrawleeOneRouteWrapper<T, RouterCtx>: (handler: (ctx: CrawleeOneRouteCtx<T, RouterCtx>) => Promise<void> | Awaitable<void>) => MaybePromise<(ctx: CrawleeOneRouteCtx<T, RouterCtx>) => Promise<void> | Awaitable<void>>

Type parameters

Name Type
T extends CrawleeOneCtx
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Type declaration

▸ (handler): MaybePromise<(ctx: CrawleeOneRouteCtx<T, RouterCtx>) => Promise<void> | Awaitable<void>>

Wrapper that modifies behavior of CrawleeOneRouteHandler

Parameters
Name Type
handler (ctx: CrawleeOneRouteCtx<T, RouterCtx>) => Promise<void> | Awaitable<void>
Returns

MaybePromise<(ctx: CrawleeOneRouteCtx<T, RouterCtx>) => Promise<void> | Awaitable<void>>

Defined in

src/lib/router/types.ts:19


CrawlerConfigActorInput

Ƭ CrawlerConfigActorInput: Pick<CheerioCrawlerOptions, "navigationTimeoutSecs" | "ignoreSslErrors" | "additionalMimeTypes" | "suggestResponseEncoding" | "forceResponseEncoding" | "requestHandlerTimeoutSecs" | "maxRequestRetries" | "maxRequestsPerCrawl" | "maxRequestsPerMinute" | "minConcurrency" | "maxConcurrency" | "keepAlive">

Crawler config fields that can be overriden from the actor input

Defined in

src/lib/input.ts:29


CrawlerType

Ƭ CrawlerType: ArrVal<typeof CRAWLER_TYPE>

Defined in

src/types/index.ts:35


CrawlerUrl

Ƭ CrawlerUrl: NonNullable<Parameters<OrigRunCrawler<any>>[0]>[0]

URL string or object passed to Crawler.run

Defined in

src/types/index.ts:80


ExtractErrorHandlerOptionsReport

Ƭ ExtractErrorHandlerOptionsReport<T>: T extends CrawleeOneErrorHandlerOptions<infer U> ? ExtractIOReport<U> : never

Type parameters

Name Type
T extends CrawleeOneErrorHandlerOptions<any>

Defined in

src/lib/integrations/types.ts:322


ExtractIOReport

Ƭ ExtractIOReport<T>: T extends CrawleeOneIO<object, infer U> ? U : never

Type parameters

Name Type
T extends CrawleeOneIO<object, object>

Defined in

src/lib/integrations/types.ts:325


GenRedactedValue

Ƭ GenRedactedValue<V, K, O>: (val: V, key: K, obj: O) => MaybePromise<any>

Type parameters

Name
V
K
O

Type declaration

▸ (val, key, obj): MaybePromise<any>

Functions that generates a "redacted" version of a value.

If you pass it a Promise, it will be resolved.

Parameters
Name Type
val V
key K
obj O
Returns

MaybePromise<any>

Defined in

src/lib/io/pushData.ts:15


LogLevel

Ƭ LogLevel: ArrVal<typeof LOG_LEVEL>

Defined in

src/lib/log.ts:8


MaybeArray

Ƭ MaybeArray<T>: T | T[]

Value or an array thereof

Type parameters

Name
T

Defined in

src/utils/types.ts:4


MaybeAsyncFn

Ƭ MaybeAsyncFn<R, Args>: R | (...args: Args) => MaybePromise<R>

Value or (a)sync func that returns thereof

Type parameters

Name Type
R R
Args extends any[]

Defined in

src/utils/types.ts:6


MaybePromise

Ƭ MaybePromise<T>: T | Promise<T>

Value or a promise thereof

Type parameters

Name
T

Defined in

src/utils/types.ts:2


Metamorph

Ƭ Metamorph: (overrides?: MetamorphActorInput) => Promise<void>

Type declaration

▸ (overrides?): Promise<void>

Trigger actor metamorph, using actor's inputs as defaults.

Parameters
Name Type
overrides? MetamorphActorInput
Returns

Promise<void>

Defined in

src/lib/actor/types.ts:38


OnBatchAddRequests

Ƭ OnBatchAddRequests: (...args: OnBatchAddRequestsArgs) => MaybePromise<void>

Type declaration

▸ (...args): MaybePromise<void>

Parameters
Name Type
...args OnBatchAddRequestsArgs
Returns

MaybePromise<void>

Defined in

src/lib/test/mockApifyClient.ts:31


OnBatchAddRequestsArgs

Ƭ OnBatchAddRequestsArgs: [requests: Omit<RequestQueueClientRequestSchema, "id">[], options?: RequestQueueClientBatchAddRequestWithRetriesOptions]

Defined in

src/lib/test/mockApifyClient.ts:27


PickPartial

Ƭ PickPartial<T, Keys>: Omit<T, Keys> & Partial<Pick<T, Keys>>

Pick properties that should be optional

Type parameters

Name Type
T extends object
Keys extends keyof T

Defined in

src/utils/types.ts:18


PickRequired

Ƭ PickRequired<T, Keys>: Omit<T, Keys> & Required<Pick<T, Keys>>

Pick properties that should be required

Type parameters

Name Type
T extends object
Keys extends keyof T

Defined in

src/utils/types.ts:21


PrivacyFilter

Ƭ PrivacyFilter<V, K, O>: boolean | (val: V, key: K, obj: O, options?: { setCustomRedactedValue: (val: V) => any }) => any

Determine if the property is considered private (and hence may be hidden for privacy reasons).

PrivacyFilter may be either boolean, or a function that returns truthy/falsy value.

Property is private if true or if the function returns truthy value.

The function receives the property value, its position, and parent object.

By default, when a property is redacted, its value is replaced with a string that informs about the redaction. If you want different text or value to be used instead, supply it to setCustomRedactedValue.

If the function returns a Promise, it will be awaited.

Type parameters

Name
V
K
O

Defined in

src/lib/io/pushData.ts:32


PrivacyMask

Ƭ PrivacyMask<T>: { [Key in keyof T]?: T[Key] extends Date | any[] ? PrivacyFilter<T[Key], Key, T> : T[Key] extends object ? PrivacyMask<T[Key]> : PrivacyFilter<T[Key], Key, T> }

PrivacyMask determines which (potentally nested) properties of an object are considered private.

PrivacyMask copies the structure of another object, but each non-object property on PrivacyMask is a PrivacyFilter - function that determines if the property is considered private.

Property is private if the function returns truthy value.

If the function returns a Promise, it will be awaited.

Type parameters

Name Type
T extends object

Defined in

src/lib/io/pushData.ts:55


RunCrawler

Ƭ RunCrawler<Ctx>: (requests?: CrawlerUrl[], options?: Parameters<OrigRunCrawler<Ctx>>[1]) => ReturnType<OrigRunCrawler<Ctx>>

Type parameters

Name Type
Ctx extends CrawlingContext = CrawlingContext<BasicCrawler>

Type declaration

▸ (requests?, options?): ReturnType<OrigRunCrawler<Ctx>>

Extended type of crawler.run() function

Parameters
Name Type
requests? CrawlerUrl[]
options? Parameters<OrigRunCrawler<Ctx>>[1]
Returns

ReturnType<OrigRunCrawler<Ctx>>

Defined in

src/lib/actor/types.ts:32

Variables

LOG_LEVEL

Const LOG_LEVEL: readonly ["debug", "info", "warn", "error", "off"]

Defined in

src/lib/log.ts:7


allActorInputValidationFields

Const allActorInputValidationFields: Object

Type declaration

Name Type
additionalMimeTypes ArraySchema<any[]>
errorReportingDatasetId StringSchema<string>
errorTelemetry BooleanSchema<boolean>
forceResponseEncoding StringSchema<string>
ignoreSslErrors BooleanSchema<boolean>
includePersonalData BooleanSchema<boolean>
inputExtendFromFunction StringSchema<string>
inputExtendUrl StringSchema<string>
keepAlive BooleanSchema<boolean>
logLevel StringSchema<string>
maxConcurrency NumberSchema<number>
maxRequestRetries NumberSchema<number>
maxRequestsPerCrawl NumberSchema<number>
maxRequestsPerMinute NumberSchema<number>
metamorphActorBuild StringSchema<string>
metamorphActorId StringSchema<string>
metamorphActorInput ObjectSchema<any>
minConcurrency NumberSchema<number>
navigationTimeoutSecs NumberSchema<number>
outputCacheActionOnResult StringSchema<string>
outputCachePrimaryKeys ArraySchema<any[]>
outputCacheStoreId StringSchema<string>
outputDatasetId StringSchema<string>
outputFilter StringSchema<string>
outputFilterAfter StringSchema<string>
outputFilterBefore StringSchema<string>
outputMaxEntries NumberSchema<number>
outputPickFields ArraySchema<any[]>
outputRenameFields ObjectSchema<any>
outputTransform StringSchema<string>
outputTransformAfter StringSchema<string>
outputTransformBefore StringSchema<string>
perfBatchSize NumberSchema<number>
perfBatchWaitSecs NumberSchema<number>
proxy ObjectSchema<any>
requestFilter StringSchema<string>
requestFilterAfter StringSchema<string>
requestFilterBefore StringSchema<string>
requestHandlerTimeoutSecs NumberSchema<number>
requestMaxEntries NumberSchema<number>
requestQueueId StringSchema<string>
requestTransform StringSchema<string>
requestTransformAfter StringSchema<string>
requestTransformBefore StringSchema<string>
startUrls ArraySchema<any[]>
startUrlsFromDataset StringSchema<string>
startUrlsFromFunction StringSchema<string>
suggestResponseEncoding StringSchema<string>

Defined in

src/lib/input.ts:1129


allActorInputs

Const allActorInputs: Object

Type declaration

Name Type
additionalMimeTypes ArrayField<any[]>
errorReportingDatasetId StringField<string, string>
errorTelemetry BooleanField<boolean>
forceResponseEncoding StringField<string, string>
ignoreSslErrors BooleanField<boolean>
includePersonalData BooleanField<boolean>
inputExtendFromFunction StringField<string, string>
inputExtendUrl StringField<string, string>
keepAlive BooleanField<boolean>
logLevel StringField<"error" | "off" | "info" | "debug" | "warn", string>
maxConcurrency IntegerField<number, string>
maxRequestRetries IntegerField<number, string>
maxRequestsPerCrawl IntegerField<number, string>
maxRequestsPerMinute IntegerField<number, string>
metamorphActorBuild StringField<string, string>
metamorphActorId StringField<string, string>
metamorphActorInput ObjectField<{ uploadDatasetToGDrive: boolean = true }>
minConcurrency IntegerField<number, string>
navigationTimeoutSecs IntegerField<number, string>
outputCacheActionOnResult StringField<NonNullable<undefined | null | "add" | "remove" | "overwrite">, string>
outputCachePrimaryKeys ArrayField<string[]>
outputCacheStoreId StringField<string, string>
outputDatasetId StringField<string, string>
outputFilter StringField<string, string>
outputFilterAfter StringField<string, string>
outputFilterBefore StringField<string, string>
outputMaxEntries IntegerField<number, string>
outputPickFields ArrayField<string[]>
outputRenameFields ObjectField<{ oldFieldName: string = 'newFieldName' }>
outputTransform StringField<string, string>
outputTransformAfter StringField<string, string>
outputTransformBefore StringField<string, string>
perfBatchSize IntegerField<number, string>
perfBatchWaitSecs IntegerField<number, string>
proxy ObjectField<object>
requestFilter StringField<string, string>
requestFilterAfter StringField<string, string>
requestFilterBefore StringField<string, string>
requestHandlerTimeoutSecs IntegerField<number, string>
requestMaxEntries IntegerField<number, string>
requestQueueId StringField<string, string>
requestTransform StringField<string, string>
requestTransformAfter StringField<string, string>
requestTransformBefore StringField<string, string>
startUrls ArrayField<any[]>
startUrlsFromDataset StringField<string, string>
startUrlsFromFunction StringField<string, string>
suggestResponseEncoding StringField<string, string>

Defined in

src/lib/input.ts:1031


apifyIO

Const apifyIO: ApifyCrawleeOneIO

Integration between CrawleeOne and Apify.

This is the default integration.

Defined in

src/lib/integrations/apify.ts:117


crawlerInput

Const crawlerInput: Object

Common input fields related to crawler setup

Type declaration

Name Type
additionalMimeTypes ArrayField<any[]>
forceResponseEncoding StringField<string, string>
ignoreSslErrors BooleanField<boolean>
keepAlive BooleanField<boolean>
maxConcurrency IntegerField<number, string>
maxRequestRetries IntegerField<number, string>
maxRequestsPerCrawl IntegerField<number, string>
maxRequestsPerMinute IntegerField<number, string>
minConcurrency IntegerField<number, string>
navigationTimeoutSecs IntegerField<number, string>
requestHandlerTimeoutSecs IntegerField<number, string>
suggestResponseEncoding StringField<string, string>

Defined in

src/lib/input.ts:523


crawlerInputValidationFields

Const crawlerInputValidationFields: Object

Type declaration

Name Type
additionalMimeTypes ArraySchema<any[]>
forceResponseEncoding StringSchema<string>
ignoreSslErrors BooleanSchema<boolean>
keepAlive BooleanSchema<boolean>
maxConcurrency NumberSchema<number>
maxRequestRetries NumberSchema<number>
maxRequestsPerCrawl NumberSchema<number>
maxRequestsPerMinute NumberSchema<number>
minConcurrency NumberSchema<number>
navigationTimeoutSecs NumberSchema<number>
requestHandlerTimeoutSecs NumberSchema<number>
suggestResponseEncoding StringSchema<string>

Defined in

src/lib/input.ts:1044


inputInput

Const inputInput: Object

Common input fields related to actor input

Type declaration

Name Type
inputExtendFromFunction StringField<string, string>
inputExtendUrl StringField<string, string>

Defined in

src/lib/input.ts:491


inputInputValidationFields

Const inputInputValidationFields: Object

Type declaration

Name Type
inputExtendFromFunction StringSchema<string>
inputExtendUrl StringSchema<string>

Defined in

src/lib/input.ts:1064


logLevelToCrawlee

Const logLevelToCrawlee: Record<LogLevel, CrawleeLogLevel>

Map log levels of crawlee-one to log levels of crawlee

Defined in

src/lib/log.ts:11


loggingInput

Const loggingInput: Object

Common input fields related to logging setup

Type declaration

Name Type
errorReportingDatasetId StringField<string, string>
errorTelemetry BooleanField<boolean>
logLevel StringField<"error" | "off" | "info" | "debug" | "warn", string>

Defined in

src/lib/input.ts:688


loggingInputValidationFields

Const loggingInputValidationFields: Object

Type declaration

Name Type
errorReportingDatasetId StringSchema<string>
errorTelemetry BooleanSchema<boolean>
logLevel StringSchema<string>

Defined in

src/lib/input.ts:1075


metamorphInput

Const metamorphInput: Object

Common input fields related to actor metamorphing

Type declaration

Name Type
metamorphActorBuild StringField<string, string>
metamorphActorId StringField<string, string>
metamorphActorInput ObjectField<{ uploadDatasetToGDrive: boolean = true }>

Defined in

src/lib/input.ts:1002


metamorphInputValidationFields

Const metamorphInputValidationFields: Object

Type declaration

Name Type
metamorphActorBuild StringSchema<string>
metamorphActorId StringSchema<string>
metamorphActorInput ObjectSchema<any>

Defined in

src/lib/input.ts:1123


outputInput

Const outputInput: Object

Common input fields related to actor output

Type declaration

Name Type
outputCacheActionOnResult StringField<NonNullable<undefined | null | "add" | "remove" | "overwrite">, string>
outputCachePrimaryKeys ArrayField<string[]>
outputCacheStoreId StringField<string, string>
outputDatasetId StringField<string, string>
outputFilter StringField<string, string>
outputFilterAfter StringField<string, string>
outputFilterBefore StringField<string, string>
outputMaxEntries IntegerField<number, string>
outputPickFields ArrayField<string[]>
outputRenameFields ObjectField<{ oldFieldName: string = 'newFieldName' }>
outputTransform StringField<string, string>
outputTransformAfter StringField<string, string>
outputTransformBefore StringField<string, string>

Defined in

src/lib/input.ts:851


outputInputValidationFields

Const outputInputValidationFields: Object

Type declaration

Name Type
outputCacheActionOnResult StringSchema<string>
outputCachePrimaryKeys ArraySchema<any[]>
outputCacheStoreId StringSchema<string>
outputDatasetId StringSchema<string>
outputFilter StringSchema<string>
outputFilterAfter StringSchema<string>
outputFilterBefore StringSchema<string>
outputMaxEntries NumberSchema<number>
outputPickFields ArraySchema<any[]>
outputRenameFields ObjectSchema<any>
outputTransform StringSchema<string>
outputTransformAfter StringSchema<string>
outputTransformBefore StringSchema<string>

Defined in

src/lib/input.ts:1102


perfInput

Const perfInput: Object

Common input fields related to performance which are not part of the CrawlerConfig

Type declaration

Name Type
perfBatchSize IntegerField<number, string>
perfBatchWaitSecs IntegerField<number, string>

Defined in

src/lib/input.ts:631


perfInputValidationFields

Const perfInputValidationFields: Object

Type declaration

Name Type
perfBatchSize NumberSchema<number>
perfBatchWaitSecs NumberSchema<number>

Defined in

src/lib/input.ts:1059


privacyInput

Const privacyInput: Object

Common input fields related to proxy setup

Type declaration

Name Type
includePersonalData BooleanField<boolean>

Defined in

src/lib/input.ts:749


privacyInputValidationFields

Const privacyInputValidationFields: Object

Type declaration

Name Type
includePersonalData BooleanSchema<boolean>

Defined in

src/lib/input.ts:1085


proxyInput

Const proxyInput: Object

Common input fields related to proxy setup

Type declaration

Name Type
proxy ObjectField<object>

Defined in

src/lib/input.ts:737


proxyInputValidationFields

Const proxyInputValidationFields: Object

Type declaration

Name Type
proxy ObjectSchema<any>

Defined in

src/lib/input.ts:1081


requestInput

Const requestInput: Object

Common input fields related to actor request

Type declaration

Name Type
requestFilter StringField<string, string>
requestFilterAfter StringField<string, string>
requestFilterBefore StringField<string, string>
requestMaxEntries IntegerField<number, string>
requestQueueId StringField<string, string>
requestTransform StringField<string, string>
requestTransformAfter StringField<string, string>
requestTransformBefore StringField<string, string>

Defined in

src/lib/input.ts:763


requestInputValidationFields

Const requestInputValidationFields: Object

Type declaration

Name Type
requestFilter StringSchema<string>
requestFilterAfter StringSchema<string>
requestFilterBefore StringSchema<string>
requestMaxEntries NumberSchema<number>
requestQueueId StringSchema<string>
requestTransform StringSchema<string>
requestTransformAfter StringSchema<string>
requestTransformBefore StringSchema<string>

Defined in

src/lib/input.ts:1089


startUrlsInput

Const startUrlsInput: Object

Common input fields for defining URLs to scrape

Type declaration

Name Type
startUrls ArrayField<any[]>
startUrlsFromDataset StringField<string, string>
startUrlsFromFunction StringField<string, string>

Defined in

src/lib/input.ts:657


startUrlsInputValidationFields

Const startUrlsInputValidationFields: Object

Type declaration

Name Type
startUrls ArraySchema<any[]>
startUrlsFromDataset StringSchema<string>
startUrlsFromFunction StringSchema<string>

Defined in

src/lib/input.ts:1069

Functions

basicCaptureErrorRouteHandler

basicCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<BasicCrawlingContext<Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:133


captureError

captureError<TIO>(input, options): Promise<never>

Error handling for CrawleeOne crawlers.

By default, error reports are saved to Apify Dataset.

See https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors#error-reporting

Type parameters

Name Type
TIO extends CrawleeOneIO<object, object, object, TIO> = CrawleeOneIO<object, object, object>

Parameters

Name Type
input CaptureErrorInput
options CrawleeOneErrorHandlerOptions<TIO>

Returns

Promise<never>

Defined in

src/lib/error/errorHandler.ts:33


captureErrorRouteHandler

captureErrorRouteHandler<T>(handler, options): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Drop-in replacement for regular request handler callback for Crawlee route that automatically tracks errors.

By default, error reports are saved to Apify Dataset.

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
handler (ctx: Omit<T["context"] & {}, "request"> & { request: Request<Dictionary> } & { captureError: CaptureError }) => MaybePromise<void>
options CrawleeOneErrorHandlerOptions<T["io"]>

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Example

router.addDefaultHandler(
 captureErrorRouteHandler(async (ctx) => {
   const { page, crawler } = ctx;
   const url = page.url();
   ...
 })
);

Defined in

src/lib/error/errorHandler.ts:110


captureErrorWrapper

captureErrorWrapper<TIO>(fn, options): Promise<void>

Error handling for Crawlers as a function wrapper

By default, error reports are saved to Apify Dataset.

Type parameters

Name Type
TIO extends CrawleeOneIO<object, object, object, TIO> = CrawleeOneIO<object, object, object>

Parameters

Name Type
fn (input: { captureError: CaptureError }) => MaybePromise<void>
options CrawleeOneErrorHandlerOptions<TIO>

Returns

Promise<void>

Defined in

src/lib/error/errorHandler.ts:77


cheerioCaptureErrorRouteHandler

cheerioCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<CheerioCrawlingContext<any, any>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:136


crawleeOne

crawleeOne<TType, T>(args): Promise<void>

Type parameters

Name Type
TType extends "basic" | "http" | "cheerio" | "jsdom" | "playwright" | "puppeteer"
T extends CrawleeOneCtx<CrawlerMeta<TType>["context"], string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T> = CrawleeOneCtx<CrawlerMeta<TType>["context"], string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>>

Parameters

Name Type
args CrawleeOneArgs<TType, T>

Returns

Promise<void>

Defined in

src/api.ts:124


createErrorHandler

createErrorHandler<T>(options): ErrorHandler<T["context"]>

Create an ErrorHandler function that can be assigned to failedRequestHandler option of BasicCrawlerOptions.

The function saves error to a Dataset, and optionally forwards it to Sentry.

By default, error reports are saved to Apify Dataset.

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
options CrawleeOneErrorHandlerOptions<T["io"]> & { onSendErrorToTelemetry?: T["telemetry"]["onSendErrorToTelemetry"] ; sendToTelemetry?: boolean }

Returns

ErrorHandler<T["context"]>

Defined in

src/lib/error/errorHandler.ts:148


createHttpCrawlerOptions

createHttpCrawlerOptions<T, TOpts>(«destructured»): Partial<TOpts> & Dictionary<TOpts["requestHandler"] | TOpts["handleRequestFunction"] | TOpts["requestList"] | TOpts["requestQueue"] | TOpts["requestHandlerTimeoutSecs"] | TOpts["handleRequestTimeoutSecs"] | TOpts["errorHandler"] | TOpts["failedRequestHandler"] | TOpts["handleFailedRequestFunction"] | TOpts["maxRequestRetries"] | TOpts["maxRequestsPerCrawl"] | TOpts["autoscaledPoolOptions"] | TOpts["minConcurrency"] | TOpts["maxConcurrency"] | TOpts["maxRequestsPerMinute"] | TOpts["keepAlive"] | TOpts["useSessionPool"] | TOpts["sessionPoolOptions"] | TOpts["loggingInterval"] | TOpts["log"]>

Given the actor input, create common crawler options.

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>
TOpts extends BasicCrawlerOptions<T["context"], TOpts>

Parameters

Name Type Description
«destructured» Object -
› defaults? TOpts Default config options set by us. These may be overriden by values from actor input (set by user).
› input null | T["input"] Actor input
› overrides? TOpts These config options will overwrite both the default and user options. This is useful for hard-setting values e.g. in tests.

Returns

Partial<TOpts> & Dictionary<TOpts["requestHandler"] | TOpts["handleRequestFunction"] | TOpts["requestList"] | TOpts["requestQueue"] | TOpts["requestHandlerTimeoutSecs"] | TOpts["handleRequestTimeoutSecs"] | TOpts["errorHandler"] | TOpts["failedRequestHandler"] | TOpts["handleFailedRequestFunction"] | TOpts["maxRequestRetries"] | TOpts["maxRequestsPerCrawl"] | TOpts["autoscaledPoolOptions"] | TOpts["minConcurrency"] | TOpts["maxConcurrency"] | TOpts["maxRequestsPerMinute"] | TOpts["keepAlive"] | TOpts["useSessionPool"] | TOpts["sessionPoolOptions"] | TOpts["loggingInterval"] | TOpts["log"]>

Defined in

src/lib/actor/actor.ts:584


createLocalMigrationState

createLocalMigrationState(«destructured»): Object

Parameters

Name Type
«destructured» Object
› stateDir string

Returns

Object

Name Type
loadState (migrationFilename: string) => Promise<Actor>
saveState (migrationFilename: string, actor: ActorClient) => Promise<void>

Defined in

src/lib/migrate/localState.ts:5


createLocalMigrator

createLocalMigrator(«destructured»): Object

Parameters

Name Type Description
«destructured» Object -
› delimeter string Delimeter between version and rest of file name
› extension string Extension glob
› migrationsDir string -

Returns

Object

Name Type
migrate (version: string) => Promise<void>
unmigrate (version: string) => Promise<void>

Defined in

src/lib/migrate/localMigrator.ts:8


createMockClientDataset

createMockClientDataset(overrides?): Dataset

Parameters

Name Type
overrides? Dataset

Returns

Dataset

Defined in

src/lib/test/mockApifyClient.ts:33


createMockClientRequestQueue

createMockClientRequestQueue(overrides?): RequestQueue

Parameters

Name Type
overrides? RequestQueue

Returns

RequestQueue

Defined in

src/lib/test/mockApifyClient.ts:50


createMockDatasetCollectionClient

createMockDatasetCollectionClient(«destructured»?): DatasetCollectionClient

Parameters

Name Type
«destructured» Object
› log? (args: any) => void

Returns

DatasetCollectionClient

Defined in

src/lib/test/mockApifyClient.ts:195


createMockKeyValueStoreClient

createMockKeyValueStoreClient(«destructured»?): KeyValueStoreClient

Parameters

Name Type
«destructured» Object
› log? (args: any) => void

Returns

KeyValueStoreClient

Defined in

src/lib/test/mockApifyClient.ts:71


createMockRequestQueueClient

createMockRequestQueueClient(«destructured»?): RequestQueueClient

Parameters

Name Type
«destructured» Object
› log? (args: any) => void
› onBatchAddRequests? OnBatchAddRequests

Returns

RequestQueueClient

Defined in

src/lib/test/mockApifyClient.ts:98


createMockStorageClient

createMockStorageClient(«destructured»?): StorageClient

Parameters

Name Type
«destructured» Object
› log? (args: any) => void
› onBatchAddRequests? OnBatchAddRequests

Returns

StorageClient

Defined in

src/lib/test/mockApifyClient.ts:227


createMockStorageDataset

createMockStorageDataset(...args): Promise<Dataset<any>>

Parameters

Name Type
...args [datasetId?: null | string, options?: OpenStorageOptions, custom?: Object]

Returns

Promise<Dataset<any>>

Defined in

src/lib/test/mockApifyClient.ts:252


createSentryTelemetry

createSentryTelemetry<T>(sentryOptions?): T

Type parameters

Name Type
T extends CrawleeOneTelemetry<CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>>, CrawleeOneErrorHandlerOptions<CrawleeOneIO<object, object, object>>, T>

Parameters

Name Type
sentryOptions? NodeOptions

Returns

T

Defined in

src/lib/telemetry/sentry.ts:24


datasetSizeMonitor

datasetSizeMonitor(maxSize, options?): Object

Semi-automatic monitoring of Dataset size. This is used in limiting the total of entries scraped per run / Dataset:

  • When Dataset reaches maxSize, then all remaining Requests in the RequestQueue are removed.
  • Pass an array of items to shortenToSize to shorten the array to the size that still fits the Dataset.

By default uses Apify Dataset.

Parameters

Name Type
maxSize number
options? DatasetSizeMonitorOptions

Returns

Object

Name Type
isFull () => Promise<boolean>
isStale () => boolean
onValue (callback: ValueCallback<number>) => () => void
refresh () => Promise<number>
shortenToSize <T>(arr: T[]) => Promise<T[]>
value () => null | number | Promise<number>

Defined in

src/lib/io/dataset.ts:94


generateTypes

generateTypes(outfile, configOrPath?): Promise<void>

Generate types for CrawleeOne given a config.

Config can be passed directly, or as the path to the config file. If the config is omitted, it is automatically searched for using CosmicConfig.

Parameters

Name Type
outfile string
configOrPath? string | CrawleeOneConfig

Returns

Promise<void>

Defined in

src/cli/commands/codegen.ts:251


getColumnFromDataset

getColumnFromDataset<T>(datasetId, field, options?): Promise<T[]>

Given a Dataset ID and a name of a field, get the columnar data.

By default uses Apify Dataset.

Example:

// Given dataset
// [
//   { id: 1, field: 'abc' },
//   { id: 2, field: 'def' }
// ]
const results = await getColumnFromDataset('datasetId123', 'field');
console.log(results)
// ['abc', 'def']

Type parameters

Name
T

Parameters

Name Type
datasetId string
field string
options? Object
options.dataOptions? Pick<DatasetDataOptions, "offset" | "desc" | "limit">
options.io? CrawleeOneIO<object, object, object>

Returns

Promise<T[]>

Defined in

src/lib/io/dataset.ts:48


getDatasetCount

getDatasetCount(datasetNameOrId?, options?): Promise<null | number>

Given a Dataset ID, get the number of entries already in the Dataset.

By default uses Apify Dataset.

Parameters

Name Type
datasetNameOrId? string
options? Object
options.io? CrawleeOneIO<object, object, object>
options.log? Log

Returns

Promise<null | number>

Defined in

src/lib/io/dataset.ts:12


httpCaptureErrorRouteHandler

httpCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<HttpCrawlingContext<any, any>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:134


itemCacheKey

itemCacheKey(item, primaryKeys?): string

Serialize dataset item to fixed-length hash.

NOTE: Apify (around which this lib is designed) allows the key-value store key to be max 256 char long. https://docs.apify.com/sdk/js/reference/class/KeyValueStore#setValue

Parameters

Name Type
item any
primaryKeys? string[]

Returns

string

Defined in

src/lib/io/pushData.ts:245


jsdomCaptureErrorRouteHandler

jsdomCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<JSDOMCrawlingContext<any, any>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:135


loadConfig

loadConfig(configFilePath?): Promise<null | CrawleeOneConfig>

Load CrawleeOne config file. Config will be searched for using CosmicConfig.

Optionally, you can supply path to the config file.

Learn more: https://github.com/cosmiconfig/cosmiconfig

Parameters

Name Type
configFilePath? string

Returns

Promise<null | CrawleeOneConfig>

Defined in

src/cli/commands/config.ts:51


logLevelHandlerWrapper

logLevelHandlerWrapper<T, RouterCtx>(logLevel): CrawleeOneRouteWrapper<T, RouterCtx>

Wrapper for Crawlee route handler that configures log level.

Usage with Crawlee's RouterHandler.addDefaultHandler

const wrappedHandler = logLevelHandlerWrapper('debug')(handler)
await router.addDefaultHandler<Ctx>(wrappedHandler);

Usage with Crawlee's RouterHandler.addHandler

const wrappedHandler = logLevelHandlerWrapper('error')(handler)
await router.addHandler<Ctx>(wrappedHandler);

Usage with createCrawleeOne

const actor = await createCrawleeOne<CheerioCrawlingContext>({
  validateInput,
  router: createCheerioRouter(),
  routes,
  routeHandlers: ({ input }) => createHandlers(input!),
  routeHandlerWrappers: ({ input }) => [
    logLevelHandlerWrapper<CheerioCrawlingContext<any, any>>(input?.logLevel ?? 'info'),
  ],
  createCrawler: ({ router, input }) => createCrawler({ router, input, crawlerConfig }),
});

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Parameters

Name Type
logLevel "error" | "off" | "info" | "debug" | "warn"

Returns

CrawleeOneRouteWrapper<T, RouterCtx>

Defined in

src/lib/log.ts:49


playwrightCaptureErrorRouteHandler

playwrightCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<PlaywrightCrawlingContext<Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:137


puppeteerCaptureErrorRouteHandler

puppeteerCaptureErrorRouteHandler<T>(...args): CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Type parameters

Name Type
T extends CrawleeOneCtx<PuppeteerCrawlingContext<Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
...args [handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>]

Returns

CrawleeOneRouteHandler<T, CrawleeOneRouteCtx<T>>

Defined in

src/lib/error/errorHandler.ts:138


pushData

pushData<Ctx, T>(ctx, oneOrManyItems, options): Promise<unknown[]>

Apify's Actor.pushData with extra features:

  • Data can be sent elsewhere, not just to Apify. This is set by the io options. By default data is sent using Apify (cloud/local).
  • Limit the max size of the Dataset. No entries are added when Dataset is at or above the limit.
  • Redact "private" fields
  • Add metadata to entries before they are pushed to dataset.
  • Select and rename (nested) properties
  • Transform and filter entries. Entries that did not pass the filter are not added to the dataset.
  • Add/remove entries to/from KeyValueStore. Entries are saved to the store by hash generated from entry fields set by cachePrimaryKeys.

Type parameters

Name Type
Ctx extends CrawlingContext<unknown, Dictionary, Ctx>
T extends Record<any, any> = Record<any, any>

Parameters

Name Type
ctx Ctx
oneOrManyItems T | T[]
options PushDataOptions<T>

Returns

Promise<unknown[]>

Defined in

src/lib/io/pushData.ts:319


pushRequests

pushRequests<T>(oneOrManyItems, options?): Promise<unknown[]>

Similar to Actor.openRequestQueue().addRequests, but with extra features:

  • Data can be sent elsewhere, not just to Apify. This is set by the io options. By default data is sent using Apify (cloud/local).
  • Limit the max size of the RequestQueue. No requests are added when RequestQueue is at or above the limit.
  • Transform and filter requests. Requests that did not pass the filter are not added to the RequestQueue.

Type parameters

Name Type
T extends RequestOptions<Dictionary> | Request<Dictionary>

Parameters

Name Type
oneOrManyItems T | T[]
options? PushRequestsOptions<T>

Returns

Promise<unknown[]>

Defined in

src/lib/io/pushRequests.ts:78


registerHandlers

registerHandlers<T, RouterCtx>(router, routes, options?): Promise<void>

Register many handlers at once onto the Crawlee's RouterHandler.

The labels under which the handlers are registered are the respective object keys.

Example:

registerHandlers(router, { labelA: fn1, labelB: fn2 });

Is similar to:

router.addHandler(labelA, fn1)
router.addHandler(labelB, fn2)

You can also specify a list of wrappers to override the behaviour of all handlers all at once.

A list of wrappers [a, b, c] will be applied to the handlers right-to-left as so a( b( c( handler ) ) ).

The entries on the routerContext object will be made available to all handlers.

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Parameters

Name Type
router RouterHandler<T["context"]>
routes Record<T["labels"], CrawleeOneRoute<T, RouterCtx>>
options? Object
options.handlerWrappers? CrawleeOneRouteWrapper<T, RouterCtx>[]
options.onSetCtx? (ctx: null | Omit<T["context"] & RouterCtx, "request"> & { request: Request<Dictionary> }) => void
options.routerContext? RouterCtx

Returns

Promise<void>

Defined in

src/lib/router/router.ts:89


requestQueueSizeMonitor

requestQueueSizeMonitor(maxSize, options?): Object

Semi-automatic monitoring of RequestQueue size. This is used for limiting the total of entries scraped per run / RequestQueue:

  • When RequestQueue reaches maxSize, then all remaining Requests are removed.
  • Pass an array of items to shortenToSize to shorten the array to the size that still fits the RequestQueue.

By default uses Apify RequestQueue.

Parameters

Name Type
maxSize number
options? RequestQueueSizeMonitorOptions

Returns

Object

Name Type
isFull () => Promise<boolean>
isStale () => boolean
onValue (callback: ValueCallback<number>) => () => void
refresh () => Promise<number>
shortenToSize <T>(arr: T[]) => Promise<T[]>
value () => null | number | Promise<number>

Defined in

src/lib/io/requestQueue.ts:24


runCrawleeOne

runCrawleeOne<TType, T>(args): Promise<void>

Create opinionated Crawlee crawler that uses, and run it within Apify's Actor.main() context.

Apify context can be replaced with custom implementation using the actorConfig.io option.

This function does the following for you:

  1. Full TypeScript coverage - Ensure all components use the same Crawler / CrawlerContext.

  2. Get Actor input from io.getInput(), which by default corresponds to Apify's Actor.getInput().

  3. (Optional) Validate Actor input

  4. Set up router such that requests that reach default route are redirected to labelled routes based on which item from "routes" they match.

  5. Register all route handlers for you.

  6. (Optional) Wrap all route handlers in a wrapper. Use this e.g. if you want to add a field to the context object, or handle errors from a single place.

  7. (Optional) Support transformation and filtering of (scraped) entries, configured via Actor input.

  8. (Optional) Support Actor metamorphing, configured via Actor input.

  9. Apify context (e.g. calling Actor.getInput) can be replaced with custom implementation using the io option.

Type parameters

Name Type
TType extends "basic" | "http" | "cheerio" | "jsdom" | "playwright" | "puppeteer"
T extends CrawleeOneCtx<CrawlerMeta<TType>["context"], string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>

Parameters

Name Type
args RunCrawleeOneOptions<TType, T>

Returns

Promise<void>

Defined in

src/lib/actor/actor.ts:155


runCrawlerTest

runCrawlerTest<TData, TInput>(«destructured»): Promise<void>

Type parameters

Name Type
TData extends MaybeArray<Dictionary>
TInput TInput

Parameters

Name Type
«destructured» Object
› input TInput
› log? (...args: any[]) => void
› onBatchAddRequests? OnBatchAddRequests
› onDone? (done: () => void) => MaybePromise<void>
› onPushData? (data: any, done: () => void) => MaybePromise<void>
› runCrawler () => MaybePromise<void>
› vi VitestUtils

Returns

Promise<void>

Defined in

src/lib/test/actor.ts:61


scrapeListingEntries

scrapeListingEntries<Ctx, UrlType>(options): Promise<UrlType[]>

Get entries from a listing page (eg URLs to profiles that should be scraped later)

Type parameters

Name Type
Ctx extends object
UrlType UrlType

Parameters

Name Type
options ListingPageScraperOptions<Ctx, UrlType>

Returns

Promise<UrlType[]>

Defined in

src/lib/actions/scrapeListing.ts:229


setupDefaultHandlers

setupDefaultHandlers<T, RouterCtx>(«destructured»): Promise<void>

Configures the default router handler to redirect URLs to labelled route handlers based on which route the URL matches first.

NOTE: This does mean that the URLs passed to this default handler will be fetched twice (as the URL will be requeued to the correct handler). We recommend to use this function only in the scenarios where there is a small number of startUrls, yet these may need various ways of processing based on different paths or etc.

Type parameters

Name Type
T extends CrawleeOneCtx<CrawlingContext<BasicCrawler<BasicCrawlingContext<Dictionary>> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler<InternalHttpCrawlingContext<any, any, HttpCrawler<any>>>, Dictionary>, string, Record<string, any>, CrawleeOneIO<object, object, object>, CrawleeOneTelemetry<any, any>, T>
RouterCtx extends Record<string, any> = CrawleeOneRouteCtx<T>

Parameters

Name Type
«destructured» Object
› input? null | T["input"]
› io T["io"]
› onSetCtx? (ctx: null | Omit<T["context"] & RouterCtx, "request"> & { request: Request<Dictionary> }) => void
› routeHandlerWrappers? CrawleeOneRouteWrapper<T, RouterCtx>[]
› router RouterHandler<T["context"]>
› routerContext? RouterCtx
› routes Record<T["labels"], CrawleeOneRoute<T, RouterCtx>>

Returns

Promise<void>

Example

const routeLabels = {
  MAIN_PAGE: 'MAIN_PAGE',
  JOB_LISTING: 'JOB_LISTING',
  JOB_DETAIL: 'JOB_DETAIL',
  JOB_RELATED_LIST: 'JOB_RELATED_LIST',
  PARTNERS: 'PARTNERS',
} as const;

const router = createPlaywrightRouter();

const routes = createPlaywrightCrawleeOneRouteMatchers<typeof routeLabels>([
 // URLs that match this route are redirected to router.addHandler(routeLabels.MAIN_PAGE)
 {
    route: routeLabels.MAIN_PAGE,
    // Check for main page like https://www.profesia.sk/?#
    match: (url) => url.match(/[\W]profesia\.sk/?(?:[?#~]|$)/i),
  },

 // Optionally override the logic that assigns the URL to the route by specifying the `action` prop
 {
    route: routeLabels.MAIN_PAGE,
    // Check for main page like https://www.profesia.sk/?#
    match: (url) => url.match(/[\W]profesia\.sk/?(?:[?#~]|$)/i),
    action: async (ctx) => {
      await ctx.crawler.addRequests([{
        url: 'https://profesia.sk/praca',
        label: routeLabels.JOB_LISTING,
      }]);
    },
  },
]);

// Set up default route to redirect to labelled routes
setupDefaultHandlers({ router, routes });

// Now set up the labelled routes
await router.addHandler(routeLabels.JOB_LISTING, async (ctx) => { ... }

Defined in

src/lib/router/router.ts:306


setupMockApifyActor

setupMockApifyActor<TInput, TData>(«destructured»): Promise<void>

Type parameters

Name Type
TInput TInput
TData extends MaybeArray<Dictionary> = MaybeArray<Dictionary>

Parameters

Name Type
«destructured» Object
› actorInput? TInput
› log? (...args: any[]) => void
› onBatchAddRequests? OnBatchAddRequests
› onGetInfo? (...args: any[]) => MaybePromise<void>
› onPushData? (data: TData) => MaybePromise<void>
› vi VitestUtils

Returns

Promise<void>

Defined in

src/lib/test/actor.ts:12


validateConfig

validateConfig(config): void

Validate given CrawleeOne config.

Config can be passed directly, or you can specify the path to the config file. For the latter, the config will be loaded using loadConfig.

Parameters

Name Type
config unknown

Returns

void

Defined in

src/cli/commands/config.ts:40