quickstart

As described in the README, a strategy requires an extractor or a transformer or both:

every run of the update function of the extractor should pull a chunk of the data from the source, be it a page, a time range etc
once the extraction process is finished, the transformation part begins. This is executed on a per-line basis (onLine) and should prepare the data to be compliant with the schema of the final dataset.

Depending on the data source, different initial configurations may be needed.

https strategy

A web 2 strategy crawls data from an https source (rss feed, apis, etc...).

This guide will show you the steps to crawl XKCD.

First off, the core expects, both for init and update, a message compliant to the schema definition for https strategies.

The init function message should provide the starting state of the crawling. For XKCD, we would write something like:

{
    write: null,
    messages: [
      {
        type: "https",
        version,
        options: {
          url: https://xkcd.com/1/info.0.json,
          method: "GET",
          header: null,
          body: null
        },
      },
    ],
  };

write: specifies the data to be written in string format. null during init, as there is no data
type: the protocol. must be https for this type of strategy
version: can be anything, it's currently not used
options:
- url: the URL pointing to the first chunk to crawl. First page on XKCD is 1
- method: GET, POST
- headers: if any specific header should be sent to the source, it shall be specified here (for instance authorization headers)
- body: same, value must be type of String
results: this field will be used by the core to pass the output of the crawling to the update function

Once the init message has been returned, the core will fetch the first chunk of data and pass it to update. The update function can perform operations on the result and store it and/or fetch more data by returning messages in a similar way to init. For each non-exit message returned by the update, the function is called with the results of the crawl.

The function, ideally, should take care of the following:

validate the outcome of the crawl
validate the data against a schema
prepare the data for storing. This step should make sure that any future consumer of the data will find all that's required to process it.
define if and what to crawl next

The core provided message will contain the following data along with the original message:

{
    "error": string,
    "results": object
}

error: null or the crawling error message
results: the direct output of the crawler

The strategy will have to

validate the outcome of the crawl

if(message.error) {
    // handle the error
    console.error(message.error);

    // continue the crawling if possible (in this case, we are not able to retrieve the next page)
    return {
      write: null,
      messages: [],
    };
}

validate the data

const data = message.results;

if (validate(data)) {
    console.error(validate.errors);
    return {
      type: "exit",
      version: "1.0"
    };
}

Prepare the data for storage

// Simply dumping the data into a string is sufficient in this case
const toBeStored = JSON.stringify(data);

Define if and what to crawl next

// Let's assume we want to crawl up to MAX_PAGE
const { num } = message.results;
if(num >= MAX_PAGE) {
  return {
    type: "exit",
    version: "1.0"
  };
}

// Instruct core to crawl next page
let options = {
  url: templateURI(num + 1),
  method: "GET"
}

Return the message back to the core

return {
    write: toBeStored,
    messages: [
      {
        type: "https",
        version,
        options: options,
      },
    ],
  };

Check out the code for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickstart.md

quickstart.md

quickstart

https strategy

Files

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

quickstart

https strategy