Skip to content

mindboxstudios/algolia-webcrawler

 
 

Repository files navigation

Algolia Webcrawler David DM npm version Build Status Greenkeeper badge

Simple node worker that crawls sitemaps in order to keep an Algolia index up-to-date.

It uses simple CSS selectors in order to find the actual text content to index.

This app uses Algolia's library.

TL;DR

  1. Usage
  2. Pre-requesites
  3. Installation
  4. Running
  5. Configuration file
  6. Configuration options
  7. Stored Object
  8. Indexing
  9. License

Usage

This script should be run via crontab in order to crawl the entire website at regular interval.

Pre-requesites

  1. Having at least one valid sitemap.xml url that contains all the url you want to be indexed.
  2. The sitemap(s) must contain at least the <loc> node, i.e. urlset/url/loc.
  3. An empty Algolia index.
  4. An Algolia Credential that can create objects and set settings on the index, i.e. search, addObject, settings, browse, deleteObject, editSettings, deleteIndex

Installation

  1. Get the latest version
    • npm npm i algolia-webcrawler -g
    • git
      • ssh+git: git clone git@github.com:DeuxHuitHuit/algolia-webcrawler.git
      • https: git clone https://github.com/DeuxHuitHuit/algolia-webcrawler.git
    • https download the latest tarball
  2. create a config.json file

Running

npm

algolia-webcrawler --config config.json

other

cd to the root of the project and run node app.

Configuration file

Configuration is done via the config.json file.

You can choose a config.json file stored elsewhere usign the --config flag.

node app --config my-config.json

Configuration options

At the bare minimum, you can edit config.json to set a values to the following options: 'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps, please list them all: sub-sitemaps will not be crawled.

All options are required. No defaults are provided.

app: String

The name of your app.

cred: Object

Algolia crendentials object. See 'cred.appid' and 'cred.apikey'.

cred.appid: String

Your Algolia App ID.

cred.apikey: String

Your generated Algolia API key.

delayBetweenRequests: Integer

Simple delay between each requests made to the website in milliseconds.

oldentries: Integer

The maximum number of seconds an entry can live without being updated. After each run, the app will search for old entries and delete them. If you do not wish to get rid of old entries, set this value to 0.

index: Object

An object containing various values related to your index.

index.name: String

Your index name.

index.settings: Object

An object that will act as argument to Algolia's Index#setSetting method.

Please read Algolia's documentation on that subject. Any valid attribute documented for this method can be used.

index.settings.attributesToIndex: Array

An array of string that defines which attributes are indexable, which means that full text search will be performed against them. For a complete list of possible attributes see the Stored Object section.

index.settings.attributesForFaceting: Array

An array of string that defines which attributes are filterable, which means that you can use them to exclude some records from being returned. For a complete list of possible attributes see the Stored Object section.

sitemaps: Array

This array should contain a list of sitemap objects.

A sitemap is a really simple object with two String properties: url and lang. The 'url' property is the exact url for this sitemap. The 'lang' property should explicit the main language used by url found in the sitemap.

http: Object

An object containing different http options.

http.auth: String

The auth string, in node's username:password form. If you do not need auth, you still need to specify an empty String.

selectors: Object

An object containing CSS selectors in order to find the content in the pages html.

selectors.title: String

CSS selector for the title of the page.

selectors.description: String

CSS selector for the description of the page.

selectors.image: String

CSS selector for the image of the page.

selectors.text: String

CSS selector for the title of the page.

selectors[key]: String

CSS selector for the "key" property. You can add custom keys as you wish.

formatters: Object

An object containing formatter string. Their values are removed from the original result obtained with the associated CSS selector.

formatters.title: String,Array

The string to remove from the title of the page. Can also be an array of strings.

formatters[key]: String,Array

The string to remove from the specified key. Can also be an array of strings.

types[key]: String

The parse function used to format the value. Supported types are "integer", "float" and "json".

defaults[key]: String

The default value inserted for the specified key. Will be set if the value is falsy.

plugins: Array

A list of javascript files to load custom code before saving the record. The only requirement is to implement the following interface, where record is the object to be saved and data is the html.

module.exports = (record, data) => {
	record.value_from_plugin = 'Yay!';
};

blacklist: Array

All url are checked against all items in the blacklist. If the complete url or its path component is in the blacklist, it won't get indexed.

Stored Object

The stored object on Algolia's server is as follows

{
	date: new Date(),
	url: 'http://...',
	objectID: shasum.digest('base64'),
	lang: sitemap.lang,
	title: '',
	description: '',
	image: '',
	text: ['...']
}

One thing to notice is that text is an array, since we tried to preserve the original text node -> actual value relationship. Algolia handle this just fine.

Indexing

Indexing is done automatically, at each run. To tweak how indexing works, please see the index.settings configuration option.

LICENSE

MIT
Made with love in Montréal by Deux Huit Huit
Copyrights (c) 2014-2017

About

Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%