Skip to content

nypl-spacetime/spacetime-etl

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
This branch is 40 commits ahead, 1 commit behind histograph:master.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 

NYC Space/Time Directory ETL tool

Extract/Transform/Load tool for NYC Space/Time Directory data: it loads separate data modules which perform ETL tasks, such as downloading and transforming data to the NYC Space/Time Directory data model.

For more information about the NYC Space/Time Directory project, as well as datasets produced by spacetime-etl, see http://spacetime.nypl.org.

ETL Modules

Space/Time's ETL modules are separate Node.js modules which need to be installed individually. Each ETL module represents a NYC Space/Time Directory dataset or data transformation, and defines a set of steps; spacetime-etl loads these modules, and executes the steps they define.

Some examples:

ETL Module Description
etl-mapwarper Outlines of maps from Map Warper, NYPL's tool for georectifying historical maps
etl-group-maps Map Warper maps, grouped by decade — used by Maps by Decade
etl-spacetime-graph Graph of all NYC Space/Time Directory datasets
etl-oldnyc Locations of 40,000 geotagged photos from OldNYC

For more ETL modules, see GitHub.

Configuration

The configuration of the data tool is done in the NYC Space/Time Directory configuration file, under the etl key.

The following configuration options must be specified:

Parameter Description
moduleDir Path (absolute, or relative to data tool) where spacetime-etl looks for data modules
modulePrefix Directory prefix used to identify data modules (e.g. etl-mapwarper) — default is etl-
outputDir Directory to which ETL modules write their data

Example:

etl:
  modulePrefix: "etl-"
  moduleDir: /Users/bertspaan/code/etl-modules
  outputDir: /Users/bertspaan/data/spacetime/etl

The configuration of the separate ETL modules can also be done in configuration file. Please see the README of the respective ETL modules for more information. Example:

etl:
  modules:
    geonames:
      types:
        PPL: 'st:Place'
        PPLX: 'st:Neighborhood'

Usage & Installation

Installing ETL Modules

To use spacetime-etl to run ETL modules, you first need to install them. Go to the directory specified by the moduleDir configuration option, and clone the ETL modules you need, for example:

git clone https://github.com/nypl-spacetime/etl-nyc-wards.git
git clone https://github.com/nypl-spacetime/etl-mapwarper.git
git clone https://github.com/nypl-spacetime/etl-oldnyc.git

Then, install the dependencies of each module:

cd etl-nyc-wards
npm install
cd ..
cd etl-mapwarper
npm install
cd ..
cd etl-oldnyc
npm install

You can now use spacetime-etl to run the three ETL modules you have just installed: nyc-wards, mapwarper and oldnyc.

Command-line Interface

Installation:

npm install -g nypl-spacetime/spacetime-etl

Run the data tool without command-line arguments to get a list of the available data modules:

spacetime-etl

To run one or more ETL modules, provide their IDs as command-line arguments:

spacetime-etl mapwarper oldnyc ...

Alternatively, you can select the processing steps you want to run:

spacetime-etl mapwarper.download

By default, all steps are run consecutively.

From a Node.js script

Installation:

npm install nypl-spacetime/spacetime-etl

Usage (to run this example, first install etl-mapwarper, see Installing ETL Modules):

const etl = require('spacetime-etl')

// Fetch all installed ETL modules:
const modules = etl.modules()

// Execute all steps:
etl.execute('mapwarper', (err) => {
  if (err) {
    console.error('Error:')
    console.error(err)
  } else {
    console.log('Done!')
  }
})

// Execute a single step:
etl.execute('nyc-streets.download', (err) => {
  if (err) {
    console.error('Error:')
    console.error(err)
  } else {
    console.log('Done!')
  }
})

The produced data files are written in a subdirectory of the configured output directory: <outputDir>/<step>/mapwarper.

Creating an ETL module from scratch

It's easy! Let's say we want to write a scraper which, very illegally, reads photos and their metadata from the NYC Municipal Archives Online Gallery.

First, create a directory in spacetime-etl's moduleDir with the following name:

mkdir etl-nyc-municipal-archives

In this directory, create two files:

First, nyc-municipal-archives.dataset.json, holding the metadata of the ETL module and the resulting dataset:

{
  "id": "nyc-municipal-archives",
  "title": "NYC Municipal Archives Online Gallery",
  "license": "CC0",
  "description": "The NYC Municipal Archives Online Gallery provides research access to over 900,000 items digitized from the Municipal Archives' vast holdings, including photographs, maps, motion-pictures and audio recordings",
  "author": "Bert Spaan",
  "website": "http://nycma.lunaimaging.com/luna/servlet"
}

The actual code goes in nyc-municipal-archives.js:

function download (config, dirs, tools, callback) {
  // Download data, write data to output directory;
  //   dirs.current contains the path of the
  //   output directory of the current step

  // config object contains configuration from
  // this module's section (if available)

  callback()
}

function transform (config, dirs, tools, callback) {
  // Read downloaded data from output directory;
  //   dirs.download contains the path of the
  //   output directory of the download step

  // Do data transformations, and write the
  //   resulting Space/Time objects to disk
  //   using tools.writer

  const object = {
    type: 'object',
    obj: {
      id: 1,
      type: 'st:Photo'
      data: {
        title: '',
        collection: ''
      },
      geometry: {
        type: "Point",
        coordinates: [
          -74.014592,
          40.702211
        ]
      }
    }
  }

  tools.writer.writeObject(object, callback)
}

module.exports.steps = [
  download,
  transform
]

You can now run this ETL module with the following command:

spacetime-etl nyc-municipal-archives

Copyright (C) 2015 Waag Society, 2017 The New York Public Library

About

Extract/Transform/Load tool for NYC Space/Time Directory data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%