The National Data Catalog, version 2.
Switch branches/tags
Nothing to show
Pull request Compare This branch is 4 commits behind mess110:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

National Data Catalog

This is the source code for the National Data Catalog. The current version (v2) intended to be more social and more collaborative than v1.


Version 1 of the National Data Catalog consisted of multiple applications (a Sinatra API, a Rails 2 web app, and many importers) that worked together.

This version (v2) is instead one combined Rails 3 application. This will make it easier to modify and maintain the application, plus it will be easier to customize and install if you want to run your own data catalog.

Changes since Version 1


  • Each User has a dedicated page.
  • Each User page has an activity stream.
  • Users can watch DataSets.
  • Users can follow other Users.


  • Sites, external web sites that we import from, are top-level resources
  • Sites have useful metadata describing their 'goodness' as data catalogs.


  • A Catalog is a grouping of DataSets.
  • The application can contain multiple Catalogs.
  • Each Catalog can be administered by its owners and curators.


  • An Organization can group other Organizations when needed. Two examples are the "U.S. District Courts" and the "U.S. Executive Departments".


  • DataSets can aggregate other DataSets. (This is a useful way to bundle up very similar DataSets, such as the Toxic Release Inventory data sets on
  • DataSets can be rated along three dimensions: Interestingness, Documentation Quality, and Data Quality.
  • Users can suggest DataSets and provide as much metadata as they like.
  • Users can suggest 'missing' DataSets.
  • Added DataSet activity stream.
  • DataSets can belong to more than one Catalog.


  • Locations, geographical locations, are top-level resources.
  • Locations are hierarchical.
  • Locations are a better way to convey the ideas of:
    • jurisdiction of an Organization
    • geographic coverage of a DataSet

Installation / Setup

Just install the National Data Catalog like a typical Rails 3 application. Here are a few notes about parts that are less common:

  1. We highly recommend that you use the latest stable version of Ruby. As of this writing we are using 1.9.2p136. While the app may run on Ruby 1.8.7, we will not be developing or testing it for that environment.
  2. Download and run MongoDB.
  3. Configure Mongoid.
    • Adjust config/mongoid.yml as necessary.
  4. Install and run Redis.
    • On Mac OS X, we recommend homebrew:
      • brew install redis
    • On Ubuntu 10.10, we recommend:
      • sudo apt-get install redis-server
  5. Load data into the database:
    • For quicker testing:
      • rake db:reset db:seed db:seed_examples
    • For a full setup:
      • rake db:reset db:seed db:seed_examples import:*
  6. Start the background processing system:
    • QUEUE=* rake resque:work
    • rake resque:scheduler
    • Optionally: resque-web
  7. Start the Rails server:
    • rails s
  8. Optional: if you are customizing or modifying the source code, fire up autotest:
    • autotest

Notable Directories

Our directory structure follows the Rails conventions; however, we have a few differences that are worth highlighting:

  1. Sass templates live in app/sass (which is different from the default location of public/stylesheets/sass). (Sass is extension of CSS3 that adds nested rules, variables, mixins, and selector inheritance.)
  2. Importers (which import DataSets from external Sites) are kept in lib/importers.
  3. Cached gravatars live in public/images/gravatars.
  4. Delayed processing logic (for Resque) lives in lib/resque.
  5. Some Mongoid models take advantage of MongoDB's map/reduce. The map and reduce functions live in javascript files located in app/models/{model}/{method}. Separating the javascript functions out of the ruby models is helpful to your text editor -- it allows for accurate syntax highlighting and JSLint checking.