Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Download and import XLS, ODS, XML, CSV, etc. into your ActiveRecord models.
Ruby

This branch is 108 commits behind seamusabshere:master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
lib
test
.gitignore
.yardopts
CHANGELOG
Gemfile
LICENSE
README.markdown
Rakefile
data_miner.gemspec

README.markdown

data_miner

Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Real-world usage

Brighter Planet logo

We use data_miner for data science at Brighter Planet and in production at

The killer combination for us is:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata - apply corrections in a transparent way
  4. data_miner (this library!) - import data idempotently

Documentation

Check out the extensive documentation.

Quick start

You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
  self.primary_key = 'iso_3166_code'

  data_miner do
    import("OpenGeoCode.org's Country Codes to Country Names list",
           :url => 'http://opengeocode.org/download/countrynames.txt',
           :format => :delimited,
           :delimiter => '; ',
           :headers => false,
           :skip => 22) do
      key   :iso_3166_code, :field_number => 0
      store :iso_3166_alpha_3_code, :field_number => 1
      store :iso_3166_numeric_code, :field_number => 2
      store :name, :field_number => 5
    end
  end
end

Now you can run:

>> Country.run_data_miner!
=> nil

More advanced usage

The earth library has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:

Model Highlights Reference
Aircraft parsing Microsoft Frontpage HTML (!) data_miner.rb
Airports forcing column names and use of :select block (Proc) data_miner.rb
Automobile model variants super advanced usage of "custom parser" and errata data_miner.rb
Country parsing CSV and a few other tricks data_miner.rb
EGRID regions parsing XLS data_miner.rb
Flight segment (stage) super advanced usage of POSTing form data data_miner.rb
Zip codes downloading a ZIP file and pulling an XLSX out of it data_miner.rb

And many more - look for the data_miner.rb file that corresponds to each model. Note that you would normally put the data_miner declaration right inside the ActiveRecord model file... it's kept separate in earth so that loading it is optional.

Authors

Copyright

Copyright (c) 2012 Brighter Planet. See LICENSE for details.

Something went wrong with that request. Please try again.