Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

MongoDB Connector for Hadoop

Octocat-spinner-32 config code quality files January 10, 2014
Octocat-spinner-32 core extract test runs in to builder-style object April 23, 2014
Octocat-spinner-32 docs Added a "load-sample-data" task to use for loading samples into mongo… March 06, 2012
Octocat-spinner-32 examples sleeps aren't needed anymore April 24, 2014
Octocat-spinner-32 flume update docs to indicate gradle as the build mechanism March 18, 2014
Octocat-spinner-32 gradle version bumps April 23, 2014
Octocat-spinner-32 hive extract test runs in to builder-style object April 23, 2014
Octocat-spinner-32 pig update docs to indicate gradle as the build mechanism March 18, 2014
Octocat-spinner-32 project base modules done. examples next. January 10, 2014
Octocat-spinner-32 scoobi DBObjectWritable provides a DBObject so you can casbah operate instead June 28, 2012
Octocat-spinner-32 streaming update docs to indicate gradle as the build mechanism March 18, 2014
Octocat-spinner-32 testing sleeps aren't needed anymore April 24, 2014
Octocat-spinner-32 tools use `python` instead of `python2` December 03, 2013
Octocat-spinner-32 .gitignore add missing jar January 10, 2014
Octocat-spinner-32 BSON_README.md update docs to indicate gradle as the build mechanism March 18, 2014
Octocat-spinner-32 CONFIG.md Added Lazy BSON as a format type July 23, 2013
Octocat-spinner-32 History.md Release r1.0.0 April 09, 2012
Octocat-spinner-32 README.md remove errant quote March 18, 2014
Octocat-spinner-32 build-all.sh add support for hadoop 2.3 March 14, 2014
Octocat-spinner-32 build.gradle version bumps April 23, 2014
Octocat-spinner-32 build.sbt beginnings of gradle implemention. "core" project done. January 10, 2014
Octocat-spinner-32 gradlew beginnings of gradle implemention. "core" project done. January 10, 2014
Octocat-spinner-32 gradlew.bat beginnings of gradle implemention. "core" project done. January 10, 2014
Octocat-spinner-32 mongo-defaults.xml Recommend disabling of speculative execution in defaults file. June 26, 2012
Octocat-spinner-32 sbt Updated sbt-extras & set javacOptions to target 1.5 June 06, 2013
Octocat-spinner-32 settings.gradle implement tests in junit and move them out of python March 26, 2014
README.md

MongoDB Connector for Hadoop

Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem.

Current stable release: 1.2.0

Features

  • Can create data splits to read from standalone, replica set, or sharded configurations
  • Source data can be filtered with queries using the MongoDB query language
  • Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
  • Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
  • Can write data out in .bson format, which can then be imported to any MongoDB database with mongorestore
  • Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.

Download

See the release page.

Building

The mongo-hadoop connector currently supports the following versions of hadoop: 0.23, 1.0, 1.1, 2.2, 2.3, and CDH 4. The default build version will build against the last Apache Hadoop (currently 2.3). If you would like to build against a specific version of Hadoop you simply need to pass -Phadoop_version=<your version>.

Then run ./gradlew jar to build the jars. The jars will be placed in to build/libs for each module. e.g. for the core module, it will be generated in the core/build/libs directory.

After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:

  • $HADOOP_HOME/lib/
  • $HADOOP_HOME/share/hadoop/mapreduce/
  • $HADOOP_HOME/share/hadoop/lib/

Supported Distributions of Hadoop

Apache Hadoop 1.0

Does not support Hadoop Streaming.

Build using -Phadoop_version=1.0

Apache Hadoop 1.1

Includes support for Hadoop Streaming.

Build using -Phadoop_version=1.1

Apache Hadoop 0.23

Includes support for Streaming

Build using -Phadoop_version=0.23

Cloudera Distribution for Hadoop Release 4

This is the newest release from Cloudera which is based on Apache Hadoop 2.0. The newer MR2/YARN APIs are not yet supported, but MR1 is still fully compatible.

Includes support for Streaming.

Build with -Phadoop_version=cdh4

Apache Hadoop 2.2

Includes support for Streaming

Build using -Phadoop_version=2.2

Apache Hadoop 2.3

Includes support for Streaming

Build using -Phadoop_version=2.3

Configuration

Configuration

Streaming

Streaming

Examples

Examples

Usage with static .bson (mongo backup) files

BSON Usage

Usage with Amazon Elastic MapReduce

Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.

Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.

Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.

For a full example (running the enron example on Elastic MapReduce) please see here.

Usage with Pig

Documentation on Pig with the MongoDB Connector for Hadoop.

For examples on using Pig with the MongoDB Connector for Hadoop, also refer to the examples section.

Notes for Contributors

If your code introduces new features, please add tests that cover them if possible and make sure that ./gradlew check still passes. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help.

Maintainers

Justin lee (justin.lee@mongodb.com)

Contributors

Support

Issue tracking: https://jira.mongodb.org/browse/HADOOP/

Discussion: http://groups.google.com/group/mongodb-user/

Something went wrong with that request. Please try again.