Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
MongoDB Connector for Hadoop
Failed to load latest commit information.
clusterConfigs put hadoop-binaries in the local dir instead. more self-contained thi…
config use constants
core/src Close FileSystem when an Exception is thrown in BSONSplitter.
docs Added a "load-sample-data" task to use for loading samples into mongo…
examples Remove the derelict UFO sightings example.
flume/src/main/java/com/mongodb/flume Remove obsolete documentation. It all lives in the Github wiki now.
gradle Fix gradle delete for hadoop-tmpdir.
hive/src Remove obsolete documentation. It all lives in the Github wiki now.
integration-tests pig test infra
pig/src Remove obsolete documentation. It all lives in the Github wiki now.
streaming Remove obsolete documentation. It all lives in the Github wiki now.
tools use `python` instead of `python2`
.gitignore Add hadoop-binaries to gitignore.
History.md Release r1.0.0
README.md Add Spark and MapReduce links to README.
build.gradle Merge pull request #122 from catap/java8
gradlew beginnings of gradle implemention. "core" project done.
gradlew.bat beginnings of gradle implemention. "core" project done.
mongo-defaults.xml Recommend disabling of speculative execution in defaults file.
settings.gradle fix interproject deps so that things sign properly
test.sh fix environment issues around hive tests

README.md

MongoDB Connector for Hadoop

Purpose

The MongoDB Connector for Hadoop is a library which allows MongoDB (or backup files in its data format, BSON) to be used as an input source, or output destination, for Hadoop MapReduce tasks. It is designed to allow greater flexibility and performance and make it easy to integrate data in MongoDB with other parts of the Hadoop ecosystem including the following:

Check out the releases page for the latest stable release.

Features

  • Can create data splits to read from standalone, replica set, or sharded configurations
  • Source data can be filtered with queries using the MongoDB query language
  • Supports Hadoop Streaming, to allow job code to be written in any language (python, ruby, nodejs currently supported)
  • Can read data from MongoDB backup files residing on S3, HDFS, or local filesystems
  • Can write data out in .bson format, which can then be imported to any MongoDB database with mongorestore
  • Works with BSON/MongoDB documents in other Hadoop tools such as Pig and Hive.

Download

See the release page.

Building

Run ./gradlew jar to build the jars. The jars will be placed in to build/libs for each module. e.g. for the core module, it will be generated in the core/build/libs directory.

After successfully building, you must copy the jars to the lib directory on each node in your hadoop cluster. This is usually one of the following locations, depending on which Hadoop release you are using:

  • $HADOOP_HOME/lib/
  • $HADOOP_HOME/share/hadoop/mapreduce/
  • $HADOOP_HOME/share/hadoop/lib/

mongo-hadoop should work on any distribution of hadoop. Should you run in to an issue, please file a Jira ticket.

Documentation

For full documentation, please check out the Hadoop Connector Wiki. The documentation includes installation instructions, configuration options, as well as specific instructions and examples for each Hadoop application the connector supports.

Usage with Amazon Elastic MapReduce

Amazon Elastic MapReduce is a managed Hadoop framework that allows you to submit jobs to a cluster of customizable size and configuration, without needing to deal with provisioning nodes and installing software.

Using EMR with the MongoDB Connector for Hadoop allows you to run MapReduce jobs against MongoDB backup files stored in S3.

Submitting jobs using the MongoDB Connector for Hadoop to EMR simply requires that the bootstrap actions fetch the dependencies (mongoDB java driver, mongo-hadoop-core libs, etc.) and place them into the hadoop distributions lib folders.

For a full example (running the enron example on Elastic MapReduce) please see here.

Notes for Contributors

If your code introduces new features, add tests that cover them if possible and make sure that ./gradlew check still passes. For instructions on how to run the tests, see the Running the Tests section in the wiki. If you're not sure how to write a test for a feature or have trouble with a test failure, please post on the google-groups with details and we will try to help. Note: Until findbugs updates its dependencies, running ./gradlew check on Java 8 will fail.

Maintainers

Luke Lovett (luke.lovett@mongodb.com)

Contributors

Support

Issue tracking: https://jira.mongodb.org/browse/HADOOP/

Discussion: http://groups.google.com/group/mongodb-user/

Something went wrong with that request. Please try again.