Configuration Reference

Configuration

This page describes options that can be set on the JVM with -D to configure the MongoDB Hadoop Connector. For example:

bin/hadoop jar MyJob.jar com.enterprise.JobMainClass \
    -Dmongo.input.split.read_shard_chunks=true \
    -Dmongo.input.uri=mongodb://...

Jump to MapReduce options, input/output options, or miscellaneous options.

MapReduce Options

Configure types and classes used in MapReduce jobs.

`mongo.job.mapper`

sets the class to be used as the implementation for Mapper

`mongo.job.reducer`

sets the class to be used as the implementation for Reducer

`mongo.job.combiner`

sets the class to be used for a combiner. Must implement Reducer.

`mongo.job.partitioner`

sets the class to be used as the implementation for Partitioner

`mongo.job.mapper.output.key`

sets the output class of the key produced by your Mapper. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

`mongo.job.mapper.output.value`

sets the output class of the value produced by your Mapper. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

`mongo.job.output.key`

sets the output class of the key produced by your Reducer.

`mongo.job.output.value`

sets the output class of the value produced by your Reducer. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

`mongo.job.sort_comparator`

the Comparator to use in the Job. Note that keys emitted by Mappers either need to be WritableComparable, or you need to specify a Comparator capable of comparing the key type emitted by Mappers.

Input and Output Options

Configure the way the connector reads from and writes to MongoDB or BSON.

`mongo.job.input.format`

the InputFormat class to use. The MongoDB Hadoop Connector provides two: MongoInputFormat and BSONFileInputFormat for reading from live MongoDB clusters and BSON dumps, respectively.

`mongo.job.output.format`

the OutputFormat class to use. The MongoDB Hadoop Connector provides two: MongoOutputFormat and BSONFileOutputFormat for writing to live MongoDB clusters and BSON files, respectively.

`mongo.auth.uri` specify an auxiliary

MongoDB connection string to authenticate against when constructing splits. If you are using a sharded cluster and your config database requires authentication, and its username/password differs from the collection that you are running Map/Reduce on, then you may need to use this option so that Hadoop can access collection metadata in order to compute splits. An example URI: mongodb://username:password@cyberdyne:27017/config.

This URI is also used by com.mongodb.hadoop.splitter.MongoSplitterFactory to determine what Splitter implementation to use, if one was not specified in mongo.splitter.class. If using MongoSplitterFactory, the user specified in this URI must also have read access to the input collection.

Note: this option cannot be used to provide credentials for reading from or writing to MongoDB. Please see authentication options in the MongoDB connection string documentation for guidance on providing correct options to mongo.input.uri and mongo.output.uri instead.

`mongo.input.uri`

specify one or more input collections (including auth and readPreference options) to use as the input data source for the Map/Reduce job. This takes the form of a MongoDB connection string, and by specifying options you can control how and where the input is read from.

Examples:

mongodb://joe:12345@weyland-yutani:27017/analytics.users?readPreference=secondary Authenticate as "joe" with the password "12345" and read from only SECONDARY nodes from the "users" collection in the database "analytics".
mongodb://joe:12345@weyland-yutani:27017/production.customers?readPreferenceTags=dc:tokyo,type:hadoop Authenticate "joe" with the password "12345" and read the "users" collection in database "analytics" only on nodes tagged with "dc:tokyo" and "type:hadoop".

`mongo.output.uri`

A MongoDB connection string describing the destionation to which to write. Deprecated: If using a sharded cluster, you can specify a space-delimited list of URIs referring to separate mongos instances and the output records will be written to them in round-robin fashion to improve throughput and avoid overloading a single mongos (the MongoDB Java Driver does this automatically, so this is unnecessary. Just provide all mongos addresses in a single connection string.).

`mongo.input.query`

filter the input collection with a query. This query must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

Example

'mongo.input.query={"_id":{"$gt":{"$date":1182470400000}}}'

//this is equivalent to {_id : {$gt : ISODate("2007-06-21T20:00:00-0400")}} in the mongo shell

`mongo.input.fields`

a projection document limiting the fields that appear in each document. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

`mongo.input.sort`

the cursor sort on each split. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

`mongo.input.limit`

the cursor limit on each split.

`mongo.input.skip`

the cursor skip on each split.

`mongo.input.notimeout`

Set noTimeout on the cursor when reading each split. This can be necessary if really large splits are being truncated because cursors are killed before all the data has been read.

`mongo.input.lazy_bson`

Defaults to false, which will decode documents into BasicBSONObjects for Mapper. Setting this option to true will decode docs into LazyBSONObjects.

`mongo.input.mongos_hosts`

specify space-delimited MongoDB connection strings giving the addresses to one or more mongos instances. Mongoses specified in this option will be read from in round-robin fashion. Deprecated: The MongoDB Java Driver does this automatically, so you can just provide a list of mongos addresses in a single connection string.

`mongo.input.key`

a key that names a field in each MongoDB document that should be used as the value for the key in each Mapper. This is _id by default. "Dot-separated" keys are supported to access nested fields, or fields contained within arrays. For example, the key location.addresses.0.street will be "Sesame" in the following document:

{
  "location": {
     "addresses": [
       {"street": "Sesame", ...},
       {"street": "Embarcadero", ...}
     ]
  }
}

`mongo.output.batch.size`

The number of documents to write in a single batch from the connector. This is 1000 by default.

Split Options

Configure how the connector creates splits.

`mongo.splitter.class`

The name of the Splitter class to use. If left unset, the MongoDB Hadoop Connector will attempt to make a best guess as to what Splitter to use. The Hadoop connector provides the following Splitters:

com.mongodb.hadoop.splitter.StandaloneMongoSplitter

This is the default Splitter implemenatation for standalone and replica set configurations. It runs splitVector on the input collection to create ranges to use for InputSplits.
com.mongodb.hadoop.splitter.ShardMongoSplitter

This is the default Splitter implementation for sharded clusters when mongo.input.split.read_from_shards is set to true (it's false by default). This Splitter creates one InputSplit per shard.
com.mongodb.hadoop.splitter.ShardChunkMongoSplitter

This is the default Splitter implementation for sharded clusters with all the default settings. This Splitter creates one InputSplit per shard chunk.
com.mongodb.hadoop.splitter.MultiMongoCollectionSplitter

This Splitter is capable of reading from multiple MongoDB collections simultaneously. It delegates the task of creating InputSplits to child Splitters for each collection.
com.mongodb.hadoop.splitter.MongoPaginatingSplitter

This Splitter builds incremental range queries to cover a query. This Splitter requires a bit more work to calculate split boundaries, but it performs better than the other splitters when a mongo.input.query is given.

`mongo.input.split.use_range_queries`

This setting causes data to be read from mongo by adding $gte/$lt clauses to the query for limiting the boundaries of each split, instead of the default behavior of limiting with $min/$max. This may be useful in cases when using mongo.input.query is being used to filter mapper input, but the index that would be used by $min/$max is less selective than the query and indexes. Setting this to 'true' lets the MongoDB server select the best index possible and still. This defaults to false if not set.

Limitations

This setting should only be used when:

The query specified in mongo.input.query does not already have filters applied against the split field. Will log an error otherwise.
The same data type is used for the split field in all documents, otherwise you may get incomplete results.
The split field contains simple scalar values, and is not a compound key (e.g., "foo" is acceptable but {"k":"foo", "v":"bar"} is not)

`mongo.input.split.create_input_splits`

Defaults to true. When set, attempts to create multiple input splits from the input collection, for parallelism. When set to false the entire collection will be treated as a single input split.

`mongo.input.split.allow_read_from_secondaries`

Deprecated.

Sets slaveOk on all outbound connections when reading data from splits. This is deprecated - if you would like to read from secondaries, just specify the appropriate readPreference as part of your mongo.input.uri setting.

`mongo.input.split.read_from_shards`

When reading data for splits, query the shard directly rather than via the mongos. This may produce inaccurate results if there are aborted/incomplete chunk migrations in process during execution of the job. Make sure to turn off the balancer and clean up orphaned chunks before using this option.

`mongo.input.split.read_shard_chunks`

Use the chunk boundaries on a MongoDB sharded cluster's config server to determine input splits. Chunk data is read through one or more mongos instances, as described in the connection string provided to mongo.input.uri or in mongo.input.mongos_hosts. If both this option and mongo.input.split.read_from_shards are true, then this option takes priority.

`mongo.input.split.split_key_pattern`

If you want to customize that split point for efficiency reasons (such as different distribution) on an un-sharded collection, you may set this to any valid field name. The restriction on this key name are the exact same rules as when sharding an existing MongoDB Collection. You must have an index on the field, and follow the other rules outlined in the docs.

`mongo.input.split.split_key_min`

Lower-bound for splits created using the index described by mongo.input.split.split_key_pattern. This value must be set to a JSON string that describes a point in the index. This setting must be used in conjunction with mongo.input.split.split_key_pattern and mongo.input.split.split_key_max.

`mongo.input.split.split_key_max`

Upper-bound for splits created using the index described by mongo.input.split.split_key_pattern. This value must be set to a JSON string that describes a point in the index. This setting must be used in conjunction with mongo.input.split.split_key_pattern and mongo.input.split.split_key_min.

`bson.split.read_splits`

When set to true, will attempt to read + calculate split points for each BSON file in the input. When set to false, will create just one split for each input file, consisting of the entire length of the file. Defaults to true.

`mongo.input.split_size`

The split size when computing splits with StandaloneMongoSplitter. Defaults to 8MB.

`mapreduce.input.fileinputformat.split.minsize`

Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.

`mapreduce.input.fileinputformat.split.maxsize`

Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.

`bson.split.write_splits`

Automatically save any split information calculated for input files, by writing to corresponding .splits files. Defaults to true.

`bson.output.build_splits`

Build a .splits file on the fly when constructing an output .BSON file. Defaults to false.

`bson.split.splits_path`

the location where .splits files should go. If left unset, .splits files will go into the same directory as their BSON file counterpart, or into the current working directory.

Miscellaneous Options

`bson.pathfilter.class`

Set the class name for a [PathFilter](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/PathFilter.html) to filter which files to process when scanning directories for bson input files. Defaults to null (no additional filtering). You can set this value to com.mongodb.hadoop.BSONPathFilter to restrict input to only files ending with the ".bson" extension.

`mongo.job.verbose`

When set to true, turns on INFO-level logging for the Job.

`mongo.job.background`

When set to true, runs the Job in the background. If set to false, the RunningJob blocks until the Job is finished. See http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#waitForCompletion%28%29.

Configuration Reference

Configuration

MapReduce Options

mongo.job.mapper

mongo.job.reducer

mongo.job.combiner

mongo.job.partitioner

mongo.job.mapper.output.key

mongo.job.mapper.output.value

mongo.job.output.key

mongo.job.output.value

mongo.job.sort_comparator

Input and Output Options

mongo.job.input.format

mongo.job.output.format

mongo.auth.uri specify an auxiliary

mongo.input.uri

mongo.output.uri

mongo.input.query

mongo.input.fields

mongo.input.sort

mongo.input.limit

mongo.input.skip

mongo.input.notimeout

mongo.input.lazy_bson

mongo.input.mongos_hosts

mongo.input.key

mongo.output.batch.size

Split Options

mongo.splitter.class

mongo.input.split.use_range_queries

mongo.input.split.create_input_splits

mongo.input.split.allow_read_from_secondaries

mongo.input.split.read_from_shards

mongo.input.split.read_shard_chunks

mongo.input.split.split_key_pattern

mongo.input.split.split_key_min

mongo.input.split.split_key_max

bson.split.read_splits

mongo.input.split_size

mapreduce.input.fileinputformat.split.minsize

mapreduce.input.fileinputformat.split.maxsize

bson.split.write_splits

bson.output.build_splits

bson.split.splits_path