Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

Configuration Reference

Yaoxing edited this page Nov 30, 2017 · 18 revisions

Configuration

This page describes options that can be set on the JVM with -D to configure the MongoDB Hadoop Connector. For example:

bin/hadoop jar MyJob.jar com.enterprise.JobMainClass \
    -Dmongo.input.split.read_shard_chunks=true \
    -Dmongo.input.uri=mongodb://...

Jump to MapReduce options, input/output options, or miscellaneous options.

MapReduce Options

Configure types and classes used in MapReduce jobs.

mongo.job.mapper

sets the class to be used as the implementation for Mapper

mongo.job.reducer

sets the class to be used as the implementation for Reducer

mongo.job.combiner

sets the class to be used for a combiner. Must implement Reducer.

mongo.job.partitioner

sets the class to be used as the implementation for Partitioner

mongo.job.mapper.output.key

sets the output class of the key produced by your Mapper. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

mongo.job.mapper.output.value

sets the output class of the value produced by your Mapper. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

mongo.job.output.key

sets the output class of the key produced by your Reducer.

mongo.job.output.value

sets the output class of the value produced by your Reducer. It may be necessary to set this value if the output types of your Mapper and Reducer classes are not the same.

mongo.job.sort_comparator

the Comparator to use in the Job. Note that keys emitted by Mappers either need to be WritableComparable, or you need to specify a Comparator capable of comparing the key type emitted by Mappers.

Input and Output Options

Configure the way the connector reads from and writes to MongoDB or BSON.

mongo.job.input.format

the InputFormat class to use. The MongoDB Hadoop Connector provides two: MongoInputFormat and BSONFileInputFormat for reading from live MongoDB clusters and BSON dumps, respectively.

mongo.job.output.format

the OutputFormat class to use. The MongoDB Hadoop Connector provides two: MongoOutputFormat and BSONFileOutputFormat for writing to live MongoDB clusters and BSON files, respectively.

mongo.auth.uri specify an auxiliary

MongoDB connection string to authenticate against when constructing splits. If you are using a sharded cluster and your config database requires authentication, and its username/password differs from the collection that you are running Map/Reduce on, then you may need to use this option so that Hadoop can access collection metadata in order to compute splits. An example URI: mongodb://username:password@cyberdyne:27017/config.

This URI is also used by com.mongodb.hadoop.splitter.MongoSplitterFactory to determine what Splitter implementation to use, if one was not specified in mongo.splitter.class. If using MongoSplitterFactory, the user specified in this URI must also have read access to the input collection.

Note: this option cannot be used to provide credentials for reading from or writing to MongoDB. Please see authentication options in the MongoDB connection string documentation for guidance on providing correct options to mongo.input.uri and mongo.output.uri instead.

mongo.input.uri

specify one or more input collections (including auth and readPreference options) to use as the input data source for the Map/Reduce job. This takes the form of a MongoDB connection string, and by specifying options you can control how and where the input is read from.

Examples:

  • mongodb://joe:12345@weyland-yutani:27017/analytics.users?readPreference=secondary Authenticate as "joe" with the password "12345" and read from only SECONDARY nodes from the "users" collection in the database "analytics".
  • mongodb://joe:12345@weyland-yutani:27017/production.customers?readPreferenceTags=dc:tokyo,type:hadoop Authenticate "joe" with the password "12345" and read the "users" collection in database "analytics" only on nodes tagged with "dc:tokyo" and "type:hadoop".
mongo.output.uri

A MongoDB connection string describing the destionation to which to write. Deprecated: If using a sharded cluster, you can specify a space-delimited list of URIs referring to separate mongos instances and the output records will be written to them in round-robin fashion to improve throughput and avoid overloading a single mongos (the MongoDB Java Driver does this automatically, so this is unnecessary. Just provide all mongos addresses in a single connection string.).

mongo.input.query

filter the input collection with a query. This query must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

Example

'mongo.input.query={"_id":{"$gt":{"$date":1182470400000}}}'

//this is equivalent to {_id : {$gt : ISODate("2007-06-21T20:00:00-0400")}} in the mongo shell
mongo.input.fields

a projection document limiting the fields that appear in each document. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

mongo.input.sort

the cursor sort on each split. This must be represented in JSON format, and use the MongoDB extended JSON format to represent non-native JSON data types like ObjectIds, Dates, etc.

mongo.input.limit

the cursor limit on each split.

mongo.input.skip

the cursor skip on each split.

mongo.input.notimeout

Set noTimeout on the cursor when reading each split. This can be necessary if really large splits are being truncated because cursors are killed before all the data has been read.

mongo.input.lazy_bson

Defaults to false, which will decode documents into BasicBSONObjects for Mapper. Setting this option to true will decode docs into LazyBSONObjects.

mongo.input.mongos_hosts

specify space-delimited MongoDB connection strings giving the addresses to one or more mongos instances. Mongoses specified in this option will be read from in round-robin fashion. Deprecated: The MongoDB Java Driver does this automatically, so you can just provide a list of mongos addresses in a single connection string.

mongo.input.key

a key that names a field in each MongoDB document that should be used as the value for the key in each Mapper. This is _id by default. "Dot-separated" keys are supported to access nested fields, or fields contained within arrays. For example, the key location.addresses.0.street will be "Sesame" in the following document:

{
  "location": {
     "addresses": [
       {"street": "Sesame", ...},
       {"street": "Embarcadero", ...}
     ]
  }
}
mongo.output.batch.size

The number of documents to write in a single batch from the connector. This is 1000 by default.

Split Options

Configure how the connector creates splits.

mongo.splitter.class

The name of the Splitter class to use. If left unset, the MongoDB Hadoop Connector will attempt to make a best guess as to what Splitter to use. The Hadoop connector provides the following Splitters:

  • com.mongodb.hadoop.splitter.StandaloneMongoSplitter

    This is the default Splitter implemenatation for standalone and replica set configurations. It runs splitVector on the input collection to create ranges to use for InputSplits.

  • com.mongodb.hadoop.splitter.ShardMongoSplitter

    This is the default Splitter implementation for sharded clusters when mongo.input.split.read_from_shards is set to true (it's false by default). This Splitter creates one InputSplit per shard.

  • com.mongodb.hadoop.splitter.ShardChunkMongoSplitter

    This is the default Splitter implementation for sharded clusters with all the default settings. This Splitter creates one InputSplit per shard chunk.

  • com.mongodb.hadoop.splitter.MultiMongoCollectionSplitter

    This Splitter is capable of reading from multiple MongoDB collections simultaneously. It delegates the task of creating InputSplits to child Splitters for each collection.

  • com.mongodb.hadoop.splitter.MongoPaginatingSplitter

    This Splitter builds incremental range queries to cover a query. This Splitter requires a bit more work to calculate split boundaries, but it performs better than the other splitters when a mongo.input.query is given.

mongo.input.split.use_range_queries

This setting causes data to be read from mongo by adding $gte/$lt clauses to the query for limiting the boundaries of each split, instead of the default behavior of limiting with $min/$max. This may be useful in cases when using mongo.input.query is being used to filter mapper input, but the index that would be used by $min/$max is less selective than the query and indexes. Setting this to 'true' lets the MongoDB server select the best index possible and still. This defaults to false if not set.

Limitations

This setting should only be used when:

  • The query specified in mongo.input.query does not already have filters applied against the split field. Will log an error otherwise.
  • The same data type is used for the split field in all documents, otherwise you may get incomplete results.
  • The split field contains simple scalar values, and is not a compound key (e.g., "foo" is acceptable but {"k":"foo", "v":"bar"} is not)
mongo.input.split.create_input_splits

Defaults to true. When set, attempts to create multiple input splits from the input collection, for parallelism. When set to false the entire collection will be treated as a single input split.

mongo.input.split.allow_read_from_secondaries

Deprecated.

Sets slaveOk on all outbound connections when reading data from splits. This is deprecated - if you would like to read from secondaries, just specify the appropriate readPreference as part of your mongo.input.uri setting.

mongo.input.split.read_from_shards

When reading data for splits, query the shard directly rather than via the mongos. This may produce inaccurate results if there are aborted/incomplete chunk migrations in process during execution of the job. Make sure to turn off the balancer and clean up orphaned chunks before using this option.

mongo.input.split.read_shard_chunks

Use the chunk boundaries on a MongoDB sharded cluster's config server to determine input splits. Chunk data is read through one or more mongos instances, as described in the connection string provided to mongo.input.uri or in mongo.input.mongos_hosts. If both this option and mongo.input.split.read_from_shards are true, then this option takes priority.

mongo.input.split.split_key_pattern

If you want to customize that split point for efficiency reasons (such as different distribution) on an un-sharded collection, you may set this to any valid field name. The restriction on this key name are the exact same rules as when sharding an existing MongoDB Collection. You must have an index on the field, and follow the other rules outlined in the docs.

mongo.input.split.split_key_min

Lower-bound for splits created using the index described by mongo.input.split.split_key_pattern. This value must be set to a JSON string that describes a point in the index. This setting must be used in conjunction with mongo.input.split.split_key_pattern and mongo.input.split.split_key_max.

mongo.input.split.split_key_max

Upper-bound for splits created using the index described by mongo.input.split.split_key_pattern. This value must be set to a JSON string that describes a point in the index. This setting must be used in conjunction with mongo.input.split.split_key_pattern and mongo.input.split.split_key_min.

bson.split.read_splits

When set to true, will attempt to read + calculate split points for each BSON file in the input. When set to false, will create just one split for each input file, consisting of the entire length of the file. Defaults to true.

mongo.input.split_size

The split size when computing splits with StandaloneMongoSplitter. Defaults to 8MB.

mapreduce.input.fileinputformat.split.minsize

Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.

mapreduce.input.fileinputformat.split.maxsize

Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.

bson.split.write_splits

Automatically save any split information calculated for input files, by writing to corresponding .splits files. Defaults to true.

bson.output.build_splits

Build a .splits file on the fly when constructing an output .BSON file. Defaults to false.

bson.split.splits_path

the location where .splits files should go. If left unset, .splits files will go into the same directory as their BSON file counterpart, or into the current working directory.

Miscellaneous Options

bson.pathfilter.class

Set the class name for a [PathFilter](http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/PathFilter.html) to filter which files to process when scanning directories for bson input files. Defaults to null (no additional filtering). You can set this value to com.mongodb.hadoop.BSONPathFilter to restrict input to only files ending with the ".bson" extension.

mongo.job.verbose

When set to true, turns on INFO-level logging for the Job.

mongo.job.background

When set to true, runs the Job in the background. If set to false, the RunningJob blocks until the Job is finished. See http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#waitForCompletion%28%29.

Clone this wiki locally