Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

Using .bson Files

TonyLee edited this page May 23, 2016 · 2 revisions

This page gives an overview of how to use BSON files with the connector.

Using BSON Files

Static .bson files (such as produced by mongodump) can be used as input or output for Hadoop or Spark jobs. This is often useful if reading from MongoDB directly causes an unpredictable load, whereas creating a BSON snapshot has a one-time, predictable cost.

Splitting/Partitioning BSON Files

Splitting and Compressing BSON Using BSONSplitter

mongo-hadoop 1.5 new feature - This is not yet available in mapred.BSONFileInputFormat

The com.mongodb.hadoop.splitter.BSONSplitter class can be run as an executable program that splits and compresses BSON files (possibly also writing them to another filesystem like HDFS). This class is contained in mongo-hadoop-core.jar. Running the program with no arguments displays the usage:

hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter
           <fileName> [-c compressionCodec] [-o outputDirectory]

Make sure to use the full path, including scheme, for input and output paths.
  • fileName - The absolute path (including scheme) to a BSON file.
  • compressionCodec (optional) - The CompressionCodec class to use when compressing the file. This class must implement org.apache.hadoop.io.compress.CompressionCodec. The selection of codecs available is likely to vary with Hadoop distribution. If left unspecified, BSONSplitter will use DefaultCodec.
  • outputDirectory (optional) - Where to place the resulting compressed BSON files (absolute path, including scheme). If this is left unspecified, then the output files are placed with the input file.
Example:

The following example reads from the file file:///home/mongodb/backups/really-big-bson-file.bson, computes splits according to the configured split size and writes each split compressed with the BZip2Codec to HDFS under /user/mongodb/input/bson.

hadoop jar mongo-hadoop-core.jar com.mongodb.hadoop.splitter.BSONSplitter \
    file:///home/mongodb/backups/really-big-bson-file.bson \
    -c org.apache.hadoop.io.compress.BZip2Codec \
    -o hdfs://localhost:8020/user/mongodb/input/bson
Automatic BSON Splitting (No Compression)

Because BSON contains headers and length information, a .bson file cannot be split at arbitrary offsets because it would create incomplete document fragments. Instead it must be split at the boundaries between documents. To facilitate this the mongo-hadoop adapter refers to a small metadata file which contains information about the offsets of documents within the file. This metadata file is stored in the same directory as the input file, with the same name but starting with a "." and ending with ".splits". If this metadata file already exists in when the job runs, the .splits file will be read and used to directly generate the list of splits. If the .splits file does not yet exist, it will be generated automatically so that it is available for subsequent runs. To disable saving of this file, set bson.split.write_splits to false, splits will still be calculated and used. To disable calculating of splits set bson.split.read_splits to false

The default split size is determined from the default block size on the input file's filesystem, or 64 megabytes if this is not available. You can set lower and upper bounds for the split size by setting values (in bytes) for mapreduce.input.fileinputformat.split .minsize and mapreduce.input.fileinputformat.split.maxsize. The .splits file contains bson objects which list the start positions and lengths for portions of the file, not exceeding the split size, which can then be read directly into a Mapper task.

However, for optimal performance, it's faster to build this file locally before uploading to S3 or HDFS if possible. You can do this by running the script tools/bson_splitter.py. The default split size is 64 megabytes, but you can set any value you want for split size by changing the value for SPLIT_SIZE in the script source, and re-running it.

###Using BSON Files for Input

Setup

To use a bson file as the input for a hadoop job, you must set mongo.job.input.format to "com.mongodb.hadoop.BSONFileInputFormat" or use MongoConfigUtil.setInputFormat(com.mongodb.hadoop.BSONFileInputFormat.class). Then set mapred.input.dir/mapreduce.input.fileinputformat.inputdir to indicate the location of the .bson input file(s). The value for this property may be:

  • the path to a single file,
  • the path to a directory (all files inside the directory will be treated as BSON input files),
  • located on the local file system (file://...), on Amazon S3 (s3n://...), on a Hadoop Filesystem (hdfs://...), or any other FS protocol your system supports,
  • a comma delimited sequence of these paths.
Code

BSON objects loaded from a .bson file do not necessarily have a _id field, so there is no key supplied to the Mapper field. Because of this, you should use NullWritable or simply Object as your input key for the map phase, and ignore the key variable in your code. For example:

public void map(NullWritable key, BSONWritable writable, final Context context){
   // …
}

Producing BSON Files as Output

By using BSONFileOutputFormat you can write the output data of a Hadoop job into .bson files, which can then be fed into a subsequent job or loaded into a MongoDB instance using mongorestore.

Setup

To write the output of a job to .bson files, set mongo.job.output.format to com.mongodb.hadoop.BSONFileOutputFormat or use MongoConfigUtil.setOutputFormat(com.mongodb.hadoop.BSONFileOutputFormat.class)

Then set mapreduce.output.fileoutputformat.outputdir to be the directory where the output .bson files should be written. This may be a path on the local filesystem, HDFS, S3, etc. Every reducer will generate its own output file, just like writing output into plain text files using TextOutputFormat.

Writing splits during output

If you intend to feed these output .bson files into subsequent jobs, you can generate the .splits files on the fly as they are written, by setting bson.output.build_splits to true. This will save time over building the .splits files on demand at the beginning of another job. By default, this setting is set to false and no .splits files will be written during output.

Settings for BSON input/output

  • bson.split.read_splits - When set to true, will attempt to read + calculate split points for each BSON file in the input. When set to false, will create just one split for each input file, consisting of the entire length of the file. Defaults to true.
  • mapreduce.input.fileinputformat.split.minsize - Set a lower bound on acceptable size for file splits (in bytes). Defaults to 1.
  • mapreduce.input.fileinputformat.split.maxsize - Set an upper bound on acceptable size for file splits (in bytes). Defaults to LONG_MAX_VALUE.
  • bson.split.write_splits - Automatically save any split information calculated for input files, by writing to corresponding .splits files. Defaults to false.
  • bson.output.build_splits - Build a .splits file on the fly when constructing an output .BSON file. Defaults to false.
  • bson.pathfilter.class - Set the class name for a [PathFilter](http://hadoop.apache .org/docs/current/api/org/apache/hadoop/fs/PathFilter.html) to filter which files to process when scanning directories for bson input files. Defaults to null (no additional filtering). You can set this value to com.mongodb.hadoop.BSONPathFilter to restrict input to only files ending with the ".bson" extension.
  • mongo.input.lazy_bson - True to indicate using LazyBSONObject as input to Mapper, false will utilize BasicBSONObject.