Permalink
Browse files

Update master

  • Loading branch information...
1 parent ff31c3d commit a73fb532e0e10af275153ec05b086ca155a330af @bwmcadams bwmcadams committed Jun 25, 2012
Showing with 1 addition and 298 deletions.
  1. +1 −298 README.md
View
299 README.md
@@ -40,301 +40,4 @@ Issue tracking: https://jira.mongodb.org/browse/HADOOP/
Discussion: http://groups.google.com/group/mongodb-user/
-## Building the Adapter
-
-The Mongo-Hadoop adapter uses the
-[SBT Build Tool](https://github.com/harrah/xsbt) tool for
-compilation. SBT provides superior support for discrete configurations
-targeting multiple Hadoop versions. The distribution includes
-self-bootstrapping copy of SBT in the distribution as `sbt`. Create a
-copy of the jar files using the following command:
-
- ./sbt package
-
-The MongoDB Hadoop Adapter supports a number of Hadoop releases. You
-can change the Hadoop version supported by the build by modifying the
-value of `hadoopRelease` in the `build.sbt` file. For instance, set
-this value to:
-
- hadoopRelease in ThisBuild := "cdh3"
-
-Configures a build against Cloudera CDH3u3, while:
-
- hadoopRelease in ThisBuild := "0.21"
-
-Configures a build against Hadoop 0.21 from the mainline Apache distribution.
-
-Unfortunately, we are not aware of any Maven repositories that contain
-artifacts for Hadoop 0.21 at present. You may need to resolve these
-dependencies by hand if you choose to build using this
-configuration. We also publish releases to the central Maven
-repository with artifacts customized using the dependent release
-name. Our "default" build has no artifact name attached and supports
-Hadoop 1.0.
-
-After building, you will need to place the "core" jar and the
-"mongo-java-driver" in the `lib` directory of each Hadoop server.
-
-The MongoDB-Hadoop Adapter supports the following releases. Valid keys
-for configuration and Maven artifacts appear below each release.
-
-### Cloudera Release 3
-
-This derives from Apache Hadoop 0.20.2, but includes many custom
-patches. Patches include binary streaming, and Pig 0.8.1. This
-target compiles *ALL* Modules, including Streaming.
-
-- cdh3
-- Maven artifact: "org.mongodb" / "mongo-hadoop_cdh3u3"
-
-### Apache Hadoop 0.20.205.0
-
-This includes Pig 0.9.2 and does *NOT* support Hadoop Streaming.
-
-- 0.20
-- 0.20.x
-- Maven artifact: "org.mongodb" / "mongo-hadoop_0.20.205.0"
-
-### Apache Hadoop 1.0.0
-
-This includes Pig 0.9.1 and does *NOT* support Hadoop Streaming.
-
-- 1.0
-- 1.0.x
-- Maven artifact: "org.mongodb" / "mongo-hadoop_1.0.0"
-
-## Apache Hadoop 0.21.0
-
-This includes Pig 0.9.1 and Hadoop Streaming.
-
-- 0.21
-- 0.21.x
-
-This build is **not** published to Maven because of upstream
-dependency availability.
-
-### Apache Hadoop 0.23
-
-Support is *forthcoming*.
-
-This is an alpha branch with ongoing work by
-[Hortonworks](http://hortonworks.com). Apache Hadoop 0.23 is "newer"
-than Apache Hadoop 1.0.
-
-The MongoDB Hadoop Adapter currently supports the following features.
-
-## Hadoop MapReduce
-
-Provides working *Input* and *Output* adapters for MongoDB. You may
-configure these adapters with XML or programatically. See the
-WordCount examples for demonstrations of both approaches. You can
-specify a query, fields and sort specs in the XML config as JSON or
-programatically as a DBObject.
-
-### Splitting up MongoDB Source Data for the InputFormat
-
-The MongoDB Hadoop Adapter makes it possible to create multiple
-*InputSplits* on source data originating from MongoDB to
-optimize/paralellize input processing for Mappers.
-
-If '*mongo.input.split.create_input_splits*' is **false** (it defaults
-to **true**) then MongoHadoop will use **no** splits. Hadoop will
-treat the entire collection as a single, giant, *Input*. This is
-primarily intended for debugging purposes.
-
-When true, as by default, the following possible behaviors exist:
-
- 1. For unsharded the source collections, MongoHadoop follows the
- "unsharded split" path. (See below.)
-
- 2. For sharded source collections:
-
- * If '*mongo.input.split.read_shard_chunks*' is **true**
- (defaults **true**) then we pull the chunk specs from the
- configuration server, and turn each shard chunk into an *Input
- Split*. Basically, this means the mongodb sharding system does
- 99% of the preconfig work for us and is a good thing™
-
- * If '*mongo.input.split.read_shard_chunks*' is **false** and
- '*mongo.input.split.read_from_shards*' is **true** (it defaults
- to **false**) then we connect to the `mongod` or replica set
- for each shard individually and each shard becomes an input
- split. The entire content of the collection on the shard is one
- split. Only use this configuration in rare situations.
-
- * If '*mongo.input.split.read_shard_chunks*' is **true** and
- '*mongo.input.split.read_from_shards*' is **true** (it defaults
- to **false**) MongoHadoop reads the chunk boundaries from
- the config server but then reads data directly from the shards
- without using the `mongos`. While this may seem like a good
- idea, it can cause erratic behavior if MongoDB balances chunks
- during a Hadoop job. This is not a recommended configuration
- for write-heavy applications but may provide effective
- parallelism in read-heavy apps.
-
- * If both '*mongo.input.split.create_input_splits*' and '*mongo.input.split.read_from_shards*' are
- **false** then we pretend there is no sharding and use
- the "unsharded split" path. When '*mongo.input.split.read_shard_chunks*' is
- **false** MongoHadoop reads everything through mongos as a
- single split.
-
-### "Unsharded Splits"
-
-"Unsharded Splits" refers to the method that MongoHadoop uses to
-calculate new splits. You may use "Unsharded splits" with sharded
-MongoDB options.
-
-This is only used:
-
-- for unsharded collections when
- '*mongo.input.split.create_input_splits*' is **true**.
-
-- for sharded collections when
- '*mongo.input.split.create_input_splits*' is **true** *and*
- '*mongo.input.split.read_shard_chunks*' is **false**.
-
-In these cases, MongoHadoop generates multiple InputSplits. Users
-have control over two factors in this system.
-
-* *mongo.input.split_size* - Controls the maximum number of megabytes
- of each split. The current default is 8, based on assumptions
- prior experience with Hadoop. MongoDB's default of 64 megabytes
- may be a bit too large for most deployments.
-
-* *mongo.input.split.split_key_pattern* - Is a MongoDB key pattern
- that follows [the same rules as shard key selection](http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys).
- This key pattern has some requirements, (i.e. must have an index,
- unique, and present in all documents.) MongoHadoop uses this key to
- determine split points. It defaults to `{ _id: 1 }` but you may find
- that it's more ideal to optimize the mapper distribution by
- configuring this value.
-
-For all three paths, you may specify a custom query filter for the
-input data. *mongo.input.query* represents a JSON document containing
-a MongoDB query. This will be properly combined with the index
-filtering on input splits allowing you to MapReduce a subset of your
-data but still get efficient splitting.
-
-### Pig
-
-MongoHadoop includes the MongoStorage and the MongoLoader module for Pig.
-Examples of loading and storing to a MongoDB from Pig can be found
-in `examples/pigtutorial`.
-
-## Examples
-
-### WordCount
-
-There are two example WordCount processes for Hadoop MapReduce in `examples/wordcount`
-Both read strings from MongoDB and save the count of word frequency.
-
-These examples read documents in the `test` database, stored in the
-collection named `in`. They will count the frequency defined in field
-`x`.
-
-The examples save results in db `test`, collection `out`.
-
-`WordCount.java` is a programatically configured MapReduce job, where
-all of the configuration params are setup in the Java code. You can
-run this with the ant task `wordcount`.
-
-`WordCountXMLConfig.java` configures the MapReduce job using only XML
-files, with JSON for queries. See
-`examples/wordcount/src/main/resources/mongo-wordcount.xml` for the
-example configuration. You can run this with the ant task
-`wordcountXML`, or with a Hadoop command that resembles the following:
-
- hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.WordCountXMLConfig -conf examples/wordcount/src/main/resources/mongo-wordcount.xml
-
-You will need to copy the `mongo-java-driver.jar` file into your
-Hadoop `lib` directory before this will work.
-
-### Treasury Yield
-
-The treasury yield example demonstrates working with a more complex
-input BSON document and calculating an average.
-
-It uses a database of daily US Treasury Bid Curves from 1990 to
-Sept. 2010 and runs them through to calculate annual averages.
-
-There is a JSON file `examples/treasury_yield/src/main/resources/yield_historical_in.json`
-which you should import into the `yield_historical.in` collection in
-the `demo` db.
-
-You may import the sample data into the `mongos` host by issuing the
-following command:
-
- mongoimport --db demo --collection yield_historical.in --type json --file examples/treasury_yield/src/main/resources/yield_historical_in.json
-
-This command assumes that `mongos` is running on the localhost
-interface on port `27017`. You'll need to setup the mongo-hadoop and
-mongo-java-driver jars in your Hadoop installations "lib"
-directory. After importing the data, run the test with the following
-command on the Hadoop master:
-
- hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml
-
-To confirm the test ran successfully, look at the `demo` database and
-query the `yield_historical.out collection`.
-
-### Pig
-
-The MongoHadoop distribution includes a modified version of the Pig
-Tutorial from the Pig distribution for testing.
-
-This script differs from the pig tutorial in that it loads the data
-from a MongoDB and saves the results to a MongoDB.
-
-The use of Pig assumes you have Hadoop and Pig installed and
-configured on your system.
-
-To populate your MongoDB with the relevant data for the example
-configure the script `examples/pigtutorial/populateMongo.pig` to
-include the connection URI to your MongoDB, then run:
-
- pig -x local examples/pigtutorial/populateMongo.pig
-
-Next configure the script `examples/pigtutorial/test.pig` to include
-the connection URI to your MongoDB. Make sure you've built
-using `ant jar`, then run:
-
- pig -x local examples/pigtutorial/test.pig
-
-You should find the data and the results in your MongoDB.
-
-NOTE - Make sure these version artifacts on the built jars match those
-in the script
-
-
-## KNOWN ISSUES
-
-### Open Issues
-
-* You cannot configure bare regexes (e.g. /^foo/) in the config xml as
- they won't parse. Use {"$regex": "^foo", "$options": ""}
- instead. .. Make sure to omit the slashes.
-
-* [HADOOP-19 - MongoStorage fails when tuples w/i bags are not named](https://jira.mongodb.org/browse/HADOOP-19)
-
- This is due to an open Apache bug, [PIG-2509](https://issues.apache.org/jira/browse/PIG-2509).
-
-### Streaming
-
-Streaming support in MongoHadoop **requires** that the Hadoop
-distribution include the patches for the following issues:
-
-* [HADOOP-1722 - Make streaming to handle non-utf8 byte array](https://issues.apache.org/jira/browse/HADOOP-1722)
-* [HADOOP-5450 - Add support for application-specific typecodes to typed bytes](https://issues.apache.org/jira/browse/HADOOP-5450)
-* [MAPREDUCE-764 - TypedBytesInput's readRaw() does not preserve custom type codes](https://issues.apache.org/jira/browse/MAPREDUCE-764)
-
-The mainline Apache Hadoop distribution merged these patches for the
-0.21.0 release. We have verified as well that the
-[Cloudera](http://cloudera.com) distribution (while based on 0.20.x
-still) includes these patches in CDH3 Update 1+ (We build against
-Update 3 now); anecdotal evidence (which needs confirmation) indicates
-they may have been there since CDH2, and likely exist in CDH3 as well.
-
-
-By default, The Mongo-Hadoop project builds against Apache 0.20.203 which does *not* include these patches. To build/enable Streaming support you must build against either Cloudera CDH3u1 or Hadoop 0.21.0.
-
-Additionally, note that Hadoop 1.0 is based on the 0.20 release. As such, it *does not include* the patches necessary for streaming. This is frustrating and upsetting but unfortunately out of our hands. We are working on attempting to get these patches backported into a future release or finding an additional workaround.
+Documentation and Build Details: http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

0 comments on commit a73fb53

Please sign in to comment.