Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Added a "load-sample-data" task to use for loading samples into mongo…

… for testing/demos
commit 8878ac82d9c5ee08deee8b0eba9749f568ce7cb3 1 parent 99e47ac
@bwmcadams bwmcadams authored
View
40 docs/Combined+Pages.html
@@ -54,11 +54,11 @@ <h4 class="toctitle">Contents</h4>
</li><li>Priya Manda <priyakanth024@gmail.com> (Test Harness Code)
</li><li>Rushin Shah <rushin10@gmail.com> (Test Harness Code)
</li><li>Sarthak Dudhara <sarthak.83@gmail.com> (BSONWritable comparable interface)
-</li></ul><h1 id="Frequently+Asked+Questions">Frequently Asked Questions</h1><h2 id="Do+the+MongoInputFormat%2FMongoOutputFormats+use+HDFS%3F">Do the MongoInputFormat/MongoOutputFormats use HDFS?</h2><p>No. The <code>Mongo\*Format</code> code is designed to not use HDFS, instead reading and writing data directly between MongoDB + Hadoop.
+</li></ul><h1 id="Frequently+Asked+Questions">Frequently Asked Questions</h1><h3 id="Do+the+MongoInputFormat%2FMongoOutputFormats+use+HDFS%3F">Do the MongoInputFormat/MongoOutputFormats use HDFS?</h3><p>No. The <code>Mongo\*Format</code> code is designed to not use HDFS, instead reading and writing data directly between MongoDB + Hadoop.
</p><p>A forthcoming release will offer a <code>BSONInputFormat</code> and <code>BSONOutputFormat</code> which will allow for working offline with MongoDB backup files (in BSON format) on HDFS and S3.
-</p><h2 id="How+does+the+MongoDB+%2B+Hadoop+Connector+differ+from+Sqoop%3F">How does the MongoDB + Hadoop Connector differ from Sqoop?</h2><p>From the [Sqoop Wiki][https://github.com/cloudera/sqoop/wiki]: <em>“Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database … and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)“</em>
+</p><h3 id="How+does+the+MongoDB+%2B+Hadoop+Connector+differ+from+Sqoop%3F">How does the MongoDB + Hadoop Connector differ from Sqoop?</h3><p>From the <a href="https://github.com/cloudera/sqoop/wiki">Sqoop Wiki</a>: <em>“Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database … and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)“</em>
</p><p>The MongoDB + Hadoop Connector does not work with HDFS, instead reading and writing directly between MongoDB and Hadoop for the highest possible performance. This also allows for Hadoop jobs to have the freshest possible view of their input data without an intermediary export process.
-</p><h3 id="Is+integration+possible+between+MongoDB+and+Sqoop%3F">Is integration possible between MongoDB and Sqoop?</h3><p>As MongoDB is neither a relational database nor utilizes JDBC for connectivity, integration with Sqoop does not seem feasible at this time.
+</p><h4 id="Is+integration+possible+between+MongoDB+and+Sqoop%3F">Is integration possible between MongoDB and Sqoop?</h4><p>As MongoDB is neither a relational database nor utilizes JDBC for connectivity, integration with Sqoop does not seem feasible at this time.
</p><p>A forthcoming release of the MongoDB + Hadoop Connector will offer a <code>BSONInputFormat</code> and <code>BSONOutputFormat</code> which will allow for working offline with MongoDB backup files (in BSON format) on HDFS and S3, without a live MongoDB database.
</p><h1 id="Getting+Started">Getting Started</h1><p>To get started with MongoDB + Hadoop, you’ll need a few things:
</p><ul><li>A MongoDB Installation (<em>mongo-hadoop</em> supports all MongoDB configurations including Sharding)
@@ -121,7 +121,7 @@ <h4 class="toctitle">Contents</h4>
artifacts for Hadoop 0.21 at present. You may need to resolve these
dependencies by hand if you chose to build using this
configuration.
-</p><h1 id="Configuration+%26+Behavior">Configuration &amp; Behavior</h1><h2 id="Hadoop+MapReduce">Hadoop MapReduce</h2><p>Provides working <em>Input</em> and <em>Output</em> adapters for MongoDB. You may
+</p><h1 id="Configuration+%26+Behavior">Configuration &amp; Behavior</h1><h2 id="Hadoop+MapReduce">Hadoop MapReduce</h2><p>This package provides working <em>Input</em> and <em>Output</em> adapters for MongoDB. You may
configure these adapters with XML or programatically. See the
WordCount examples for demonstrations of both approaches. You can
specify a query, fields and sort specs in the XML config as JSON or
@@ -137,32 +137,12 @@ <h4 class="toctitle">Contents</h4>
</p><ol><li><p>For unsharded the source collections, MongoHadoop follows the
</p><p> “unsharded split” path. (See below.)
</p></li><li><p>For sharded source collections:
-</p><ul><li>If <code>mongo.input.split.read_shard_chunks</code> is <strong>true</strong>
-(defaults <strong>true</strong>) then we pull the chunk specs from the
-configuration server, and turn each shard chunk into an <em>Input
-Split</em>. Basically, this means the mongodb sharding system does
-</li></ul><p> 99% of the preconfig work for us and is a good thing‚Ñ¢
-</p><ul><li>If <code>read_shard_chunks</code> is <strong>false</strong> and
-<code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults
-to <strong>false</strong>) then we connect to the <code>mongod</code> or replica set
-for each shard individually and each shard becomes an input
-split. The entire content of the collection on the shard is one
-split. Only use this configuration in rare situations.
-</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>true</strong> and
-<code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults
-to <strong>false</strong>) MongoHadoop reads the chunk boundaries from
-the config server but then reads data directly from the shards
-without using the <code>mongos</code>. While this may seem like a good
-idea, it can cause erratic behavior if MongoDB balances chunks
-during a Hadoop job. This is not a recommended configuration
-for write heavy applications but may provide effective
-parallelism in read heavy apps.
-</li></ul><ul><li>If both <code>create_input_splits</code> and <code>read_from_shards</code> are
-</li></ul><p> <strong>false</strong> disabled then we pretend there is no sharding and use
- the “unsharded split” path. When <code>read_shard_chunks</code> is
- <strong>false</strong> MongoHadoop reads everything through mongos as a
- single split.
-</p></li></ol><h3 id="%E2%80%9CUnsharded+Splits%E2%80%9C">“Unsharded Splits“</h3><p>“Unsharded Splits” refers to the system that MongoHadoop uses to
+</p><ul><li>If <code>mongo.input.split.read_shard_chunks</code> is <strong>true</strong> (defaults <strong>true</strong>) then we pull the chunk specs from the
+configuration server, and turn each shard chunk into an <em>Input Split</em>. Basically, this means the mongodb sharding system does 99% of the preconfig work for us and is a good thing‚Ñ¢
+</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>false</strong> and <code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults to <strong>false</strong>) then we connect to the <code>mongod</code> or replica set for each shard individually and each shard becomes an input split. The entire content of the collection on the shard is one split. Only use this configuration in rare situations.
+</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>true</strong> and <code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults to <strong>false</strong>) MongoHadoop reads the chunk boundaries from the config server but then reads data directly from the shards without using the <code>mongos</code>. While this may seem like a good idea, it can cause erratic behavior if MongoDB balances chunks during a Hadoop job. This is not a recommended configuration for write heavy applications but may provide effective parallelism in read heavy apps.
+</li></ul><ul><li>If both <code>create_input_splits</code> and <code>read_from_shards</code> are <strong>false</strong> disabled then we pretend there is no sharding and use the “unsharded split” path. When <code>read_shard_chunks</code> is <strong>false</strong> MongoHadoop reads everything through mongos as a single split.
+</li></ul></li></ol><h3 id="%E2%80%9CUnsharded+Splits%E2%80%9C">“Unsharded Splits“</h3><p>“Unsharded Splits” refers to the system that MongoHadoop uses to
calculate new splits. You may use “Unsharded splits” with sharded
MongoDB options.
</p><p>This is only used:
View
34 docs/Configuration+&+Behavior.html
@@ -34,7 +34,7 @@
</div>
</div>
<div class="span-16 prepend-1 append-1 contents">
- <h1 id="Configuration+%26+Behavior">Configuration &amp; Behavior</h1><h2 id="Hadoop+MapReduce">Hadoop MapReduce</h2><p>Provides working <em>Input</em> and <em>Output</em> adapters for MongoDB. You may
+ <h1 id="Configuration+%26+Behavior">Configuration &amp; Behavior</h1><h2 id="Hadoop+MapReduce">Hadoop MapReduce</h2><p>This package provides working <em>Input</em> and <em>Output</em> adapters for MongoDB. You may
configure these adapters with XML or programatically. See the
WordCount examples for demonstrations of both approaches. You can
specify a query, fields and sort specs in the XML config as JSON or
@@ -50,32 +50,12 @@ <h1 id="Configuration+%26+Behavior">Configuration &amp; Behavior</h1><h2 id="Had
</p><ol><li><p>For unsharded the source collections, MongoHadoop follows the
</p><p> “unsharded split” path. (See below.)
</p></li><li><p>For sharded source collections:
-</p><ul><li>If <code>mongo.input.split.read_shard_chunks</code> is <strong>true</strong>
-(defaults <strong>true</strong>) then we pull the chunk specs from the
-configuration server, and turn each shard chunk into an <em>Input
-Split</em>. Basically, this means the mongodb sharding system does
-</li></ul><p> 99% of the preconfig work for us and is a good thing‚Ñ¢
-</p><ul><li>If <code>read_shard_chunks</code> is <strong>false</strong> and
-<code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults
-to <strong>false</strong>) then we connect to the <code>mongod</code> or replica set
-for each shard individually and each shard becomes an input
-split. The entire content of the collection on the shard is one
-split. Only use this configuration in rare situations.
-</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>true</strong> and
-<code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults
-to <strong>false</strong>) MongoHadoop reads the chunk boundaries from
-the config server but then reads data directly from the shards
-without using the <code>mongos</code>. While this may seem like a good
-idea, it can cause erratic behavior if MongoDB balances chunks
-during a Hadoop job. This is not a recommended configuration
-for write heavy applications but may provide effective
-parallelism in read heavy apps.
-</li></ul><ul><li>If both <code>create_input_splits</code> and <code>read_from_shards</code> are
-</li></ul><p> <strong>false</strong> disabled then we pretend there is no sharding and use
- the “unsharded split” path. When <code>read_shard_chunks</code> is
- <strong>false</strong> MongoHadoop reads everything through mongos as a
- single split.
-</p></li></ol><h3 id="%E2%80%9CUnsharded+Splits%E2%80%9C">“Unsharded Splits“</h3><p>“Unsharded Splits” refers to the system that MongoHadoop uses to
+</p><ul><li>If <code>mongo.input.split.read_shard_chunks</code> is <strong>true</strong> (defaults <strong>true</strong>) then we pull the chunk specs from the
+configuration server, and turn each shard chunk into an <em>Input Split</em>. Basically, this means the mongodb sharding system does 99% of the preconfig work for us and is a good thing‚Ñ¢
+</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>false</strong> and <code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults to <strong>false</strong>) then we connect to the <code>mongod</code> or replica set for each shard individually and each shard becomes an input split. The entire content of the collection on the shard is one split. Only use this configuration in rare situations.
+</li></ul><ul><li>If <code>read_shard_chunks</code> is <strong>true</strong> and <code>mongo.input.split.read_from_shards</code> is <strong>true</strong> (it defaults to <strong>false</strong>) MongoHadoop reads the chunk boundaries from the config server but then reads data directly from the shards without using the <code>mongos</code>. While this may seem like a good idea, it can cause erratic behavior if MongoDB balances chunks during a Hadoop job. This is not a recommended configuration for write heavy applications but may provide effective parallelism in read heavy apps.
+</li></ul><ul><li>If both <code>create_input_splits</code> and <code>read_from_shards</code> are <strong>false</strong> disabled then we pretend there is no sharding and use the “unsharded split” path. When <code>read_shard_chunks</code> is <strong>false</strong> MongoHadoop reads everything through mongos as a single split.
+</li></ul></li></ol><h3 id="%E2%80%9CUnsharded+Splits%E2%80%9C">“Unsharded Splits“</h3><p>“Unsharded Splits” refers to the system that MongoHadoop uses to
calculate new splits. You may use “Unsharded splits” with sharded
MongoDB options.
</p><p>This is only used:
View
6 docs/Frequently+Asked+Questions.html
@@ -34,11 +34,11 @@
</div>
</div>
<div class="span-16 prepend-1 append-1 contents">
- <h1 id="Frequently+Asked+Questions">Frequently Asked Questions</h1><h2 id="Do+the+MongoInputFormat%2FMongoOutputFormats+use+HDFS%3F">Do the MongoInputFormat/MongoOutputFormats use HDFS?</h2><p>No. The <code>Mongo\*Format</code> code is designed to not use HDFS, instead reading and writing data directly between MongoDB + Hadoop.
+ <h1 id="Frequently+Asked+Questions">Frequently Asked Questions</h1><h3 id="Do+the+MongoInputFormat%2FMongoOutputFormats+use+HDFS%3F">Do the MongoInputFormat/MongoOutputFormats use HDFS?</h3><p>No. The <code>Mongo\*Format</code> code is designed to not use HDFS, instead reading and writing data directly between MongoDB + Hadoop.
</p><p>A forthcoming release will offer a <code>BSONInputFormat</code> and <code>BSONOutputFormat</code> which will allow for working offline with MongoDB backup files (in BSON format) on HDFS and S3.
-</p><h2 id="How+does+the+MongoDB+%2B+Hadoop+Connector+differ+from+Sqoop%3F">How does the MongoDB + Hadoop Connector differ from Sqoop?</h2><p>From the [Sqoop Wiki][https://github.com/cloudera/sqoop/wiki]: <em>“Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database … and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)“</em>
+</p><h3 id="How+does+the+MongoDB+%2B+Hadoop+Connector+differ+from+Sqoop%3F">How does the MongoDB + Hadoop Connector differ from Sqoop?</h3><p>From the <a href="https://github.com/cloudera/sqoop/wiki">Sqoop Wiki</a>: <em>“Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database … and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)“</em>
</p><p>The MongoDB + Hadoop Connector does not work with HDFS, instead reading and writing directly between MongoDB and Hadoop for the highest possible performance. This also allows for Hadoop jobs to have the freshest possible view of their input data without an intermediary export process.
-</p><h3 id="Is+integration+possible+between+MongoDB+and+Sqoop%3F">Is integration possible between MongoDB and Sqoop?</h3><p>As MongoDB is neither a relational database nor utilizes JDBC for connectivity, integration with Sqoop does not seem feasible at this time.
+</p><h4 id="Is+integration+possible+between+MongoDB+and+Sqoop%3F">Is integration possible between MongoDB and Sqoop?</h4><p>As MongoDB is neither a relational database nor utilizes JDBC for connectivity, integration with Sqoop does not seem feasible at this time.
</p><p>A forthcoming release of the MongoDB + Hadoop Connector will offer a <code>BSONInputFormat</code> and <code>BSONOutputFormat</code> which will allow for working offline with MongoDB backup files (in BSON format) on HDFS and S3, without a live MongoDB database.
</p><div class="tocwrapper show">
<a class="tochead nav" style="display: none" href="#toc">❦</a>
View
2  docs/pamflet.manifest
@@ -1,5 +1,5 @@
CACHE MANIFEST
-# Tue Mar 06 13:54:47 EST 2012
+# Tue Mar 06 14:57:51 EST 2012
MongoDB%2BHadoop+Connector.html
Frequently+Asked+Questions.html
Getting+Started.html
View
8 docs/src/01.markdown
@@ -1,19 +1,19 @@
Frequently Asked Questions
==========================
-## Do the MongoInputFormat/MongoOutputFormats use HDFS?
+### Do the MongoInputFormat/MongoOutputFormats use HDFS?
No. The `Mongo\*Format` code is designed to not use HDFS, instead reading and writing data directly between MongoDB + Hadoop.
A forthcoming release will offer a `BSONInputFormat` and `BSONOutputFormat` which will allow for working offline with MongoDB backup files (in BSON format) on HDFS and S3.
-## How does the MongoDB + Hadoop Connector differ from Sqoop?
+### How does the MongoDB + Hadoop Connector differ from Sqoop?
-From the [Sqoop Wiki][https://github.com/cloudera/sqoop/wiki]: *"Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database ... and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)"*
+From the [Sqoop Wiki](https://github.com/cloudera/sqoop/wiki): *"Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database ... and automatically generates the necessary classes to import data into the Hadoop Distributed File System (HDFS)"*
The MongoDB + Hadoop Connector does not work with HDFS, instead reading and writing directly between MongoDB and Hadoop for the highest possible performance. This also allows for Hadoop jobs to have the freshest possible view of their input data without an intermediary export process.
-### Is integration possible between MongoDB and Sqoop?
+#### Is integration possible between MongoDB and Sqoop?
As MongoDB is neither a relational database nor utilizes JDBC for connectivity, integration with Sqoop does not seem feasible at this time.
View
38 docs/src/02/b.markdown
@@ -4,7 +4,7 @@ Configuration & Behavior
## Hadoop MapReduce
-Provides working *Input* and *Output* adapters for MongoDB. You may
+This package provides working *Input* and *Output* adapters for MongoDB. You may
configure these adapters with XML or programatically. See the
WordCount examples for demonstrations of both approaches. You can
specify a query, fields and sort specs in the XML config as JSON or
@@ -28,34 +28,14 @@ When true, as by default, the following possible behaviors exist:
2. For sharded source collections:
- * If `mongo.input.split.read_shard_chunks` is **true**
- (defaults **true**) then we pull the chunk specs from the
- configuration server, and turn each shard chunk into an *Input
- Split*. Basically, this means the mongodb sharding system does
- 99% of the preconfig work for us and is a good thing™
-
- * If `read_shard_chunks` is **false** and
- `mongo.input.split.read_from_shards` is **true** (it defaults
- to **false**) then we connect to the `mongod` or replica set
- for each shard individually and each shard becomes an input
- split. The entire content of the collection on the shard is one
- split. Only use this configuration in rare situations.
-
- * If `read_shard_chunks` is **true** and
- `mongo.input.split.read_from_shards` is **true** (it defaults
- to **false**) MongoHadoop reads the chunk boundaries from
- the config server but then reads data directly from the shards
- without using the `mongos`. While this may seem like a good
- idea, it can cause erratic behavior if MongoDB balances chunks
- during a Hadoop job. This is not a recommended configuration
- for write heavy applications but may provide effective
- parallelism in read heavy apps.
-
- * If both `create_input_splits` and `read_from_shards` are
- **false** disabled then we pretend there is no sharding and use
- the "unsharded split" path. When `read_shard_chunks` is
- **false** MongoHadoop reads everything through mongos as a
- single split.
+ * If `mongo.input.split.read_shard_chunks` is **true** (defaults **true**) then we pull the chunk specs from the
+ configuration server, and turn each shard chunk into an *Input Split*. Basically, this means the mongodb sharding system does 99% of the preconfig work for us and is a good thing™
+
+ * If `read_shard_chunks` is **false** and `mongo.input.split.read_from_shards` is **true** (it defaults to **false**) then we connect to the `mongod` or replica set for each shard individually and each shard becomes an input split. The entire content of the collection on the shard is one split. Only use this configuration in rare situations.
+
+ * If `read_shard_chunks` is **true** and `mongo.input.split.read_from_shards` is **true** (it defaults to **false**) MongoHadoop reads the chunk boundaries from the config server but then reads data directly from the shards without using the `mongos`. While this may seem like a good idea, it can cause erratic behavior if MongoDB balances chunks during a Hadoop job. This is not a recommended configuration for write heavy applications but may provide effective parallelism in read heavy apps.
+
+ * If both `create_input_splits` and `read_from_shards` are **false** disabled then we pretend there is no sharding and use the "unsharded split" path. When `read_shard_chunks` is **false** MongoHadoop reads everything through mongos as a single split.
### "Unsharded Splits"
View
12 project/MongoHadoopBuild.scala
@@ -2,6 +2,7 @@ import sbt._
import Keys._
import Reference._
import sbtassembly.Plugin._
+import Process._
import AssemblyKeys._
object MongoHadoopBuild extends Build {
@@ -15,6 +16,7 @@ object MongoHadoopBuild extends Build {
/** The version of Hadoop to build against. */
lazy val hadoopRelease = SettingKey[String]("hadoop-release", "Hadoop Target Release Distro/Version")
+ val loadSampleData = TaskKey[Unit]("load-sample-data", "Loads sample data for example programs")
private val stockPig = "0.9.2"
@@ -43,11 +45,11 @@ object MongoHadoopBuild extends Build {
lazy val root = Project( id = "mongo-hadoop",
base = file("."),
- settings = dependentSettings ) aggregate(core, flume, pig)
+ settings = dependentSettings ++ Seq(loadSampleDataTask)) aggregate(core, flume, pig)
lazy val core = Project( id = "mongo-hadoop-core",
base = file("core"),
- settings = coreSettings )
+ settings = coreSettings )
lazy val pig = Project( id = "mongo-hadoop-pig",
@@ -173,6 +175,12 @@ object MongoHadoopBuild extends Build {
testFrameworks += TestFrameworks.Specs2
)
+ val loadSampleDataTask = loadSampleData := {
+ println("Loading Sample data ...")
+ "mongoimport -d mongo_hadoop -c yield_historical.in --drop examples/treasury_yield/src/main/resources/yield_historical_in.json" !
+
+ "mongoimport -d mongo_hadoop -c ufo_sightings.in --drop examples/ufo_sightings/src/main/resources/ufo_awesome.json" !
+ }
def hadoopDependencies(hadoopVersion: String, useStreaming: Boolean, pigVersion: String, altStreamingVer: Option[String] = None, nextGen: Boolean = false): (Option[() => Seq[ModuleID]], () => Seq[ModuleID], String, () => Seq[ModuleID]) = {
Please sign in to comment.
Something went wrong with that request. Please try again.