Permalink
Fetching contributors…
Cannot retrieve contributors at this time
130 lines (103 sloc) 4.99 KB
.. _getting-started-with-hadoop:
===========================
Getting Started with Hadoop
===========================
.. default-domain:: mongodb
.. contents:: On this page
:local:
:backlinks: none
:depth: 1
:class: singlecol
MongoDB and Hadoop are a powerful combination and can be used together
to deliver complex analytics and data processing for data stored in
MongoDB. The following guide shows how you can start working with the
MongoDB Connector for Hadoop. Once you become familiar with the connector, you
can use it to pull your MongoDB data into Hadoop Map-Reduce jobs,
process the data and return results back to a MongoDB collection.
Prerequisites
-------------
Hadoop
~~~~~~
In order to use the following guide, you should already have Hadoop up and
running. This can range from a deployed cluster containing multiple nodes or a
single node pseudo-distributed Hadoop installation running locally. As long as
you are able to run any of the examples on your Hadoop installation, you should
be all set. The Hadoop connector supports all Apache Hadoop versions 2.X and up,
including distributions based on these versions such as CDH4, CDH5, and HDP 2.0
and up.
MongoDB
~~~~~~~
Install and run the latest version of MongoDB. In
addition, the MongoDB commands should be in your system search path
(i.e. ``$PATH``).
If your :program:`mongod` requires authorization [#authrequired]_, the user
account used by the Hadoop connector must be able to run the
:authaction:`splitVector` command on the input database. You can either give the
user the :authrole:`clusterManager` role, or create a custom role for this:
.. code-block:: javascript
db.createRole({
role: "hadoopSplitVector",
privileges: [{
resource: {
db: "myDatabase",
collection: "myCollection"
},
actions: ["splitVector"]
}],
roles:[]
});
db.createUser({
user: "hadoopUser",
pwd: "secret",
roles: ["hadoopSplitVector", "readWrite"]
});
Note that the ``splitVector`` command cannot be run through a ``mongos``, so the
Hadoop connector will automatically connect to the primary shard in order to run
the command. If your input collection is unsharded and the connector reads
through a ``mongos``, make sure that your MongoDB Hadoop user is created on the
primary shard as well.
.. [#authrequired] :program:`mongod` requires clients to
authenticate as user when you specify the
:setting:`security.authorization` or :setting:`security.keyFile`
settings or the corresponding :option:`--auth <mongod --auth>` or
:option:`--keyFile <mongod --keyFile>` command line options.
Miscellaneous
~~~~~~~~~~~~~
In addition to Hadoop, you should also have ``git``, ``python``, and Java 7 or
greater installed.
Examples
--------
The MongoDB Connector for Hadoop ships with a few examples of how to use the
connector in your own setup. In this guide, we'll focus on the Treasury Yield example.
This example can be run via gradle with the following command:
.. code-block:: sh
./gradlew jar testJar historicalYield
You should see a lot of data scroll past. To understand what's going on, let's break
down the steps. The first thing this task does is to import :file:`examples/treasury_yield/src/main/resources/yield_historical_in.json`
in to mongo. You can view that data as shown below:
.. code-block:: sh
$ mongo mongo_hadoop
MongoDB shell version: 3.0.4
connecting to: mongo_hadoop
> show collections
system.indexes
yield_historical.in
yield_historical.out
> db.yield_historical.in.find()
{ "_id" : ISODate("1990-01-02T00:00:00Z"), "dayOfWeek" : "TUESDAY", "bc3Year" : 7.9, "bc5Year" : 7.87, "bc10Year" : 7.94, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.87, "bc3Month" : 7.83, "bc30Year" : 8, "bc1Year" : 7.81, "bc7Year" : 7.98, "bc6Month" : 7.89 }
{ "_id" : ISODate("1990-01-03T00:00:00Z"), "dayOfWeek" : "WEDNESDAY", "bc3Year" : 7.96, "bc5Year" : 7.92, "bc10Year" : 7.99, "bc20Year" : null, "bc1Month" : null, "bc2Year" : 7.94, "bc3Month" : 7.89, "bc30Year" : 8.04, "bc1Year" : 7.85, "bc7Year" : 8.04, "bc6Month" : 7.94 }
...
has more
>
When you run the examples, gradle will download the appropriate hadoop distribution bundle, extract it,
and copy over the appropriate dependencies for you. Subsequent runs will not redownload those files so it's
safe to switch between versions of hadoop. Gradle will manage all that for you so you need not worry about
any of those details. It ``is`` recommend to run ``./gradlew clean`` when changing versions of hadoop to
ensure everything is built against the correct versions of libraries.
Where to go from here
~~~~~~~~~~~~~~~~~~~~~
Read the full documentation on the MongoDB Connector for Hadoop
`here <https://github.com/mongodb/mongo-hadoop/wiki>`_.
To modify configuration options, you can put additional lines in the ``historicalYield`` task to set
properties passed to hadoop at runtime. Read the full comments of the script to see details on
using these options to read/write from BSON as well as mongoDB collections.