Feature/multiple inputs #39

wants to merge 3 commits into

9 participants


Up to date code to allow you to run a hadoop job against multiple collections. Running a job requires that you provide multiple inputURIs.

For example, if you want to run a streaming job, you can invoke hadoop like:

$HADOOP_COMMON_HOME/bin/hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc1-SNAPSHOT.jar -mapper pymapper.py -reducer pyreducer.py -inputURI mongodb:// -inputURI mongodb:// -outputURI mongodb:// -file pymapper.py -file pyreducer.py

mlew added some commits Mar 21, 2012
@mlew mlew Streaming takes multiple inputs from the cmdline
Process multiple inputs from the command line in streaming mode, and sets them in the configuration.  Included in this
commit is code that provides access to the multiple input uris.
@mlew mlew Build input split objects for each mongoURI object a489873
@mlew mlew Cleanup and removal of deprecated code 6c65db6

That' nice!
I want to specify query, fields, sort with each inputURI.

So in my pull-req, I created new Class (MongoRequest containing above feature) and replace MongoURI with it.


I think these are great features, and encourage further development of them. However, I do not believe that the lack of these features should prohibit the adoption of this pull request.


Yes! I believe too!


Any updates on whether this pull request will be merged into master?


This will be merged into master, if you can coordinate with @muddydixon so you are both on the same page that'd be ideal.


What is the status of this?


Just out of curiosity which pull request are you planning on merging for the use of multiple inputs? The branch muddydixon was referencing uses the DelegateInputFormat interface which I believe is a more appropriate way of achieving my goal of being able to do a join using a mapper and a MultipleInputs class. He also includes an implementation of multipleinputs but does not extend the mapreduce.lib.input version for some reason.

FYI I am trying to do something similar to this:


which in the new mapreduce.* would use this:


However MultipleInputs expects a path (*hdfs or otherwise) so I can't get it working without this patch.


Any news on the release? Seems like the repo has been dead for a while.

I'm currently using a personally modified branch of this to support different queries on multiple inputs, but I'm wondering when the official release will be.


I submitted this 8 months ago, but have since given up hope that this patch will ever be integrated. At this point it's most likely too stale to apply the patch to master without significant work (but I haven't looked at it in close to 6 months). I hope at some point this project becomes active. I think there's a growing community of people who would like to contribute to and use this product, but it hasn't received the love it needs from the owners.


@mlew I've been monitoring the repo for changes for a while and from what I can see your changes are still valid. The commits that have been done are really trivial and don't seem to conflict with your patch. I'm using your patch in production right now and having no problem other than the lack of per-collection querying which I hacked in ( I might have time for a more complete solution later, but until I see more activity on this project I'm not going to make that investment).


Is this feature going to be merged? It would be very nice to have MultipleInputs for mongo


Doesn't seem like it, does it?


Is there any reason? I can make another pull request with the merge if necessary.


I believe @mpobrien said we need some tests. He might be able to provide more insight.

mongodb member

apologies on the delays on this, i've just recently taken over this repo and am working on getting testing up to speed (there's now a suite of unit tests in master in the folder /testing) so we can start handling pull requests smoothly.
I would like to merge this multiple input feature in - however another outstanding request is to allow jobs to use multiple mongos hosts so that splits do not need to be read all from the same host (helps with distributing load). I'm going to work on that feature in a branch and merge this into that first, test it to make sure they can work properly together (and are not too confusing to configure) and then put them together into master.
again, sorry for the hold up.


It should already be possible to use multiple inputs with the connector. See mongo.input.uri. It looks like this feature was implemented as part of HADOOP-36 in 9b87197.

Apologies that this information didn't reach this thread sooner. I'm going to close this pull request, since the feature it describes is already implemented. Any improvements or concerns about its current implementation should be made into a new pull request or a new JIRA ticket here: https://jira.mongodb.org/browse/HADOOP

@llvtt llvtt closed this May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment