Skip to content

lxrave/pymongo_spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

pymongo-spark integrates PyMongo, the Python driver for MongoDB, with PySpark, the Python front-end for Apache Spark. It is designed to be used in tandem with mongo-hadoop-spark.jar, which can be found under spark/build/libs after compiling mongo-hadoop (the jar will be available for download when 1.5 is officially released). For instructions on how to compile mongo-hadoop, please see the main project's README.

Installation

  1. Download mongo-hadoop:

    git clone https://github.com/mongodb/mongo-hadoop.git
    
  2. Go to the pymongo-spark directory of the project and install:

    cd mongo-hadoop/spark/src/main/python
    python setup.py install
    
  3. Install pymongo on each machine in your Spark cluster.

You'll also need to put mongo-hadoop-spark.jar (see above for instructions on how to obtain this) somewhere on Spark's CLASSPATH prior to using this package.

Usage

pymongo-spark works by monkey-patching PySpark's RDD and SparkContext classes. All you need to do is call the activate() function in pymongo-spark:

import pymongo_spark
pymongo_spark.activate()

# You are now ready to use BSON and MongoDB with PySpark.

Make sure to set the appropriate options to put mongo-hadoop-spark.jar on Spark's CLASSPATH. For example:

bin/pyspark --jars mongo-hadoop-spark.jar \
            --driver-class-path mongo-hadoop-spark.jar

You might also need to add pymongo-spark and/or PyMongo to Spark's PYTHONPATH explicitly:

bin/pyspark --py-files /path/to/pymongo_spark.py,/path/to/pymongo.egg

Examples

Read from MongoDB

>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
>>> print(mongo_rdd.first())
{u'_id': ObjectId('55cd069c6e32abacca39da2b'),
 u'hello': u'from MongoDB!'}

Write to MongoDB

>>> some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')

Reading from a BSON File

>>> file_path = 'my_bson_files/dump.bson'
>>> rdd = sc.BSONFileRDD(file_path)
>>> rdd.first()
{u'_id': ObjectId('55cd071e6e32abacca39da2c'),
 u'hello': u'from BSON!'}

Write to a BSON File

>>> some_rdd.saveToBSON('my_bson_files/output')

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages