GitHub

pymongo-spark integrates PyMongo, the Python driver for MongoDB, with PySpark, the Python front-end for Apache Spark. It is designed to be used in tandem with mongo-hadoop-spark.jar, which can be found under spark/build/libs after compiling mongo-hadoop (the jar will be available for download when 1.5 is officially released). For instructions on how to compile mongo-hadoop, please see the main project's README.

Installation

Download mongo-hadoop:

git clone https://github.com/mongodb/mongo-hadoop.git

Go to the pymongo-spark directory of the project and install:

cd mongo-hadoop/spark/src/main/python
python setup.py install

Install pymongo on each machine in your Spark cluster.

You'll also need to put mongo-hadoop-spark.jar (see above for instructions on how to obtain this) somewhere on Spark's CLASSPATH prior to using this package.

Usage

pymongo-spark works by monkey-patching PySpark's RDD and SparkContext classes. All you need to do is call the activate() function in pymongo-spark:

import pymongo_spark
pymongo_spark.activate()

# You are now ready to use BSON and MongoDB with PySpark.

Make sure to set the appropriate options to put mongo-hadoop-spark.jar on Spark's CLASSPATH. For example:

bin/pyspark --jars mongo-hadoop-spark.jar \
            --driver-class-path mongo-hadoop-spark.jar

You might also need to add pymongo-spark and/or PyMongo to Spark's PYTHONPATH explicitly:

bin/pyspark --py-files /path/to/pymongo_spark.py,/path/to/pymongo.egg

Examples

Read from MongoDB

>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/db.collection')
>>> print(mongo_rdd.first())
{u'_id': ObjectId('55cd069c6e32abacca39da2b'),
 u'hello': u'from MongoDB!'}

Write to MongoDB

>>> some_rdd.saveToMongoDB('mongodb://localhost:27017/db.output_collection')

Reading from a BSON File

>>> file_path = 'my_bson_files/dump.bson'
>>> rdd = sc.BSONFileRDD(file_path)
>>> rdd.first()
{u'_id': ObjectId('55cd071e6e32abacca39da2c'),
 u'hello': u'from BSON!'}

Write to a BSON File

>>> some_rdd.saveToBSON('my_bson_files/output')

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pymongo_spark		pymongo_spark
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Usage

Examples

Read from MongoDB

Write to MongoDB

Reading from a BSON File

Write to a BSON File

About

Releases

Packages

Languages

lxrave/pymongo_spark

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Examples

Read from MongoDB

Write to MongoDB

Reading from a BSON File

Write to a BSON File

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages