Spark Python Notebooks

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
nb1-rdd-creation		nb1-rdd-creation
nb10-sql-dataframes		nb10-sql-dataframes
nb2-rdd-basics		nb2-rdd-basics
nb3-rdd-sampling		nb3-rdd-sampling
nb4-rdd-set		nb4-rdd-set
nb5-rdd-aggregations		nb5-rdd-aggregations
nb6-rdd-key-value		nb6-rdd-key-value
nb7-mllib-statistics		nb7-mllib-statistics
nb8-mllib-logit		nb8-mllib-logit
nb9-mllib-trees		nb9-mllib-trees
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

pagarba/spark-py-notebooks

Folders and files

Latest commit

History

Repository files navigation

Spark Python Notebooks

Instructions

Datasets

References

Notebooks

About

Resources

License

Stars

Watchers

Forks

Languages