Skip to content

pagarba/spark-py-notebooks

 
 

Repository files navigation

Spark Python Notebooks

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

  • Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

About reading files and parallelize.

A look at map, filter, and collect.

RDD sampling methods explained.

Brief introduction to some of the RDD pseudo-set operations.

RDD actions reduce, fold, and aggregate.

How to deal with key/value pairs in order to aggregate and explore data.

A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.

Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.

Use of tree-based methods and how they help explaining models and feature selection.

In this notebook a schema is inferred for our network interactions dataset. Based on that, we use Spark's SQL DataFrame abstraction to perform a more structured exploratory data analysis.

About

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%