## Using spark with ipython

In [2]:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://master:7077")

<pyspark.conf.SparkConf at 0x7f03c0c73f90>

In [3]:
# create the context
sc = pyspark.SparkContext(conf=conf)

In [4]:
# do something to prove it works
rdd = sc.parallelize(range(10000))
rdd.sumApprox(3)

49995000.0

## Where to find examples

Examples:
https://github.com/apache/spark/tree/master/examples/src/main/python

## Where to find data to test

A library of text to use
http://www.gutenberg.org/browse/scores/top

## How to save data for spark

You can see '/data/worker' directory from the `worker` container inside this container too.

In [6]:
! ls /data/worker/books

pg1232.txt



Use the worker docker shell to save data into HDFS:

`$ docker exec -it containers_worker_1 hdfs dfs -put /data/worker/books/pg1232.txt hdfs://master:9000/mybook.txt`



## A book example
<small> source: http://www.mccarroll.net/blog/pyspark/ </small>

In [7]:
lines = sc.textFile('hdfs://master:9000/mybook.txt')

In [8]:
lines

MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:-2

In [9]:
lines_nonempty = lines.filter( lambda x: len(x) > 0 )

In [10]:
lines_nonempty.count()

4500

In [11]:
words = lines_nonempty.flatMap(lambda x: x.split())

In [12]:
wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)

In [13]:
wordcounts.take(10)

[(2948, u'the'),
 (2072, u'to'),
 (1770, u'of'),
 (1766, u'and'),
 (940, u'in'),
 (832, u'he'),
 (754, u'a'),
 (685, u'that'),
 (629, u'his'),
 (495, u'by')]

*Cheers*