# Using spark with ipython

This small notebook is usefull to quickly test Spark configuration.



### Logs
To verify what is happening between Master and Worker inside the cluster,
you may check the spark cluster logs with two docker commands:
```
# First shell
docker logs -f containers_master_1
# Second shell
docker logs -f containers_worker_1
```

### Connect
These operations take some seconds

In [1]:
# Create a Spark configuration
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://master:7077")

<pyspark.conf.SparkConf at 0x7f5160229090>

In [2]:
# Create the context
sc = pyspark.SparkContext(conf=conf)

### Test with one operation in parallel

In [3]:
# do something to prove it works
rdd = sc.parallelize(range(10000))
rdd.sumApprox(3)

49995000.0

### Examples

There are many examples to test here:

https://github.com/apache/spark/tree/master/examples/src/main/python

### Where to find data to test

A library of texts to easily use is available here:

http://www.gutenberg.org/browse/scores/top

### Make use of HDFS with our Spark cluster

You can see '/data/worker' directory from the `worker` container inside this container too.

In [4]:
! ls /data/worker/books

pg1232.txt


Note: only the worker has access to directory `data` and also client access to HDFS server.

So you should use the worker docker shell to save data into HDFS:

`$ docker exec -it containers_worker_1 hdfs dfs -put /data/worker/books/pg1232.txt hdfs://master:9000/mybook.txt`



## RDD on a book saved in hdfs

If you executed the previous command, you should find the book availabe on HDFS.
Let's make some operations there.

<small> source: http://www.mccarroll.net/blog/pyspark/ </small>

In [5]:
lines = sc.textFile('hdfs://master:9000/mybook.txt')

In [6]:
lines

MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:-2

In [7]:
lines_nonempty = lines.filter( lambda x: len(x) > 0 )

In [8]:
lines_nonempty.count()

4500

In [9]:
words = lines_nonempty.flatMap(lambda x: x.split())

In [10]:
wordcounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x,y:x+y).map(lambda x:(x[1],x[0])).sortByKey(False)

In [11]:
wordcounts.take(10)

[(2948, u'the'),
 (2072, u'to'),
 (1770, u'of'),
 (1766, u'and'),
 (940, u'in'),
 (832, u'he'),
 (754, u'a'),
 (685, u'that'),
 (629, u'his'),
 (495, u'by')]