# Class 7 Notebook 2: Spark is lazy

Class 7 (6 Dec 2016) of [BS1804-1617 Fundamentals of Database Technologies](https://imperialbusiness.school/category/bs1804-1617/) by [Piotr Migdal](http://p.migdal.pl/)

See also this post: [How can I force Spark to execute code? - Stack Overflow](http://stackoverflow.com/a/31384084/907575)

In [2]:
import pyspark
sc = pyspark.SparkContext('local[*]')
# accumulator - an counter which can be modified
acc = sc.accumulator(0)

In [3]:
text = ["I like cats", "cats like me", "Cats are fun",
        "CATS are nice", "I LIKE fun"]
rdd = sc.parallelize(text)

In [4]:
# performing an action for each element in the RDD
# without creating a new RDD
rdd.foreach(lambda x: acc.add(1))

In [5]:
# we can access its value
acc.value

5

In [6]:
# if we want to do more things than just count,
# we can wrap them in a function
def function_with_acc(x):
    acc.add(1)

In [7]:
# the same as before, but passed as a function
rdd.foreach(function_with_acc)

In [8]:
# again, we can check its value
acc.value

10

In [9]:
# let's create a new accumulator
acc2 = sc.accumulator(0)

In [10]:
# and create a function
# which splits a line into words AND increments the accumulator
def split_and_countspaces(x):
    acc2.add(x.count(" "))
    return x.split()

In [11]:
# we create a flatMap
words = rdd.flatMap(split_and_countspaces)
# and...

## Exercise
(a busy exercise in being lazy)

* What is the current value of `acc2`?
* Take the first 4 words - what is the current value of `acc2`?
* Do it again.
* Write `words.cache()` and take the first 15 words. Check `acc2.value`.
* Take the first 4 words (for the last time, I promise!). Check `acc2.value`.

In [12]:
acc2.value

0

In [13]:
words.top(4)

['nice', 'me', 'like', 'like']

In [14]:
acc2

Accumulator<id=1, value=10>

In [15]:
words.take(4)

['I', 'like', 'cats', 'cats']

In [16]:
acc2.value

14

In [17]:
words.cache()

PythonRDD[5] at RDD at PythonRDD.scala:43

In [18]:
words.take(4)

['I', 'like', 'cats', 'cats']

In [19]:
acc2.value

18

In [20]:
words.take(4)

['I', 'like', 'cats', 'cats']

In [21]:
acc2.value

18