In [29]:
import pyspark
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession


In [1]:
sc = pyspark.SparkContext("local[*]")

lets prove it works:

In [3]:
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)


[89, 610, 669, 84, 71]

### read a text file into a spark DataFrame

In [8]:
textRDD01 = sc.textFile("ReadMe.txt")

ReadMe.txt MapPartitionsRDD[7] at textFile at NativeMethodAccessorImpl.java:0


In [11]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [16]:
# create a SparkContext
sc2 = spark.sparkContext

# this way of reading a text file returns a RDD[string]
textRDD02 = sc2.textFile("ReadMe.txt")

# see what type is this
type(textRDD02)

pyspark.rdd.RDD

In [17]:

# this way of reading a text file returns a spark DataFrame, whos schema starts with a string column named "value",
# and followed by pertitioned columns if there are any
textRDD03 = spark.read.text("ReadMe.txt").rdd

# see what type is this
type(textRDD03)

pyspark.rdd.RDD

In [18]:
textRDD02.count()

572

In [19]:
textRDD03.count()

572

In [24]:
textRDD02.takeSample(False, 10)

['    X-ray binaries and 319 related objects with known or suspected orbital',
 '  Ritter H., Kolb U., 1995, in "X-ray Binaries", Lewin W.H.G,',
 '                                    at mideclipse.',
 '    WZ = WZ Sge star = SU UMa star with an extremely long supercycle',
 '--------------------------------------------------------------------------------',
 '     RS = system shows RS CVn-like chromospheric activity',
 '    NL    (UX,AC)         in normal state',
 ' 109-116  F8.6  d         Orb.Per ? Orbital period, in case of object',
 '     155  I1    ---       SB      [1,2]? Flag specifying the type of',
 ' 158-164  A7    ---       SpType2 Spectral type of the secondary (G6)']

In [32]:
sc.stop()

## initializing Spark

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

For *setMaster* parameter see:
https://spark.apache.org/docs/2.2.0/submitting-applications.html#master-urls

*local[**]* means run locally with as many workers as possible

In [34]:
conf = SparkConf().setAppName("appNameHere").setMaster("local[*]")

# get the context
sc = SparkContext(conf = conf)

### basics of execution in Spark

In [37]:
lines = sc.textFile("ReadMe.txt")

# map is a transformation, not an action, the code is not executed yet
lineLengths = lines.map(lambda s: len(s))

# if we want to use linezLengts later we need to use:
lineLengths.persist()
# this should be done before the "action" take place

# reduce is an action performed on linezlengts, this is when the code gets executed
# computation here is executed on multiple threads, eventually multiple machines
totalLengts = lineLengths.reduce(lambda a,b: a + b)
print(totalLengts)



30868


### passing functions to Spark

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are three recommended ways to do this:

* Lambda expressions, for simple functions that can be written as an expression. (Lambdas do not support multi-statement functions or statements that do not return a value.)
* Local defs inside the function calling into Spark, for longer code.
* Top-level functions in a module.

For example, to pass a longer function than can be supported using a lambda, consider the code below:

In [38]:
def a_function(p):
    words = p.split(" ")
    return len(words)

fileRDD = sc.textFile("ReadMe.txt")
wordsNumber = fileRDD.map(a_function)

totalWords = wordsNumber.reduce(lambda a,b: a + b)

print(totalWords)

9629


It is possible to pass a method of a class to Spark. But we would need to include the whole class when sending to the cluster. 

In [39]:
class a_class(object):
    def a_func(self, s):
        return s
    def do_some_stuff(self, rdd):
        return rdd.map(self.a_func)
    

### understanding closures

The scope of variables when using Spark can be a source of confusion. That's because of the *laziness* and also because of the fact that code is executed in a distributed parallel environment.

The following example is a bad example of wronfully using an increment.

In [41]:
counter = 0
fileRDD = sc.textFile("ReadMe.txt")

# NO!!
def increment_counter(x):
    global counter
    counter += x

# fileRDD.foreach(increment_counter)

print(counter)

0


Thats why we have accumulators

In [48]:
# a correct way to print a few lines from the text file
for line in fileRDD.take(5):
    print(line)

B/cb        Cataclysmic Binaries, LMXBs, and related objects   (Ritter+, 2011)
Catalogue of cataclysmic binaries, low-mass X-ray binaries
and related objects (7th Edition, rev. 7.14, September 2010)
     Ritter H., Kolb U.


In [49]:
lines = sc.textFile("ReadMe.txt")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
print(counts)

PythonRDD[32] at RDD at PythonRDD.scala:53


### understand accumulators

In [28]:
# this codee should not work because we dont use accumulators
lines =0
for line in textRDD02.collect():
    lines += 1

print(lines)

572


example of using accumulator

In [52]:
accum = sc.accumulator(0)
accum

Accumulator<id=0, value=0>

In [53]:
rdd = sc.parallelize([1,2,3,4,5,6,7,8])

In [54]:
rdd.foreach(lambda x: accum.add(x))

In [56]:
print(accum.value)

36


another example, this is how not to use accumulators

In [57]:
accum = sc.accumulator(0)
def g(x):
    accum.add(x)
    return x + 1
rdd = sc.parallelize([1,2,3,4,5,6,7,8])
rdd.map(g)
print(accum)

0
