In [1]:
sc

# Resilient Distributed Dataset (RDD)

A fault-tolerant collection of elements that can be operated on in parallel.

## Parallelized collections

Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

In [2]:
data = [1,2,3,4,5,6]
distData = sc.parallelize(data)
type(distData)

pyspark.rdd.RDD

Parallelized in two cores of the local machine.

In [3]:
distData.glom().collect()

[[1, 2, 3], [4, 5, 6]]

## External datasets

Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines.

In [4]:
textFile = sc.textFile('biblia.txt')
type(textFile)

pyspark.rdd.RDD

# Saving and Loading SequenceFiles

In [5]:
rdd1 = sc.parallelize(range(1,10)).map(lambda x: (x, 'a'*x))
rdd1.collect()

[(1, 'a'),
 (2, 'aa'),
 (3, 'aaa'),
 (4, 'aaaa'),
 (5, 'aaaaa'),
 (6, 'aaaaaa'),
 (7, 'aaaaaaa'),
 (8, 'aaaaaaaa'),
 (9, 'aaaaaaaaa')]

In [None]:
rdd1.saveAsSequenceFile('file')

In [6]:
sorted(sc.sequenceFile('file').collect())

[(1, 'a'),
 (2, 'aa'),
 (3, 'aaa'),
 (4, 'aaaa'),
 (5, 'aaaaa'),
 (6, 'aaaaaa'),
 (7, 'aaaaaaa'),
 (8, 'aaaaaaaa'),
 (9, 'aaaaaaaaa')]

# Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program

In [16]:
lines = sc.textFile('imortality.txt')
lineLengths = lines.map(lambda s: len(s.split(' ')))
totalLength = lineLengths.reduce(lambda a,b: a+b)

In [17]:
lineLengths.take(10)

[8, 1, 111, 1, 62, 1, 53, 1, 1, 87]

Total of words in the document.

In [18]:
totalLength

2378

If we also wanted to use lineLengths again later, we could add:

In [19]:
lineLengths.persist()

PythonRDD[24] at RDD at PythonRDD.scala:53

before the reduce, which would cause lineLengths to be saved in memory after the first time it is computed.

# Basics

Number of rows in the DataFrame:

In [20]:
textFile.count()

32369

First row in the DataFrame:

In [21]:
textFile.first()

'BÍBLIA SAGRADA'

Select lines from the file which has the word "imortalidade":

In [22]:
specificLines = textFile.filter(lambda line: line.count('imortalidade') > 0)

In [23]:
specificLines.collect()

['53 Porque é necessário que isto que é corruptível se revista da incorruptibilidade e que isto que é mortal se revista da imortalidade.',
 '54 Mas, quando isto que é corruptível se revestir da incorruptibilidade, e isto que é mortal se revestir da imortalidade, então se cumprirá a palavra que está escrito: Tragada foi a morte na vitória.',
 '16 aquele que possui, ele só, a imortalidade, e habita em luz inacessível; a quem nenhum dos homens tem visto nem pode ver; ao qual seja honra e poder sempiterno. Amém.',
 '10 e que agora se manifestou pelo aparecimento de nosso Salvador Cristo Jesus, o qual destruiu a morte, e trouxe à luz a vida e a imortalidade pelo evangelho,']

How many lines contain 'imortalidade'?

In [24]:
specificLines.count()

4

# More on Dataset Operations

In [25]:
from pyspark.sql.functions import *

Count each word in document:

In [26]:
counts = textFile.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

Sort by most frequent words:

In [27]:
counts.sortBy(lambda x: x[1], ascending=False).collect()

[('e', 33027),
 ('o', 25358),
 ('de', 24848),
 ('a', 24671),
 ('que', 20458),
 ('os', 13441),
 ('do', 10179),
 ('para', 8421),
 ('não', 6991),
 ('da', 6692),
 ('se', 6362),
 ('em', 6027),
 ('as', 5649),
 ('com', 4876),
 ('dos', 4696),
 ('ao', 4141),
 ('Senhor', 4006),
 ('E', 3984),
 ('por', 3812),
 ('seu', 3699),
 ('no', 3676),
 ('como', 3635),
 ('um', 3630),
 ('é', 3499),
 ('sua', 3410),
 ('seus', 3099),
 ('na', 3008),
 ('sobre', 2726),
 ('ele', 2704),
 ('filhos', 2389),
 ('eu', 2296),
 ('à', 2226),
 ('te', 2186),
 ('Deus', 2179),
 ('todos', 2123),
 ('me', 2093),
 ('teu', 2027),
 ('porque', 2020),
 ('Senhor,', 2018),
 ('meu', 2013),
 ('mas', 1949),
 ('aos', 1912),
 ('Então', 1889),
 ('uma', 1876),
 ('rei', 1763),
 ('nem', 1722),
 ('filho', 1693),
 ('contra', 1664),
 ('terra', 1644),
 ('suas', 1622),
 ('também', 1613),
 ('vos', 1608),
 ('tua', 1602),
 ('das', 1556),
 ('lhe', 1553),
 ('pois,', 1542),
 ('até', 1441),
 ('minha', 1433),
 ('casa', 1430),
 ('todo', 1362),
 ('', 1334),
 ('nos

# Caching

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset.

In [28]:
textFile.cache()

biblia.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [29]:
textFile.count()

32369