# Big Data Fundamentals with PySpark - Part 2

## Programming in PySpark RDD’s
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

In [None]:
BUCKET = 'driven-actor-210609'

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/28 13:55:54 INFO SparkEnv: Registering MapOutputTracker
25/02/28 13:55:54 INFO SparkEnv: Registering BlockManagerMaster
25/02/28 13:55:54 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/02/28 13:55:54 INFO SparkEnv: Registering OutputCommitCoordinator


### RDDs from Parallelized collections
Resilient Distributed Dataset (RDD) is the basic abstraction in Spark. It is an immutable distributed collection of objects. Since RDD is a fundamental and backbone data type in Spark, it is important that you understand how to create it. In this exercise, you'll create your first RDD in PySpark from a collection of words.

Remember you already have a SparkContext sc available in your workspace.

In [3]:
# Create an RDD from a list of words
text_array = ["Spark", "is", "a", "framework", "for", "Big Data processing"]
text_rdd = sc.parallelize(text_array)

# Print out the type of the created object
print("The type of RDD is", type(text_rdd))

The type of RDD is <class 'pyspark.rdd.RDD'>


### Partitions in your data
SparkContext's textFile() method takes an optional second argument called minPartitions for specifying the minimum number of partitions. In this exercise, you'll create an RDD named fileRDD_part with 5 partitions and then compare that with fileRDD that you created in the previous exercise. Refer to the "Understanding Partition" slide in video 2.1 to know the methods for creating and getting the number of partitions in an RDD.

Remember, you already have a SparkContext sc, file_path and fileRDD available in your workspace.

In [5]:
# Check the number of partitions in fileRDD
print("Number of partitions in text_rdd is", text_rdd.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
text_rdd_part = sc.parallelize(text_array, numSlices=5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in text_rdd_part is", text_rdd_part.getNumPartitions())

Number of partitions in text_rdd is 2
Number of partitions in text_rdd_part is 5


### RDD Transformations
- map() transformation applies a function to all elements in the RDD
- filter() transormation returns a new RDD with only the elements that pass the condition
- flatMap() transformation returns multiple values for each element in the original RDD
- union() transformation returns the union of one RDD with another RDD

### RDD Actions
- collect() return all the elements of the dataset as an array
- take(N) returns an array with the first N elements of the dataset
- first() prints the first element of the RDD
- count() return the number of elements in the RDD
- reduce() is used for aggregating the elements
- saveAsTextFile() saves RDD into a text file with each partition as a separate file
- coalesce() can be used to save RDD as a single text file: rdd.coalesce(1).saveAsTextFile("tempFile")

### Pair RDDs Transformations
- reduceByKey(func): combine values with the same key
- groupByKey(): group values with the same key
- sortByKey(): return an RDD sorted by the key
- join(): join two pair RDDs based on their key

### Pair RDDs Actions
- countByKey() counts the number of elements for each key
- collectAsMap() return the key-value pairs in the RDD as a dictionary

### Map and Collect
The main method by which you can manipulate data in PySpark is using map(). The map() transformation takes in a function and applies it to each element in the RDD. It can be used to do any number of things, from fetching the website associated with each URL in our collection to just squaring the numbers. In this simple exercise, you'll use map() transformation to cube each number of the numbRDD RDD that you created earlier. Next, you'll return all the elements to a variable and finally print the output.

Remember, you already have a SparkContext sc, and numbRDD available in your workspace.

In [11]:
numb = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

numb_rdd = sc.parallelize(numb)

print(type(numb_rdd))

<class 'pyspark.rdd.RDD'>


In [12]:
# Create map() transformation to cube numbers
cubed_rdd = numb_rdd.map(lambda x: x ** 3)

print(type(cubed_rdd))

<class 'pyspark.rdd.PipelinedRDD'>


In [13]:
# Collect the results
numbers_all = cubed_rdd.collect()

print(type(numbers_all))



<class 'list'>


                                                                                

In [15]:
# Print the numbers from numbers_all
for numb in numbers_all:
    print(numb)

1
8
27
64
125
216
343
512
729
1000


### Filter and Count
The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. For this exercise, you'll filter out lines containing keyword Spark from fileRDD RDD which consists of lines of text from the README.md file. Next, you'll count the total number of lines containing the keyword Spark and finally print the first 4 lines of the filtered RDD.

Remember, you already have a SparkContext sc, file_path and fileRDD available in your workspace.

In [18]:
# Filter the fileRDD to select lines with Spark keyword
text_rdd_filter = text_rdd.filter(lambda line: 'a' in line)

print(type(text_rdd_filter))

<class 'pyspark.rdd.PipelinedRDD'>


In [19]:
# How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", text_rdd_filter.count())

The total number of lines with the keyword Spark is 4


In [20]:
# Print the first four lines of fileRDD
for line in text_rdd_filter.take(4):
  print(line)

Spark
a
framework
Big Data processing


### map vs flatMap

In [21]:
text_rdd.map(lambda x: x.split(" ")).collect()

[['Spark'],
 ['is'],
 ['a'],
 ['framework'],
 ['for'],
 ['Big', 'Data', 'processing']]

In [22]:
text_rdd.flatMap(lambda x: x.split(" ")).collect()

['Spark', 'is', 'a', 'framework', 'for', 'Big', 'Data', 'processing']

### ReduceBykey and Collect
One of the most popular pair RDD transformations is reduceByKey() which operates on key, value (k,v) pairs and merges the values for each key. In this exercise, you'll first create a pair RDD from a list of tuples, then combine the values with the same key and finally print out the result.

Remember, you already have a SparkContext sc available in your workspace.

In [23]:
# Create PairRDD Rdd with key value pairs
Rdd = sc.parallelize([(1,2),(3,4),(3,6),(4,5),(3,8)])

# Apply reduceByKey() operation on Rdd
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x + y)

# Iterate over the result and print the output
for num in Rdd_Reduced.collect():
  print("Key {} has {} Counts".format(num[0], num[1]))



Key 4 has 5 Counts
Key 1 has 2 Counts
Key 3 has 18 Counts


                                                                                

### Inside reduceByKey

In [24]:
def f(x, y):
    z = x + y
    print(f"{x} + {y} = {z}")
    return z

In [25]:
Rdd.reduceByKey(f).collect()

[(4, 5), (1, 2), (3, 18)]

### SortByKey and Collect
Many times it is useful to sort the pair RDD based on the key (for example word count which you'll see later in the chapter). In this exercise, you'll sort the pair RDD Rdd_Reduced that you created in the previous exercise into descending order and print the final output.

Remember, you already have a SparkContext sc and Rdd_Reduced available in your workspace.

In [26]:
# Sort the reduced RDD with the key by descending order
Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

# Iterate over the result and retrieve all the elements of the RDD
for num in Rdd_Reduced_Sort.collect():
  print("Key {} has {} Counts".format(num[0], num[1]))

Key 4 has 5 Counts
Key 3 has 18 Counts
Key 1 has 2 Counts


### CountingBykeys
For many datasets, it is important to count the number of keys in a key/value dataset. For example, counting the number of countries where the product was sold or to show the most popular baby names. In this simple exercise, you'll use the Rdd that you created earlier and count the number of unique keys in that pair RDD.

Remember, you already have a SparkContext sc and Rdd available in your workspace.

In [28]:
# Count the unique keys
total = Rdd.countByKey()

# What is the type of total?
print("The type of total is", type(total))

The type of total is <class 'collections.defaultdict'>


In [29]:
# Iterate over the total and print the output
for k, v in total.items():
  print("key", k, "has", v, "counts")

key 1 has 1 counts
key 3 has 3 counts
key 4 has 1 counts


### Create a base RDD and transform it
The volume of unstructured data (log lines, images, binary files) in existence is growing dramatically, and PySpark is an excellent framework for analyzing this type of data through RDDs. In this 3 part exercise, you will write code that calculates the most common words from Complete Works of William Shakespeare.

Here are the brief steps for writing the word counting program:

Create a base RDD from Complete_Shakespeare.txt file.
Use RDD transformation to create a long list of words from each element of the base RDD.
Remove stop words from your data.
Create pair RDD where each element is a pair tuple of ('w', 1)
Group the elements of the pair RDD by key (word) and add up their values.
Swap the keys (word) and values (counts) so that keys is count and value is the word.
Finally, sort the RDD by descending order and print the 10 most frequent words and their frequencies.
In this first exercise, you'll create a base RDD from Complete_Shakespeare.txt file and transform it to create a long list of words.

Remember, you already have a SparkContext sc already available in your workspace. A file_path variable (which is the path to the Complete_Shakespeare.txt file) is also loaded for you.

In [34]:
file_path = f'gs://{BUCKET}/pyspark/datasets/shakespeare.txt'
# file_path = '../datasets/books/shakespeare.txt'

# Create a baseRDD from the file path
baseRDD = sc.textFile(file_path)

baseRDD.take(10)

['The Project Gutenberg EBook of The Complete Works of William Shakespeare, by',
 'William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org',
 '',
 '** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **',
 '**     Please follow the copyright guidelines in this file.     **']

In [35]:
# Split the lines of baseRDD into words
splitRDD = baseRDD.flatMap(lambda x: x.split())

splitRDD.take(10)

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Complete',
 'Works',
 'of',
 'William']

In [37]:
# Count the total number of words
print("Total number of words in splitRDD:", splitRDD.count())

Total number of words in splitRDD: 904061


                                                                                

### Remove stop words and reduce the dataset
After splitting the lines in the file into a long list of words in the previous exercise, in the next step, you'll remove stop words from your data. Stop words are common words that are often uninteresting. For example "I", "the", "a" etc., are stop words. You can remove many obvious stop words with a list of your own. But for this exercise, you will just remove the stop words from a curated list stop_words provided to you in your environment.

After removing stop words, you'll next create a pair RDD where each element is a pair tuple (k, v) where k is the key and v is the value. In this example, pair RDD is composed of (w, 1) where w is for each word in the RDD and 1 is a number. Finally, you'll combine the values with the same key from the pair RDD.

Remember you already have a SparkContext sc and splitRDD available in your workspace.

In [None]:
# file_path = '../datasets/stop_words.txt'

# with open(file_path) as f:
    # stop_words = f.read().splitlines()

In [44]:
file_path = f'gs://{BUCKET}/pyspark/datasets/stop_words.txt'

stop_words = sc.textFile(file_path).collect()

In [45]:
# Preview stop words
print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']


In [46]:
# Convert the words in lower case and remove stop words from the stop_words curated list
splitRDD_no_stop = splitRDD.filter(lambda x: x.lower() not in stop_words)

splitRDD_no_stop.take(10)

['Project',
 'Gutenberg',
 'EBook',
 'Complete',
 'Works',
 'William',
 'Shakespeare,',
 'William',
 'Shakespeare',
 'eBook']

In [47]:
# Create a tuple of the word and 1
splitRDD_no_stop_words = splitRDD_no_stop.map(lambda w: (w, 1))

splitRDD_no_stop_words.take(10)

[('Project', 1),
 ('Gutenberg', 1),
 ('EBook', 1),
 ('Complete', 1),
 ('Works', 1),
 ('William', 1),
 ('Shakespeare,', 1),
 ('William', 1),
 ('Shakespeare', 1),
 ('eBook', 1)]

In [48]:
# Count of the number of occurences of each word
resultRDD = splitRDD_no_stop_words.reduceByKey(lambda x, y: x + y)

resultRDD.take(10)

                                                                                

[('Gutenberg', 26),
 ('EBook', 2),
 ('Complete', 4),
 ('Works', 5),
 ('Shakespeare', 45),
 ('eBook', 9),
 ('use', 266),
 ('anyone', 4),
 ('cost', 34),
 ('almost', 137)]

### Print word frequencies
After combining the values (counts) with the same key (word), in this exercise, you'll return the first 10 word frequencies. You could have retrieved all the elements at once using collect() but it is bad practice and not recommended. RDDs can be huge: you may run out of memory and crash your computer..

What if we want to return the top 10 words? For this, first you'll need to swap the key (word) and values (counts) so that keys is count and value is the word. After you swap the key and value in the tuple, you'll sort the pair RDD based on the key (count). This way it is easy to sort the RDD based on the key rather than the key using sortByKey operation in PySpark. Finally, you'll return the top 10 words from the sorted RDD.

You already have a SparkContext sc and resultRDD available in your workspace.

In [49]:
# Display the first 10 words and their frequencies from the input RDD
for word in resultRDD.take(10):
    print(word)

('Gutenberg', 26)
('EBook', 2)
('Complete', 4)
('Works', 5)
('Shakespeare', 45)
('eBook', 9)
('use', 266)
('anyone', 4)
('cost', 34)
('almost', 137)


In [50]:
# Swap the keys and values from the input RDD
resultRDD_swap = resultRDD.map(lambda x: (x[1], x[0]))

resultRDD_swap.take(10)

[(26, 'Gutenberg'),
 (2, 'EBook'),
 (4, 'Complete'),
 (5, 'Works'),
 (45, 'Shakespeare'),
 (9, 'eBook'),
 (266, 'use'),
 (4, 'anyone'),
 (34, 'cost'),
 (137, 'almost')]

In [51]:
# Sort the keys in descending order
resultRDD_swap_sort = resultRDD_swap.sortByKey(ascending=False)

resultRDD_swap_sort.take(10)

[(4247, 'thou'),
 (3630, 'thy'),
 (3018, 'shall'),
 (2046, 'good'),
 (1974, 'would'),
 (1926, 'Enter'),
 (1780, 'thee'),
 (1737, "I'll"),
 (1614, 'hath'),
 (1452, 'like')]

In [52]:
# Show the top 10 most frequent words and their frequencies from the sorted RDD
for word in resultRDD_swap_sort.take(10):
    print("{},{}". format(word[1], word[0]))

thou,4247
thy,3630
shall,3018
good,2046
would,1974
Enter,1926
thee,1780
I'll,1737
hath,1614
like,1452
