In [1]:
#we use the findspark library to locate spark on our local machine
import findspark
findspark.init(r'C:\spark\spark-3.5.0-bin-hadoop3')
import pyspark # only run this after findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = ["Project Gutenberg’s",
        "Alice’s Adventures in Wonderland",
        "Project Gutenberg’s",
        "Adventures in Wonderland",
        "Project Gutenberg’s"]
rdd=spark.sparkContext.parallelize(data)

for element in rdd.collect():
    print(element)

Project Gutenberg’s
Alice’s Adventures in Wonderland
Project Gutenberg’s
Adventures in Wonderland
Project Gutenberg’s


The code you've provided is using the flatMap transformation on an RDD (rdd) in PySpark. Let's break down what this code does:

rdd is assumed to be an RDD containing text data or a collection of strings.

flatMap is a transformation operation that takes a function (in this case, a lambda function) and applies it to each element in the RDD. The lambda function you've provided (lambda x: x.split(" ")) splits each string element in the RDD by space (" ") characters.

x.split(" ") is applied to each element x in the RDD, splitting it into a list of words based on space as the delimiter. This effectively breaks down each string into multiple words.

The flatMap transformation flattens the resulting list of words for each element into a single RDD, effectively creating an RDD of words instead of an RDD of lists of words. This is different from the map transformation, which would have resulted in an RDD of lists.

In [3]:
#Flatmap    
rdd2=rdd.flatMap(lambda x: x.split(" "))
for element in rdd2.collect():
    print(element)

Project
Gutenberg’s
Alice’s
Adventures
in
Wonderland
Project
Gutenberg’s
Adventures
in
Wonderland
Project
Gutenberg’s
