### map()

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName("Spark_ln").getOrCreate()
spark

1. The map() in pyspark is a transformation function that is used to apply a function(lambda) to each element of an RDD and return a new RDD consisting of the result.

2. DataFrame doesn’t have map() transformation to use with DataFrame; hence, you need to convert DataFrame to RDD first.

3. PySpark DataFrame doesn’t have map() transformation to apply the lambda function, when you wanted to apply the custom transformation, you need to convert the DataFrame to RDD and apply the map() transformation.


In [0]:
data = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"
]

# map() with rdd
rdd = spark.sparkContext.parallelize(data)
rdd.collect()

Out[3]: ['Project',
 'Gutenberg’s',
 'Alice’s',
 'Adventures',
 'in',
 'Wonderland',
 'Project',
 'Gutenberg’s',
 'Adventures',
 'in',
 'Wonderland',
 'Project',
 'Gutenberg’s']

In [0]:
# map() with rdd
rdd2=rdd.map(lambda x: (x,1))
for element in rdd2.collect():
    print(element)

('Project', 1)
('Gutenberg’s', 1)
('Alice’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('Project', 1)
('Gutenberg’s', 1)
('Adventures', 1)
('in', 1)
('Wonderland', 1)
('Project', 1)
('Gutenberg’s', 1)


In [0]:
# zipWithIndex()
rdd3 = rdd.zipWithIndex()
for element in rdd3.collect():
    print(element)

('Project', 0)
('Gutenberg’s', 1)
('Alice’s', 2)
('Adventures', 3)
('in', 4)
('Wonderland', 5)
('Project', 6)
('Gutenberg’s', 7)
('Adventures', 8)
('in', 9)
('Wonderland', 10)
('Project', 11)
('Gutenberg’s', 12)


In [0]:
data = [('Raj','Kumar','M',30),
  ('Annie','Dhumi','F',41),
  ('Rahul','kumar','M',62), 
]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data = data, schema = columns)
display(df)

firstname,lastname,gender,salary
Raj,Kumar,M,30
Annie,Dhumi,F,41
Rahul,kumar,M,62


In [0]:
# Converting DataFrame to RDD and implement map function
rdd4 = df.rdd.map(lambda x: (x[0]+" "+x[1],x[2],x[3]+5))
df1 = rdd4.toDF(["name","gender","new_salary"])
# display(df1)
df1.show()

+-----------+------+----------+
|       name|gender|new_salary|
+-----------+------+----------+
|  Raj Kumar|     M|        35|
|Annie Dhumi|     F|        46|
|Rahul kumar|     M|        67|
+-----------+------+----------+



In [0]:
# Using Column Name
rdd5 = df.rdd.map(lambda x: (x["firstname"]+" "+x["lastname"],x["gender"],x["salary"]+6))
df2 = rdd5.toDF(["name","gender","new_salary"])
# display(df2)
df2.show()

+-----------+------+----------+
|       name|gender|new_salary|
+-----------+------+----------+
|  Raj Kumar|     M|        36|
|Annie Dhumi|     F|        47|
|Rahul kumar|     M|        68|
+-----------+------+----------+



In [0]:
# Another Way
rdd6 = df.rdd.map(lambda x: (x.firstname+" "+x.lastname, x.gender, x.salary+5))
df3 = rdd6.toDF(["name","gender","new_salary"])
df3.show()

+-----------+------+----------+
|       name|gender|new_salary|
+-----------+------+----------+
|  Raj Kumar|     M|        35|
|Annie Dhumi|     F|        46|
|Rahul kumar|     M|        67|
+-----------+------+----------+



### flatMap()

- PySpark flatMap() is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame.

In [0]:
flatmap_data = [
    "PySpark flatMap() is a transformation operation" ,"that flattens the RDD/DataFrame (array/map DataFrame columns)", "after applying the function", "on every element and returns", "a new PySpark RDD/DataFrame."
]

In [0]:
rdd7 = spark.sparkContext.parallelize(flatmap_data)
for x in rdd7.collect():
    print(x)
    

PySpark flatMap() is a transformation operation
that flattens the RDD/DataFrame (array/map DataFrame columns)
after applying the function
on every element and returns
a new PySpark RDD/DataFrame.


In [0]:
flatmap_rdd = rdd7.flatMap(lambda x: x.split(" "))
for y in flatmap_rdd.collect():
    print(y)

PySpark
flatMap()
is
a
transformation
operation
that
flattens
the
RDD/DataFrame
(array/map
DataFrame
columns)
after
applying
the
function
on
every
element
and
returns
a
new
PySpark
RDD/DataFrame.
