## map() Transformation
`map()` is an RDD transformation that is used to apply the transformation function (lambda) on every element of RDD/DataFrame and returns a new RDD

RDD `map()` transformation is used to apply any complex operations like adding a column, updating a column, transforming the data e.t.c, the output of map transformations would always have the same number of records as input.

* Note1: DataFrame doesn’t have `map()` transformation to use with DataFrame hence you need to DataFrame to RDD first.
* Note2: If you have a heavy initialization use PySpark `mapPartitions()` transformation instead of `map()`, as with `mapPartitions()` heavy initialization executes only once for each partition instead of every record.

Create an RDD from the list

In [67]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[1]").appName("map").getOrCreate()

data = ["Eggs", "are", "forgiving", "of", "forgetfulness.", "But", "butter,", "not", "so", "much"]

rdd = spark.sparkContext.parallelize(data)

### map() with RDD
We'll be adding a new element with value 1 for each element, the result of the RDD is PairRDDFunctions which contains key-value pairs, word of type String as Key and 1 of type Int as value

In [68]:
rdd2=rdd.map(lambda x: (x,"checked"))
for element in rdd2.collect():
    print(element)

('Eggs', 'checked')
('are', 'checked')
('forgiving', 'checked')
('of', 'checked')
('forgetfulness.', 'checked')
('But', 'checked')
('butter,', 'checked')
('not', 'checked')
('so', 'checked')
('much', 'checked')


### map() with DataFrame
PySpark DataFrame doesn’t have `map()` transformation to apply the lambda function, when you wanted to apply the custom transformation, you need to convert the DataFrame to RDD and apply the `map()` transformation

In [69]:
data = [('James','Smith','M',130),
        ('Anna','Trump','F',141),
        ('Robert','Williams','M',162)]

columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema=columns)
df.show()

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|   130|
|     Anna|   Trump|     F|   141|
|   Robert|Williams|     M|   162|
+---------+--------+------+------+



Refering columns by index.

In [70]:
rdd2 = df.rdd.map(lambda x: (x[0]+","+x[1],x[2],x[3]*2))  
df2 = rdd2.toDF(["name","gender","new_salary"])
df2.show()

25/08/09 15:44:57 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|       260|
|     Anna,Trump|     F|       282|
|Robert,Williams|     M|       324|
+---------------+------+----------+



can also refer to the DataFrame column names while iterating.

In [71]:
rdd3 = df.rdd.map(lambda x: (x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2)) 
df3 = rdd3.toDF(["name","gender","new_salary"])
df3.show()

25/08/09 15:44:58 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|       260|
|     Anna,Trump|     F|       282|
|Robert,Williams|     M|       324|
+---------------+------+----------+



In [72]:
rdd4 = df.rdd.map(lambda x: (x.firstname + "," + x.lastname, x.gender, x.salary*2)) 
df4 = rdd4.toDF(["name","gender","new_salary"])
df4.show()

25/08/09 15:44:58 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|       260|
|     Anna,Trump|     F|       282|
|Robert,Williams|     M|       324|
+---------------+------+----------+



#### By Calling a function

In [73]:
def func1(x):
    firstName = x.firstname
    lastName = x.lastname
    name = firstName + "," + lastName
    gender = x.gender.lower()
    salary = x.salary*2
    return (name,gender,salary)

rdd5 = df.rdd.map(lambda x: func1(x))
df5 = rdd5.toDF(["name","gender","new_salary"])
df5.show()

25/08/09 15:44:58 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     m|       260|
|     Anna,Trump|     f|       282|
|Robert,Williams|     m|       324|
+---------------+------+----------+



### Convert DataFrame Columns to MapType (Dict)
Convert selected or all DataFrame columns to MapType similar to Python Dictionary (Dict) object.

function `create_map()` is used to convert selected DataFrame columns to `MapType`, `create_map()` takes a list of columns you wanted to convert as an argument and returns a MapType column.

Let’s create a DataFrame

In [74]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data = [ ("36636","Finance",13000,"USA"), 
         ("40288","Finance",15000,"MEX"), 
         ("42114","Sales",13900,"USA"), 
         ("39192","Marketing",12500,"CAN"), 
         ("34534","Sales",16500,"USA") ]

schema = StructType([
     StructField('id', StringType(), True),
     StructField('dept', StringType(), True),
     StructField('salary', IntegerType(), True),
     StructField('location', StringType(), True)
     ])

df6 = spark.createDataFrame(data=data,schema=schema)
df6.printSchema()
df6.show(truncate=False)

root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- location: string (nullable = true)

+-----+---------+------+--------+
|id   |dept     |salary|location|
+-----+---------+------+--------+
|36636|Finance  |13000 |USA     |
|40288|Finance  |15000 |MEX     |
|42114|Sales    |13900 |USA     |
|39192|Marketing|12500 |CAN     |
|34534|Sales    |16500 |USA     |
+-----+---------+------+--------+



#### Convert DataFrame Columns to MapType
using `create_map()` SQL function let’s convert PySpark DataFrame columns `salary` and `location` to `MapType`.

In [75]:
from pyspark.sql.functions import col,lit,create_map

df7 = df6.withColumn("propertiesMap",create_map(lit("dept"),col("dept"),  
                                                lit("location"),col("location"))).drop("dept","location")
                                      
                    
df7.printSchema()
df7.show(truncate=False)

root
 |-- id: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- propertiesMap: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+-----+------+------------------------------------+
|id   |salary|propertiesMap                       |
+-----+------+------------------------------------+
|36636|13000 |{dept -> Finance, location -> USA}  |
|40288|15000 |{dept -> Finance, location -> MEX}  |
|42114|13900 |{dept -> Sales, location -> USA}    |
|39192|12500 |{dept -> Marketing, location -> CAN}|
|34534|16500 |{dept -> Sales, location -> USA}    |
+-----+------+------------------------------------+



#### Using foreach() to Loop Through Rows in DataFrame
Similar to `map()`, `foreach()` also applied to every row of DataFrame, the difference being `foreach()` is an action and it returns nothing. Below are some examples to iterate through DataFrame using for each.

In [76]:
df.foreach(lambda x: print("Data ==> "+x["firstname"]+","+x["lastname"]+","+x["gender"]+","+str(x["salary"]*2)))

25/08/09 15:44:59 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
Data ==> James,Smith,M,260
Data ==> Anna,Trump,F,282
Data ==> Robert,Williams,M,324


* Data ==>Robert,Williams,M,124
* Data ==>James,Smith,M,60
* Data ==>Anna,Rose,F,82

#### Using pandas() to Iterate
If you have a small dataset, you can also Convert PySpark DataFrame to Pandas and use pandas to iterate through. Use `spark.sql.execution.arrow.enabled` config to enable Apache Arrow with Spark. Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM.

In [77]:
# Using pandas
import pandas as pd

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandasDF = df.toPandas()

for index, row in pandasDF.iterrows():
    print(row['firstname'], row['gender'])

25/08/09 15:44:59 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


James M
Anna F
Robert M


#### Collect Data As List and Loop Through
You can also Collect the PySpark DataFrame to Driver and iterate through Python, you can also use `toLocalIterator()`.

In [78]:
# Collect the data to Python List
from termcolor import cprint

dataCollect = df.collect()
for row in dataCollect:
    cprint(row['firstname'] + "," +row['lastname'], 'red')

#Using toLocalIterator()
dataCollect=df.rdd.toLocalIterator()
for row in dataCollect:
    cprint(row['firstname'] + "," +row['lastname'], 'blue')

[31mJames,Smith[0m
[31mAnna,Trump[0m
[31mRobert,Williams[0m
[34mJames,Smith[0m
[34mAnna,Trump[0m
[34mRobert,Williams[0m


25/08/09 15:44:59 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
