## PySpark Loop/Iterate Through Rows in DataFrame

PySpark provides `map()` and `mapPartitions()` to loop/iterate through rows in RDD/DataFrame to perform the complex transformations, and these two returns the same number of records as in the original DataFrame but the number of columns could be different (after add/update).

PySpark also provides `foreach()` and `foreachPartitions()` actions to loop/iterate through each Row in a DataFrame but these two returns nothing.

Iteration could be done using:
* map(), 
* foreach(), 
* converting to Pandas, 
* converting DataFrame to Python List

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf, explode, concat_ws

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark Loop/Iterate Through Rows in DataFrame').getOrCreate()

#### Create a DataFrame

In [0]:
data = [
  ('John', 'Smith', 'M', 2500.0),
  ('Jane', 'Doe', 'F', 500.0),
  ('Richard', 'Marquette', 'M', 1500.0),
  ('Israel', 'Israeli', 'M', 3000.0),
  ('Edward', 'III', 'M', 5000.0)
]
 
schema = StructType([
  StructField('firstname', StringType(),True),
  StructField('lastname', StringType(),True),
  StructField('gender', StringType(), True),
  StructField('salary', DoubleType(), True)
])

columns = schema.fieldNames()

df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show()

#### Loop Through Rows in DataFrame 

Mostly for simple computations, instead of iterating through using `map()` and `foreach()`, you should use either DataFrame's `select()` or `withColumn()` in conjunction with PySpark SQL functions.

In [0]:
df.select(
  concat_ws(',',df.firstname,df.lastname).alias('name'),
  df.gender,
  lit(df.salary*1.20).alias('new_salary')
).show()

#### Using map() to Loop

PySpark doesn’t have a `map()` in DataFrame instead it’s in RDD hence you need to convert DataFrame to RDD first and then use the `map()`.  
It returns an RDD and you should convert RDD to PySpark DataFrame if needed.

In [0]:
rdd = df.rdd.map(lambda x: (x[0]+","+x[1],x[2],x[3]*1.20))
#rdd2 = df.rdd.map(lambda x: (x.firstname+","+x.lastname,x.gender,x.salary*2))
#rdd2 = df.rdd.map(lambda x: (x["firstname"]+","+x["lastname"],x["gender"],x["salary"]*2))

#def func1(x):
#  firstName=x.firstname
#  lastName=x.lastName
#  name=firstName+","+lastName
#  gender=x.gender.lower()
#  salary=x.salary*2
#  return (name,gender,salary)
#rdd2 = df.rdd.map(lambda x: func1(x))

df2 = rdd.toDF(['name','gender','new_salary'])
df2.show()

#### Using foreach() to Loop

`foreach()` is an action and it returns nothing.

In [0]:
def f(x): print(x)
df.foreach(f)

In [0]:
df.foreach(lambda x: print("Data ==>"+x["firstname"]+","+x["lastname"]+","+x["gender"]+","+str(x["salary"]*2)))

#### Using pandas() to Iterate

In [0]:
import pandas as pd
# Use spark.sql.execution.arrow.enabled config to enable Apache Arrow with Spark
# Apache Spark uses Apache Arrow which is an in-memory columnar format to transfer the data between Python and JVM.
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')

pandasDF = df.toPandas()

for index, row in pandasDF.iterrows(): print(row['firstname'], row['gender'])

#### Collect Data As List and Loop Through

In [0]:
# Collect the data to Python List
dataCollect = df.collect()
for row in dataCollect: print(f"{row['firstname']},{row['lastname']}")

In [0]:
# Using toLocalIterator()
dataCollect = df.rdd.toLocalIterator()
for row in dataCollect: print(f"{row['firstname']},{row['lastname']}")

#### The end of the notebook