## PySpark flatMap()/explode() Transformation

`flatMap()`/`explode()` is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame.

In [0]:
dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this command.dbutils.restartPython()

#### Load libraries

In [0]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import IntegerType, DateType, StringType, StructType, StructField, ArrayType, MapType, DoubleType
from pyspark.sql.functions import lit, col, expr, when, sum, avg, max, min, mean, count, udf, explode

#### Create Spark session

In [0]:
spark = SparkSession.builder.appName('PySpark flatMap()/explode() Transformation').getOrCreate()

#### Create a RDD/DataFrame

In [0]:
data = [
  'Lorem ipsum', 
  'dolor sit amet', 
  'consectetur adipiscing elit', 
  'sed do eiusmod tempor',
  'incididunt ut labore',
  'et dolore magna aliqua' 
]
rdd = spark.sparkContext.parallelize(data)

for element in rdd.collect():
    print(element)

#### Using flatMap()

In [0]:
rdd2 = rdd.flatMap(lambda x: x.split(' '))
for element in rdd2.collect():
    print(element)

#### Using explode()

PySpark DataFame doesn’t have `flatMap()` transformation however, DataFrame has `explode() `

In [0]:
data = [
  ('James',['Java','Scala']),
  ('Michael',['Spark','Java',None]),
  ('Robert',['CSharp','']),
  ('Washington',None),
  ('Jefferson',['1','2'])
]
df = spark.createDataFrame(data=data, schema = ['name','knownLanguages'])

df2 = df.select(df.name,explode(df.knownLanguages).alias('lang'))
df2.printSchema()
df2.show()

#### The end of the notebook