## Explode nested array into rows
PySpark explode function can be used to explode an Array of Array (nested Array) ArrayType(ArrayType(StringType)) columns to rows on PySpark DataFrame using python

Let’s create a DataFrame with a nested array column. From below example column “subjects” is an array of ArraType which holds subjects learned.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('explode').getOrCreate()

arrayArrayData = [
  ("James",[["Java","Scala","C++"],["Spark","Java"]]),
  ("Michael",[["Spark","Java","C++"],["Spark","Java"]]),
  ("Robert",[["CSharp","VB"],["Spark","Python"]])
]

df = spark.createDataFrame(data=arrayArrayData, schema = ['name','subjects'])
df.printSchema()
df.show(truncate=False)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/09 21:07:42 WARN Utils: Your hostname, javier-ubuntu, resolves to a loopback address: 127.0.1.1; using 172.17.0.1 instead (on interface docker0)
25/08/09 21:07:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/09 21:07:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/09 21:07:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/08/09 21:07:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/08/09 21:07:43 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/08/09 21:07:43 WARN Utils:

root
 |-- name: string (nullable = true)
 |-- subjects: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)



                                                                                

+-------+-----------------------------------+
|name   |subjects                           |
+-------+-----------------------------------+
|James  |[[Java, Scala, C++], [Spark, Java]]|
|Michael|[[Spark, Java, C++], [Spark, Java]]|
|Robert |[[CSharp, VB], [Spark, Python]]    |
+-------+-----------------------------------+



Now, let’s explode “subjects” array column to array rows. after exploding, it creates a new column ‘col’ with rows represents an array.

In [2]:
from pyspark.sql.functions import explode

df.select(df.name,explode(df.subjects)).show(truncate=False)

+-------+------------------+
|name   |col               |
+-------+------------------+
|James  |[Java, Scala, C++]|
|James  |[Spark, Java]     |
|Michael|[Spark, Java, C++]|
|Michael|[Spark, Java]     |
|Robert |[CSharp, VB]      |
|Robert |[Spark, Python]   |
+-------+------------------+



If you want to flatten the arrays, use flatten function which converts array of array columns to a single array on DataFrame.

In [3]:
from pyspark.sql.functions import flatten

df.select(df.name,flatten(df.subjects)).show(truncate=False)

+-------+-------------------------------+
|name   |flatten(subjects)              |
+-------+-------------------------------+
|James  |[Java, Scala, C++, Spark, Java]|
|Michael|[Spark, Java, C++, Spark, Java]|
|Robert |[CSharp, VB, Spark, Python]    |
+-------+-------------------------------+



### Explode vs Explode_outer

`explode` creates a row for each element in the array or map column by ignoring null or empty values in array whereas `explode_outer` returns all values in array or map including null or empty.

In [4]:
arrayData = [
  (1, 'Luke',["baseball","Soccer"]),
  (2, "Lucy",None)
]

df = spark.createDataFrame(data=arrayData, schema = ['index','names','likes'])
df.printSchema()
df.show(truncate=False)

root
 |-- index: long (nullable = true)
 |-- names: string (nullable = true)
 |-- likes: array (nullable = true)
 |    |-- element: string (containsNull = true)

+-----+-----+------------------+
|index|names|likes             |
+-----+-----+------------------+
|1    |Luke |[baseball, Soccer]|
|2    |Lucy |NULL              |
+-----+-----+------------------+



In [5]:
from pyspark.sql.functions import explode
df1 = df.select('index','names',explode('likes').alias('likes'))
df1.show()

+-----+-----+--------+
|index|names|   likes|
+-----+-----+--------+
|    1| Luke|baseball|
|    1| Luke|  Soccer|
+-----+-----+--------+



In [6]:
from pyspark.sql.functions import explode_outer

df1 = df.select('index','names',explode_outer('likes').alias('likes'))
df1.show()

+-----+-----+--------+
|index|names|   likes|
+-----+-----+--------+
|    1| Luke|baseball|
|    1| Luke|  Soccer|
|    2| Lucy|    NULL|
+-----+-----+--------+



In [7]:
df1.select('likes').count()

3

In [8]:
df1.select('names').count()

3