- Title: Handling Complex Data Types in Spark DataFrame
- Slug: pyspark-handling-complex-data-types
- Date: 2019-12-18
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, complex data types, StructType, ArrayTypes
- Author: Ben Du
- Modified: 2019-12-18


## Comemnts

There are multiple ways (vanilla string, JSON string, StructType and ArrayType) to represent complex data types in Spark DataFrames.
Notice that a Tuple is converted to a StructType in Spark DataFrames
and an Array is converted to a ArrayType in Spark DataFrames.
Starting from Spark 2.4, 
you can use ArrayType which is more convenient if the elements have the same type.

### Vanilla String

- string, substring, regexp_extract, locate, left, concat_ws

### JSON String

- json_tuple
- get_json_object
- from_json

### StructType



### ArrayType

- array
- element_at
- array_min, array_max, array_join, array_interesect, array_except, array_distinct, array_contains, array, array_position, array_remove, array_repeat, array_sort, array_union, array_overlap, array_zip


In [1]:
from pathlib import Path
import findspark
findspark.init(str(next(Path("/opt").glob("spark-3*"))))

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("PySpark_Str_Func") \
    .enableHiveSupport().getOrCreate()

## explode

In [4]:
spark.sql("""
    select
        split("how are you", " ") as words
    """).show()

+---------------+
|          words|
+---------------+
|[how, are, you]|
+---------------+



In [3]:
spark.sql("""
    select
        explode(split("how are you", " ")) as words
    """).show()

+-----+
|words|
+-----+
|  how|
|  are|
|  you|
+-----+



## collect

## Work with StructType

Notice that a Tuple is converted to StructType in Spark DataFrames.

In [6]:
val df = Seq(
    ((1, 2), "how"),
    ((2, 3), "are"),
    ((3, 4), "you")
).toDF("col1", "col2")
df.show

+------+----+
|  col1|col2|
+------+----+
|[1, 2]| how|
|[2, 3]| are|
|[3, 4]| you|
+------+----+



null

Split all elements of a StructType into different columns.

In [15]:
df.select("col1.*").show

+---+---+
| _1| _2|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
+---+---+



Extract elements from StructTypes by position and rename the columns.

In [16]:
df.select(
    $"col1._1".alias("v1"),
    $"col1._2".alias("v2")
).show

+---+---+
| v1| v2|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
+---+---+



## Work with ArrayType

Notice that an Array is converted to an ArrayType in Spark DataFrames.
Note: ArrayType requires Spark 2.4.0+.

In [17]:
val df = Seq(
    (Array(1, 2), "how"),
    (Array(2, 3), "are"),
    (Array(3, 4), "you")
).toDF("col1", "col2")
df.show

+------+----+
|  col1|col2|
+------+----+
|[1, 2]| how|
|[2, 3]| are|
|[3, 4]| you|
+------+----+



null

In [22]:
df.select(
    element_at($"col1", 1).alias("v1"),
    element_at($"col1", 2).alias("v2")
).show

+---+---+
| v1| v2|
+---+---+
|  1|  2|
|  2|  3|
|  3|  4|
+---+---+



## References

https://docs.databricks.com/_static/notebooks/transform-complex-data-types-scala.html

https://stackoverflow.com/questions/45789489/how-to-split-a-list-to-multiple-columns-in-pyspark?noredirect=1&lq=1

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/Dataset.html

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/functions.html

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Row.html