# StructType & StructField
`StructType` & `StructField` classes are used to programmatically specify the schema to the DataFrame and creating complex columns like nested struct, array and map columns. `StructType` is a collection of `StructField’s` that defines column name, column data type, boolean to specify if the field can be nullable or not and metadata.

## StructType – Defines the structure of the Dataframe
PySpark provides from pyspark.sql.types import `StructType` class to define the structure of the DataFrame.

`StructType` is a collection or list of `StructField` objects.

`printSchema()` method on the DataFrame shows `StructType` columns as “struct”.

## StructField – Defines the metadata of the DataFrame column
PySpark provides pyspark.sql.types import `StructField` class to define the columns which includes column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)

## Using PySpark StructType & StructField with DataFrame
While creating a PySpark DataFrame we can specify the structure using `StructType` and `StructField` classes. As specified, `StructType` is a collection of `StructField’s` which is used to define the column name, data type, and a flag for nullable or not. Using `StructField` we can also add nested struct schema, `ArrayType` for arrays, and `MapType` for key-value pairs.

The below example demonstrates a very simple example of how to create a StructType & StructField on DataFrame and it’s usage with sample data to support it.

In [29]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder.appName('structtype').getOrCreate()

data = [("James","","Smith","36636","M",3000),
        ("Michael","Rose","","40288","M",4000),
        ("Robert","","Williams","42114","M",4000),
        ("Maria","Anne","Jones","39192","F",4000),
        ("Jen","Mary","Brown","","F",-1)
       ]

schema = StructType([ \
        StructField("firstname",StringType(),True), \
        StructField("middlename",StringType(),True), \
        StructField("lastname",StringType(),True), \
        StructField("id", StringType(), True), \
        StructField("gender", StringType(), True), \
        StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



## Defining Nested StructType object struct
nested struct column can be defined using StructType.

In [30]:
structureData = [
                (("James","","Smith"),"36636","M",3100),
                (("Michael","Rose",""),"40288","M",4300),
                (("Robert","","Williams"),"42114","M",1400),
                (("Maria","Anne","Jones"),"39192","F",5500),
                (("Jen","Mary","Brown"),"","F",-1)
                ]
structureSchema = StructType([
                    StructField('name', 
                            StructType([
                            StructField('firstname', StringType(), True),
                            StructField('middlename', StringType(), True),
                            StructField('lastname', StringType(), True)
                                ])),
                    StructField('id', StringType(), True),
                    StructField('gender', StringType(), True),
                    StructField('salary', IntegerType(), True)
                    ])

df2 = spark.createDataFrame(data=structureData,schema=structureSchema)
df2.printSchema()
df2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



## Adding & Changing struct of the DataFrame
Using PySpark SQL function 'struct()', we can change the struct of the existing DataFrame and add a new 'StructType' to it. The below example demonstrates how to copy the columns from one structure to another and adding a new column. PySpark Column Class also provides some functions to work with the StructType column.

In [31]:
from pyspark.sql.functions import col,struct,when

df2.printSchema()

updatedDF = df2.withColumn("OtherInfo", 
                            struct( col("id").alias("identifier"),
                                    col("gender").alias("gender"),
                                    col("salary").alias("salary"),
                                    when(col("salary").cast(IntegerType()) < 2000,"Low")
                                    .when(col("salary").cast(IntegerType()) < 4000,"Medium")
                                    .otherwise("High").alias("Salary_Grade")
                                    )
                            ).drop("id","gender","salary")

updatedDF.printSchema()
updatedDF.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- OtherInfo: struct (nullable = false)
 |    |-- identifier: string (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- Salary_Grade: string (nullable = false)

+--------------------+------------------------+
|name                |OtherInfo               |
+--------------------+------------------------+
|{James, , Smith}    |{36636, M, 3100, Medium}|
|{Michael, Rose, }   |{40288, M, 4300, High}  |
|{Robert, , Williams}|{42114, M, 1400, Low}   |
|{Maria, Anne, 

## Using SQL ArrayType and MapType
SQL `StructType` also supports `ArrayType` and `MapType` to define the DataFrame columns for array and map collections respectively. On the example, column hobbies defined as ArrayType(StringType) and properties defined as MapType(StringType,StringType) meaning both key and value as String.

In [32]:
from pyspark.sql.types import ArrayType, MapType

arrayStructureSchema = StructType([
                                    StructField('name', 
                                            StructType([
                                                StructField('firstname', StringType(), True),
                                                StructField('middlename', StringType(), True),
                                                StructField('lastname', StringType(), True)
                                                        ])
                                                ),
                                    StructField('hobbies', ArrayType(StringType()), True),
                                    StructField('properties', MapType(StringType(),IntegerType()), True)
                                    ])

structureData = [
                (("James","","Smith"),("soccer", "tennis"),{"IT":3100}),
                (("Michael","Rose",""),("football", "chess"),{"Sales":4300}),
                (("Robert","","Williams"),("swimming", "hikimg"),{"Marketing":1400}),
                (("Maria","Anne","Jones"),("soccer", "hiking"),{"CTO":5500}),
                (("Jen","Mary","Brown"),("poker", "gardening"),{"Cleaning":-1})
                ]

df3 = spark.createDataFrame(data=structureData,schema=arrayStructureSchema)
df3.printSchema()
df3.show(truncate=False)                                    

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

+--------------------+------------------+-------------------+
|name                |hobbies           |properties         |
+--------------------+------------------+-------------------+
|{James, , Smith}    |[soccer, tennis]  |{IT -> 3100}       |
|{Michael, Rose, }   |[football, chess] |{Sales -> 4300}    |
|{Robert, , Williams}|[swimming, hikimg]|{Marketing -> 1400}|
|{Maria, Anne, Jones}|[soccer, hiking]  |{CTO -> 5500}      |
|{Jen, Mary, Brown}  |[poker, gardening]|{Cleaning -> -1}   |
+--------------------+------------------+-------------------+



## Creating StructType object struct from JSON file
If you have too many columns and the structure of the DataFrame changes now and then, it’s a good practice to load the SQL `StructType` schema from JSON file. You can get the schema by using df2.schema.json() , store this in a file and will use it to create a the schema from this file.

In [33]:
import json

print(json.dumps(json.loads(df2.schema.json()), indent=4))

with open('./resources/json_files/data_schema.json', 'w') as f:
    f.write(df2.schema.json())

{
    "fields": [
        {
            "metadata": {},
            "name": "name",
            "nullable": true,
            "type": {
                "fields": [
                    {
                        "metadata": {},
                        "name": "firstname",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "middlename",
                        "nullable": true,
                        "type": "string"
                    },
                    {
                        "metadata": {},
                        "name": "lastname",
                        "nullable": true,
                        "type": "string"
                    }
                ],
                "type": "struct"
            }
        },
        {
            "metadata": {},
            "name": "id",
            "nullable": true,
            "type": "string

In [34]:
json_file_path = './resources/json_files/data_schema.json'

with open(json_file_path, 'r') as j:
    contents = json.loads(j.read())
    schemaFromJson = StructType.fromJson(contents)
    df4 = spark.createDataFrame( spark.sparkContext.parallelize(structureData),schemaFromJson)
    df4.printSchema()

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



## Checking if a Column Exists in a DataFrame
If you want to perform some checks on metadata of the DataFrame, for example, if a column or field exists in a DataFrame or data type of column; we can easily do this using several functions on SQL 'StructType' and 'StructField'

In [35]:
print("name" in df4.schema.fieldNames())
print(df4.schema)

True
StructType([StructField('name', StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)]), True), StructField('id', StringType(), True), StructField('gender', StringType(), True), StructField('salary', IntegerType(), True)])


## Convert StructType (struct) to Dictionary/MapType (map)
PySpark provides a `create_map()` function that takes a list of column types as an argument and returns a `MapType` column, so we can use this to convert the DataFrame struct column to map Type. struct is a type of `StructType` and `MapType` is used to store Dictionary key-value pair.

In [36]:
data = [("36636","Finance",(3000,"USA")), 
        ("40288","Finance",(5000,"IND")), 
        ("42114","Sales",(3900,"USA")), 
        ("39192","Marketing",(2500,"CAN")), 
        ("34534","Sales",(6500,"USA")) ]

schema = StructType([
            StructField('id', StringType(), True),
            StructField('dept', StringType(), True),
            StructField('properties', StructType([
                        StructField('salary', IntegerType(), True),
                        StructField('location', StringType(), True)
                            ])
                        )
     ])

df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- properties: struct (nullable = true)
 |    |-- salary: integer (nullable = true)
 |    |-- location: string (nullable = true)

+-----+---------+-----------+
|id   |dept     |properties |
+-----+---------+-----------+
|36636|Finance  |{3000, USA}|
|40288|Finance  |{5000, IND}|
|42114|Sales    |{3900, USA}|
|39192|Marketing|{2500, CAN}|
|34534|Sales    |{6500, USA}|
+-----+---------+-----------+



### Convert StructType to MapType (map) Column
`create_map()` is a PySpark SQL function that is used to convert `StructType` to `MapType` column.

In [37]:
#Convert struct type to Map
from pyspark.sql.functions import col,lit,create_map

df1 = df.withColumn('location', df.properties.location).withColumn('salary', df.properties.salary).drop("properties")

df1.show()

df2 = df1.withColumn("propertiesMap",
                    create_map( lit("location"),col("location"),                        
                                lit("salary"),col("salary")                        
                    )
        )

df2.printSchema()

+-----+---------+--------+------+
|   id|     dept|location|salary|
+-----+---------+--------+------+
|36636|  Finance|     USA|  3000|
|40288|  Finance|     IND|  5000|
|42114|    Sales|     USA|  3900|
|39192|Marketing|     CAN|  2500|
|34534|    Sales|     USA|  6500|
+-----+---------+--------+------+

root
 |-- id: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- location: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- propertiesMap: map (nullable = false)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)

