# 2.1 Spark Schema

`Spark schema` is the structure of the DataFrame or Dataset, we can define it using StructType class which is a collection of StructField that define the **column name(String), column type (DataType), nullable column (Boolean) and metadata (MetaData)**

The column type can be primitive type that provided by spark. For more details, please check all available data type [list](https://spark.apache.org/docs/latest/sql-ref-datatypes.html)

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import json
import os

In [4]:
local=True
if local:
    spark=SparkSession.builder.master("local[4]") \
                  .appName("Dataframe_schema").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("Dataframe_schema") \
                      .config("spark.kubernetes.container.image","inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName",os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"]) \
                      .config('spark.jars.packages','org.postgresql:postgresql:42.2.24') \
                      .getOrCreate()

22/04/11 14:33:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/04/11 14:33:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/04/11 14:33:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [11]:
data=[("James",None,"Smith","36636","M",3000),
    ("Michael","Rose",None,"40288","M",4000),
    ("Robert",None,"Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Bn",None,"F",-1)]

## 2.1.1 Define a simple schema

In below example, we define a simple schema by using `StructType` which contains a list of `StructField`. In StructField, we define the column name(String), column type (DataType), nullable column (Boolean) and `metadata (MetaData) (only exist in scala implementation, not for python)`

Below is an example how to define metadata in StructField in scala

```scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder()
  .putString("comment", "this is a comment")
  .build
import org.apache.spark.sql.types.{LongType, StructField}
val f = new StructField(name = "id", dataType = LongType, nullable = false, metadata)
```

In [12]:
schema=StructType([
    StructField("firstname",StringType(),True),
    StructField("middlename",StringType(),True),
    StructField("lastname",StringType(),True),
    StructField("id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", IntegerType(), True)
])

In [13]:
df=spark.createDataFrame(data=data,schema=schema)
df.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|   id|gender|salary|
+---------+----------+--------+-----+------+------+
|    James|      null|   Smith|36636|     M|  3000|
|  Michael|      Rose|    null|40288|     M|  4000|
|   Robert|      null|Williams|42114|     M|  4000|
|    Maria|      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|      Bn| null|     F|    -1|
+---------+----------+--------+-----+------+------+



In [14]:
# check the schema of a dataframe
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



#### 2.1.1.2 Nullable column

You have noticed that, if we define the schema of a column as not nullable (nullable=False), and if the column contains null value. Spark cant read the data frame correctly.

Check below example, we have two columns that contains null value, middlename and lastname. We set middlename column as non-nullable. You can see spark raise exception "field middlename: This field is not nullable, but got None"

In [15]:
schema_null=StructType([
    StructField("firstname",StringType(),True),
    StructField("middlename",StringType(),False),
    StructField("lastname",StringType(),True),
    StructField("id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", IntegerType(), True)
])

In [16]:
df_null=spark.createDataFrame(data=data,schema=schema_null)
df_null.show()

ValueError: field middlename: This field is not nullable, but got None

## 2.1.2 Define a nested structure schema

Spark DataFrame often need to work with the nested struct columns. In below example, we group column first_name, middle_name, last_name, and create a nested column name that contains three fields.

For defining `nested_struct_data`, we group first_name, middle_name, last_name as a new **Row called name inside the main Row employee**

For defining `nested_schema`, we define first a StructType name, then in the main schema, we create a StructField that has type name

In [7]:

nested_struct_data=[
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Bn"),"","F",-1)    
]

In [8]:
# first we build the nested structType name
name=StructType([
    StructField("firstname",StringType(),True),
    StructField("middlename",StringType(),True),
    StructField("lastname",StringType(),True)])

# we define column name has nested structType
nested_schema=StructType([
    StructField("name",name,False),
    StructField("id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary", IntegerType(), True)
])

In [9]:
nested_df=spark.createDataFrame(data=nested_struct_data,schema=nested_schema)
nested_df.show()

+--------------------+-----+------+------+
|                name|   id|gender|salary|
+--------------------+-----+------+------+
|    {James, , Smith}|36636|     M|  3100|
|   {Michael, Rose, }|40288|     M|  4300|
|{Robert, , Williams}|42114|     M|  1400|
|{Maria, Anne, Jones}|39192|     F|  5500|
|     {Jen, Mary, Bn}|     |     F|    -1|
+--------------------+-----+------+------+



In [10]:
nested_df.printSchema()

root
 |-- name: struct (nullable = false)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



## 2.1.3 Write schema to json file

In above examples, we use StructType and StructField to define schema by hand, if the dataframe has many columns, this can be time-consuming to do so. We can write/read schema of a dataframe by using json.

Let's start with how to export a schema. We have seen in previous sections, we can print the schema by using printSchema() method. If we want to get StructType, We can use Dataframe.schema.


In [13]:
out_schema=nested_df.schema
print(type(out_schema))

<class 'pyspark.sql.types.StructType'>


You can notice the type of out_schema is `pyspark.sql.types.StructType`. We can also get column list(list of StructField) by using Dataframe.schema.fields

In [14]:
out_fields=nested_df.schema.fields
print(type(out_fields))

<class 'list'>


In [15]:
for field in out_fields:
    print(field)

StructField(name,StructType(List(StructField(firstname,StringType,true),StructField(middlename,StringType,true),StructField(lastname,StringType,true))),false)
StructField(id,StringType,true)
StructField(gender,StringType,true)
StructField(salary,IntegerType,true)


Now, we have all the information we need, we need to convert the schema(StructType) to json format. And pyspark already provide this function. In below example, we
get the schema in json string, then write it in a json file

In [11]:
json_schema=nested_df.schema.json()
print(json_schema)

{"fields":[{"metadata":{},"name":"name","nullable":false,"type":{"fields":[{"metadata":{},"name":"firstname","nullable":true,"type":"string"},{"metadata":{},"name":"middlename","nullable":true,"type":"string"},{"metadata":{},"name":"lastname","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"id","nullable":true,"type":"string"},{"metadata":{},"name":"gender","nullable":true,"type":"string"},{"metadata":{},"name":"salary","nullable":true,"type":"integer"}],"type":"struct"}


In [12]:
json_file_path="data/schema.json"
with open(json_file_path, "w") as f:
   f.write(json_schema)
   f.close()

## 2.1.4 Read schema from a json file

In previous section we output a schema into json file, now let's build a schema based on the json file.

First step, we read the json string from the json file.

In [17]:
with open(json_file_path,"r") as f:
    json_string_schema=f.read()

print(json_string_schema)

{
  "fields": [
    {
      "metadata": {},
      "name": "name",
      "nullable": false,
      "type": {
        "fields": [
          {
            "metadata": {},
            "name": "firstname",
            "nullable": true,
            "type": "string"
          },
          {
            "metadata": {},
            "name": "middlename",
            "nullable": true,
            "type": "string"
          },
          {
            "metadata": {},
            "name": "lastname",
            "nullable": true,
            "type": "string"
          }
        ],
        "type": "struct"
      }
    },
    {
      "metadata": {},
      "name": "id",
      "nullable": true,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "gender",
      "nullable": true,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "salary",
      "nullable": true,
      "type": "integer"
    }
  ],
  "type": "struct"
}


In [25]:
# second step, we convert the json string to json object
loaded_json_schema=json.loads(json_string_schema)

In [26]:
# third step, we build a schema by using the json object
loaded_json_schema=StructType.fromJson(loaded_json_schema)

In [36]:
print(loaded_json_schema)

StructType(List(StructField(name,StructType(List(StructField(firstname,StringType,true),StructField(middlename,StringType,true),StructField(lastname,StringType,true))),false),StructField(id,StringType,true),StructField(gender,StringType,true),StructField(salary,IntegerType,true)))


In [28]:
df_with_json_schema=spark.createDataFrame(data=nested_struct_data,schema=loaded_json_schema)
df_with_json_schema.show(5)

+--------------------+-----+------+------+
|                name|   id|gender|salary|
+--------------------+-----+------+------+
|    {James, , Smith}|36636|     M|  3100|
|   {Michael, Rose, }|40288|     M|  4300|
|{Robert, , Williams}|42114|     M|  1400|
|{Maria, Anne, Jones}|39192|     F|  5500|
|     {Jen, Mary, Bn}|     |     F|    -1|
+--------------------+-----+------+------+



In [29]:
df_with_json_schema.printSchema()

root
 |-- name: struct (nullable = false)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)



## 2.1.5 Write schema to DDL String

Spark also provides api to export/import schema by using DDL string. But pyspark does not yet implement this feature yet. For scala api, check the below code example

```scala

import sparkSession.implicits._

// generate ddl string from a schema
val ddlSchemaStr =nested_df.schema.toDDL()
// It should return the below string
// "`fullName` STRUCT<`first`: STRING, `last`: STRING, `middle`: STRING>,`age` INT,`gender` STRING"

// generate a schema from ddl string
val ddlSchema = StructType.fromDDL(ddlSchemaStr)
ddlSchema.printTreeString()
```

In [34]:
ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING, `middle`: STRING>,`age` INT,`gender` STRING"
# StructType only provide fromJson(), fromDDL() is not implemented
ddlSchema = StructType.fromDDL(ddlSchemaStr)
ddlSchema.printTreeString()

AttributeError: type object 'StructType' has no attribute 'fromDDL'