# 3.4 Read write json files
The official doc can be found [here](https://spark.apache.org/docs/latest/sql-data-sources-json.html). Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.


You may encounter three different situations when you read json files:
- A text file containing complete JSON objects, one per line. This is typical when you are loading JSON files to Databricks tables.
- A text file containing various fields (columns) of data, one of which is a JSON object. This is often seen in computer logs, where there is some plain-text meta-data followed by more detail in a JSON string.
- A variation of the above where the JSON field is an array of objects.

In [1]:
from pyspark.sql import SparkSession
import os

In [2]:
local = True
if local:
    spark = SparkSession.builder.master("local[4]") \
        .appName("ReadWriteJson").getOrCreate()
else:
    spark = SparkSession.builder \
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("ReadWriteJson")\
        .config("spark.kubernetes.container.image", os.environ["IMAGE_NAME"])\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "8g")\
        .config("spark.kubernetes.driver.pod.name", os.environ["POD_NAME"])\
        .config('spark.jars.packages', 'org.postgresql:postgresql:42.2.24')\
        .getOrCreate()

22/02/18 08:56:01 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/02/18 08:56:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/02/18 08:56:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


## 3.4.1 Read standard json file

Standard JSON text files look like this
```text
{ "Text1":"hello", "Text2":"goodbye", "Num1":5, "Array1":[7,8,9] }
{ "Text1":"this", "Text2":"that", "Num1":6.6, "Array1":[77,88,99] }
{ "Text1":"yes", "Text2":"no", "Num1":-0.03, "Array1":[555,444,222] }
```
It contains a set of {}, each {} is a row of your dataframe. The field name will be translated as column name, the value will be row value. Note the value type can be String, Int, Array, or object(embedded structure type).

Below shows an example on how to read standard json file. Note unlike read csv, when spark read json by default the inferSchema is activated.

In [None]:
file_path = "data/json/adult.json"

df = spark.read.json(file_path)

In [4]:
df.show(5)

+---+------------+------------+---------+-------------+------+--------------+------+------------------+--------------+-----------------+-----+-------------+------+----------------+
|age|capital-gain|capital-loss|education|education-num|fnlwgt|hours-per-week|income|    marital-status|native-country|       occupation| race| relationship|   sex|       workclass|
+---+------------+------------+---------+-------------+------+--------------+------+------------------+--------------+-----------------+-----+-------------+------+----------------+
| 39|        2174|           0|Bachelors|           13| 77516|            40| <=50K|     Never-married| United-States|     Adm-clerical|White|Not-in-family|  Male|       State-gov|
| 50|           0|           0|Bachelors|           13| 83311|            13| <=50K|Married-civ-spouse| United-States|  Exec-managerial|White|      Husband|  Male|Self-emp-not-inc|
| 38|           0|           0|  HS-grad|            9|215646|            40| <=50K|          D

In [5]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- capital-gain: long (nullable = true)
 |-- capital-loss: long (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: long (nullable = true)
 |-- fnlwgt: long (nullable = true)
 |-- hours-per-week: long (nullable = true)
 |-- income: string (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- native-country: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- race: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- workclass: string (nullable = true)



## 3.4.2 Read multi line json file

As we mentioned in previous section, A normal json file has the below format
```text
{record_1}
{record_2}
{record_3}
```
And each record shares the same schema.


A json file that has multiple lines has the following form:
```text
[
{record_1},
{record_2},
{record_3}
]
```

The records are located in a list.

PySpark JSON reader accept only the normal json format. To read JSON files scattered across multiple lines, we must
activate the "multiline" option. By default, the "multiline" option, is set to false.

Try to change multiline option to false, and see what happens.

In [9]:
multi_line_path = "data/json/zipcode.json"

df1 = spark.read.option("multiline", "true").json(multi_line_path)
df1.show()

+-------------------+------------+-----+-----------+-------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
+-------------------+------------+-----+-----------+-------+



## 3.4.3 Read multiple files

We can also read multiple json file at the same time. Note the file path must be in a list. Below is an example.

In [10]:
file1 = "data/json/zipcode.json"
file2 = "data/json/zipcode1.json"
file3 = "data/json/zipcode2.json"

In [15]:
df2 = spark.read.option("multiline", "true").json([file1, file2, file3])
df2.show()

+-------------------+------------+-----+-----------+-------+
|               City|RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
|PASEO COSTA DEL SUR|           2|   PR|   STANDARD|    704|
|       BDA SAN LUIS|          10|   PR|   STANDARD|    709|
+-------------------+------------+-----+-----------+-------+



## 3.4.4. Read Text file with json field

You may encounter a csv file that contains a json column.

In [18]:
message_file_path = "data/json/message.csv"
df3 = spark.read.csv(message_file_path, sep="|", header=True, inferSchema=True)
df3.show(5, truncate=False)

+-----+-------+-----+------------------------+
|Text1|Text2  |size |user                    |
+-----+-------+-----+------------------------+
|hello|goodbye|5.0  |{"name":"john","age":3} |
|this |that   |6.6  |{"name":"betty","age":4}|
|yes  |no     |-0.03|{"name":"bobby","age":5}|
+-----+-------+-----+------------------------+



In [17]:
df3.printSchema()

root
 |-- Text1: string (nullable = true)
 |-- Text2: string (nullable = true)
 |-- size: double (nullable = true)
 |-- user: string (nullable = true)



Note the schema, the json field is taken as string. So we can't access the name, age field of the user column. We need to convert the json field into structured column. To do so, we can use the from_json function.

In [20]:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), nullable=True),
    StructField("age", IntegerType(), nullable=True),

])
df4 = df3.withColumn("struct_user", from_json(col("user"), schema=schema))
df4.show(5, truncate=False)

+-----+-------+-----+------------------------+-----------+
|Text1|Text2  |size |user                    |struct_user|
+-----+-------+-----+------------------------+-----------+
|hello|goodbye|5.0  |{"name":"john","age":3} |{john, 3}  |
|this |that   |6.6  |{"name":"betty","age":4}|{betty, 4} |
|yes  |no     |-0.03|{"name":"bobby","age":5}|{bobby, 5} |
+-----+-------+-----+------------------------+-----------+



In [21]:
df4.printSchema()

root
 |-- Text1: string (nullable = true)
 |-- Text2: string (nullable = true)
 |-- size: double (nullable = true)
 |-- user: string (nullable = true)
 |-- struct_user: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: integer (nullable = true)



Now the column struct_user is a struc column. We can access its field directly. Below query is an example

In [22]:
df4.select(col("struct_user.name").alias("user_name"), col("struct_user.age").alias("user_age")).show()

+---------+--------+
|user_name|user_age|
+---------+--------+
|     john|       3|
|    betty|       4|
|    bobby|       5|
+---------+--------+



### 3.4.4.1 Multiple line in json field

In below example, the column user contains a list of json record. Unlike read.json has multiline option. The from_json() function does not have this option.

In [42]:
multi_line_field_json = "data/json/message1.csv"
df_multi_line_field = spark.read.csv(multi_line_field_json, header=True, sep="|", inferSchema=True)
df_multi_line_field.show(truncate=False)

+-----+-------+-----+------------------------------------------------------+
|Text1|Text2  |size |user                                                  |
+-----+-------+-----+------------------------------------------------------+
|hello|goodbye|5.0  |[{"name":"stop", "age":3}, {"name":"go", "age":6}]    |
|this |that   |6.6  |[{"name":"eggs", "age":4}, {"name":"bacon", "age":8}] |
|yes  |no     |-0.03|[{"name":"apple", "age":5}, {"name":"pear", "age":10}]|
+-----+-------+-----+------------------------------------------------------+



In [43]:
df_multi_line_field.withColumn("struct_user", from_json(col("user"), schema=schema)).show()

+-----+-------+-----+--------------------+------------+
|Text1|  Text2| size|                user| struct_user|
+-----+-------+-----+--------------------+------------+
|hello|goodbye|  5.0|[{"name":"stop", ...|{null, null}|
| this|   that|  6.6|[{"name":"eggs", ...|{null, null}|
|  yes|     no|-0.03|[{"name":"apple",...|{null, null}|
+-----+-------+-----+--------------------+------------+



You can notice the from_json function can't get the value at all. So we need to write a little UDF which will do this correctly.

In [46]:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType
import json

# Schema for the array of JSON objects.
json_array_schema = ArrayType(
    StructType([
        StructField('name', StringType(), nullable=False),
        StructField('age', IntegerType(), nullable=False)
    ])
)


# Create function to parse JSON using standard Python json library.
def parse_json(array_str):
    json_obj = json.loads(array_str)
    for item in json_obj:
        yield (item['name'], item['age'])


# Create a UDF, whose return type is the JSON schema defined above.
parse_json_udf = udf(lambda str: parse_json(str), json_array_schema)

# Use the UDF to change the JSON string into a true array of structs.
df_success=df_multi_line_field.withColumn("struct_user", parse_json_udf((col("user"))))
df_success.show(truncate=False)

+-----+-------+-----+------------------------------------------------------+------------------------+
|Text1|Text2  |size |user                                                  |struct_user             |
+-----+-------+-----+------------------------------------------------------+------------------------+
|hello|goodbye|5.0  |[{"name":"stop", "age":3}, {"name":"go", "age":6}]    |[{stop, 3}, {go, 6}]    |
|this |that   |6.6  |[{"name":"eggs", "age":4}, {"name":"bacon", "age":8}] |[{eggs, 4}, {bacon, 8}] |
|yes  |no     |-0.03|[{"name":"apple", "age":5}, {"name":"pear", "age":10}]|[{apple, 5}, {pear, 10}]|
+-----+-------+-----+------------------------------------------------------+------------------------+



In [47]:
df_success.printSchema()

root
 |-- Text1: string (nullable = true)
 |-- Text2: string (nullable = true)
 |-- size: double (nullable = true)
 |-- user: string (nullable = true)
 |-- struct_user: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = false)
 |    |    |-- age: integer (nullable = false)



## 3.4.5 Infer json schema

In the above example, we give a schema to the json column explicitly. We can also ask spark to infer schema for us by using schema_of_json() function

In [34]:
from pyspark.sql.functions import schema_of_json, lit

# get the first row
user_row = df3.select("user").first()
# get the value of column user
json_str = user_row.user
print(json_str)

{"name":"john","age":3}


In [35]:
df_schema = spark.range(1)
df_schema.select(schema_of_json(lit(json_str)).alias("json")).collect()

[Row(json='STRUCT<`age`: BIGINT, `name`: STRING>')]

## 3.4.6 Convert struct column to json string

We have seen how to convert json string to struct column, we can also convert struct column back to json string by using to_json() function

In [36]:
from pyspark.sql.functions import to_json

df4.withColumn("json_str", to_json("struct_user")).show()

+-----+-------+-----+--------------------+-----------+--------------------+
|Text1|  Text2| size|                user|struct_user|            json_str|
+-----+-------+-----+--------------------+-----------+--------------------+
|hello|goodbye|  5.0|{"name":"john","a...|  {john, 3}|{"name":"john","a...|
| this|   that|  6.6|{"name":"betty","...| {betty, 4}|{"name":"betty","...|
|  yes|     no|-0.03|{"name":"bobby","...| {bobby, 5}|{"name":"bobby","...|
+-----+-------+-----+--------------------+-----------+--------------------+



## 3.4.7 Convert json string column to flat column

We have seen how to convert json to struct column, we can also convert json filed to flat column. For example, we can create two columns name and age directly by using json_tuple(). Note the generated column name needs to be renamed (e.g. toDF()).

In [38]:
from pyspark.sql.functions import json_tuple

df_flat_field = df3.select("Text1", "Text2", "size", json_tuple(col("user"), "name", "age"))
df_flat_field.show(truncate=False)

+-----+-------+-----+-----+---+
|Text1|Text2  |size |c0   |c1 |
+-----+-------+-----+-----+---+
|hello|goodbye|5.0  |john |3  |
|this |that   |6.6  |betty|4  |
|yes  |no     |-0.03|bobby|5  |
+-----+-------+-----+-----+---+



You can also use get_json_object()

In [39]:
from pyspark.sql.functions import get_json_object

df_age = df3.select("Text1", "Text2", "size", get_json_object(col("user"), "$.age").alias("user_age"))
df_age.show()

+-----+-------+-----+--------+
|Text1|  Text2| size|user_age|
+-----+-------+-----+--------+
|hello|goodbye|  5.0|       3|
| this|   that|  6.6|       4|
|  yes|     no|-0.03|       5|
+-----+-------+-----+--------+



                                                                                

## 3.4.8 Write json file

Write json file is quite simple. Below is an example

In [51]:
out_path="/tmp/output_test"
df_success.write.mode("overwrite").json(out_path)