### PySpark JSON Functions
PySpark JSON functions are used to query or extract the elements from JSON string of DataFrame column by path, convert it to struct, mapt type e.t.c
* from_json() – Converts JSON string into Struct type or Map type.
* to_json() – Converts MapType or Struct type to JSON string.
* json_tuple() – Extract the Data from JSON and create them as a new columns.
* get_json_object() – Extracts JSON element from a JSON string based on json path specified.
* schema_of_json() – Create schema string from JSON string

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()

In [2]:
from pyspark.sql import Row

jsonString="""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df=spark.createDataFrame([(1, jsonString)],["id","value"])
df.printSchema()
df.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: string (nullable = true)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



#### from_json()
PySpark from_json() function is used to convert JSON string into Struct type or Map type.

In [3]:
#Convert JSON string column to Map type
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import from_json
df2=df.withColumn("value",from_json(df.value,MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |[Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR]|
+---+---------------------------------------------------------------------------+



In [4]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
    StructField("Zipcode", IntegerType(), True),
    StructField("ZipCodeType", StringType(), True),
    StructField("City", StringType(), True),
    StructField("State", StringType(), True)
])

df3=df.withColumn("value", from_json(df.value, schema))
df3.printSchema()
df3.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- Zipcode: integer (nullable = true)
 |    |-- ZipCodeType: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- State: string (nullable = true)

+---+--------------------------------+
|id |value                           |
+---+--------------------------------+
|1  |[704, STANDARD, PARC PARQUE, PR]|
+---+--------------------------------+



#### to_json()
to_json() function is used to convert DataFrame columns MapType or Struct type to JSON string. Here, I am using df2 that created from above from_json() example.

In [5]:
from pyspark.sql.functions import to_json, col
df2.withColumn("value",to_json(col("value"))) \
   .show(truncate=False)
df3.withColumn("value",to_json(col("value"))) \
   .show(truncate=False)

+---+----------------------------------------------------------------------------+
|id |value                                                                       |
+---+----------------------------------------------------------------------------+
|1  |{"Zipcode":"704","ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+----------------------------------------------------------------------------+

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



#### json_tuple()
Function json_tuple() is used the query or extract the elements from JSON column and create the result as a new columns.

In [6]:
from pyspark.sql.functions import json_tuple
df.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City", "State")) \
    .toDF("id","Zipcode","ZipCodeType","City", "State") \
    .show(truncate=False)

+---+-------+-----------+-----------+-----+
|id |Zipcode|ZipCodeType|City       |State|
+---+-------+-----------+-----------+-----+
|1  |704    |STANDARD   |PARC PARQUE|PR   |
+---+-------+-----------+-----------+-----+



#### get_json_object()
get_json_object() is used to extract the JSON string based on path from the JSON column.

In [7]:
from pyspark.sql.functions import get_json_object
df.select(col("id"),get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
    .show(truncate=False)

+---+-----------+
|id |ZipCodeType|
+---+-----------+
|1  |STANDARD   |
+---+-----------+



#### schema_of_json()
Use schema_of_json() to create schema string from JSON string column.

In [8]:
from pyspark.sql.functions import schema_of_json,lit
schemaStr=spark.range(1) \
    .select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \
    .collect()[0][0]
print(schemaStr)

struct<City:string,State:string,ZipCodeType:string,Zipcode:bigint>


|Category|Functions|
|--------|---------|
|Aggregate Functions|approxCountDistinct, avg, count, countDistinct, first, last, max, mean, min, sum, sumDistinct|
|Collection Functions|array_contains, explode, size, sort_array|
|Date/time Functions|<b>Date/timestamp conversion</b>:<br>unix_timestamp, from_unixtime, to_date, quarter, day, dayofyear, weekofyear, from_utc_timestamp, to_utc_timestamp<br><b>Extracting fields from a date/timestamp value:</b><br>year, month, dayofmonth, hour, minute, second<br><b>Date/timestamp calculation:</b><br>datediff, date_add, date_sub, add_months, last_day, next_day, months_between<br><b>Misc.:</b><br>current_date, current_timestamp, trunc, date_format|
|Math Functions|abs, acros, asin, atan, atan2, bin, cbrt, ceil, conv, cos, sosh, exp, expm1, factorial, floor, hex, hypot, log, log10, log1p, log2, pmod, pow, rint, round, shiftLeft, shiftRight, shiftRightUnsigned, signum, sin, sinh, sqrt, tan, tanh, toDegrees, toRadians, unhex|
|Misc. Functions|array, bitwiseNOT, callUDF, coalesce, crc32, greatest, if, inputFileName, isNaN, isnotnull, isnull, least, lit, md5, monotonicallyIncreasingId, nanvl, negate, not, rand, randn, sha, sha1, sparkPartitionId, struct, when|
|String Functions|ascii, base64, concat, concat_ws, decode, encode, format_number, format_string, get_json_object, initcap, instr, length, levenshtein, locate, lower, lpad, ltrim, printf, regexp_extract, regexp_replace, repeat, reverse, rpad, rtrim, soundex, space, split, substring, substring_index, translate, trim, unbase64, upper|
|Window Functions (in addition to Aggregate Functions)|cumeDist, denseRank, lag, lead, ntile, percentRank, rank, rowNumber|