# Day 8 - Using Complex Types to Analyse Unstructured or JSON Data
My challenge of today is to go beyond processing well structured data, which complies to a schema and where all values are clearly seperated into typed columns. Today I want to analyse the stock descriptions in the retail data set, which come as unstructured text. This is my use case to investigate Sparks complex datataypes like arrays and maps. Next to that, I want to get familiar with the processing of semi-structured data like JSON.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession\
   .builder\
   .getOrCreate()

retailDF = spark.read\
   .option("header", "true")\
   .option("inferSchema", "true")\
   .format("csv")\
   .load("./data/retail-data/by-day/*.csv")

There are two questions, I want to investigate regarding the description data:
* What is the average number of words in the Description per StockCode?
* Which are the most frequently used words?

## Data Preparation
The granularity of my analysis is StockCode and not individual invoice items. So to prevent StockCode duplicates, I tailor the data set to get a DataFrame containing distinct StockCodes and their description.

In [2]:
distinctDF = retailDF.select(
        "StockCode",
        "Description").distinct()

distinctDF.orderBy("StockCode").show(10, truncate=False)

+---------+---------------------------+
|StockCode|Description                |
+---------+---------------------------+
|10002    |null                       |
|10002    |INFLATABLE POLITICAL GLOBE |
|10080    |null                       |
|10080    |check                      |
|10080    |GROOVY CACTUS INFLATABLE   |
|10120    |DOGGY RUBBER               |
|10123C   |HEARTS WRAPPING TAPE       |
|10123C   |null                       |
|10123G   |null                       |
|10124A   |SPOTS ON RED BOOKCOVER TAPE|
+---------+---------------------------+
only showing top 10 rows



Apparently the null value problem, I investigated yesterday, occures again. Rows having null values in any column are uselesss for my analysis, so I want to remove them.

In [38]:
cleanedDF = distinctDF.dropna(how="any")

cleanedDF.orderBy("StockCode").show(10, truncate=False)

+---------+----------------------------+
|StockCode|Description                 |
+---------+----------------------------+
|10002    |INFLATABLE POLITICAL GLOBE  |
|10080    |check                       |
|10080    |GROOVY CACTUS INFLATABLE    |
|10120    |DOGGY RUBBER                |
|10123C   |HEARTS WRAPPING TAPE        |
|10124A   |SPOTS ON RED BOOKCOVER TAPE |
|10124G   |ARMY CAMO BOOKCOVER TAPE    |
|10125    |MINI FUNKY DESIGN TAPES     |
|10133    |damaged                     |
|10133    |COLOURING PENCILS BROWN TUBE|
+---------+----------------------------+
only showing top 10 rows



## Arrays
Next I've to do is to split up the text strings into arrays of words. The words in the descriptions are seperated by blanks, so I define this as split seperator. The result looks like Python lists but in contrast to lists, all array elements must have the same data type.

In [39]:
from pyspark.sql.functions import split

splittedDF = cleanedDF.select(
        "StockCode",
        split("Description", " ").alias("word_list")
)

splittedDF.show(10, truncate=False)

+---------+------------------------------------------+
|StockCode|word_list                                 |
+---------+------------------------------------------+
|21285    |[RETROSPOT, CANDLE, , MEDIUM]             |
|84987    |[SET, OF, 36, TEATIME, PAPER, DOILIES]    |
|22708    |[WRAP, DOLLY, GIRL]                       |
|22690    |[DOORMAT, HOME, SWEET, HOME, BLUE, ]      |
|21249    |[WOODLAND, , HEIGHT, CHART, STICKERS, ]   |
|85015    |[SET, OF, 12, , VINTAGE, POSTCARD, SET]   |
|84279P   |[CHERRY, BLOSSOM, , DECORATIVE, FLASK]    |
|23623    |[SET, 10, CARD, CHRISTMAS, WELCOME, 17112]|
|23432    |[PRETTY, HANGING, QUILTED, HEARTS]        |
|21002    |[ROSE, DU, SUD, DRAWSTRING, BAG]          |
+---------+------------------------------------------+
only showing top 10 rows



Like with normal Python lists I can grab specific elements, i.e. words from my word lists, by referencing their index starting with 0 for the first element. So to get the second word in each description, I need to refer to index 1.

In [40]:
from pyspark.sql.functions import col

splittedDF.select("StockCode", col("word_list")[1]).show(10)

+---------+------------+
|StockCode|word_list[1]|
+---------+------------+
|    21285|      CANDLE|
|    84987|          OF|
|    22708|       DOLLY|
|    22690|        HOME|
|    21249|            |
|    85015|          OF|
|   84279P|     BLOSSOM|
|    23623|          10|
|    23432|     HANGING|
|    21002|          DU|
+---------+------------+
only showing top 10 rows



Interesting to note that InvoiceNo 21249 seems to have a double blank after the first word. Maybe a typo in a free-text field? Anyway, I dont to count words, not blanks, so I have to removing them later. First, I want to double check, if this is a more general or single-case issue. 

I can easily check wether or not a word list contains specific key words by using the `array_contains()` function. For my analysis, I want to identify rows having empty words in the list, which I'dont want to count.

In [50]:
from pyspark.sql.functions import array_contains

splittedDF.select(
    "StockCode", 
    "word_list", 
    array_contains("word_list", "").alias("empty strings inside")
).show(10, truncate=False)

+---------+------------------------------------------+--------------------+
|StockCode|word_list                                 |empty strings inside|
+---------+------------------------------------------+--------------------+
|21285    |[RETROSPOT, CANDLE, , MEDIUM]             |true                |
|84987    |[SET, OF, 36, TEATIME, PAPER, DOILIES]    |false               |
|22708    |[WRAP, DOLLY, GIRL]                       |false               |
|22690    |[DOORMAT, HOME, SWEET, HOME, BLUE, ]      |true                |
|21249    |[WOODLAND, , HEIGHT, CHART, STICKERS, ]   |true                |
|85015    |[SET, OF, 12, , VINTAGE, POSTCARD, SET]   |true                |
|84279P   |[CHERRY, BLOSSOM, , DECORATIVE, FLASK]    |true                |
|23623    |[SET, 10, CARD, CHRISTMAS, WELCOME, 17112]|false               |
|23432    |[PRETTY, HANGING, QUILTED, HEARTS]        |false               |
|21002    |[ROSE, DU, SUD, DRAWSTRING, BAG]          |false               |
+---------+-

So now lets let's clean up the word lists and remove any empty words.

In [52]:
from pyspark.sql.functions import array_remove

cleanedWordListDF = splittedDF.select(
    "StockCode", 
    array_remove("word_list", "").alias("word_list")
)

cleanedWordListDF.show(10, truncate=False)   

+---------+------------------------------------------+
|StockCode|word_list                                 |
+---------+------------------------------------------+
|21285    |[RETROSPOT, CANDLE, MEDIUM]               |
|84987    |[SET, OF, 36, TEATIME, PAPER, DOILIES]    |
|22708    |[WRAP, DOLLY, GIRL]                       |
|22690    |[DOORMAT, HOME, SWEET, HOME, BLUE]        |
|21249    |[WOODLAND, HEIGHT, CHART, STICKERS]       |
|85015    |[SET, OF, 12, VINTAGE, POSTCARD, SET]     |
|84279P   |[CHERRY, BLOSSOM, DECORATIVE, FLASK]      |
|23623    |[SET, 10, CARD, CHRISTMAS, WELCOME, 17112]|
|23432    |[PRETTY, HANGING, QUILTED, HEARTS]        |
|21002    |[ROSE, DU, SUD, DRAWSTRING, BAG]          |
+---------+------------------------------------------+
only showing top 10 rows



Did it work?

In [53]:
cleanedWordListDF.select(
    "StockCode", 
    "word_list", 
    array_contains("word_list", "").alias("empty strings inside")
).show(10, truncate=False)

+---------+------------------------------------------+--------------------+
|StockCode|word_list                                 |empty strings inside|
+---------+------------------------------------------+--------------------+
|21285    |[RETROSPOT, CANDLE, MEDIUM]               |false               |
|84987    |[SET, OF, 36, TEATIME, PAPER, DOILIES]    |false               |
|22708    |[WRAP, DOLLY, GIRL]                       |false               |
|22690    |[DOORMAT, HOME, SWEET, HOME, BLUE]        |false               |
|21249    |[WOODLAND, HEIGHT, CHART, STICKERS]       |false               |
|85015    |[SET, OF, 12, VINTAGE, POSTCARD, SET]     |false               |
|84279P   |[CHERRY, BLOSSOM, DECORATIVE, FLASK]      |false               |
|23623    |[SET, 10, CARD, CHRISTMAS, WELCOME, 17112]|false               |
|23432    |[PRETTY, HANGING, QUILTED, HEARTS]        |false               |
|21002    |[ROSE, DU, SUD, DRAWSTRING, BAG]          |false               |
+---------+-

yes, it did.

Back to my questions. Now, after having cleaned up the data the number of words per stock description is simply the array length which is provided by the `size()` function.

In [55]:
from pyspark.sql.functions import size

cleanedWordListDF.select(
    "StockCode", 
    size("word_list").alias("num_of_words")
).show(10)

+---------+------------+
|StockCode|num_of_words|
+---------+------------+
|    21285|           3|
|    84987|           6|
|    22708|           3|
|    22690|           5|
|    21249|           4|
|    85015|           6|
|   84279P|           4|
|    23623|           6|
|    23432|           4|
|    21002|           5|
+---------+------------+
only showing top 10 rows



In [56]:
from pyspark.sql.functions import avg

avgDF = cleanedWordListDF.select(
    avg(
        size("word_list")
    ).alias("avg_num_of_words")
)

avgDF.show(10)

+-----------------+
| avg_num_of_words|
+-----------------+
|4.031510851419032|
+-----------------+



So the answer to my first question is that stock descriptions are quite short, just about four words in average.

Pyspark module [pyspark.sql.functions](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) provides further array related functions, which I just list here for later reference:

* **array()** - creates a new array column from a list of columns or column expressions that have the **same data type**
* **array_distinct(col)** - Collection function: removes duplicate values from the array 
* **array_except(col1, col2)** - Collection function: returns an array of the elements in col1 but not in col2, without duplicates
* **array_intersect(col1, col2)** - Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates 
* **array_join()** 
* **array_max()** - Collection function: returns the maximum value of the array
* **array_min()** - Collection function: returns the maximum value of the array
* **array_position()** - Collection function: Locates the position of the first occurrence of the given value in the given array
* **array_repeat(col, count)** - Collection function: creates an array containing a column repeated count times
* **array_sort(col)** - Collection function: sorts the input array in ascending order
* **array_union(col1, col2)** - Collection function: returns an array of the elements in the union of col1 and col2, without duplicates
* **arrays_overlap(a1, a2)** - Collection function: returns true if the arrays contain any common non-null element
* **arrays_zip()** - Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays

## Explode
Two answer my secondf question, it would be easier for me having all words in in column instead of spread across many lists. To turn array elements into rows, I need to apply the `explode()` function. As the name of the function indicates, this can heavily increase the number of rows and the values of all remaining columns get duplicated.

In [57]:
from pyspark.sql.functions import explode

explodedDF = cleanedWordListDF.select(
    "StockCode",
    explode("word_list").alias("words")
)

explodedDF.orderBy("StockCode").show(20)

+---------+----------+
|StockCode|     words|
+---------+----------+
|    10002|INFLATABLE|
|    10002|     GLOBE|
|    10002| POLITICAL|
|    10080|    CACTUS|
|    10080|     check|
|    10080|    GROOVY|
|    10080|INFLATABLE|
|    10120|    RUBBER|
|    10120|     DOGGY|
|   10123C|      TAPE|
|   10123C|  WRAPPING|
|   10123C|    HEARTS|
|   10124A|        ON|
|   10124A|      TAPE|
|   10124A|     SPOTS|
|   10124A| BOOKCOVER|
|   10124A|       RED|
|   10124G|      CAMO|
|   10124G| BOOKCOVER|
|   10124G|      ARMY|
+---------+----------+
only showing top 20 rows



The anser to my second question is simply a count of rows per word sorted in descending order.

In [77]:
from pyspark.sql.functions import desc, count, lit

explodedDF\
    .groupBy("words")\
    .agg(count(lit(1)).alias("word_count"))\
    .orderBy(desc("word_count"))\
    .show(10)

+---------+----------+
|    words|word_count|
+---------+----------+
|      SET|       341|
|     PINK|       317|
|       OF|       255|
|    HEART|       242|
|  VINTAGE|       225|
|     BLUE|       221|
|      RED|       205|
|      BAG|       172|
|CHRISTMAS|       158|
|    GLASS|       157|
+---------+----------+
only showing top 10 rows



Pink stocks seems to be quite popular.

## Maps
For handling data in key:value structure, Spark provides another complex datatype: *maps*.

My testdata does not provide key:value structured data. So first, I will transform my existing data into maps and second, I can investigate, how to handle key:value source data as an input to my ETL dataprocessing.

### Creating Maps

In [4]:
dfFlight = spark.read\
   .option("inferSchema", "true")\
   .option("header", "true")\
   .csv("./data/flight-data/2015-summary.csv")

from pyspark.sql.functions import lit, struct, array, col
from pyspark.sql.types import StringType

arrDF = dfFlight.select(
    array(
        lit("destination"),
        lit("origin"),
        lit("count")
    ).alias("key"),
    array(
        "DEST_COUNTRY_NAME",
        "ORIGIN_COUNTRY_NAME",
        col("count").cast(StringType())
    ).alias("value")
)

arrDF.show(10, truncate=False)

+----------------------------+--------------------------------+
|key                         |value                           |
+----------------------------+--------------------------------+
|[destination, origin, count]|[United States, Romania, 15]    |
|[destination, origin, count]|[United States, Croatia, 1]     |
|[destination, origin, count]|[United States, Ireland, 344]   |
|[destination, origin, count]|[Egypt, United States, 15]      |
|[destination, origin, count]|[United States, India, 62]      |
|[destination, origin, count]|[United States, Singapore, 1]   |
|[destination, origin, count]|[United States, Grenada, 62]    |
|[destination, origin, count]|[Costa Rica, United States, 588]|
|[destination, origin, count]|[Senegal, United States, 40]    |
|[destination, origin, count]|[Moldova, United States, 1]     |
+----------------------------+--------------------------------+
only showing top 10 rows



In [93]:
from pyspark.sql.functions import map_from_arrays

mapDF = arrDF.select(
    map_from_arrays("key", "value").alias("data_map")
)

mapDF.show(10, truncate=False)

+------------------------------------------------------------------+
|data_map                                                          |
+------------------------------------------------------------------+
|[destination -> United States, origin -> Romania, count -> 15]    |
|[destination -> United States, origin -> Croatia, count -> 1]     |
|[destination -> United States, origin -> Ireland, count -> 344]   |
|[destination -> Egypt, origin -> United States, count -> 15]      |
|[destination -> United States, origin -> India, count -> 62]      |
|[destination -> United States, origin -> Singapore, count -> 1]   |
|[destination -> United States, origin -> Grenada, count -> 62]    |
|[destination -> Costa Rica, origin -> United States, count -> 588]|
|[destination -> Senegal, origin -> United States, count -> 40]    |
|[destination -> Moldova, origin -> United States, count -> 1]     |
+------------------------------------------------------------------+
only showing top 10 rows



In [94]:
mapDF.select(col("data_map")["destination"]).show(10)

+---------------------+
|data_map[destination]|
+---------------------+
|        United States|
|        United States|
|        United States|
|                Egypt|
|        United States|
|        United States|
|        United States|
|           Costa Rica|
|              Senegal|
|              Moldova|
+---------------------+
only showing top 10 rows



In [95]:
mapDF.select(col("data_map")["origin"]).show(10)

+----------------+
|data_map[origin]|
+----------------+
|         Romania|
|         Croatia|
|         Ireland|
|   United States|
|           India|
|       Singapore|
|         Grenada|
|   United States|
|   United States|
|   United States|
+----------------+
only showing top 10 rows



In [96]:
from pyspark.sql.functions import map_keys

mapDF.select(
        map_keys("data_map")
).show(10, truncate=False)

+----------------------------+
|map_keys(data_map)          |
+----------------------------+
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
|[destination, origin, count]|
+----------------------------+
only showing top 10 rows



In [97]:
from pyspark.sql.functions import map_values

mapDF.select(
        map_values("data_map")
).show(10, truncate=False)

+--------------------------------+
|map_values(data_map)            |
+--------------------------------+
|[United States, Romania, 15]    |
|[United States, Croatia, 1]     |
|[United States, Ireland, 344]   |
|[Egypt, United States, 15]      |
|[United States, India, 62]      |
|[United States, Singapore, 1]   |
|[United States, Grenada, 62]    |
|[Costa Rica, United States, 588]|
|[Senegal, United States, 40]    |
|[Moldova, United States, 1]     |
+--------------------------------+
only showing top 10 rows



The data I've processed so far looks at least semi-structured because the keys and values all appear in identical order. So there is still an implicit schema because all rows match to the same pattern:

destination -> descVal, origin -> origValue, count -> cntVal

What would happen, if rows have keys and values in different order? Because my testdata does not provide examples for this, I create a DataFrame manuall with synthetic data in multiple orders.

In [98]:
unstructuredDF = spark.createDataFrame(
        [
            (["destination", "origin", "count"], ["United States", "Germany", "10"],), 
            (["count", "origin", "destination"], ["25", "France", "Spain"],),
            (["count", "destination", "origin"], ["75", "Italy", "Spain"],)
        ], 
        ["key", "value"]
)

unstructuredDF.show(truncate=False)

+----------------------------+----------------------------+
|key                         |value                       |
+----------------------------+----------------------------+
|[destination, origin, count]|[United States, Germany, 10]|
|[count, origin, destination]|[25, France, Spain]         |
|[count, destination, origin]|[75, Italy, Spain]          |
+----------------------------+----------------------------+



In [99]:
mapDF2 = unstructuredDF.select(
    map_from_arrays("key", "value").alias("data_map")
)

mapDF2.show(truncate=False)

+--------------------------------------------------------------+
|data_map                                                      |
+--------------------------------------------------------------+
|[destination -> United States, origin -> Germany, count -> 10]|
|[count -> 25, origin -> France, destination -> Spain]         |
|[count -> 75, destination -> Italy, origin -> Spain]          |
+--------------------------------------------------------------+



In [100]:
mapDF2.select(col("data_map")["origin"]).show(10)

+----------------+
|data_map[origin]|
+----------------+
|         Germany|
|          France|
|           Spain|
+----------------+



Luckily the odering doesn't matter because I reference the values by keys and not by positions. Maps are more like dictionaries than lists or arrays.

### Turning Maps into DataFrames

So with my self-created map I can now investigate how to handle such data as input for my ETL process which finally will write data in tabular form into a file or database table. So as intermediate step, I will have to align more or less ordered *key:value* pairs with the schema of a `DataFrame`. 

Can the `explode()` function help again?

In [163]:
mapDF2.select(explode("data_map")).show(10)

+-----------+-------------+
|        key|        value|
+-----------+-------------+
|destination|United States|
|     origin|      Germany|
|      count|           10|
|      count|           25|
|     origin|       France|
|destination|        Spain|
|      count|           75|
|destination|        Italy|
|     origin|        Spain|
+-----------+-------------+



Well, yes and no. Yes, `explode()` accepts both arrays as well as maps as an argument. No, because now I've lost the information, which three rows belong together. Additionally my intention was to gain three columns, one for each key value, and not just two. For maps referencing by key is always a better approach than referencing by position.

In [164]:
mapDF2.select(
    col("data_map")["destination"].alias("destination"),
    col("data_map")["origin"].alias("origin"),
    col("data_map")["count"].alias("count")
).show(10)

+-------------+-------+-----+
|  destination| origin|count|
+-------------+-------+-----+
|United States|Germany|   10|
|        Spain| France|   25|
|        Italy|  Spain|   75|
+-------------+-------+-----+



So with Spark handling nearly unstructured data records of key:value pairs in different orders is not a big problem.

Pyspark module <a href=https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions>pyspark.sql.functions</a> provides further map related functions, which I also just list here for later reference:

* **map_concat()** - Returns the union of all the given maps
* **map_from_entries()** - Collection function: Returns a map created from the given array of entries

### Turning Arrays or Maps into JSON
A nice Spark feature is the `to_json()` function which converts StructType, ArrayType or MapType data into JSON. This can be relevant for me if I have to call a REST API which expects JSON documents as paylod.

In [103]:
from pyspark.sql.functions import to_json

mapDF2.select(to_json("data_map")).show(10, truncate=False)

+---------------------------------------------------------------+
|structstojson(data_map)                                        |
+---------------------------------------------------------------+
|{"destination":"United States","origin":"Germany","count":"10"}|
|{"count":"25","origin":"France","destination":"Spain"}         |
|{"count":"75","destination":"Italy","origin":"Spain"}          |
+---------------------------------------------------------------+



In [114]:
mapDF2.select(to_json("data_map")).printSchema()

root
 |-- structstojson(data_map): string (nullable = true)



## Processing Semi-structured JSON data
As I've learned on day 3, reading data from JSON file and transforming it into a DataFreame is quite simple. Just for repetition:

In [5]:
jsonDF = spark.read\
   .option("inferSchema", "true")\
   .format("json")\
   .load("./data/flight-data/2015-summary.json")\

jsonDF.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



In [6]:
jsonDF.show(10)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
|    United States|          Singapore|    1|
|    United States|            Grenada|   62|
|       Costa Rica|      United States|  588|
|          Senegal|      United States|   40|
|          Moldova|      United States|    1|
+-----------------+-------------------+-----+
only showing top 10 rows



But what have I to to in case of having tabular data where only one column contains JSON strings? To check this out first I create same testdata.

In [73]:
df = spark.createDataFrame(
        [
            (123, "DUS", '{"destinations" : ["FRA", "MUC", "TXL"], "airlines" : ["LH", "EW", "RY"]}'), 
            (456, "FRA", '{"destinations" : ["CDG", "MUC", "JFK"], "airlines" : ["AF", "LH", "DL"]}'),
            (789, "MUC", '{"destinations" : ["FRA", "ZUC", "DUS"], "airlines" : ["EW", "LH", "EW"]}')
        ], 
        ["key", "airport", "dest"]
)

In [74]:
df.show(truncate=False)

+---+-------+-------------------------------------------------------------------------+
|key|airport|dest                                                                     |
+---+-------+-------------------------------------------------------------------------+
|123|DUS    |{"destinations" : ["FRA", "MUC", "TXL"], "airlines" : ["LH", "EW", "RY"]}|
|456|FRA    |{"destinations" : ["CDG", "MUC", "JFK"], "airlines" : ["AF", "LH", "DL"]}|
|789|MUC    |{"destinations" : ["FRA", "ZUC", "DUS"], "airlines" : ["EW", "LH", "EW"]}|
+---+-------+-------------------------------------------------------------------------+



### Navigation along JSON Paths
Each row in the "dest" column contains a valid JSON document. Now I can use the `get_json_object()` function to access the values inside of the JSON documents by specifiying the path from the root element (represented by `$`) down the nesting hierarchie to the specific JSON obect I want to extract. 

path: `$.key_level1.key_level_2....key_level_n`

Since in my DataFrame the objects "destinations", and "airlines" have value lists, I have to specify the list index to get one singular value per row.

In [79]:
from pyspark.sql.functions import get_json_object

df.select(
        "key", 
        "airport",
        get_json_object("dest", '$.destinations[2]').alias("destination"),
        get_json_object("dest", '$.airlines[1]').alias("airline"),
).show(truncate=False)

+---+-------+-----------+-------+
|key|airport|destination|airline|
+---+-------+-----------+-------+
|123|DUS    |TXL        |EW     |
|456|FRA    |JFK        |LH     |
|789|MUC    |DUS        |LH     |
+---+-------+-----------+-------+



If I omitt the list index, I'll get the entire value list in my result DataFrame.

In [80]:
df.select(
        "key", 
        "airport",
        get_json_object("dest", '$.destinations').alias("destination"),
        get_json_object("dest", '$.airlines').alias("airline"),
).show(truncate=False)

+---+-------+-------------------+----------------+
|key|airport|destination        |airline         |
+---+-------+-------------------+----------------+
|123|DUS    |["FRA","MUC","TXL"]|["LH","EW","RY"]|
|456|FRA    |["CDG","MUC","JFK"]|["AF","LH","DL"]|
|789|MUC    |["FRA","ZUC","DUS"]|["EW","LH","EW"]|
+---+-------+-------------------+----------------+



There is a similar function `json_tuple()` but I'm not sure if it provides any benefits to me, because:
1. I cannot use it if the JSON document has more than one level of nesting, and
1. I cannot refer to single list elements

In [91]:
from pyspark.sql.functions import json_tuple
df.select("key", 
          "airport",
          json_tuple("dest", "destinations", "airlines").alias("destination", "airline"),
).show(truncate=False)

+---+-------+-------------------+----------------+
|key|airport|destination        |airline         |
+---+-------+-------------------+----------------+
|123|DUS    |["FRA","MUC","TXL"]|["LH","EW","RY"]|
|456|FRA    |["CDG","MUC","JFK"]|["AF","LH","DL"]|
|789|MUC    |["FRA","ZUC","DUS"]|["EW","LH","EW"]|
+---+-------+-------------------+----------------+



### Turning JSON to Map based on Schema
Finally, like I can read from JSON files using an explicit schema definition, I can also apply `from_json()` on DataFrame columns containing JSON by using a schema. Depending on the schema definition `from_json()` will return StructType, ArrayType or MapType. Actually I could perform a conversion round-trip  from StructType, ArrayType or MapType -> `to_json()` -> {Json} -> `from_json()` ->  StructType, ArrayType or MapType.

I convert the Json. 

In [130]:
from pyspark.sql.types import *
from pyspark.sql.functions import from_json

jsonSchema = MapType(
    StringType(), 
    ArrayType(StringType(), True),
    True
)

mappedDF = df.select("key", 
          "airport",
         from_json("dest", jsonSchema).alias("json_data")
)

mappedDF.show(truncate=False)

+---+-------+-----------------------------------------------------------+
|key|airport|json_data                                                  |
+---+-------+-----------------------------------------------------------+
|123|DUS    |[destinations -> [FRA, MUC, TXL], airlines -> [LH, EW, RY]]|
|456|FRA    |[destinations -> [CDG, MUC, JFK], airlines -> [AF, LH, DL]]|
|789|MUC    |[destinations -> [FRA, ZUC, DUS], airlines -> [EW, LH, EW]]|
+---+-------+-----------------------------------------------------------+



Now I can navigate on the Map structure to extract single values similar to navigating the JSON path using `get_json_object()`, e.g. grabbing the third element of the destinations lists.

In [137]:
mappedDF.select(
    "key", 
    "airport",
    col("json_data")["destinations"][2]
).show()

+---+-------+--------------------------+
|key|airport|json_data[destinations][2]|
+---+-------+--------------------------+
|123|    DUS|                       TXL|
|456|    FRA|                       JFK|
|789|    MUC|                       DUS|
+---+-------+--------------------------+



The question is: what is the benefit of taking these extra effort, defining a schema and converting JSON to Map? In my opinion this leads to cleaner code and a better design, because:
1. now the JSON structure, a mexpecting is explicitly documented in the code by the schema instead of implicitly assumed 
1. the Map structure is a unifying abstraction of any key:value data, regardles of the source format, e.g. CSV file, JSON documents or key-value database tables