# Multidimensional data frames: Using PySpark with JSON data

Thus far, we have used PySpark’s data frame to work with textual and tabular data.
Let's push the abstraction a little further by representing *hierarchical information* within a data frame. Imagine it for a moment: columns within columns, the ultimate flexibility.

## Reading JSON data

This section explains what JSON is, how to use the specialized JSON reader with
PySpark, and how a JSON file is represented within a data frame. For this chapter, we use a JSON dump of information about the TV show Silicon Valley from TV Maze.

```json
{
   "id":143,
   "name":"Silicon Valley",
   "type":"Scripted",
   "language":"English",
   "genres":[
      "Comedy"
   ],
   "network":{
      "id":8,
      "name":"HBO",
      "country":{
         "name":"United States",
         "code":"US",
         "timezone":"America/New_York"
      }
   },
   "_embedded":{
      "episodes":[
         {
            "id":10897,
            "name":"Minimum Viable Product",
            "season":1,
            "number":1
         },
         {
            "id":10898,
            "name":"The Cap Table",
            "season":1,
            "number":2
         }
      ]
   }
}
```

In [18]:
# Reading a simple JSON document as python dictionary
import json
sample_json = """{
   "id":143,
   "name":"Silicon Valley",
   "type":"Scripted",
   "language":"English",
   "genres":[
      "Comedy"
   ],
   "network":{
      "id":8,
      "name":"HBO",
      "country":{
         "name":"United States",
         "code":"US",
         "timezone":"America/New_York"
      }
   }
}"""
document = json.loads(sample_json)
print(document)
print(type(document))

{'id': 143, 'name': 'Silicon Valley', 'type': 'Scripted', 'language': 'English', 'genres': ['Comedy'], 'network': {'id': 8, 'name': 'HBO', 'country': {'name': 'United States', 'code': 'US', 'timezone': 'America/New_York'}}}
<class 'dict'>


In [19]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import os

spark = SparkSession.builder.getOrCreate()

DIRECTORY = "data/shows/"

shows = spark.read.json(
    os.path.join(DIRECTORY,'shows-silicon-valley.json')
)

shows.count()

1

In the PySpark world, reading JSON follows this rule: *one JSON document, one line, one record*. This means that if you want to have multiple JSON records in the same document, you need to have one document per line and no new line within your document.

If you want to ingest multiple documents across multiple files, you need to set
the `multiLine` (careful about the capital L!) parameter to `true`. This will changethe JSON reading rule to the following: *one JSON document, one file, one record*.

In [20]:
three_shows = spark.read.json(
    os.path.join(DIRECTORY,"shows-*.json"),
    multiLine=True)
    
three_shows.count()

3

### Complex data types
ARRAY, MAP, STRUCT

Spark uses this the term *complex data types* to refer to data types that contain other types.  In PySpark, we have the *array*, the *map*, and the *struct*. With these, you will be able to express an infinite amount of data layout.

In [21]:
shows.printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true

In [22]:
# the hierarchy within the schema. 
# PySpark took every top-level key—the keys 
# from the root object—and parsed them as columns 

print(shows.columns)

['_embedded', '_links', 'externals', 'genres', 'id', 'image', 'language', 'name', 'network', 'officialSite', 'premiered', 'rating', 'runtime', 'schedule', 'status', 'summary', 'type', 'updated', 'url', 'webChannel', 'weight']


In [23]:
array_subset = shows.select("name","genres")

array_subset.show(1,False)

+--------------+--------+
|name          |genres  |
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



our genres column can be thought of as containing lists of elements within each record. To get to the value inside the array, we need to extract them. PySpark provides a very pythonic way to work with arrays as if they were lists.

PySpark’s array functions—available in the pyspark.sql.functions module—are
almost all prefixed with the `array_` keyword

Let's play with these functions.

- We create three literal columns (using `lit()` to create scalar columns, then
`make_array()`) to create an array of possible genres. PySpark won’t accept
Python lists as an argument to `lit()`, so we have to go the long route by creating individual scalar columns before combining them into a single array.
- Then use the function `array_repeat()` to create a column repeating the Comedy string we extracted below five times. Finally compute the size of both
columns, de-dupe both arrays, and intersect them, yielding our original [Comedy] array

In [24]:
array_subset = array_subset.select(
    "name",
    array_subset.genres[0].alias("dot_and_index"),
    F.col("genres")[0].alias("col_and_index"),
    array_subset.genres.getItem(0).alias("dot_and_method"),
    F.col("genres").getItem(0).alias("col_and_method"),
)
array_subset.show()

+--------------+-------------+-------------+--------------+--------------+
|          name|dot_and_index|col_and_index|dot_and_method|col_and_method|
+--------------+-------------+-------------+--------------+--------------+
|Silicon Valley|       Comedy|       Comedy|        Comedy|        Comedy|
+--------------+-------------+-------------+--------------+--------------+



In [25]:
array_subset_repeated = array_subset.select(
    "name",
    F.lit("Comedy").alias("one"),
    F.lit("Horror").alias("two"),
    F.lit("Drama").alias("three"),
    F.col("dot_and_index"),
).select(
    "name",
    F.array("one", "two", "three").alias("Some_Genres"),
    F.array_repeat("dot_and_index", 5).alias("Repeated_Genres"),
)

array_subset_repeated.show(1, False)

+--------------+-----------------------+----------------------------------------+
|name          |Some_Genres            |Repeated_Genres                         |
+--------------+-----------------------+----------------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]|[Comedy, Comedy, Comedy, Comedy, Comedy]|
+--------------+-----------------------+----------------------------------------+



In [26]:
array_subset_repeated.select(
    "name", F.size("Some_Genres"), F.size("Repeated_Genres")
).show()

+--------------+-----------------+---------------------+
|          name|size(Some_Genres)|size(Repeated_Genres)|
+--------------+-----------------+---------------------+
|Silicon Valley|                3|                    5|
+--------------+-----------------+---------------------+



In [27]:
array_subset_repeated.select(
    "name",
    F.array_distinct("Some_Genres"),
    F.array_distinct("Repeated_Genres"),
).show(1, False)

+--------------+---------------------------+-------------------------------+
|name          |array_distinct(Some_Genres)|array_distinct(Repeated_Genres)|
+--------------+---------------------------+-------------------------------+
|Silicon Valley|[Comedy, Horror, Drama]    |[Comedy]                       |
+--------------+---------------------------+-------------------------------+



In [28]:
array_subset_repeated = array_subset_repeated.select(
    "name",
    F.array_intersect("Some_Genres", "Repeated_Genres").alias("Genres"),
)
array_subset_repeated.show()

+--------------+--------+
|          name|  Genres|
+--------------+--------+
|Silicon Valley|[Comedy]|
+--------------+--------+



When you want to know the position of a value in an array, you can use `array_position()`. This function takes two arguments:
- An array column to perform the search
- A value to search for within the array

In [29]:
array_subset_repeated.select(
    "Genres", F.array_position("Genres", "Comedy")
).show()

+--------+------------------------------+
|  Genres|array_position(Genres, Comedy)|
+--------+------------------------------+
|[Comedy]|                             1|
+--------+------------------------------+



#### MAP type: keys and values within a column

A map is conceptually very close to a Python typed dictionary: you have keys and values just like in a dictionary, but as with the array, the keys need to be of the same type, and the values need to be of the same type (the type for the keys can be different than the type for the values). Values can be null, but keys can’t, just like with Python dictionaries.

One of the easiest ways to create a map is from two columns of type array. We will
do so by collecting some information about the name, language, type, and url columns into an array and using the map_from_arrays() function, like below

In [31]:
columns = ["name", "language", "type"]

shows_map = shows.select(
    *[F.lit(column) for column in columns],
    F.array(*columns).alias("values"),
)
shows_map = shows_map.select(F.array(*columns).alias("keys"), "values")
shows_map.show(1,False)

+----------------------+-----------------------------------+
|keys                  |values                             |
+----------------------+-----------------------------------+
|[name, language, type]|[Silicon Valley, English, Scripted]|
+----------------------+-----------------------------------+



In [32]:
shows_map = shows_map.select(
    F.map_from_arrays("keys", "values").alias("mapped")
)

shows_map.printSchema()

root
 |-- mapped: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



In [33]:
shows_map.show(1,False)

+---------------------------------------------------------------+
|mapped                                                         |
+---------------------------------------------------------------+
|{name -> Silicon Valley, language -> English, type -> Scripted}|
+---------------------------------------------------------------+



In [34]:
# To access the mapped values
shows_map.select(
    F.col("mapped.name"),
    F.col("mapped")["name"],
    shows_map.mapped["name"],
).show()

+--------------+--------------+--------------+
|          name|  mapped[name]|  mapped[name]|
+--------------+--------------+--------------+
|Silicon Valley|Silicon Valley|Silicon Valley|
+--------------+--------------+--------------+



Just like with the `array`, PySpark provides a few functions to work with maps under the `pyspark.sql.functions` module. Most of them are prefixed or suffixed with `map`, such as `map_values()` (which creates an array column out of the map values) or `create_map()` (which creates a map from the columns passed as a parameter, alternating between keys and values)

### Struct: Nesting columns within columns

The struct is akin to a JSON object, in the sense that the key or name of each pair is a string and that each record can be of a different type

In [35]:
shows.select("schedule").printSchema()

root
 |-- schedule: struct (nullable = true)
 |    |-- days: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- time: string (nullable = true)



The struct is very different from the `array` and the `map` in that the number of fields and their names are known ahead of time. In our case, the `schedule` struct column is fixed: we know that each record of our data frame will contain that `schedule` struct (or a `null` value, if we want to be pedantic), and within that struct there will be an array of strings, `days`, and a string, `time`. The `array` and the `map` enforce the types of the values, but not their numbers or names. The struct allows for more versatility of types, as long as you name each field and provide the type ahead of time.

we can visualize that schedule is a data frame of two columns (days and time)
trapped within the column.

![struct data type](images/with_json_struct.png)

#### Navigating structs as if they were nested columns
how to extract values from nested structs inside a data frame. PySpark provides the same convenience when working with nested columns as it would for regular columns.


In [36]:
shows.select(F.col("_embedded")).printSchema()

root
 |-- _embedded: struct (nullable = true)
 |    |-- episodes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- airdate: string (nullable = true)
 |    |    |    |-- airstamp: string (nullable = true)
 |    |    |    |-- airtime: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- image: struct (nullable = true)
 |    |    |    |    |-- medium: string (nullable = true)
 |    |    |    |    |-- original: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- number: long (nullable = true)
 |    |    |    |-- runtime: long (nullable = true)
 |    |    |    |-- season: long (nullable = true)
 |    |    |    |-- summary: string (nullable = true)
 |    |    |    |-- url: string (nullable = true

In [37]:
shows_clean = shows.withColumn(
    "episodes", F.col("_embedded.episodes")
).drop("_embedded")

shows_clean.printSchema()

root
 |-- _links: struct (nullable = true)
 |    |-- previousepisode: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |    |-- self: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |-- externals: struct (nullable = true)
 |    |-- imdb: string (nullable = true)
 |    |-- thetvdb: long (nullable = true)
 |    |-- tvrage: long (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: long (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- medium: string (nullable = true)
 |    |-- original: string (nullable = true)
 |-- language: string (nullable = true)
 |-- name: string (nullable = true)
 |-- network: struct (nullable = true)
 |    |-- country: struct (nullable = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- timezone: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nul

we can refer to individual elements in the array using the index in brackets after
the column reference. What about extracting the names of all the episodes, which are within the episodes array of structs?

In [38]:
episodes_name = shows_clean.select(F.col("episodes.name"))
episodes_name.printSchema()


root
 |-- name: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [39]:
episodes_name.select(F.explode("name").alias("name")).show(3, False)

+-------------------------+
|name                     |
+-------------------------+
|Minimum Viable Product   |
|The Cap Table            |
|Articles of Incorporation|
+-------------------------+
only showing top 3 rows



### Building and using the data frame schema
Let's see how to define and use a schema with a PySpark data frame.We build the schema for our JSON object programmatically and review the out-ofthe-box types  PySpark offers. Being able to use Python structures (serialized as JSON) means that we can manipulate our schemas just like any other data structure; we can reuse our data manipulation tool kit for manipulating our data frame’s metadata. By doing this, we also address the potential slowdown from inferSchema, as we
don’t need Spark to read the data twice (once to infer the schema, once to perform
the read).

#### Using Spark types as the base blocks of a schema
Now we will build the schema for our shows data frame from scratch and include some programmatic niceties of the PySpark schema-building capabilities. The data types we use to build a schema are located in the `pyspark.sql.types` module. 



In [40]:
import pyspark.sql.types as T

episode_links_schema = T.StructType(
    [
        T.StructField(
            "self", T.StructType([T.StructField("href", T.StringType())]) # The _links field contains a self struct that itself contains a single-string field: href.
        )
    ]
)

episode_image_schema = T.StructType(
    [
        T.StructField("medium", T.StringType()),
        T.StructField("original", T.StringType()),
    ]
)

episode_schema = T.StructType(
    [
        T.StructField("_links", episode_links_schema),  # <- since types are python
        T.StructField("airdate", T.DateType()),         # objects, we can pass them
        T.StructField("airstamp", T.TimestampType()),   # to varibles and use them.
        T.StructField("airtime", T.StringType()),       # Using episodes_links_schema
        T.StructField("id", T.StringType()),            # and episode_image_schema
        T.StructField("image", episode_image_schema),   # <- makes our schema for an
        T.StructField("name", T.StringType()),          # episode look much cleaner
        T.StructField("number", T.LongType()),
        T.StructField("runtime", T.LongType()),
        T.StructField("season", T.LongType()),
        T.StructField("summary", T.StringType()),
        T.StructField("url", T.StringType()),
    ]
)

embedded_schema = T.StructType(
    [
        T.StructField(
            "_embedded",
            T.StructType(
                [
                    T.StructField(
                        "episodes", T.ArrayType(episode_schema)
                    )
                ]
            ),
        )
    ]
)


#### Reading JSON with stric schema in place

This section covers how to read a JSON document while enforcing a precise schema.
This proves extremely useful when you want to improve the robustness of your data
pipeline; it’s better to know you’re missing a few columns at ingestion time than to get an error later in the program

In [41]:
shows_with_schema = spark.read.json(
    os.path.join(DIRECTORY,"shows-silicon-valley.json"),
    schema=embedded_schema,
    mode="FAILFAST",
)

In [43]:
for column in ["airdate", "airstamp"]:
    shows.select(f"_embedded.episodes.{column}").select(
        F.explode(column)
    ).show(5,False)

+----------+
|col       |
+----------+
|2014-04-06|
|2014-04-13|
|2014-04-20|
|2014-04-27|
|2014-05-04|
+----------+
only showing top 5 rows

+-------------------------+
|col                      |
+-------------------------+
|2014-04-07T02:00:00+00:00|
|2014-04-14T02:00:00+00:00|
|2014-04-21T02:00:00+00:00|
|2014-04-28T02:00:00+00:00|
|2014-05-05T02:00:00+00:00|
+-------------------------+
only showing top 5 rows



In [46]:
# witnessing a json document ingestion with incompatible schema
from py4j.protocol import Py4JJavaError

episode_schema_BAD = T.StructType(
    [
        T.StructField("_links", episode_links_schema),
        T.StructField("airdate", T.DateType()),
        T.StructField("airstamp", T.TimestampType()),
        T.StructField("airtime", T.StringType()),
        T.StructField("id", T.StringType()),
        T.StructField("image", episode_image_schema),
        T.StructField("name", T.StringType()),
        T.StructField("number", T.LongType()),
        T.StructField("runtime", T.LongType()),
        T.StructField("season", T.LongType()),
        T.StructField("summary", T.LongType()),
        T.StructField("url", T.LongType()),
    ]
)

embedded_schema2 = T.StructType(
    [
        T.StructField(
            "_embedded",
            T.StructType(
                [
                    T.StructField(
                        "episodes", T.ArrayType(episode_schema_BAD)
                    )
                ]
            )
        )
    ]
)

shows_with_schema_wrong = spark.read.json(
    "./data/shows/shows-silicon-valley.json",
    schema=embedded_schema2,
    mode="FAILFAST",
)

try:
    shows_with_schema_wrong.show()
except Py4JJavaError:
    print('failed')

failed


#### Going full circle: Specifying the schemas in JSON

The StructType object has a handy fromJson() method that will read a JSON-formatted schema. As long as we know how to provide a proper JSON schema, we should be good to go. 

To understand the layout and content of a typical PySpark data frame, we use our `shows_with_schema` data frame and the schema attribute. Unlike `printSchema()`, which prints our schema to a standard output, schema returns an internal representation of the schema in terms of StructType. Fortunately, StructType comes with two methods for exporting its content into a JSON-esque format:
- `json()` will output a string containing the JSON-formatted schema.
- `jsonValue()` will return the schema as a dictionary

In [47]:
import pprint

pprint.pprint(
    shows_with_schema.select(
        F.explode("_embedded.episodes").alias("episode")
    )
    .select("episode.airtime")
    .schema.jsonValue()
)

{'fields': [{'metadata': {},
             'name': 'airtime',
             'nullable': True,
             'type': 'string'}],
 'type': 'struct'}


In [48]:
pprint.pprint(
    T.StructField("array_example", T.ArrayType(T.StringType())).jsonValue()
)

{'metadata': {},
 'name': 'array_example',
 'nullable': True,
 'type': {'containsNull': True, 'elementType': 'string', 'type': 'array'}}


In [49]:
pprint.pprint(
    T.StructField(
    "map_example", T.MapType(T.StringType(), T.LongType())
    ).jsonValue()
)

{'metadata': {},
 'name': 'map_example',
 'nullable': True,
 'type': {'keyType': 'string',
          'type': 'map',
          'valueContainsNull': True,
          'valueType': 'long'}}


In [50]:
pprint.pprint(
    T.StructType(
        [
            T.StructField(
                "map_example", T.MapType(T.StringType(), T.LongType())
            ),
            T.StructField("array_example", T.ArrayType(T.StringType())),
        ]
    ).jsonValue()
)

{'fields': [{'metadata': {},
             'name': 'map_example',
             'nullable': True,
             'type': {'keyType': 'string',
                      'type': 'map',
                      'valueContainsNull': True,
                      'valueType': 'long'}},
            {'metadata': {},
             'name': 'array_example',
             'nullable': True,
             'type': {'containsNull': True,
                      'elementType': 'string',
                      'type': 'array'}}],
 'type': 'struct'}


In [51]:
other_shows_schema = T.StructType.fromJson(
    json.loads(shows_with_schema.schema.json())
)

print(other_shows_schema == shows_with_schema.schema)

True


#### Putting it all together: Reducing duplicate data with complex data types

![Data Frame](images/with_json_final.png)

Since the beginning of the book, all of our data processing has tried to converge with having a single table. If we want to avoid data duplication, keep the relationship information, and have a single table, then we can—and should!—use the data frame’s complex column types. In our shows data frame
- Each record represents a show.
- A show has multiple episodes (array of structs column).
- Each episode has many fields (struct column within the array).
- Each show can have multiple genres (array of string column).
- Each show has a schedule (struct column).
- Each schedule belonging to a show can have multiple days (array) but a single time (string).

### Explore and Collect

how to use explode and collect operations to go from hierarchical to tabular and back. We cover the methods to break an array or a map into discrete records and how to get the records back into the original structure.


In [52]:
episodes = shows.select(
    "id", F.explode("_embedded.episodes").alias("episodes")
)

episodes.show(5, truncate=70)

+---+----------------------------------------------------------------------+
| id|                                                              episodes|
+---+----------------------------------------------------------------------+
|143|{{{http://api.tvmaze.com/episodes/10897}}, 2014-04-06, 2014-04-07T0...|
|143|{{{http://api.tvmaze.com/episodes/10898}}, 2014-04-13, 2014-04-14T0...|
|143|{{{http://api.tvmaze.com/episodes/10899}}, 2014-04-20, 2014-04-21T0...|
|143|{{{http://api.tvmaze.com/episodes/10900}}, 2014-04-27, 2014-04-28T0...|
|143|{{{http://api.tvmaze.com/episodes/10901}}, 2014-05-04, 2014-05-05T0...|
+---+----------------------------------------------------------------------+
only showing top 5 rows



In [53]:
episodes.count()

53

Explode can also happen with maps: the keys and values will be exploded in two different fields. The second type of explosion: posexplode(). The “pos” stands for position: it explodes the column and returns an additional column before the data that contains the position as a long.

In [54]:
episode_name_id = shows.select(
    F.map_from_arrays(
        F.col("_embedded.episodes.id"), F.col("_embedded.episodes.name")
    ).alias("name_id")
)

episode_name_id = episode_name_id.select(
    F.posexplode("name_id").alias("position", "id", "name")
)

episode_name_id.show(5)

+--------+-----+--------------------+
|position|   id|                name|
+--------+-----+--------------------+
|       0|10897|Minimum Viable Pr...|
|       1|10898|       The Cap Table|
|       2|10899|Articles of Incor...|
|       3|10900|    Fiduciary Duties|
|       4|10901|      Signaling Risk|
+--------+-----+--------------------+
only showing top 5 rows



Both `explode()` and `posexplode()` will skip any null values in the array or the map. If you want to have null as records, you can use `explode_outer()` or `posexplode_outer()` the same way.

Now that we have exploded data frames, we’ll do the opposite by collecting our
records into a complex column. 

In [55]:
collected = episodes.groupby("id").agg(
    F.collect_list("episodes").alias("episodes")
)

collected.count() 

1

In [56]:
collected.printSchema()

root
 |-- id: long (nullable = true)
 |-- episodes: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |-- airdate: string (nullable = true)
 |    |    |-- airstamp: string (nullable = true)
 |    |    |-- airtime: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- image: struct (nullable = true)
 |    |    |    |-- medium: string (nullable = true)
 |    |    |    |-- original: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- number: long (nullable = true)
 |    |    |-- runtime: long (nullable = true)
 |    |    |-- season: long (nullable = true)
 |    |    |-- summary: string (nullable = true)
 |    |    |-- url: string (nullable = true)



### Building your won hierarchies: Struct as a function

how you can create structs within a data frame. With this last tool in your toolbox, the structure of a data frame will have no secrets for you.

To create a struct, we use the `struct()` function from the `pyspark.sql.functions` module. This function takes a number of columns as parameters (just like `select()`) and returns a struct column containing the columns passed as parameters as fields.

In [59]:
struct_ex = shows.select(
    F.struct(
        F.col("status"), F.col("weight"), F.lit(True).alias("has_watched")
    ).alias("info")
)

struct_ex.show(1,False)

+-----------------+
|info             |
+-----------------+
|{Ended, 96, true}|
+-----------------+



In [60]:
struct_ex.printSchema()

root
 |-- info: struct (nullable = false)
 |    |-- status: string (nullable = true)
 |    |-- weight: long (nullable = true)
 |    |-- has_watched: boolean (nullable = false)



***
<p style="text-align:left;">
    <a href="./5_Joining_Grouping.ipynb">Previous Chapter</a>
    <span style="float:right;">
        <a href="./7_Python_SQL.ipynb">Next Chapter</a>
    </span>
</p>
