## Projecting Struct and Map Columns

As part of this topic we will see how to project `STRUCT` and `MAP`.

* Create list with appropriate types.
* Create Data Frame using list and define schema with relevant types.
* We will print schema as well as preview the data.
* We will then project the fields in structs and maps

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.spark.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Special Data Types'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [3]:
employees = [
     (2, "Henry", "Ford", 1250.0, 
      "India", ['henry@ford.com', 'hford@companyx.com'], 
      {"Home": "+91 234 567 8901", "Office": "+91 345 678 9012"}, 
      "456 78 9123", ('111 BCD Cir', 'Some City', 'Some State', 500091)
     ),
     (3, "Nick", "Junior", 750.0, 
      "United Kingdom", ['nick@junior.com', 'njunior@companyx.com'], 
      {"Home": "+44 111 111 1111", "Office": "+44 222 222 2222"}, 
      "222 33 4444", ('222 Giant Cly', 'UK City', 'UK Province', None)
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "Australia", ['bill@gomes.com', 'bgomes@companyx.com'], 
      {"Home": "+61 987 654 3210", "Office": "+61 876 543 2109"}, 
      "789 12 6118", None
     )
]

In [4]:
employees_df = spark.createDataFrame(
    employees,
    schema="""employee_id INT, employee_first_name STRING, employee_last_name STRING,
        employee_salary FLOAT, employee_nationality STRING, employee_email_ids ARRAY<STRING>,
        employee_phone_numbers MAP<STRING, STRING>, employee_ssn STRING,
        employee_address STRUCT<street: STRING, city: STRING, state: STRING, postal_code: INT>
    """
)

In [5]:
employees_df.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- employee_first_name: string (nullable = true)
 |-- employee_last_name: string (nullable = true)
 |-- employee_salary: float (nullable = true)
 |-- employee_nationality: string (nullable = true)
 |-- employee_email_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- employee_phone_numbers: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- employee_ssn: string (nullable = true)
 |-- employee_address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- postal_code: integer (nullable = true)



In [6]:
employees_df.show(truncate=False)

+-----------+-------------------+------------------+---------------+--------------------+---------------------------------------+------------------------------------------------------+------------+--------------------------------------------+
|employee_id|employee_first_name|employee_last_name|employee_salary|employee_nationality|employee_email_ids                     |employee_phone_numbers                                |employee_ssn|employee_address                            |
+-----------+-------------------+------------------+---------------+--------------------+---------------------------------------+------------------------------------------------------+------------+--------------------------------------------+
|2          |Henry              |Ford              |1250.0         |India               |[henry@ford.com, hford@companyx.com]   |[Office -> +91 345 678 9012, Home -> +91 234 567 8901]|456 78 9123 |[111 BCD Cir, Some City, Some State, 500091]|
|3          |Nick           

In [8]:
employees_df.select('employee_phone_numbers').show()

+----------------------+
|employee_phone_numbers|
+----------------------+
|  [Office -> +91 34...|
|  [Office -> +44 22...|
|  [Office -> +61 87...|
+----------------------+



In [11]:
employees_df.select('employee_phone_numbers.Office', 'employee_phone_numbers.Home').show()

+----------------+----------------+
|          Office|            Home|
+----------------+----------------+
|+91 345 678 9012|+91 234 567 8901|
|+44 222 222 2222|+44 111 111 1111|
|+61 876 543 2109|+61 987 654 3210|
+----------------+----------------+



In [19]:
from pyspark.sql.functions import map_keys, map_values

In [26]:
employees_df.select('employee_id', map_keys('employee_phone_numbers').alias('employee_phone_numbers_keys')).show()

+-----------+---------------------------+
|employee_id|employee_phone_numbers_keys|
+-----------+---------------------------+
|          2|             [Office, Home]|
|          3|             [Office, Home]|
|          4|             [Office, Home]|
+-----------+---------------------------+



In [25]:
employees_df.select('employee_id', map_values('employee_phone_numbers').alias('employee_phone_numbers_values')).show()

+-----------+-----------------------------+
|employee_id|employee_phone_numbers_values|
+-----------+-----------------------------+
|          2|         [+91 345 678 9012...|
|          3|         [+44 222 222 2222...|
|          4|         [+61 876 543 2109...|
+-----------+-----------------------------+



In [22]:
from pyspark.sql.functions import explode

In [24]:
employees_df.select('employee_id', explode(map_values('employee_phone_numbers')).alias('employee_phone_number')).show()

+-----------+---------------------+
|employee_id|employee_phone_number|
+-----------+---------------------+
|          2|     +91 345 678 9012|
|          2|     +91 234 567 8901|
|          3|     +44 222 222 2222|
|          3|     +44 111 111 1111|
|          4|     +61 876 543 2109|
|          4|     +61 987 654 3210|
+-----------+---------------------+



In [12]:
employees_df.select('employee_address').printSchema()

root
 |-- employee_address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- postal_code: integer (nullable = true)



In [13]:
employees_df.select('employee_address.*').printSchema()

root
 |-- street: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- postal_code: integer (nullable = true)



In [14]:
employees_df.select('employee_address.*').show()

+-------------+---------+-----------+-----------+
|       street|     city|      state|postal_code|
+-------------+---------+-----------+-----------+
|  111 BCD Cir|Some City| Some State|     500091|
|222 Giant Cly|  UK City|UK Province|       null|
|         null|     null|       null|       null|
+-------------+---------+-----------+-----------+

