## Dealing with Arrays
Let us understand how to deal with array type columns in the Data Frame.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Special Data Types'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [2]:
employees = [
     (2, "Henry", "Ford", 1250.0, 
      "India", ['henry@ford.com', 'hford@companyx.com'], 
      {"Home": "+91 234 567 8901", "Office": "+91 345 678 9012"}, 
      "456 78 9123", ('111 BCD Cir', 'Some City', 'Some State', 500091)
     ),
     (3, "Nick", "Junior", 750.0, 
      "United Kingdom", ['nick@junior.com', 'njunior@companyx.com'], 
      {"Home": "+44 111 111 1111", "Office": "+44 222 222 2222"}, 
      "222 33 4444", ('222 Giant Cly', 'UK City', 'UK Province', None)
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "Australia", ['bill@gomes.com', 'bgomes@companyx.com'], 
      {"Home": "+61 987 654 3210", "Office": "+61 876 543 2109"}, 
      "789 12 6118", None
     )
]

In [3]:
employees_df = spark.createDataFrame(
    employees,
    schema="""employee_id INT, employee_first_name STRING, employee_last_name STRING,
        employee_salary FLOAT, employee_nationality STRING, employee_email_ids ARRAY<STRING>,
        employee_phone_numbers MAP<STRING, STRING>, employee_ssn STRING,
        employee_address STRUCT<street: STRING, city: STRING, state: STRING, postal_code: INT>
    """
)

In [4]:
employees_df.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- employee_first_name: string (nullable = true)
 |-- employee_last_name: string (nullable = true)
 |-- employee_salary: float (nullable = true)
 |-- employee_nationality: string (nullable = true)
 |-- employee_email_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- employee_phone_numbers: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- employee_ssn: string (nullable = true)
 |-- employee_address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- postal_code: integer (nullable = true)



In [5]:
employees_df.show(truncate=False)

+-----------+-------------------+------------------+---------------+--------------------+---------------------------------------+------------------------------------------------------+------------+--------------------------------------------+
|employee_id|employee_first_name|employee_last_name|employee_salary|employee_nationality|employee_email_ids                     |employee_phone_numbers                                |employee_ssn|employee_address                            |
+-----------+-------------------+------------------+---------------+--------------------+---------------------------------------+------------------------------------------------------+------------+--------------------------------------------+
|2          |Henry              |Ford              |1250.0         |India               |[henry@ford.com, hford@companyx.com]   |[Office -> +91 345 678 9012, Home -> +91 234 567 8901]|456 78 9123 |[111 BCD Cir, Some City, Some State, 500091]|
|3          |Nick           

In [6]:
employees_df.select('employee_email_ids').show(truncate=False)

+---------------------------------------+
|employee_email_ids                     |
+---------------------------------------+
|[henry@ford.com, hford@companyx.com]   |
|[nick@junior.com, njunior@companyx.com]|
|[bill@gomes.com, bgomes@companyx.com]  |
+---------------------------------------+



* We can use `explode` function to explode an array into multiple rows. Let us get employee id with email ids exploded into multiple rows.

In [7]:
employees_df.count()

3

In [8]:
from pyspark.sql.functions import explode

In [11]:
employees_df.select('employee_id', explode('employee_email_ids').alias('employee_id')).show(truncate=False)

+-----------+--------------------+
|employee_id|employee_id         |
+-----------+--------------------+
|2          |henry@ford.com      |
|2          |hford@companyx.com  |
|3          |nick@junior.com     |
|3          |njunior@companyx.com|
|4          |bill@gomes.com      |
|4          |bgomes@companyx.com |
+-----------+--------------------+



In [12]:
employees_df.select('employee_id', explode('employee_email_ids')).count()

6

* We can use `concat_ws` on top of email ids to convert array into delimited string.

In [13]:
from pyspark.sql.functions import concat_ws

In [17]:
employees_df. \
    select('employee_id', concat_ws(', ', 'employee_email_ids').alias('employee_email_ids')). \
    show(truncate=False)

+-----------+-------------------------------------+
|employee_id|employee_email_ids                   |
+-----------+-------------------------------------+
|2          |henry@ford.com, hford@companyx.com   |
|3          |nick@junior.com, njunior@companyx.com|
|4          |bill@gomes.com, bgomes@companyx.com  |
+-----------+-------------------------------------+



* We can convert delimited string into array using `split` function.

In [18]:
employees = [
     (2, "Henry", "Ford", 1250.0, 
      "India", 'henry@ford.com, hford@companyx.com', 
      {"Home": "+91 234 567 8901", "Office": "+91 345 678 9012"}, 
      "456 78 9123", ('111 BCD Cir', 'Some City', 'Some State', 500091)
     ),
     (3, "Nick", "Junior", 750.0, 
      "United Kingdom", 'nick@junior.com, njunior@companyx.com', 
      {"Home": "+44 111 111 1111", "Office": "+44 222 222 2222"}, 
      "222 33 4444", ('222 Giant Cly', 'UK City', 'UK Province', None)
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "Australia", 'bill@gomes.com, bgomes@companyx.com', 
      {"Home": "+61 987 654 3210", "Office": "+61 876 543 2109"}, 
      "789 12 6118", None
     )
]

In [19]:
employees_df = spark.createDataFrame(
    employees,
    schema="""employee_id INT, employee_first_name STRING, employee_last_name STRING,
        employee_salary FLOAT, employee_nationality STRING, employee_email_ids STRING,
        employee_phone_numbers MAP<STRING, STRING>, employee_ssn STRING,
        employee_address STRUCT<street: STRING, city: STRING, state: STRING, postal_code: INT>
    """
)

In [20]:
employees_df. \
    select('employee_id', 'employee_email_ids'). \
    show(truncate=False)

+-----------+-------------------------------------+
|employee_id|employee_email_ids                   |
+-----------+-------------------------------------+
|2          |henry@ford.com, hford@companyx.com   |
|3          |nick@junior.com, njunior@companyx.com|
|4          |bill@gomes.com, bgomes@companyx.com  |
+-----------+-------------------------------------+



In [21]:
from pyspark.sql.functions import split

In [22]:
employees_df. \
    select('employee_id', split('employee_email_ids', ', ').alias('employee_email_ids')). \
    show(truncate=False)

+-----------+---------------------------------------+
|employee_id|split(employee_email_ids, , )          |
+-----------+---------------------------------------+
|2          |[henry@ford.com, hford@companyx.com]   |
|3          |[nick@junior.com, njunior@companyx.com]|
|4          |[bill@gomes.com, bgomes@companyx.com]  |
+-----------+---------------------------------------+



In [24]:
employees_df. \
    select('employee_id', split('employee_email_ids', ', ').alias('employee_email_ids')). \
    printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- employee_email_ids: array (nullable = true)
 |    |-- element: string (containsNull = true)

