```{figure} ../images/banner.png
---
align: center
name: banner
---
```

# Chapter 15 : Aggregate Operations

## Chapter Learning Objectives

- Various data operations on columns containing map. 

## Chapter Outline

- [1. Dataframe Aggregation](#1)
    - [1a. agg](#2)
    - [1b. avg](#3)
    - [1c. count](#4)
    - [1d. max](#5)
    - [1e. min](#6)
    - [1f. mean](#7)
    - [1g. sum](#8)
    - [1g. pivot](#8)

In [1]:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

In [2]:
import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

<a id='1'></a>

##  Chapter Outline - Gallery

<div class="special_table"></div>

Click on any of the image below                 |To come back to this image gallery, on the top right corner under contents, click on Chapter Outline - Gallery" 
- | - 
[![alt](img/chapter9/1.png)](#2)| [![alt](img/chapter9/2.png)](#3)
[![alt](img/chapter9/3.png)](#4)| [![alt](img/chapter9/4.png)](#5)
[![alt](img/chapter9/5.png)](#6)| [![alt](img/chapter9/6.png)](#7)
[![alt](img/chapter9/7.png)](#8)|

<a id='2'></a>

## 1a. DataFrame Aggregations

Explanation here.

```{figure} img/chapter9/1a.png
---
align: center
---
```

In [40]:
df = spark.createDataFrame([(1,"north",100,"walmart"),(2,"south",300,"apple"),(3,"west",200,"google"),
                            (1,"east",200,"google"),(2,"north",100,"walmart"),(3,"west",300,"apple"),
                            (1,"north",200,"walmart"),(2,"east",500,"google"),(3,"west",400,"apple"),],
                          ["emp_id","region","sales","customer"])
                     
df.toPandas()#show(truncate=False)


Unnamed: 0,emp_id,region,sales,customer
0,1,north,100,walmart
1,2,south,300,apple
2,3,west,200,google
3,1,east,200,google
4,2,north,100,walmart
5,3,west,300,apple
6,1,north,200,walmart
7,2,east,500,google
8,3,west,400,apple


In [44]:
print(df.sort('customer').toPandas().to_string(index=False))#show()

 emp_id region  sales customer
      2  south    300    apple
      3   west    300    apple
      3   west    400    apple
      3   west    200   google
      1   east    200   google
      2   east    500   google
      1  north    100  walmart
      2  north    100  walmart
      1  north    200  walmart


<a id='3'></a>

## 1b. How to read individual elements of a map column ?


```{figure} img/chapter9/1b.png
---
align: center
---
```

<b>Input:  Spark dataframe containing map column</b>

In [25]:
df.agg({"sales": "sum"}).show()

+----------+
|sum(sales)|
+----------+
|      2300|
+----------+



In [26]:
df.agg({"sales": "min"}).show()

+----------+
|min(sales)|
+----------+
|       100|
+----------+



In [27]:
df.agg({"sales": "max"}).show()

+----------+
|max(sales)|
+----------+
|       500|
+----------+



In [28]:
df.agg({"sales": "count"}).show()

+------------+
|count(sales)|
+------------+
|           9|
+------------+



In [32]:
df.agg({"sales": "mean"}).show()

+------------------+
|        avg(sales)|
+------------------+
|255.55555555555554|
+------------------+



In [36]:
df.agg({"sales": "mean","customer":"count"}).show()

+------------------+---------------+
|        avg(sales)|count(customer)|
+------------------+---------------+
|255.55555555555554|              9|
+------------------+---------------+



In [38]:
df.agg("sales")

AssertionError: all exprs should be Column

In [45]:
df.groupby("emp_id").agg({"sales": "sum"}).orderBy('emp_id').toPandas()#show()

Unnamed: 0,emp_id,sum(sales)
0,1,500
1,2,900
2,3,900


In [46]:
df.groupby("emp_id").agg({"sales": "max"}).orderBy('emp_id').toPandas()

Unnamed: 0,emp_id,max(sales)
0,1,200
1,2,500
2,3,400


In [50]:
df.groupby("emp_id").agg({"sales": "last"}).orderBy('emp_id').toPandas()

Unnamed: 0,emp_id,last(sales)
0,1,200
1,2,500
2,3,400


In [22]:
df.groupby("region").agg({"sales": "sum"}).orderBy('region').show()

+------+----------+
|region|sum(sales)|
+------+----------+
|  east|       700|
| north|       400|
| south|       300|
|  west|       900|
+------+----------+



In [23]:
df.groupby("customer").agg({"sales": "sum"}).orderBy('customer').show()

+--------+----------+
|customer|sum(sales)|
+--------+----------+
|   apple|      1000|
|  google|       900|
| walmart|       400|
+--------+----------+



<b>Output :  Spark dataframe containing map keys as column and its value </b>

In [14]:
df_map = df1.select(df1.data.a.alias("a"), df1.data.b.alias("b"), df1.data.c.alias("c") )
df_map.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+



<b> Summary:</b>

In [15]:
print("Input                     ",            "Output")
display_side_by_side(df1.toPandas(),df_map.toPandas())

Input                      Output


data
"{'a': 1, 'b': 2, 'c': 3}"

a,b,c
1,2,3


<a id='4'></a>

## 1c. How to extract the keys from a map column?


```{figure} img/chapter9/1c.png
---
align: center
---
```

Lets first understand the syntax



```{admonition} Syntax
<b>pyspark.sql.functions.map_keys(col)</b>

Returns an unordered array containing the keys of the map.


<b>Parameters</b>:

- col – name of column or expression


'''

<b>Input:  Spark data frame consisting of a map column </b>

In [22]:
df2 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df2.show(truncate=False)

+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+



<b>Output :  Spark data frame consisting of a column of keys </b>

In [23]:
from pyspark.sql.functions import map_keys
df_keys = df2.select(map_keys(df2.data).alias("keys"))
df_keys.show()

+---------+
|     keys|
+---------+
|[a, b, c]|
+---------+



<b> Summary:</b>

In [24]:
print("input                     ",            "output")
display_side_by_side(df2.toPandas(),df_keys.toPandas())

input                      output


data
"{'a': 1, 'b': None, 'c': 3}"

keys
"[a, b, c]"


<a id='5'></a>

## 1d. How to extract the values from a map column?

Explanation here.

```{figure} img/chapter9/1d.png
---
align: center
---
```

Lets first understand the syntax

```{admonition} Syntax

<b>pyspark.sql.functions.array_except(col1, col2)</b>

returns an array of the elements in col1 but not in col2, without duplicates.
 

<b>Parameters</b>
- col1 – name of column containing array
- col2 – name of column containing array
'''

<b>Input:  Spark data frame consisting of a map column  </b>

In [30]:
df3 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df3.show(truncate=False)

+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+



<b>Output :  Spark data frame consisting of a column of values </b>

In [31]:
from pyspark.sql.functions import map_values
df_values = df3.select(map_values(df3.data).alias("values"))
df_values.show()

+-------+
| values|
+-------+
|[1,, 3]|
+-------+



<b> Summary:</b>

In [33]:
print("Input                     ",            "Output")
display_side_by_side(df3.toPandas(),df_values.toPandas())

Input                      Output


data
"{'a': 1, 'b': None, 'c': 3}"

values
"[1, None, 3]"


<a id='6'></a>

## 1e. How to convert a map column into an array column?

Explanation here.

```{figure} img/chapter9/1e.png
---
align: center
---
```

Lets first understand the syntax

```{admonition} Syntax
<b>pyspark.sql.functions.array_sort(col)</b>

sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.

<b>Parameters</b>
- col – name of column or expression
'''

<b>Input:  Spark data frame with map column </b>

In [38]:
df4 = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df4.show(truncate=False)

+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+



<b>Output :  Spark dataframe containing an array</b>

In [39]:
from pyspark.sql.functions import map_entries
df_array = df4.select(map_entries(df4.data).alias("array"))
df_array.show(truncate=False)

+------------------------+
|array                   |
+------------------------+
|[[a, 1], [b, 2], [c, 3]]|
+------------------------+



<b> Summary:</b>

In [40]:
print("Input                     ",            "Output")
display_side_by_side(df4.toPandas(),df_array.toPandas())

Input                      Output


data
"{'a': 1, 'b': 2, 'c': 3}"

array
"[(a, 1), (b, 2), (c, 3)]"


<a id='7'></a>

## 1f. How to create a map column from multiple array columns?



```{figure} img/chapter9/1f.png
---
align: center
---
```

Lets first understand the syntax

```{admonition} Syntax
<b>pyspark.sql.functions.map_from_arrays(col1, col2)</b>

Creates a new map from two arrays.

<b>Parameters</b>
- col1 – name of column containing a set of keys. All elements should not be null
- col2 – name of column containing a set of values

'''

<b>Input:  Spark data frame with a column </b>

In [43]:
df5 = spark.createDataFrame([([2, 5], ['a', 'b'])], ['k', 'v'])
df5.show()

+------+------+
|     k|     v|
+------+------+
|[2, 5]|[a, b]|
+------+------+



<b>Output :  Spark data frame with a column of array of repeated values</b>

In [45]:
from pyspark.sql.functions import map_from_arrays
df_map1 = df5.select(map_from_arrays(df5.k, df5.v).alias("map"))
df_map1.show()

+----------------+
|             map|
+----------------+
|[2 -> a, 5 -> b]|
+----------------+



<b> Summary:</b>

In [46]:
print("Input                     ",            "Output")
display_side_by_side(df5.toPandas(),df_map1.toPandas())

Input                      Output


k,v
"[2, 5]","[a, b]"

map
"{5: 'b', 2: 'a'}"


<a id='8'></a>

## 1g. How to combine multiple map columns into one?

Explanation here.

```{figure} img/chapter9/1g.png
---
align: center
---
```

Lets first understand the syntax

```{admonition} Syntax
<b>pyspark.sql.functions.array_remove(col, element)</b>

Remove all elements that equal to element from the given array.

<b>Parameters</b>

- col – name of column containing array
- element – element to be removed from the array

'''

<b>Input:  Spark data frame with multiple map columns </b>

In [51]:
df6 = spark.sql("SELECT map(1, 'a', 2, 'b') as map1, map(3, 'c') as map2")
df6.show()

+----------------+--------+
|            map1|    map2|
+----------------+--------+
|[1 -> a, 2 -> b]|[3 -> c]|
+----------------+--------+



<b>Output :  Spark data frame with an array column with an element removed</b>

In [52]:
from pyspark.sql.functions import map_concat
df_com = df6.select(map_concat("map1", "map2").alias("combined_map"))
df_com.show(truncate=False)

+------------------------+
|combined_map            |
+------------------------+
|[1 -> a, 2 -> b, 3 -> c]|
+------------------------+



<b> Summary:</b>

In [53]:
print("Input                     ",            "Output")
display_side_by_side(df6.toPandas(),df_com.toPandas())

Input                      Output


map1,map2
"{1: 'a', 2: 'b'}",{3: 'c'}

combined_map
"{1: 'a', 2: 'b', 3: 'c'}"


<a id='9'></a>

<a id='9'></a>