# PySpark DatFrame GroupBy & Aggregate

* [Aggregations with Spark (groupBy, cube, rollup)](https://mungingdata.com/apache-spark/aggregations/)

In [1]:
import os
import sys
import gc

---
# HDFS permission

For a non-spark user to be able to submit a job, login to the HDFS node as the hadoop user to run:

```
hadoop fs -mkdir /user/${USERNAME}
hadoop fs -chown ${USERNAME} /user/${USERNAME}
hadoop fs -chmod g+w /user/${USERNAME}
```

Otherwise an error:
```
21/08/15 21:15:28 ERROR SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: user=${USERNAME}, access=WRITE, inode="/user":hadoop:hadoop:drwxrwxr-x
```

---
# Environment variables

## HADOOP_CONF_DIR

Copy the **HADOOP_CONF_DIR** from the Hadoop/YARN master node and set the ```HADOOP_CONF_DIR``` environment variable locally to point to the directory.

* [Launching Spark on YARN
](http://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)

> Ensure that **HADOOP_CONF_DIR** or **YARN_CONF_DIR** points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. 

In [2]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [3]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

---
# Spark Session


In [4]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

21/09/15 16:45:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/15 16:45:45 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


# PySpark DataFrame

* [SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html)

> Parameters
> * data: RDD or iterable
an RDD of any kind of SQL data representation(e.g. Row, tuple, int, boolean, etc.), or list, or pandas.DataFrame.
> * schema: pyspark.sql.types.DataType, str or list, optional

In [6]:
df = spark.createDataFrame(
    data=[
        ("messi", 1), 
        ("ronald", 2), 
        ("messi", 3), 
        ("ronald", 4),
        ("messi", 5), 
    ], 
    schema=[
        "name", "goal"
    ]
)
df.show()

                                                                                

+------+----+
|  name|goal|
+------+----+
| messi|   1|
|ronald|   2|
| messi|   3|
|ronald|   4|
| messi|   5|
+------+----+



# GroupBy Aggregation

* [Grouping Data](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html#Grouping-Data)

> PySpark DataFrame also provides a way of handling grouped data by using the common approach, split-apply-combine strategy. It groups the data by a certain condition applies a function to each group and then combines them back to the DataFrame.

* [GroupedData.agg(*exprs)](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.agg.html)

> If exprs is a single dict mapping from string to string, then the key is the column to perform aggregation on, and the value is the aggregate function. Alternatively, exprs can also be a list of aggregate Column expressions.


## Builtin functions

In [7]:
from pyspark.sql import functions as F

import re
[f for f in dir(F) if re.search(r'^[a-z].+', f) is not None]

['abs',
 'acos',
 'acosh',
 'add_months',
 'aggregate',
 'approxCountDistinct',
 'approx_count_distinct',
 'array',
 'array_contains',
 'array_distinct',
 'array_except',
 'array_intersect',
 'array_join',
 'array_max',
 'array_min',
 'array_position',
 'array_remove',
 'array_repeat',
 'array_sort',
 'array_union',
 'arrays_overlap',
 'arrays_zip',
 'asc',
 'asc_nulls_first',
 'asc_nulls_last',
 'ascii',
 'asin',
 'asinh',
 'assert_true',
 'atan',
 'atan2',
 'atanh',
 'avg',
 'base64',
 'bin',
 'bitwiseNOT',
 'broadcast',
 'bround',
 'bucket',
 'cbrt',
 'ceil',
 'coalesce',
 'col',
 'collect_list',
 'collect_set',
 'column',
 'concat',
 'concat_ws',
 'conv',
 'corr',
 'cos',
 'cosh',
 'count',
 'countDistinct',
 'covar_pop',
 'covar_samp',
 'crc32',
 'create_map',
 'cume_dist',
 'current_date',
 'current_timestamp',
 'date_add',
 'date_format',
 'date_sub',
 'date_trunc',
 'datediff',
 'dayofmonth',
 'dayofweek',
 'dayofyear',
 'days',
 'decode',
 'degrees',
 'dense_rank',
 'desc',
 '


## Causion: 

DO NOT use Python built-in aggregation funtions e.g. sum. It will cause the error.



In [8]:
df.groupby("name").agg(sum("goal"))  # <--- Using Python built-in sum.

TypeError: unsupported operand type(s) for +: 'int' and 'str'

## Sum

In [9]:
df.groupby("name").agg(F.sum("goal")).show()
df.groupby("name").sum("goal").show()

                                                                                

+------+---------+
|  name|sum(goal)|
+------+---------+
|ronald|        6|
| messi|        9|
+------+---------+





+------+---------+
|  name|sum(goal)|
+------+---------+
|ronald|        6|
| messi|        9|
+------+---------+





## Renaming column

* [Column alias after groupBy in pyspark](https://stackoverflow.com/questions/33516490/column-alias-after-groupby-in-pyspark)
* [Spark - DataFrame - Select](https://montan.atlassian.net/wiki/spaces/~masayukionishi/pages/692158548/Spark+-+DataFrame+-+Select)

In [10]:
# Select function & alias
df.groupby("name").sum("goal").select(F.col("sum(goal)").alias("scores")).show()

# Aggregation function & alias
df.groupby("name").agg(
    F.sum("goal").alias("scores"),
).show()

                                                                                

+------+
|scores|
+------+
|     6|
|     9|
+------+





+------+------+
|  name|scores|
+------+------+
|ronald|     6|
| messi|     9|
+------+------+



                                                                                

# Aggregate multiple columns

List multiple column aggregations in the ```agg``` method.

In [11]:
df.groupby("name").agg(
    F.sum("goal").alias("scores"),
    F.avg("goal").alias("avearge"),
    F.stddev("goal").alias("sd"),
    F.count(F.when(F.col("goal") >= 3, 1)).alias("hattrick")
).show()

                                                                                

+------+------+-------+------------------+--------+
|  name|scores|avearge|                sd|hattrick|
+------+------+-------+------------------+--------+
|ronald|     6|    3.0|1.4142135623730951|       1|
| messi|     9|    3.0|               2.0|       2|
+------+------+-------+------------------+--------+



# Cleanup

In [12]:
del spark
gc.collect()

324