# Aggregation Basics

### groupBy
Use the DataFrame `groupBy` method to create a grouped data object. 

This grouped data object is called `RelationalGroupedDataset` in Scala and `GroupedData` in Python.

### Grouped data methods
Various aggregation methods are available on the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.html" target="_blank">GroupedData</a> object.


| Method | Description |
| --- | --- |
| agg | Compute aggregates by specifying a series of aggregate columns |
| avg | Compute the mean value for each numeric columns for each group |
| count | Count the number of rows for each group |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| min | Compute the min value for each numeric column for each group |
| pivot | Pivots a column of the current DataFrame and performs the specified aggregation |
| sum | Compute the sum for each numeric columns for each group |

### Data Source
Generate 50 Sample Records

In [0]:
%run ../GenerateSampleData

In [0]:
# Show the data
df.show(10, truncate=False)

### AVG
Compute average amount per category:

In [0]:
df.groupBy("category").avg("amount").show()

### MEAN
Alias for avg:

In [0]:
df.groupBy("category").mean("amount").show()

### COUNT
Count transactions per product:

In [0]:
df.groupBy("product").count().show()

### MAX
Maximum amount per category:

In [0]:
df.groupBy("category").max("amount").show()

### MIN
Minimum amount per category:

In [0]:
df.groupBy("category").min("amount").show()

### SUM
Total sales amount per product:

In [0]:
df.groupBy("product").sum("amount").withColumnRenamed('sum(amount)', 'total_amount').show()

### PIVOT
Pivot the category column to see totals per date:

In [0]:
pivot_df = df.groupBy("transaction_date").pivot("category").sum("amount")
pivot_df.show()

### AGG
If you want all aggregates together without grouping:

In [0]:
from pyspark.sql.functions import count, avg, sum, max, min
df.agg(
    count("*").alias("total_transactions"),
    avg("amount").alias("avg_amount"),
    sum("amount").alias("total_amount"),
    max("amount").alias("max_amount"),
    min("amount").alias("min_amount")
).show()


### Multiple Aggregations with groupBy().agg()
Let’s say you want to:

- Count transactions per category
- Compute total amount
- Compute average amount
- Compute min and max amounts

In [0]:
agg_df = (
    df.groupBy("category")
      .agg(
          count("*").alias("transaction_count"),
          sum("amount").alias("total_amount"),
          avg("amount").alias("average_amount"),
          min("amount").alias("min_amount"),
          max("amount").alias("max_amount")
      )
)

agg_df.show()

### Grouping by More Than One Column

If you want to group by both transaction_date and category:

In [0]:
agg_df2 = (
    df.groupBy("transaction_date", "category")
      .agg(
          count("*").alias("transaction_count"),
          sum("amount").alias("total_amount"),
          avg("amount").alias("average_amount")
      )
)

agg_df2.show()


### Using Multiple Aggregations With agg() Dictionary Syntax

You can also specify aggregates like this:

In [0]:
agg_df3 = (
    df.groupBy("product")
      .agg(
          {"amount": "sum", "amount": "avg", "*": "count"}
      )
)

agg_df3.show()

# Aggregation

##### Objectives
1. Group data by specified columns
1. Apply grouped data methods to aggregate data
1. Apply built-in functions to aggregate data

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `groupBy`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.html#pyspark.sql.GroupedData" target="_blank" target="_blank">Grouped Data</a>: `agg`, `avg`, `count`, `max`, `sum`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `approx_count_distinct`, `avg`, `sum`

### Setup

In [0]:
%run ../DatasetSourcePath

Let's use the BedBricks events dataset.

In [0]:
df = spark.read.parquet(eventsPath)
display(df.limit(10))
print("Total record: {0:,}".format(df.count()))

In [0]:
df.groupBy("event_name")

In [0]:
df.groupBy("geo.state", "geo.city")

In [0]:
eventCountsDF = df.groupBy("event_name").count()
display(eventCountsDF)

Here, we're getting the average purchase revenue for each.

In [0]:
avgStatePurchasesDF = df.groupBy("geo.state").avg("ecommerce.purchase_revenue_in_usd")
display(avgStatePurchasesDF)

And here the total quantity and sum of the purchase revenue for each combination of state and city.

In [0]:
cityPurchaseQuantitiesDF = (df.groupBy("geo.state", "geo.city")
                            .sum(
                                "ecommerce.total_item_quantity", 
                                "ecommerce.purchase_revenue_in_usd"))
                                
display(cityPurchaseQuantitiesDF)

## Built-In Functions
In addition to DataFrame and Column transformation methods, there are a ton of helpful functions in Spark's built-in <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-builtin.html" target="_blank">SQL functions</a> module.

In Scala, this is <a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_bank">`org.apache.spark.sql.functions`</a>, and <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions" target="_blank">`pyspark.sql.functions`</a> in Python. Functions from this module must be imported into your code.

### Aggregate Functions

Here are some of the built-in functions available for aggregation.

| Method | Description |
| --- | --- |
| approx_count_distinct | Returns the approximate number of distinct items in a group |
| avg | Returns the average of the values in a group |
| collect_list | Returns a list of objects with duplicates |
| corr | Returns the Pearson Correlation Coefficient for two columns |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| stddev_samp | Returns the sample standard deviation of the expression in a group |
| sumDistinct | Returns the sum of distinct values in the expression |
| var_pop | Returns the population variance of the values in a group |

Use the grouped data method <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.agg.html#pyspark.sql.GroupedData.agg" target="_blank">`agg`</a> to apply built-in aggregate functions

This allows you to apply other transformations on the resulting columns, such as <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html" target="_blank">`alias`</a>.

In [0]:
from pyspark.sql.functions import sum

statePurchasesDF = df.groupBy("geo.state").agg(
        sum("ecommerce.total_item_quantity").alias("total_purchases")
    )
    
display(statePurchasesDF)

Apply multiple aggregate functions on grouped data

In [0]:
from pyspark.sql.functions import avg, approx_count_distinct

stateAggregatesDF = (df
                     .groupBy("geo.state")
                     .agg(
                       avg("ecommerce.total_item_quantity").alias("avg_quantity"),
                       approx_count_distinct("user_id").alias("distinct_users")
                     )
                    )

display(stateAggregatesDF)

### Math Functions
Here are some of the built-in functions for math operations.

| Method | Description |
| --- | --- |
| ceil | Computes the ceiling of the given column. |
| cos | Computes the cosine of the given value. |
| log | Computes the natural logarithm of the given value. |
| round | Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode. |
| sqrt | Computes the square root of the specified float value. |

In [0]:
from pyspark.sql.functions import cos, sqrt

display(
    spark.range(10)  # Create a DataFrame with a single column called "id" with a range of integer values
    .withColumn("sqrt", sqrt("id"))
    .withColumn("cos", cos("id"))
)