<img style="float: left" src="images/spark.png" />
<img style="float: right" src="images/surfsara.png" />
<hr style="clear: both" />

## Spark Aggregation problem

Below is a exercise for groupBy and aggregation on Dataframes in pySpark.

The [groupBy](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy) method on a dataFrame does not return another dataFrame but a [groupedData]( https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData) object that has several handy aggregation methods, like [avg](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.avg), [count](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.count), [max](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.max), [min](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.min) and many others.

The problem below was taken from Coursera's MOOC [Big Data Analysis with Scala and Spark](https://www.coursera.org/learn/scala-spark-big-data) by the Ecole Polytechnique Federale de Lausanne. We adapted the problem for pySpark.

Let's assume we have a dataset about posts in a discussion forum. The entries of the dataset consist of an authorID, the name of a subforum, the number of likes and a date.<br>

<b>We would like to tally up the all posts for each author and then rank authors with the most likes per subforum.</b>

Aftre creating a SparkSession, we present the data as a python dictionary that we then transform to a dataframe.

In [None]:
# Create a SparkSession, the 'DataFrame version' of the SparkContext
from pyspark.sql import SparkSession

spark = (
    SparkSession
    .builder
    .getOrCreate()
)

In [None]:
from  pyspark.sql import Row
from pyspark.sql.functions import count


posts = [{'authorID' : 4, 'subforum': 'java', 'likes': 5, 'date' : 'sept 5'},
         {'authorID' : 1, 'subforum': 'python', 'likes': 3, 'date' : 'sept 4'},
        {'authorID' : 2, 'subforum': 'python', 'likes': 35, 'date' : 'sept 3'},
        {'authorID' : 3, 'subforum': 'java', 'likes': 1, 'date' : 'sept 5'},
        {'authorID' : 4, 'subforum': 'java', 'likes': 14, 'date' : 'sept 5'},
        {'authorID' : 3, 'subforum': 'python', 'likes': 12, 'date' : 'sept 3'},
        {'authorID' : 3, 'subforum': 'java', 'likes': 14, 'date' : 'sept 5'},
        {'authorID' : 3, 'subforum': 'java', 'likes': 10, 'date' : 'sept 5'},
        {'authorID' : 2, 'subforum': 'python', 'likes': 21, 'date' : 'sept 5'}]

rdd = spark.sparkContext.parallelize(posts)
dfPosts = spark.createDataFrame(rdd.map(lambda x : Row(**x)))
dfPosts.show()

In [None]:
dfPosts.printSchema()

Please use a [groupBy](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy), an [aggregation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.agg) and an [orderBy](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy) to come up with the desired dataFrame. Note that you want to order in descending order.

In [None]:
<YOUR CODE HERE>