# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

spark = SparkSession.builder.appName("Alcohol_Consumption").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [2]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

drinks = spark.read.csv(SparkFiles.get("drinks.csv"), header=True, inferSchema= True)

In [3]:
drinks.printSchema()

root
 |-- country: string (nullable = true)
 |-- beer_servings: integer (nullable = true)
 |-- spirit_servings: integer (nullable = true)
 |-- wine_servings: integer (nullable = true)
 |-- total_litres_of_pure_alcohol: double (nullable = true)
 |-- continent: string (nullable = true)



### Step 4. Which continent drinks more beer on average?

In [9]:
drinks.select(F.col('continent'), F.col('beer_servings')).\
groupBy('continent').avg().orderBy('avg(beer_servings)', ascending=False).show()

+---------+------------------+
|continent|avg(beer_servings)|
+---------+------------------+
|       EU|193.77777777777777|
|       SA|175.08333333333334|
|       NA|145.43478260869566|
|       OC|           89.6875|
|       AF|61.471698113207545|
|       AS| 37.04545454545455|
+---------+------------------+



### Step 5. For each continent print the statistics for wine consumption.

In [14]:
wine_continent =  drinks.select(F.col('continent'), F.col('wine_servings')).\
groupBy('continent').avg().orderBy('avg(wine_servings)', ascending=False)

In [18]:
wine_continent.show()

+---------+------------------+
|continent|avg(wine_servings)|
+---------+------------------+
|       EU|142.22222222222223|
|       SA|62.416666666666664|
|       OC|            35.625|
|       NA| 24.52173913043478|
|       AF|16.264150943396228|
|       AS| 9.068181818181818|
+---------+------------------+



### Step 6. Print the mean alcohol consumption per continent for every column

In [19]:
drinks.groupBy('continent').avg().show()

+---------+------------------+--------------------+------------------+---------------------------------+
|continent|avg(beer_servings)|avg(spirit_servings)|avg(wine_servings)|avg(total_litres_of_pure_alcohol)|
+---------+------------------+--------------------+------------------+---------------------------------+
|       NA|145.43478260869566|   165.7391304347826| 24.52173913043478|                5.995652173913044|
|       SA|175.08333333333334|              114.75|62.416666666666664|                6.308333333333334|
|       AS| 37.04545454545455|   60.84090909090909| 9.068181818181818|               2.1704545454545454|
|       OC|           89.6875|             58.4375|            35.625|               3.3812500000000005|
|       EU|193.77777777777777|  132.55555555555554|142.22222222222223|                8.617777777777777|
|       AF|61.471698113207545|  16.339622641509433|16.264150943396228|                 3.00754716981132|
+---------+------------------+--------------------+----

### Step 7. Print the median alcohol consumption per continent for every column

In [77]:
drinks.groupby('continent').agg(
    F.percentile_approx("beer_servings", 0.5, 100000).alias('50%_beer'),
    F.percentile_approx("spirit_servings", 0.5, 100000).alias('50%_spirit'),
    F.percentile_approx("wine_servings", 0.5, 100000).alias('50%_wine'),
    F.percentile_approx("total_litres_of_pure_alcohol", 0.5, 100000).alias('50%_total_ltrs')
).show()

+---------+--------+----------+--------+--------------+
|continent|50%_beer|50%_spirit|50%_wine|50%_total_ltrs|
+---------+--------+----------+--------+--------------+
|       NA|     143|       137|      11|           6.3|
|       SA|     162|       100|       8|           6.6|
|       AS|      16|        16|       1|           1.0|
|       OC|      49|        35|       8|           1.5|
|       EU|     219|       122|     128|          10.0|
|       AF|      32|         3|       2|           2.3|
+---------+--------+----------+--------+--------------+



### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [68]:
drinks.select('continent', 'spirit_servings').groupby('continent').agg(
        F.avg(F.col('spirit_servings')).alias('avg_spirit_servings'),
        F.min(F.col('spirit_servings')).alias('min_spirit_servings'),
        F.max(F.col('spirit_servings')).alias('max_spirit_servings'),
).show()

+---------+-------------------+-------------------+-------------------+
|continent|avg_spirit_servings|min_spirit_servings|max_spirit_servings|
+---------+-------------------+-------------------+-------------------+
|       NA|  165.7391304347826|                 68|                438|
|       SA|             114.75|                 25|                302|
|       AS|  60.84090909090909|                  0|                326|
|       OC|            58.4375|                  0|                254|
|       EU| 132.55555555555554|                  0|                373|
|       AF| 16.339622641509433|                  0|                152|
+---------+-------------------+-------------------+-------------------+

