# Introduction to Spark

## Import Libraries

In [1]:
from os.path import abspath
from pyspark.sql import SparkSession, HiveContext
import pyspark.sql.functions as F
import pandas as pd

## Spark Session

**Spark session** is a unified entry point for all spark applications starting from Spark 2.0. Instead of having a Spark context, Hive context, SQL context, now all of it is encapsulated in a Spark session.

**Resources:**
 * [A tale of Spark Session and Spark Context](https://medium.com/@achilleus/spark-session-10d0d66d1d24)

Create Spark session with `SparkSession.builder`:
 * `config("spark.sql.warehouse.dir", warehouse_location)` - `warehouse_location` points to the default location for managed databases and tables
 * `config('spark.driver.extraJavaOptions','-Dderby.system.home=../data/tmp')` points where `metastore_db` and `derby.log` are created

In [2]:
warehouse_location = abspath('../data/spark-warehouse')

In [3]:
spark = SparkSession \
         .builder \
         .config("spark.sql.warehouse.dir", warehouse_location) \
         .config('spark.driver.extraJavaOptions','-Dderby.system.home=../data/tmp') \
         .enableHiveSupport() \
         .getOrCreate()

### Multiple Spark Sessions

Creating multiple Spark sessions can cause issues, so it's best practice to use the `getOrCreate()` method. It returns an existing Spark session if there's already one in the environment, or creates a new one if necessary. Let's test this and create another Spark session:

In [4]:
spark_2 = (SparkSession.builder.enableHiveSupport().getOrCreate())

And now we can verify that both Spark sessions are the same objects:

In [5]:
print(spark)
print(spark_2)

<pyspark.sql.session.SparkSession object at 0x7f83a1d54a90>
<pyspark.sql.session.SparkSession object at 0x7f83a1d54a90>


Check Spark version:

In [6]:
spark.version

'2.4.1'

Note, Spark context (and other contexts) are accessible from the Spark session object - `spark`:

In [7]:
sc = spark.sparkContext

In [8]:
sc

Another example: access Spark configuration parameters:

In [9]:
spark.sparkContext._conf.getAll()

[('spark.driver.host', 'host.docker.internal'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.port', '52422'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1574741565100'),
 ('spark.driver.extraJavaOptions', '-Dderby.system.home=../data/tmp'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'pyspark-shell'),
 ('spark.sql.warehouse.dir',
  '/mnt/d/pakhotin/Personal/Projects/Advanced-Spark/data/spark-warehouse')]

## Data in Spark

### Read from a File to Spark Data Frame 

We can read data in Spark data frames, for example, from a `csv` file.

In [10]:
df_iris = spark.read.csv("../data/raw/iris.csv", header=True, inferSchema =True)

Note, the data above is the famous _Iris_ sample by Fisher:
 * [_Iris_ Data Set at Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)
 * [_Iris_ flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)
 * [R. A. Fisher (1936). "The use of multiple measurements in taxonomic problems". Annals of Eugenics. 7 (2): 179–188](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1936.tb02137.x)

In [11]:
df_iris

DataFrame[sepal_length_cm: double, sepal_width_cm: double, petal_length_cm: double, petal_width_cm: double, class_iris: string]

In [12]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|
+---------------+--------------+---------------+--------------+-----------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
+---------------+--------------+---------------+--------------+-----------+
only showing top 5 rows



### Save Spark Data Frame to Table

Let's save the data frame `df_iris` to Hive table:

In [13]:
df_iris.write.saveAsTable("iris_tb")

AnalysisException: 'Table `iris_tb` already exists.;'

We can see **tables** available in Spark cluster with `catalog.listTables()` method:

In [14]:
print(spark.catalog.listTables())

[Table(name='iris_tb', database='default', description=None, tableType='MANAGED', isTemporary=False)]


We can see **databases** available in Spark cluster with `catalog.listDatabases()` method:

In [15]:
print(spark.catalog.listDatabases())

[Database(name='default', description='Default Hive database', locationUri='file:/mnt/d/pakhotin/Personal/Projects/Advanced-Spark/data/spark-warehouse')]


Note, it is located inside `spark-warehouse` which we defined above in Spark configuration.

We can also register Spark data frame into **TEMPORARY** table to make it available within other contexts as well (for example, within SQL context) but only from the specific Spark session that was used to create the data frame.

There are two methods:
 * `createTempView()` - the lifetime of this temporary table is tied to the `SparkSession` that was used to create this `DataFrame`. It throws `TempTableAlreadyExistsException`, if the view name already exists in the catalog.
 * `createOrReplaceTempView()` -  similar to above but safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. This is recommended method.

In [16]:
df_iris.createOrReplaceTempView("iris_temp")

Let's examine catalog again and see that new table `iris_temp` is there and listed as temporary:

In [17]:
print(spark.catalog.listTables())

[Table(name='iris_tb', database='default', description=None, tableType='MANAGED', isTemporary=False), Table(name='iris_temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


### Read from Table to Spark Data Frame 

We can read the **entire** table into data frame using method `table()` as follows:

In [18]:
df_iris_2 = spark.table("iris_tb")

In [19]:
df_iris_2.show(10)

+---------------+--------------+---------------+--------------+-----------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|
+---------------+--------------+---------------+--------------+-----------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|
|            5.4|           3.9|            1.7|           0.4|Iris-setosa|
|            4.6|           3.4|            1.4|           0.3|Iris-setosa|
|            5.0|           3.4|            1.5|           0.2|Iris-setosa|
|            4.4|           2.9|            1.4|           0.2|Iris-setosa|
|            4.9|           3.1|            1.5|           0.1|Iris-setosa|
+-----------

Or we can perform **SQL query** on the table to read results into data frame. Method `sql()` allows to run queries as follows:

In [20]:
query = "FROM iris_tb SELECT * WHERE class_iris = 'Iris-versicolor' LIMIT 10"

In [21]:
flowers10 = spark.sql(query)

In [22]:
flowers10.show()

+---------------+--------------+---------------+--------------+---------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|
+---------------+--------------+---------------+--------------+---------------+
|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|
|            6.4|           3.2|            4.5|           1.5|Iris-versicolor|
|            6.9|           3.1|            4.9|           1.5|Iris-versicolor|
|            5.5|           2.3|            4.0|           1.3|Iris-versicolor|
|            6.5|           2.8|            4.6|           1.5|Iris-versicolor|
|            5.7|           2.8|            4.5|           1.3|Iris-versicolor|
|            6.3|           3.3|            4.7|           1.6|Iris-versicolor|
|            4.9|           2.4|            3.3|           1.0|Iris-versicolor|
|            6.6|           2.9|            4.6|           1.3|Iris-versicolor|
|            5.2|           2.7|        

### Convert Spark Data Frame to Pandas Data Frame

If resulting Spark data frame has manageable size it could be converted to **Pandas** dataframe with method `toPandas()` as follows:

In [23]:
pdf_flowers10 = flowers10.toPandas()

In [24]:
pdf_flowers10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
sepal_length_cm    10 non-null float64
sepal_width_cm     10 non-null float64
petal_length_cm    10 non-null float64
petal_width_cm     10 non-null float64
class_iris         10 non-null object
dtypes: float64(4), object(1)
memory usage: 480.0+ bytes


In [25]:
pdf_flowers10.describe()

Unnamed: 0,sepal_length_cm,sepal_width_cm,petal_length_cm,petal_width_cm
count,10.0,10.0,10.0,10.0
mean,6.1,2.87,4.37,1.38
std,0.727247,0.340098,0.487739,0.168655
min,4.9,2.3,3.3,1.0
25%,5.55,2.725,4.125,1.3
50%,6.35,2.85,4.55,1.4
75%,6.575,3.175,4.675,1.5
max,7.0,3.3,4.9,1.6


### Convert Pandas Data Frame to Spark Data Frame

We can convert Pandas data frame to Spark data frame using `createDataFrame()` method with Pandas data frame as argument as follows:

In [26]:
flowers10_tmp = spark.createDataFrame(pdf_flowers10)

In [27]:
flowers10_tmp

DataFrame[sepal_length_cm: double, sepal_width_cm: double, petal_length_cm: double, petal_width_cm: double, class_iris: string]

In [28]:
flowers10_tmp.show()

+---------------+--------------+---------------+--------------+---------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|
+---------------+--------------+---------------+--------------+---------------+
|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|
|            6.4|           3.2|            4.5|           1.5|Iris-versicolor|
|            6.9|           3.1|            4.9|           1.5|Iris-versicolor|
|            5.5|           2.3|            4.0|           1.3|Iris-versicolor|
|            6.5|           2.8|            4.6|           1.5|Iris-versicolor|
|            5.7|           2.8|            4.5|           1.3|Iris-versicolor|
|            6.3|           3.3|            4.7|           1.6|Iris-versicolor|
|            4.9|           2.4|            3.3|           1.0|Iris-versicolor|
|            6.6|           2.9|            4.6|           1.3|Iris-versicolor|
|            5.2|           2.7|        

## Manipulating Data in Spark

### Creating a New Column in a Data Frame

Method `withColumn()` allows to perform column-wise operations. It takes two arguments:
 * `colName` - a string containing the name of the new column
 * `col` - a column expression
and returns a new DataFrame with the new column added.  Note, data frames in Spark are **imutable**, i.e. can't be changed in place, but we can reassign resulting data frame to the initial data frame:

In [29]:
df_iris = df_iris.withColumn("sepal_area_cm2", df_iris.sepal_length_cm * df_iris.sepal_width_cm)

In [30]:
df_iris.show(10)

+---------------+--------------+---------------+--------------+-----------+------------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|    sepal_area_cm2|
+---------------+--------------+---------------+--------------+-----------+------------------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|17.849999999999998|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|14.700000000000001|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|15.040000000000001|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|             14.26|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|              18.0|
|            5.4|           3.9|            1.7|           0.4|Iris-setosa|21.060000000000002|
|            4.6|           3.4|            1.4|           0.3|Iris-setosa|15.639999999999999|
|            5.0|           3.4|            1.5|  

Similarly, we can create a new column of boolean values based on a condition as follows:

In [31]:
df_iris = df_iris.withColumn("sepal_length_big", df_iris.sepal_length_cm > 6.0)

In [32]:
df_iris.show(10)

+---------------+--------------+---------------+--------------+-----------+------------------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|    sepal_area_cm2|sepal_length_big|
+---------------+--------------+---------------+--------------+-----------+------------------+----------------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|17.849999999999998|           false|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|14.700000000000001|           false|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|15.040000000000001|           false|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|             14.26|           false|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|              18.0|           false|
|            5.4|           3.9|            1.7|           0.4|Iris-setosa|21.060000000000002|          

### Renaming a Column in a Data Frame

A column could be renamed using `withColumnRenamed()` method:

In [33]:
df_iris = df_iris.withColumnRenamed("sepal_area_cm2", "sepal_area_cm_squared")

In [34]:
df_iris.show(5)

+---------------+--------------+---------------+--------------+-----------+---------------------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|sepal_area_cm_squared|sepal_length_big|
+---------------+--------------+---------------+--------------+-----------+---------------------+----------------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|   17.849999999999998|           false|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|   14.700000000000001|           false|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|   15.040000000000001|           false|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|                14.26|           false|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|                 18.0|           false|
+---------------+--------------+---------------+--------------+-----------+-----

### Filtering Records in a Data Frame

Method `filter()` allows to select only **rows** from a Spark data frame that satisfy given condition. This method accpets one argument - the condition expression which could be constructed as  follows:
 * SQL expression
 * or Spark column of boolean values. 

The following is an example of SQL expression used for filtering. Note, that the expression must be **string** and doesn't contain data frame name (use `"sepal_length_cm > 6.0"` but not `"df_iris.sepal_length_cm > 6.0"`):

In [35]:
df_iris.filter("sepal_length_cm > 6.0").show(10)

+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|sepal_area_cm_squared|sepal_length_big|
+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|   22.400000000000002|            true|
|            6.4|           3.2|            4.5|           1.5|Iris-versicolor|   20.480000000000004|            true|
|            6.9|           3.1|            4.9|           1.5|Iris-versicolor|                21.39|            true|
|            6.5|           2.8|            4.6|           1.5|Iris-versicolor|                 18.2|            true|
|            6.3|           3.3|            4.7|           1.6|Iris-versicolor|                20.79|            true|
|            6.6|           2.9|            4.6|

This is an example of Spark column of boolean values used for filtering. Note, it does contain name of the data frame (use `df_iris.sepal_length_cm > 6.0` but not `sepal_length_cm > 6.0`) and it is **not** string:

In [36]:
df_iris.filter(df_iris.sepal_length_cm > 6.0).show(10)

+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|sepal_area_cm_squared|sepal_length_big|
+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|   22.400000000000002|            true|
|            6.4|           3.2|            4.5|           1.5|Iris-versicolor|   20.480000000000004|            true|
|            6.9|           3.1|            4.9|           1.5|Iris-versicolor|                21.39|            true|
|            6.5|           2.8|            4.6|           1.5|Iris-versicolor|                 18.2|            true|
|            6.3|           3.3|            4.7|           1.6|Iris-versicolor|                20.79|            true|
|            6.6|           2.9|            4.6|

Note, that Spark column used in the filter could be defined separately as follows: 

In [37]:
filter_long_sepal = df_iris.sepal_length_cm > 6.0

In [38]:
df_iris.filter(filter_long_sepal).show(10)

+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|     class_iris|sepal_area_cm_squared|sepal_length_big|
+---------------+--------------+---------------+--------------+---------------+---------------------+----------------+
|            7.0|           3.2|            4.7|           1.4|Iris-versicolor|   22.400000000000002|            true|
|            6.4|           3.2|            4.5|           1.5|Iris-versicolor|   20.480000000000004|            true|
|            6.9|           3.1|            4.9|           1.5|Iris-versicolor|                21.39|            true|
|            6.5|           2.8|            4.6|           1.5|Iris-versicolor|                 18.2|            true|
|            6.3|           3.3|            4.7|           1.6|Iris-versicolor|                20.79|            true|
|            6.6|           2.9|            4.6|

### Selecting Columns from a Data Frame

Method `select()` allows to select **columns** from Spark data frame with given names. This method accpets multiple arguments - names of columns as follows:
 * string name
 * or column object. 

The following is an example of **string** column name used for selection (use `"sepal_length_cm"` but not `"df_iris.sepal_length_cm"`):

In [39]:
df_iris.select("sepal_length_cm", "sepal_width_cm").show(5)

+---------------+--------------+
|sepal_length_cm|sepal_width_cm|
+---------------+--------------+
|            5.1|           3.5|
|            4.9|           3.0|
|            4.7|           3.2|
|            4.6|           3.1|
|            5.0|           3.6|
+---------------+--------------+
only showing top 5 rows



The following is an example of **column object** used for selection (use `df_iris.sepal_length_cm` but not `sepal_length_cm`):

In [40]:
df_iris.select(df_iris.sepal_length_cm, df_iris.sepal_width_cm).show(5)

+---------------+--------------+
|sepal_length_cm|sepal_width_cm|
+---------------+--------------+
|            5.1|           3.5|
|            4.9|           3.0|
|            4.7|           3.2|
|            4.6|           3.1|
|            5.0|           3.6|
+---------------+--------------+
only showing top 5 rows



We can mix both types of arguments:

In [41]:
df_iris.select("sepal_length_cm", df_iris.sepal_width_cm).show(5)

+---------------+--------------+
|sepal_length_cm|sepal_width_cm|
+---------------+--------------+
|            5.1|           3.5|
|            4.9|           3.0|
|            4.7|           3.2|
|            4.6|           3.1|
|            5.0|           3.6|
+---------------+--------------+
only showing top 5 rows



The same method `select()` could be also used to apply **column-wise operations**. It is applied **only** to column objects as follows (applying to SQL strings `select("sepal_length_cm*10")` would **not** work):

In [42]:
df_iris.select(df_iris.sepal_length_cm*10, df_iris.sepal_width_cm*10).show(5)

+----------------------+---------------------+
|(sepal_length_cm * 10)|(sepal_width_cm * 10)|
+----------------------+---------------------+
|                  51.0|                 35.0|
|                  49.0|                 30.0|
|                  47.0|                 32.0|
|                  46.0|                 31.0|
|                  50.0|                 36.0|
+----------------------+---------------------+
only showing top 5 rows



Additionally, we can use method `alias()` to rename selected and changed columns as follows:

In [43]:
df_iris.select( (df_iris.sepal_length_cm*10).alias("sepal_length_mm"), (df_iris.sepal_width_cm*10).alias("sepal_width_mm") ).show(5)

+---------------+--------------+
|sepal_length_mm|sepal_width_mm|
+---------------+--------------+
|           51.0|          35.0|
|           49.0|          30.0|
|           47.0|          32.0|
|           46.0|          31.0|
|           50.0|          36.0|
+---------------+--------------+
only showing top 5 rows



Conviniently arguments of the `select()` method could be defined separately and then plugged in:

In [44]:
sepal_length_mm = (df_iris.sepal_length_cm*10).alias("sepal_length_mm")
sepal_width_mm = (df_iris.sepal_width_cm*10).alias("sepal_width_mm")
df_iris.select(sepal_length_mm, sepal_width_mm).show(5)

+---------------+--------------+
|sepal_length_mm|sepal_width_mm|
+---------------+--------------+
|           51.0|          35.0|
|           49.0|          30.0|
|           47.0|          32.0|
|           46.0|          31.0|
|           50.0|          36.0|
+---------------+--------------+
only showing top 5 rows



Notice, in the code above we use `df_iris.select(sepal_length_mm, sepal_width_mm)` but not `df_iris.select(df_iris.sepal_length_mm, df_iris.sepal_width_mm)`.

As noted above, selection **and** operation on columns with `select()` method could be performed using column objects only. But if we want to use SQL strings, then we have to use method `selectExpr()` as follows:

In [45]:
df_iris.selectExpr("sepal_length_cm*10", "sepal_width_cm*10").show(5)

+----------------------+---------------------+
|(sepal_length_cm * 10)|(sepal_width_cm * 10)|
+----------------------+---------------------+
|                  51.0|                 35.0|
|                  49.0|                 30.0|
|                  47.0|                 32.0|
|                  46.0|                 31.0|
|                  50.0|                 36.0|
+----------------------+---------------------+
only showing top 5 rows



Or we can rename new columns with SQL operator `AS` (similarly as we did above with `alias()` method):

In [46]:
df_iris.selectExpr("sepal_length_cm*10 AS sepal_length_mm", "sepal_width_cm*10 sepal_width_mm").show(5)

+---------------+--------------+
|sepal_length_mm|sepal_width_mm|
+---------------+--------------+
|           51.0|          35.0|
|           49.0|          30.0|
|           47.0|          32.0|
|           46.0|          31.0|
|           50.0|          36.0|
+---------------+--------------+
only showing top 5 rows



Again, let's use both methods next to each other to demonstrate that results are the same:

In [47]:
df_iris.select( (df_iris.sepal_length_cm*10).alias("sepal_length_mm"), (df_iris.sepal_width_cm*10).alias("sepal_width_mm") ).show(5)

+---------------+--------------+
|sepal_length_mm|sepal_width_mm|
+---------------+--------------+
|           51.0|          35.0|
|           49.0|          30.0|
|           47.0|          32.0|
|           46.0|          31.0|
|           50.0|          36.0|
+---------------+--------------+
only showing top 5 rows



In [48]:
df_iris.selectExpr("sepal_length_cm*10 AS sepal_length_mm", "sepal_width_cm*10 sepal_width_mm").show(5)

+---------------+--------------+
|sepal_length_mm|sepal_width_mm|
+---------------+--------------+
|           51.0|          35.0|
|           49.0|          30.0|
|           47.0|          32.0|
|           46.0|          31.0|
|           50.0|          36.0|
+---------------+--------------+
only showing top 5 rows



### Difference between `withColumn` and `select`

Method `select()` creates a new data frame with only columns specified as its arguments.

Method `withColumn()` creates a new data fram with **all** columns of original data frame plus new column specified with its two arguments.

In [49]:
df_iris.select( (df_iris.sepal_length_cm*10).alias("sepal_length_mm") ).show(5)

+---------------+
|sepal_length_mm|
+---------------+
|           51.0|
|           49.0|
|           47.0|
|           46.0|
|           50.0|
+---------------+
only showing top 5 rows



In [50]:
df_iris.withColumn("sepal_length_mm", df_iris.sepal_length_cm * 10.0).show(5)

+---------------+--------------+---------------+--------------+-----------+---------------------+----------------+---------------+
|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm| class_iris|sepal_area_cm_squared|sepal_length_big|sepal_length_mm|
+---------------+--------------+---------------+--------------+-----------+---------------------+----------------+---------------+
|            5.1|           3.5|            1.4|           0.2|Iris-setosa|   17.849999999999998|           false|           51.0|
|            4.9|           3.0|            1.4|           0.2|Iris-setosa|   14.700000000000001|           false|           49.0|
|            4.7|           3.2|            1.3|           0.2|Iris-setosa|   15.040000000000001|           false|           47.0|
|            4.6|           3.1|            1.5|           0.2|Iris-setosa|                14.26|           false|           46.0|
|            5.0|           3.6|            1.4|           0.2|Iris-setosa|        

### Aggregating Records in a Data Frame

Aggregation methods follow `groupBy()` method wich creates a `GroupedData` object from Spark data frame.

List of some aggregation methods:
 * `min()`
 * `max()`
 * `avg()`
 * `sum()`
 * `count()`

Aggregation methods are used with **string** column names (column objects don't work, for example use `min("sepal_length_cm")` but not `min(df_iris.sepal_length_cm)`).

Minimal sepal length:

In [51]:
df_iris.groupBy().min("sepal_length_cm").show()

+--------------------+
|min(sepal_length_cm)|
+--------------------+
|                 4.3|
+--------------------+



Maximal sepal length:

In [52]:
df_iris.groupBy().max("sepal_length_cm").show()

+--------------------+
|max(sepal_length_cm)|
+--------------------+
|                 7.9|
+--------------------+



Average sepal length:

In [53]:
df_iris.groupBy().avg("sepal_length_cm").show()

+--------------------+
|avg(sepal_length_cm)|
+--------------------+
|   5.843333333333335|
+--------------------+



Sum of all sepal lengths in the table:

In [54]:
df_iris.groupBy().sum("sepal_length_cm").show()

+--------------------+
|sum(sepal_length_cm)|
+--------------------+
|   876.5000000000002|
+--------------------+



Counts of records in the table:

In [55]:
df_iris.groupBy().count().show()

+-----+
|count|
+-----+
|  150|
+-----+



The `groupBy()` method could accept name of one or more columns as an argument. For example, we can group records by *Iris* classes (there are 3 of them *virginica*, *setosa* and *versicolor*) and calcualte average in each class separately:

In [56]:
df_iris.groupBy("class_iris").avg("sepal_length_cm").show()

+---------------+--------------------+
|     class_iris|avg(sepal_length_cm)|
+---------------+--------------------+
| Iris-virginica|   6.587999999999998|
|    Iris-setosa|   5.005999999999999|
|Iris-versicolor|               5.936|
+---------------+--------------------+



Arguments to `groupBy()` could be column name strings (as `"class_iris"` used above) or column objects `df_iris.class_iris` as used in the example below:

In [57]:
df_iris.groupBy(df_iris.class_iris).avg("sepal_length_cm").show()

+---------------+--------------------+
|     class_iris|avg(sepal_length_cm)|
+---------------+--------------------+
| Iris-virginica|   6.587999999999998|
|    Iris-setosa|   5.005999999999999|
|Iris-versicolor|               5.936|
+---------------+--------------------+



Another iinteresting application of `groupBy` method is together with `count` method to return number of records for each `Iris` class:

In [58]:
df_iris.groupBy(df_iris.class_iris).count().show()

+---------------+-----+
|     class_iris|count|
+---------------+-----+
| Iris-virginica|   50|
|    Iris-setosa|   50|
|Iris-versicolor|   50|
+---------------+-----+



Finally, any aggreagte function from `pyspark.sql.functions` module could be used with `groupBy()` and `agg()` methods. For example, let's group by *Iris* class and calculate standard deviation:

In [59]:
df_iris.groupBy("class_iris").agg( F.stddev("sepal_length_cm") ).show()

+---------------+----------------------------+
|     class_iris|stddev_samp(sepal_length_cm)|
+---------------+----------------------------+
| Iris-virginica|           0.635879593274432|
|    Iris-setosa|          0.3524896872134513|
|Iris-versicolor|          0.5161711470638635|
+---------------+----------------------------+



Note, that we import `pyspark.sql.functions` as `F` at the beginning of this notebook.

### Joining Data Frames

Let's create a data frame from `iris_class.csv` file that contains *Iris* classes and corresponding English names:

In [60]:
df_iris_class = spark.read.csv("../data/raw/iris_class.csv", header=True, inferSchema =True)

In [61]:
df_iris_class.show()

+---------------+--------------------+
|     class_iris|           name_iris|
+---------------+--------------------+
|    Iris-setosa|Bristle-pointed iris|
|Iris-versicolor|       Virginia iris|
| Iris-virginica|Northern blue fla...|
+---------------+--------------------+



Method `join()` creates a new data frame combining information from 2 data frames using a column as a key. The method runs on 1st data frame and accepts 3 arguments:
 * 2nd data frame
 * `on` - name of column to join over (it should be the same name in both data frames; use `withColumnRenamed` if needed to rename)
 * `how` - defines different types of join, we use `leftouter` in the example below

In [62]:
df_iris.join(df_iris_class, on = "class_iris", how = "leftouter").show(5)

+-----------+---------------+--------------+---------------+--------------+---------------------+----------------+--------------------+
| class_iris|sepal_length_cm|sepal_width_cm|petal_length_cm|petal_width_cm|sepal_area_cm_squared|sepal_length_big|           name_iris|
+-----------+---------------+--------------+---------------+--------------+---------------------+----------------+--------------------+
|Iris-setosa|            5.1|           3.5|            1.4|           0.2|   17.849999999999998|           false|Bristle-pointed iris|
|Iris-setosa|            4.9|           3.0|            1.4|           0.2|   14.700000000000001|           false|Bristle-pointed iris|
|Iris-setosa|            4.7|           3.2|            1.3|           0.2|   15.040000000000001|           false|Bristle-pointed iris|
|Iris-setosa|            4.6|           3.1|            1.5|           0.2|                14.26|           false|Bristle-pointed iris|
|Iris-setosa|            5.0|           3.6|    