## Using first and last functions

Let us understand the usage of first and last value functions.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.spark.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Windowing Functions'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', '2')

* Let us get highest paid employee and least paid employee with in each department for each employee using employees data set.
* You can also use max to get max salary for each department, but you cannot get other attributes related to ma salary such as employee id, name etc. With first or last, you can get other details as well.

In [3]:
employeesPath = '/public/hr_db/employees'

In [4]:
employees = spark. \
    read. \
    format('csv'). \
    option('sep', '\t'). \
    schema('''employee_id INT, 
              first_name STRING, 
              last_name STRING, 
              email STRING,
              phone_number STRING, 
              hire_date STRING, 
              job_id STRING, 
              salary FLOAT,
              commission_pct STRING,
              manager_id STRING, 
              department_id STRING
            '''). \
    load(employeesPath)

In [6]:
from pyspark.sql.functions import col

In [7]:
employees. \
    select('employee_id', 
           col('department_id').cast('int').alias('department_id'), 
           'salary'
          ). \
    orderBy('department_id', 'salary'). \
    show()

+-----------+-------------+-------+
|employee_id|department_id| salary|
+-----------+-------------+-------+
|        178|         null| 7000.0|
|        200|           10| 4400.0|
|        202|           20| 6000.0|
|        201|           20|13000.0|
|        119|           30| 2500.0|
|        118|           30| 2600.0|
|        117|           30| 2800.0|
|        116|           30| 2900.0|
|        115|           30| 3100.0|
|        114|           30|11000.0|
|        203|           40| 6500.0|
|        132|           50| 2100.0|
|        136|           50| 2200.0|
|        128|           50| 2200.0|
|        127|           50| 2400.0|
|        135|           50| 2400.0|
|        140|           50| 2500.0|
|        144|           50| 2500.0|
|        191|           50| 2500.0|
|        182|           50| 2500.0|
+-----------+-------------+-------+
only showing top 20 rows



In [9]:
from pyspark.sql.window import Window

In [10]:
spec = Window. \
    partitionBy('department_id'). \
    orderBy(col('salary').desc())

In [12]:
from pyspark.sql.functions import first

In [15]:
employees. \
    select('employee_id', 
           col('department_id').cast('int').alias('department_id'), 
           'salary'
          ). \
    withColumn("highest_salary", first('salary').over(spec)). \
    withColumn("highest_employee_id", first('employee_id').over(spec)). \
    orderBy("department_id", col("salary").desc()). \
    show()

+-----------+-------------+-------+--------------+-------------------+
|employee_id|department_id| salary|highest_salary|highest_employee_id|
+-----------+-------------+-------+--------------+-------------------+
|        178|         null| 7000.0|        7000.0|                178|
|        200|           10| 4400.0|        4400.0|                200|
|        201|           20|13000.0|       13000.0|                201|
|        202|           20| 6000.0|       13000.0|                201|
|        114|           30|11000.0|       11000.0|                114|
|        115|           30| 3100.0|       11000.0|                114|
|        116|           30| 2900.0|       11000.0|                114|
|        117|           30| 2800.0|       11000.0|                114|
|        118|           30| 2600.0|       11000.0|                114|
|        119|           30| 2500.0|       11000.0|                114|
|        203|           40| 6500.0|        6500.0|                203|
|     

* The default functionality of last function is to use the rows between unbounded preceding to current row. We need to change the rows between to unbounded preceding to unbounded following.

In [16]:
from pyspark.sql.functions import last

In [26]:
spec = Window. \
    partitionBy('department_id'). \
    orderBy(col('salary').desc())

In [27]:
employees. \
    select('employee_id', 
           col('department_id').cast('int').alias('department_id'), 
           'salary'
          ). \
    withColumn("highest_salary", last('salary').over(spec)). \
    withColumn("highest_employee_id", last('employee_id').over(spec)). \
    orderBy("department_id", col("salary").desc()). \
    show()

+-----------+-------------+-------+--------------+-------------------+
|employee_id|department_id| salary|highest_salary|highest_employee_id|
+-----------+-------------+-------+--------------+-------------------+
|        178|         null| 7000.0|        7000.0|                178|
|        200|           10| 4400.0|        4400.0|                200|
|        201|           20|13000.0|       13000.0|                201|
|        202|           20| 6000.0|        6000.0|                202|
|        114|           30|11000.0|       11000.0|                114|
|        115|           30| 3100.0|        3100.0|                115|
|        116|           30| 2900.0|        2900.0|                116|
|        117|           30| 2800.0|        2800.0|                117|
|        118|           30| 2600.0|        2600.0|                118|
|        119|           30| 2500.0|        2500.0|                119|
|        203|           40| 6500.0|        6500.0|                203|
|     

In [18]:
help(spec)

Help on WindowSpec in module pyspark.sql.window object:

class WindowSpec(builtins.object)
 |  A window specification that defines the partitioning, ordering,
 |  and frame boundaries.
 |  
 |  Use the static methods in :class:`Window` to create a :class:`WindowSpec`.
 |  
 |  .. note:: Experimental
 |  
 |  .. versionadded:: 1.4
 |  
 |  Methods defined here:
 |  
 |  __init__(self, jspec)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  orderBy(self, *cols)
 |      Defines the ordering columns in a :class:`WindowSpec`.
 |      
 |      :param cols: names of columns or expressions
 |      
 |      .. versionadded:: 1.4
 |  
 |  partitionBy(self, *cols)
 |      Defines the partitioning columns in a :class:`WindowSpec`.
 |      
 |      :param cols: names of columns or expressions
 |      
 |      .. versionadded:: 1.4
 |  
 |  rangeBetween(self, start, end)
 |      Defines the frame boundaries, from `start` (inclusive) to `end` (inclusive).
 |      
 |   

In [30]:
spec = Window. \
    partitionBy('department_id'). \
    orderBy(col('salary')). \
    rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [31]:
employees. \
    select('employee_id', 
           col('department_id').cast('int').alias('department_id'), 
           'salary'
          ). \
    withColumn("highest_salary", last('salary').over(spec)). \
    withColumn("highest_employee_id", last('employee_id').over(spec)). \
    orderBy("department_id", col("salary").desc()). \
    show()

+-----------+-------------+-------+--------------+-------------------+
|employee_id|department_id| salary|highest_salary|highest_employee_id|
+-----------+-------------+-------+--------------+-------------------+
|        178|         null| 7000.0|        7000.0|                178|
|        200|           10| 4400.0|        4400.0|                200|
|        201|           20|13000.0|       13000.0|                201|
|        202|           20| 6000.0|        6000.0|                202|
|        114|           30|11000.0|       11000.0|                114|
|        115|           30| 3100.0|        3100.0|                115|
|        116|           30| 2900.0|        2900.0|                116|
|        117|           30| 2800.0|        2800.0|                117|
|        118|           30| 2600.0|        2600.0|                118|
|        119|           30| 2500.0|        2500.0|                119|
|        203|           40| 6500.0|        6500.0|                203|
|     