# Window functions

General Elections were held in the UK in 2015 and 2017. Every citizen votes in a constituency. The candidate who gains the most votes becomes MP for that constituency.

All these results are recorded in a table ge

yr	| firstName	| lastName	| constituency	| party	| votes
---:|-----------|-----------|---------------|-------|------:
2015	| Ian	| Murray	| S14000024	| Labour	| 19293
2015	| Neil	| Hay	| S14000024	| Scottish National Party	| 16656
2015	| Miles	| Briggs	| S14000024	| Conservative | 8626
2015	| Phyl	| Meyer	| S14000024	| Green	| 2090
2015	| Pramod	| Subbaraman	| S14000024	| Liberal Democrat	| 1823
2015	| Paul	| Marshall	| S14000024	| UK Independence Party	 | 601
2015	| Colin	| Fox	| S14000024	| Scottish Socialist Party	| 197
2017	| Ian	| MURRAY	| S14000024	| Labour	| 26269
2017	| Jim	| EADIE	| S14000024	| SNP	| 10755
2017	| Stephanie Jane Harley	| SMITH	| S14000024	| Conservative	| 9428
2017	| Alan Christopher	| BEAL	| S14000024	| Liberal Democrats	| 1388


In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app09-")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._
[39m
[32mimport [39m[36morg.apache.spark.sql.expressions.Window

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@79fe4a49

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@cd0694

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val ge = hiveCxt.table("sqlzoo.ge")

[36mge[39m: [32mDataFrame[39m = [yr: int, firstname: string ... 4 more fields]

## 1. Warming up

Show the **lastName, party** and **votes** for the **constituency** 'S14000024' in 2017.

In [5]:
(ge.filter((ge("constituency")==="S14000024") &&
           (ge("yr")===2017))
    .select("lastname", "party", "votes")
    .showHTML())

lastname,party,votes
BEAL,Liberal Democrats,1388
MURRAY,Labour,26269
EADIE,SNP,10755
SMITH,Conservative,9428


## 2. Who won?

You can use the RANK function to see the order of the candidates. If you RANK using (ORDER BY votes DESC) then the candidate with the most votes has rank 1.

f**Show the party and RANK for constituency S14000024 in 2017. List the output by party**

In [6]:
(ge.filter((ge("constituency")==="S14000024") && 
           (ge("yr")===2017))
 .select("party", "votes")
 .withColumn("rank", rank().over(Window.orderBy(col("votes").desc)))
 .orderBy("party")
 .showHTML())

party,votes,rank
Conservative,9428,3
Labour,26269,1
Liberal Democrats,1388,4
SNP,10755,2


## 3. PARTITION BY

The 2015 election is a different PARTITION to the 2017 election. We only care about the order of votes for each year.

**Use PARTITION to show the ranking of each party in S14000021 in each year. Include yr, party, votes and ranking (the party with the most votes is 1).**

In [7]:
(ge.filter(ge("constituency")==="S14000021")
 .withColumn("posn", rank().over(
     Window.partitionBy("yr").orderBy(col("votes").desc)))
 .select("yr", "party", "votes", "posn")
 .orderBy("party", "yr")
 .showHTML())

yr,party,votes,posn
2015,Conservative,12465,3
2017,Conservative,21496,1
2019,Conservative,19451,2
2015,Labour,19295,2
2017,Labour,14346,2
2019,Labour,6855,3
2015,Liberal Democrats,1069,4
2017,Liberal Democrats,1112,3
2019,Liberal Democrats,4174,4
2015,SNP,23013,1


## 4. Edinburgh Constituency

Edinburgh constituencies are numbered S14000021 to S14000026.

**Use PARTITION BY constituency to show the ranking of each party in Edinburgh in 2017. Order your results so the winners are shown first, then ordered by constituency.**

In [8]:
(ge.filter((ge("constituency").between("S14000021", "S14000026")) &&
       (ge("yr")===2017))
 .withColumn("posn", rank().over(
     Window.partitionBy("constituency").orderBy(col("votes").desc)))
 .select("constituency", "party", "votes", "posn")
 .orderBy("posn", "constituency")
 .showHTML())

constituency,party,votes,posn
S14000021,Conservative,21496,1
S14000022,SNP,18509,1
S14000023,SNP,19243,1
S14000024,Labour,26269,1
S14000025,SNP,17575,1
S14000026,Liberal Democrats,18108,1
S14000021,Labour,14346,2
S14000022,Labour,15084,2
S14000023,Labour,17618,2
S14000024,SNP,10755,2


## 5. Winners Only

You can use [SELECT within SELECT](https://sqlzoo.net/wiki/SELECT_within_SELECT_Tutorial) to pick out only the winners in Edinburgh.

**Show the parties that won for each Edinburgh constituency in 2017.**

In [9]:
(ge.filter((ge("constituency").between("S14000021", "S14000026")) &&
           (ge("yr")===2017))
 .withColumn("posn", rank().over(
     Window.partitionBy("constituency").orderBy(col("votes").desc)))
 .filter(col("posn")===1)
 .select("constituency", "party")
 .orderBy("constituency")
 .showHTML())

constituency,party
S14000021,Conservative
S14000022,SNP
S14000023,SNP
S14000024,Labour
S14000025,SNP
S14000026,Liberal Democrats


## 6. Scottish seats

You can use **COUNT** and **GROUP BY** to see how each party did in Scotland. Scottish constituencies start with 'S'

**Show how many seats for each party in Scotland in 2017.**

In [10]:
(ge.filter((ge("constituency").startsWith("S")) && 
           (ge("yr")===2017))
 .withColumn("posn", rank().over(
     Window.partitionBy("constituency").orderBy(col("votes").desc)))
 .filter(col("posn")===1)
 .groupBy("party")
 .count()
 .showHTML())

party,count
SNP,34
Labour,9
Conservative,12
Liberal Democrats,4


In [11]:
spark.stop()