# Self join

## Edinburgh Buses
[Details of the database](https://sqlzoo.net/wiki/Edinburgh_Buses.) Looking at the data

```
stops(id, name)
route(num, company, pos, stop)
```

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app09")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@785ebcfa

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@5dbfb5f

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val stops = hiveCxt.table("sqlzoo.stops")
val route = hiveCxt.table("sqlzoo.route")

[36mstops[39m: [32mDataFrame[39m = [id: int, name: string]
[36mroute[39m: [32mDataFrame[39m = [num: string, company: string ... 2 more fields]

## 1.
How many **stops** are in the database.

In [5]:
stops.agg(count("id")).showHTML()

count(id)
246


## 2.
Find the **id** value for the stop 'Craiglockhart'

In [6]:
stops.filter(stops("name")==="Craiglockhart").select("id").showHTML()

id
53


## 3.
Give the **id** and the **name** for the **stops** on the '4' 'LRT' service.

In [7]:
(stops.join(route, stops("id")===route("stop"), joinType="left")
 .filter((col("num")==="4") && (col("company")==="LRT"))
 .select("id", "name")
 .showHTML())

id,name
19,Bingham
53,Craiglockhart
85,Fairmilehead
115,Haymarket
117,Hillend
149,London Road
177,Northfield
179,Oxgangs
194,Princes Street


## 4. Routes and stops

The query shown gives the number of routes that visit either London Road (149) or Craiglockhart (53). Run the query and notice the two services that link these stops have a count of 2. Add a HAVING clause to restrict the output to these two routes.

In [8]:
(route.filter(route("stop").isin(List(149, 53): _*))
    .groupBy("company", "num")
    .agg(count("stop"))
    .filter($"count(stop)"===2)
    .showHTML())

company,num,count(stop)
LRT,45,2
LRT,4,2


## 5.
Execute the self join shown and observe that b.stop gives all the places you can get to from Craiglockhart, without changing routes. Change the query so that it shows the services from Craiglockhart to London Road.

In [15]:
(route.filter(col("stop")===53)
 .join(route
       .filter(col("stop")===149)
       .withColumnsRenamed(Map("stop" -> "stop2")),
      Seq("company", "num"))
 .select("company", "num", "stop", "stop2")
 .showHTML())

company,num,stop,stop2
LRT,4,53,149
LRT,45,53,149


## 6.
The query shown is similar to the previous one, however by joining two copies of the **stops** table we can refer to **stops** by **name** rather than by number. Change the query so that the services between 'Craiglockhart' and 'London Road' are shown. If you are tired of these places try 'Fairmilehead' against 'Tollcross'

In [16]:
(route.join(route
            .withColumnsRenamed(Map(
                "pos" -> "pos2", "stop" -> "stop2")), 
            Seq("company", "num"))
    .join(stops, col("stop")===stops("id"))
    .join(stops
          .withColumnsRenamed(Map(
              "id" -> "id2", "name" -> "name2")), 
          col("stop2")===col("id2"))
    .filter((col("name")==="Craiglockhart") && 
            (col("name2")==="London Road"))
    .select("company", "num", "name", "name2")
    .showHTML())

company,num,name,name2
LRT,4,Craiglockhart,London Road
LRT,45,Craiglockhart,London Road


## 7. [Using a self join](https://sqlzoo.net/wiki/Using_a_self_join)

Give a list of all the services which connect stops 115 and 137 ('Haymarket' and 'Leith')

In [17]:
(route.join(route
            .withColumnsRenamed(Map(
                "pos" -> "pos2", "stop" -> "stop2")),
            Seq("company", "num"))
    .filter((col("stop")===115) && (col("stop2")===137))
    .select("company", "num")
    .distinct()
    .showHTML())

company,num
LRT,2A
LRT,2
LRT,25
SMT,C5
LRT,12
LRT,22


## 8.
Give a list of the services which connect the stops 'Craiglockhart' and 'Tollcross'

In [18]:
(route.join(route
            .withColumnsRenamed(Map(
                "pos" -> "pos2", "stop" -> "stop2")),
            Seq("company", "num"))
    .join(stops, col("stop")===col("id"))
    .join(stops
          .withColumnsRenamed(Map(
              "id" -> "id2", "name" -> "name2")), 
          col("stop2")===col("id2"))
    .filter((col("name")==="Craiglockhart") && 
            (col("name2")==="Tollcross"))
    .select("company", "num")
    .showHTML())

company,num
LRT,10
LRT,27
LRT,45
LRT,47


## 9.
Give a distinct list of the **stops** which may be reached from 'Craiglockhart' by taking one bus, including 'Craiglockhart' itself, offered by the LRT company. Include the company and bus no. of the relevant services.

In [19]:
(route.join(route
            .withColumnsRenamed(Map(
                "pos" -> "pos2", "stop" -> "stop2")),
            Seq("company", "num"))
    .join(stops, col("stop")===col("id"))
    .join(stops
          .withColumnsRenamed(Map(
              "id" -> "id2", "name" -> "name2")), 
          col("stop2")===col("id2"))
    .filter((col("name")==="Craiglockhart") && 
            (col("company")==="LRT"))
    .select("name2", "company", "num")
    .dropDuplicates()
    .showHTML())

name2,company,num
Tollcross,LRT,27
Duddingston,LRT,45
Balerno Church,LRT,47
Craiglockhart,LRT,10
Hillend,LRT,4
Tollcross,LRT,47
Tollcross,LRT,45
Riccarton Campus,LRT,45
Princes Street,LRT,4
Oxgangs,LRT,27


## 10.
Find the routes involving two buses that can go from **Craiglockhart** to **Lochend**.
Show the bus no. and company for the first bus, the name of the stop for the transfer,
and the bus no. and company for the second bus.

> _Hint_    
> Self-join twice to find buses that visit Craiglockhart and Lochend, then join those on matching stops.

In [20]:
val bus1 = (route.join(route
            .withColumnsRenamed(Map(
                "pos" -> "pos2", "stop" -> "stop2")),
            Seq("company", "num"))
    .join(stops, col("stop")===col("id"))
    .join(stops
          .withColumnsRenamed(Map(
              "id" -> "id2", "name" -> "name2")), 
          col("stop2")===col("id2"))
    .filter(col("name")==="Craiglockhart")
    .select("name2", "company", "num", "stop2")
    .dropDuplicates())
val bus2 = (route.join(route
            .withColumnsRenamed(
                Map("pos" -> "pos2", "stop" -> "stop2")),
            Seq("company", "num"))
    .join(stops, col("stop")===col("id"))
    .join(stops
          .withColumnsRenamed(Map(
              "id" -> "id2", "name" -> "name2")), 
          col("stop2")===col("id2"))
    .filter(col("name2")==="Lochend")
    .select("stop", "company", "num")
    .dropDuplicates())
(bus1.join(bus2
           .withColumnsRenamed(Map(
               "company" -> "company2", "num" -> "num2")), 
           bus1("stop2")===bus2("stop"))
    .select("num", "company", "name2", "num2", "company2")
    .showHTML())

num,company,name2,num2,company2
45,LRT,Riccarton Campus,65,LRT
10,LRT,Leith,C5,SMT
10,LRT,Leith,34,LRT
10,LRT,Leith,87,LRT
10,LRT,Leith,35,LRT
45,LRT,Duddingston,42,LRT
45,LRT,Duddingston,46A,LRT
4,LRT,Princes Street,C5,SMT
4,LRT,Princes Street,65,LRT
4,LRT,Haymarket,65,LRT


[36mbus1[39m: [32mDataset[39m[[32mRow[39m] = [name2: string, company: string ... 2 more fields]
[36mbus2[39m: [32mDataset[39m[[32mRow[39m] = [stop: int, company: string ... 1 more field]

In [21]:
spark.stop()