# Self join

## Edinburgh Buses
[Details of the database](https://sqlzoo.net/wiki/Edinburgh_Buses.) Looking at the data

```
stops(id, name)
route(num, company, pos, stop)
```

In [1]:
import os
import pandas as pd
import findspark
os.environ['SPARK_HOME'] =  '/opt/spark'
findspark.init()

from pyspark.sql import SparkSession
sc = (SparkSession.builder.appName('app09')
      .config('spark.sql.warehouse.dir', 'hdfs://quickstart.cloudera:8020/user/hive/warehouse')
      .config('hive.metastore.uris', 'thrift://quickstart.cloudera:9083')
      .enableHiveSupport().getOrCreate())

## 1.
How many **stops** are in the database.

In [2]:
stops = sc.read.table('sqlzoo.stops')
route = sc.read.table('sqlzoo.route')

In [3]:
stops.agg({'id': 'count'}).toPandas()

Unnamed: 0,count(id)
0,246


## 2.
Find the **id** value for the stop 'Craiglockhart'

In [4]:
stops.filter(stops['name']=='Craiglockhart').select('id').toPandas()

Unnamed: 0,id
0,53


## 3.
Give the **id** and the **name** for the **stops** on the '4' 'LRT' service.

In [5]:
from pyspark.sql.functions import *
(stops.join(route, stops['id']==route['stop'], how='left')
 .filter((col('num')=='4') & (col('company')=='LRT'))
 .select('id', 'name').toPandas())

Unnamed: 0,id,name
0,19,Bingham
1,53,Craiglockhart
2,85,Fairmilehead
3,115,Haymarket
4,117,Hillend
5,149,London Road
6,177,Northfield
7,179,Oxgangs
8,194,Princes Street


## 4. Routes and stops

The query shown gives the number of routes that visit either London Road (149) or Craiglockhart (53). Run the query and notice the two services that link these stops have a count of 2. Add a HAVING clause to restrict the output to these two routes.

In [6]:
(route.filter(route['stop'].isin([149, 53]))
    .groupBy('company', 'num')
    .agg({'stop': 'count'})
    .filter("count(stop)==2")
    .toPandas())

Unnamed: 0,company,num,count(stop)
0,LRT,45,2
1,LRT,4,2


## 5.
Execute the self join shown and observe that b.stop gives all the places you can get to from Craiglockhart, without changing routes. Change the query so that it shows the services from Craiglockhart to London Road.

In [7]:
(route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'), 
            ['company', 'num'], how='inner')
    .filter((col('stop')==53) & (col('stop2')==149))
    .select('company', 'num', 'stop', 'stop2')
    .toPandas())

Unnamed: 0,company,num,stop,stop2
0,LRT,4,53,149
1,LRT,45,53,149


## 6.
The query shown is similar to the previous one, however by joining two copies of the **stops** table we can refer to **stops** by **name** rather than by number. Change the query so that the services between 'Craiglockhart' and 'London Road' are shown. If you are tired of these places try 'Fairmilehead' against 'Tollcross'

In [8]:
(route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'), 
            ['company', 'num'])
    .join(stops, col('stop')==stops['id'])
    .join(stops
          .withColumnRenamed('id', 'id2')
          .withColumnRenamed('name', 'name2'), 
          col('stop2')==col('id2'))
    .filter((col('name')=="Craiglockhart") & (col('name2')=="London Road"))
    .select('company', 'num', 'name', 'name2')
    .toPandas())

Unnamed: 0,company,num,name,name2
0,LRT,4,Craiglockhart,London Road
1,LRT,45,Craiglockhart,London Road


## 7. [Using a self join](https://sqlzoo.net/wiki/Using_a_self_join)

Give a list of all the services which connect stops 115 and 137 ('Haymarket' and 'Leith')

In [9]:
(route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'),
            ['company', 'num'])
    .filter((col('stop')==115) & (col('stop2')==137))
    .select('company', 'num')
    .dropDuplicates()
    .toPandas())

Unnamed: 0,company,num
0,LRT,2A
1,LRT,2
2,LRT,25
3,SMT,C5
4,LRT,12
5,LRT,22


## 8.
Give a list of the services which connect the stops 'Craiglockhart' and 'Tollcross'

In [10]:
(route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'),
            ['company', 'num'])
    .join(stops, col('stop')==col('id'))
    .join(stops
          .withColumnRenamed('id', 'id2')
          .withColumnRenamed('name', 'name2'), 
          col('stop2')==col('id2'))
    .filter((col('name')=="Craiglockhart") & 
            (col('name2')=="Tollcross"))
    .select('company', 'num')
    .toPandas())

Unnamed: 0,company,num
0,LRT,10
1,LRT,27
2,LRT,45
3,LRT,47


## 9.
Give a distinct list of the **stops** which may be reached from 'Craiglockhart' by taking one bus, including 'Craiglockhart' itself, offered by the LRT company. Include the company and bus no. of the relevant services.

In [11]:
(route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'),
            ['company', 'num'])
    .join(stops, col('stop')==col('id'))
    .join(stops
          .withColumnRenamed('id', 'id2')
          .withColumnRenamed('name', 'name2'), 
          col('stop2')==col('id2'))
    .filter((col('name')=="Craiglockhart") & (col('company')=="LRT"))
    .select('name2', 'company', 'num')
    .dropDuplicates()
    .toPandas())

Unnamed: 0,name2,company,num
0,Tollcross,LRT,27
1,Duddingston,LRT,45
2,Balerno Church,LRT,47
3,Craiglockhart,LRT,10
4,Hillend,LRT,4
5,Tollcross,LRT,47
6,Tollcross,LRT,45
7,Riccarton Campus,LRT,45
8,Princes Street,LRT,4
9,Oxgangs,LRT,27


## 10.
Find the routes involving two buses that can go from **Craiglockhart** to **Lochend**.
Show the bus no. and company for the first bus, the name of the stop for the transfer,
and the bus no. and company for the second bus.

> _Hint_    
> Self-join twice to find buses that visit Craiglockhart and Lochend, then join those on matching stops.

In [12]:
bus1 = (route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'),
            ['company', 'num'])
    .join(stops, col('stop')==col('id'))
    .join(stops
          .withColumnRenamed('id', 'id2')
          .withColumnRenamed('name', 'name2'), 
          col('stop2')==col('id2'))
    .filter(col('name')=="Craiglockhart")
    .select('name2', 'company', 'num', 'stop2')
    .dropDuplicates())
bus2 = (route.join(route
            .withColumnRenamed('pos', 'pos2')
            .withColumnRenamed('stop', 'stop2'),
            ['company', 'num'])
    .join(stops, col('stop')==col('id'))
    .join(stops
          .withColumnRenamed('id', 'id2')
          .withColumnRenamed('name', 'name2'), 
          col('stop2')==col('id2'))
    .filter(col('name2')=="Lochend")
    .select('stop', 'company', 'num')
    .dropDuplicates())
(bus1.join(bus2
           .withColumnRenamed('company', 'company2')
           .withColumnRenamed('num', 'num2'), 
           bus1['stop2']==bus2['stop'])
    .select('num', 'company', 'name2', 'num2', 'company2')
    .toPandas())

Unnamed: 0,num,company,name2,num2,company2
0,10,LRT,Leith,35,LRT
1,10,LRT,Leith,87,LRT
2,10,LRT,Leith,34,LRT
3,10,LRT,Leith,C5,SMT
4,4,LRT,Haymarket,C5,SMT
5,4,LRT,Haymarket,65,LRT
6,27,LRT,Canonmills,35,LRT
7,27,LRT,Canonmills,34,LRT
8,47,LRT,Canonmills,35,LRT
9,47,LRT,Canonmills,34,LRT


In [13]:
sc.stop()