# The JOIN operation

!(rel)(https://sqlzoo.net/w/images/c/ce/FootballERD.png)

## JOIN and UEFA EURO 2012

This tutorial introduces `JOIN` which allows you to use data from two or more tables. The tables contain all matches and goals from UEFA EURO 2012 Football Championship in Poland and Ukraine.

The data is available (mysql format) at <http://sqlzoo.net/euro2012.sql>

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app06")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@74279fa1

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@c75b704

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val game = hiveCxt.table("sqlzoo.game")
val goal = hiveCxt.table("sqlzoo.goal")
val eteam = hiveCxt.table("sqlzoo.eteam")

[36mgame[39m: [32mDataFrame[39m = [id: int, mdate: string ... 3 more fields]
[36mgoal[39m: [32mDataFrame[39m = [matchid: int, teamid: string ... 2 more fields]
[36meteam[39m: [32mDataFrame[39m = [id: string, teamname: string ... 1 more field]

## 1.
The first example shows the goal scored by a player with the last name 'Bender'. The `*` says to list all the columns in the table - a shorter way of saying `matchid, teamid, player, gtime`

**Modify it to show the matchid and player name for all goals scored by Germany. To identify German players, check for: `teamid = 'GER'`**

In [5]:
goal.filter(goal("teamid")==="GER").select("matchid", "player").showHTML()

matchid,player
1008,Mario Gómez
1010,Mario Gómez
1010,Mario Gómez
1012,Lukas Podolski
1012,Lars Bender
1026,Philipp Lahm
1026,Sami Khedira
1026,Miroslav Klose
1026,Marco Reus
1030,Mesut Özil


## 2.

From the previous query you can see that Lars Bender's scored a goal in game 1012. Now we want to know what teams were playing in that match.

Notice in the that the column `matchid `in the `goal` table corresponds to the `id` column in the `game` table. We can look up information about game 1012 by finding that row in the `game` table.

**Show id, stadium, team1, team2 for just game 1012**

In [6]:
game.filter(game("id")===1012).select("id", "stadium", "team1", "team2").showHTML()

id,stadium,team1,team2
1012,Arena Lviv,DEN,GER


## 3.
You can combine the two steps into a single query with a JOIN.

```sql
SELECT *
  FROM game JOIN goal ON (id=matchid)
```

The **FROM** clause says to merge data from the goal table with that from the game table. The **ON** says how to figure out which rows in **game** go with which rows in **goal** - the **matchid** from **goal** must match **id** from **game**. (If we wanted to be more clear/specific we could say
`ON (game.id=goal.matchid)`

The code below shows the player (from the goal) and stadium name (from the game table) for every goal scored.

**Modify it to show the player, teamid, stadium and mdate for every German goal.**

In [7]:
(game.join(goal, game("id")===goal("matchid"))
    .filter(goal("teamid")==="GER")
    .select("player", "teamid", "stadium", "mdate")
    .showHTML())

player,teamid,stadium,mdate
Mario Gómez,GER,Arena Lviv,9 June 2012
Mario Gómez,GER,Metalist Stadium,13 June 2012
Mario Gómez,GER,Metalist Stadium,13 June 2012
Lukas Podolski,GER,Arena Lviv,17 June 2012
Lars Bender,GER,Arena Lviv,17 June 2012
Philipp Lahm,GER,PGE Arena Gdansk,22 June 2012
Sami Khedira,GER,PGE Arena Gdansk,22 June 2012
Miroslav Klose,GER,PGE Arena Gdansk,22 June 2012
Marco Reus,GER,PGE Arena Gdansk,22 June 2012
Mesut Özil,GER,"National Stadium, Warsaw",28 June 2012


## 4.
Use the same `JOIN` as in the previous question.

**Show the team1, team2 and player for every goal scored by a player called Mario `player LIKE 'Mario%'`**

In [8]:
(game.join(goal, game("id")===goal("matchid"))
    .filter(goal("player").startsWith("Mario"))
    .select("team1", "team2", "player").showHTML())

team1,team2,player
GER,POR,Mario Gómez
NED,GER,Mario Gómez
NED,GER,Mario Gómez
IRL,CRO,Mario Mandžukic
IRL,CRO,Mario Mandžukic
ITA,CRO,Mario Mandžukic
ITA,IRL,Mario Balotelli
GER,ITA,Mario Balotelli
GER,ITA,Mario Balotelli


## 5.

The table `eteam` gives details of every national team including the coach. You can `JOIN` `goal` to `eteam` using the phrase goal `JOIN eteam on teamid=id`

**Show `player, teamid, coach, gtime` for all goals scored in the first 10 minutes `gtime<=10`**

In [9]:
(goal.join(eteam, goal("teamid")===eteam("id"))
    .filter(goal("gtime") <= 10)
    .select("player", "teamid", "coach", "gtime")
    .showHTML())

player,teamid,coach,gtime
Petr Jirácek,CZE,Michal Bílek,3
Václav Pilar,CZE,Michal Bílek,6
Mario Mandžukic,CRO,Slaven Bilic,3
Fernando Torres,ESP,Vicente del Bosque,4


## 6.

To `JOIN` `game` with `eteam` you could use either
`game JOIN eteam ON (team1=eteam.id)` or `game JOIN eteam ON (team2=eteam.id)`

Notice that because `id` is a column name in both `game` and `eteam` you must specify `eteam.id` instead of just `id`

**List the the dates of the matches and the name of the team in which 'Fernando Santos' was the team1 coach.**

In [10]:
(game.join(eteam, game("team1")===eteam("id"))
 .filter($"coach"==="Fernando Santos")
 .select("mdate", "teamname")
 .showHTML())

mdate,teamname
12 June 2012,Greece
16 June 2012,Greece


## 7.

**List the player for every goal scored in a game where the stadium was 'National Stadium, Warsaw'**

In [11]:
(goal.join(game, goal("matchid")===game("id"))
 .filter($"stadium"==="National Stadium, Warsaw")
 .select("player").showHTML())

player
Robert Lewandowski
Dimitris Salpingidis
Alan Dzagoev
Jakub Blaszczykowski
Giorgos Karagounis
Cristiano Ronaldo
Mario Balotelli
Mario Balotelli
Mesut Özil


## 8. More difficult questions

The example query shows all goals scored in the Germany-Greece quarterfinal.
**Instead show the name of all players who scored a goal against Germany.**

> __HINT__   
> Select goals scored only by non-German players in matches where GER was the id of either **team1** or **team2**.
> You can use `teamid!="GER"` to prevent listing German players.
> You can use `DISTINCT` to stop players being listed twice.

In [12]:
(game.join(goal, game("id")===goal("matchid"))
 .filter((($"team1" === "GER") || ($"team2" === "GER")) && 
         ($"teamid" !== "GER"))
 .select("player")
 .distinct()
 .showHTML())

player
Michael Krohn-Dehli
Robin van Persie
Mario Balotelli
Dimitris Salpingidis
Georgios Samaras


## 9.
Show teamname and the total number of goals scored.

> __COUNT and GROUP BY__  
> You should COUNT(*) in the SELECT line and GROUP BY teamname

In [13]:
(eteam.join(goal, eteam("id")===goal("teamid"))
 .select("teamname", "player")
 .groupBy("teamname")
 .count()
 .showHTML())

teamname,count
Russia,5
Sweden,5
Germany,10
France,3
Greece,5
Croatia,4
Italy,6
Spain,12
Denmark,4
Ukraine,2


## 10.

**Show the stadium and the number of goals scored in each stadium.**

In [14]:
(game.join(goal, game("id")===goal("matchid"))
 .select("stadium", "player")
 .groupBy("stadium")
 .count()
 .showHTML())

stadium,count
Metalist Stadium,7
Arena Lviv,9
Stadion Miejski (Poznan),8
Donbass Arena,7
Stadion Miejski (Wroclaw),9
Olimpiyskiy National Sports Complex,14
"National Stadium, Warsaw",9
PGE Arena Gdansk,13


## 11.
**For every match involving 'POL', show the matchid, date and the number of goals scored.**

In [15]:
(game.join(goal, game("id")===goal("matchid"))
 .filter($"team1"==="POL" || $"team2"==="POL")
 .select("matchid", "mdate", "player")
 .groupBy("matchid", "mdate")
 .count()
 .showHTML())

matchid,mdate,count
1004,12 June 2012,2
1001,8 June 2012,2
1005,16 June 2012,1


## 12.
**For every match where 'GER' scored, show matchid, match date and the number of goals scored by 'GER'**

In [16]:
(game.join(goal, game("id")===goal("matchid"))
 .filter($"teamid"==="GER")
 .select("matchid", "mdate", "player")
 .groupBy("matchid", "mdate")
 .count()
 .showHTML())

matchid,mdate,count
1012,17 June 2012,2
1010,13 June 2012,2
1030,28 June 2012,1
1026,22 June 2012,4
1008,9 June 2012,1


## 13.
List every match with the goals scored by each team as shown. This will use "CASE WHEN" which has not been explained in any previous exercises.

no  | mdate      | team1  | score1 | team2 | score2
----|------------|--------|--------|-------|--------
1   | July 2012  | ESP    | 4      | ITA   | 0
10  | June 2012  | ESP    | 1      | ITA   | 1
10  | June 2012  | IRL    | 1      | CRO   | 3
... | ...        | ...    | ...    | ...   | ...

Notice in the query given every goal is listed. If it was a team1 goal then a 1 appears in score1, otherwise there is a 0. You could SUM this column to get a count of the goals scored by team1. **Sort your result by mdate, matchid, team1 and team2.**

In [17]:
// sc.conf.set("spark.sql.crossJoin.enabled", "true")
val s = (goal.select("matchid", "teamid", "player")
         .groupBy("matchid", "teamid")
         .count())

val a = (game.join(s, (game("id")===s("matchid") && 
                       game("team1")===s("teamid")), joinType="left")
         .drop("teamid", "matchid")
         .withColumnRenamed("count", "score1").alias("a"))
(a.join(s, (a("id")===s("matchid") && a("team2")===s("teamid")), joinType="left")
 .drop("matchid", "teamid")
 .withColumnRenamed("count", "score2")
 .select("mdate", "team1", "score1", "team2", "score2")
 .na.fill(Map("score1" -> 0, "score2" -> 0))
 .orderBy("mdate")
 .showHTML())

mdate,team1,score1,team2,score2
1 July 2012,ESP,4,ITA,0
10 June 2012,ESP,1,ITA,1
10 June 2012,IRL,1,CRO,3
11 June 2012,FRA,1,ENG,1
11 June 2012,UKR,2,SWE,1
12 June 2012,GRE,1,CZE,2
12 June 2012,POL,1,RUS,1
13 June 2012,DEN,2,POR,3
13 June 2012,NED,1,GER,2
14 June 2012,ITA,1,CRO,1


[36ms[39m: [32mDataFrame[39m = [matchid: int, teamid: string ... 1 more field]
[36ma[39m: [32mDataset[39m[[32mRow[39m] = [id: int, mdate: string ... 4 more fields]

In [18]:
spark.stop()