# Using Null

- teacher

id	| dept	| name	| phone	| mobile
----|-------|-------|-------|-----
101	| 1 | Shrivell	| 2753 | 07986 555 1234
102	| 1	| Throd	    | 2754 | 07122 555 1920
103	| 1	| Splint	| 2293	|
104 |	| Spiregrain | 3287	|
105 | 2	| Cutflower	 | 3212 | 07996 555 6574
106 |	| Deadyawn | 3345 |	
... |      |        |        |

- dept

id	| name
----|----
1	| Computing
2	| Design
3	| Engineering
... |

### Teachers and Departments
The school includes many departments. Most teachers work exclusively for a single department. Some teachers have no department.

[Selecting NULL values](https://sqlzoo.net/wiki/Selecting_NULL_values).

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app08")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@73be95f9

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@2b4fd56e

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val teacher = hiveCxt.table("sqlzoo.teacher")
val dept = hiveCxt.table("sqlzoo.dept")

[36mteacher[39m: [32mDataFrame[39m = [id: int, dept: int ... 3 more fields]
[36mdept[39m: [32mDataFrame[39m = [id: int, name: string]

## 1. NULL, INNER JOIN, LEFT JOIN, RIGHT JOIN

List the teachers who have NULL for their department.

> _Why we cannot use =_   
> You might think that the phrase dept=NULL would work here but it doesn't - you can use the phrase dept IS NULL
> 
> _That's not a proper explanation._  
> No it's not, but you can read a better explanation at Wikipedia:NULL.

In [5]:
(teacher.filter(isnull(teacher("dept")))
 .select("name").showHTML())

name
Spiregrain
Deadyawn


## 2.
Note the INNER JOIN misses the teachers with no department and the departments with no teacher.

In [6]:
(teacher.withColumnRenamed("name", "teacher")
     .join(dept, teacher("dept")===dept("id"))
     .select("teacher", "name")
     .showHTML())

teacher,name
Shrivell,Computing
Throd,Computing
Splint,Computing
Cutflower,Design


## 3.
Use a different JOIN so that all teachers are listed.

In [7]:
(teacher.withColumnRenamed("name", "teacher")
    .join(dept, teacher("dept")===dept("id"), joinType="left")
    .select("teacher", "name")
    .showHTML())

teacher,name
Shrivell,Computing
Throd,Computing
Splint,Computing
Spiregrain,
Cutflower,Design
Deadyawn,


## 4.
Use a different JOIN so that all departments are listed.

In [8]:
(teacher.withColumnRenamed("name", "teacher")
    .join(dept, teacher("dept")===dept("id"), joinType="right")
    .select("teacher", "name")
    .showHTML())

teacher,name
Splint,Computing
Throd,Computing
Shrivell,Computing
Cutflower,Design
,Engineering


## 5. Using the [COALESCE](https://sqlzoo.net/wiki/COALESCE) function


Use COALESCE to print the mobile number. Use the number '07986 444 2266' if there is no number given. **Show teacher name and mobile number or '07986 444 2266'**

In [9]:
(teacher.select("name", "mobile")
 .na.fill("07986 444 2266", Array("mobile"))
 .showHTML())

name,mobile
Shrivell,07986 555 1234
Throd,07122 555 1920
Splint,07986 444 2266
Spiregrain,07986 444 2266
Cutflower,07996 555 6574
Deadyawn,07986 444 2266


## 6.
Use the COALESCE function and a LEFT JOIN to print the teacher name and department name. Use the string 'None' where there is no department.

In [10]:
(teacher.withColumnRenamed("name", "teacher")
    .join(dept, teacher("dept")===dept("id"), joinType="left")
    .select("teacher", "name")
    .na.fill("None", Array("name"))
    .showHTML())

teacher,name
Shrivell,Computing
Throd,Computing
Splint,Computing
Spiregrain,
Cutflower,Design
Deadyawn,


## 7.
Use COUNT to show the number of teachers and the number of mobile phones.

In [11]:
teacher.agg(count("name"), count("mobile")).showHTML()

count(name),count(mobile)
6,3


## 8.
Use COUNT and GROUP BY **dept.name** to show each department and the number of staff. Use a RIGHT JOIN to ensure that the Engineering department is listed.

In [12]:
(teacher.withColumnRenamed("name", "teacher")
 .join(dept, teacher("dept")===dept("id"), joinType="right")
 .groupBy("name")
 .agg(count("teacher"))
 .showHTML())

name,count(teacher)
Engineering,0
Computing,3
Design,1


## 9. Using [CASE](https://sqlzoo.net/wiki/CASE)


Use CASE to show the **name** of each teacher followed by 'Sci' if the teacher is in **dept** 1 or 2 and 'Art' otherwise.

In [13]:
(teacher
 .select($"name", $"dept", when($"dept".isin(List(0, 1): _*), "Sci")
         .otherwise("Art").alias("label"))
 .showHTML())

name,dept,label
Shrivell,1.0,Sci
Throd,1.0,Sci
Splint,1.0,Sci
Spiregrain,,Art
Cutflower,2.0,Art
Deadyawn,,Art


## 10.
Use CASE to show the name of each teacher followed by 'Sci' if the teacher is in dept 1 or 2, show 'Art' if the teacher's dept is 3 and 'None' otherwise.

In [14]:
(teacher
 .select($"name", $"dept", when($"dept".isin(List(1, 2): _*), "Sci")
         .when($"dept".isin(List(3): _*), "Art")
         .otherwise("None").alias("label"))
 .showHTML())

name,dept,label
Shrivell,1.0,Sci
Throd,1.0,Sci
Splint,1.0,Sci
Spiregrain,,
Cutflower,2.0,Sci
Deadyawn,,


In [15]:
spark.stop()