# SELECT from WORLD

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app02")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@7b280c25

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@387b229

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val world = hiveCxt.table("sqlzoo.world")

[36mworld[39m: [32mDataFrame[39m = [name: string, continent: string ... 6 more fields]

## 1. Introduction

[Read the notes about this table](https://sqlzoo.net/wiki/Read_the_notes_about_this_table.). Observe the result of running this SQL command to show the name, continent and population of all countries.

In [5]:
world.select($"name", $"continent", $"population").showHTML()

name,continent,population
Afghanistan,Asia,32225560.0
Albania,Europe,2845955.0
Algeria,Africa,43000000.0
Andorra,Europe,77543.0
Angola,Africa,31127674.0
Antigua and Barbuda,Caribbean,96453.0
Argentina,South America,44938712.0
Armenia,Eurasia,2957500.0
Australia,Oceania,25690023.0
Austria,Europe,8902600.0


## 2. Large Countries

[How to use WHERE to filter records](https://sqlzoo.net/wiki/WHERE_filters). Show the name for the countries that have a population of at least 200 million. 200 million is 200000000, there are eight zeros.

In [6]:
world.filter($"population">=2e8).select($"name").showHTML()

name
Brazil
China
India
Indonesia
Nigeria
Pakistan
United States


## 3. Per capita GDP

Give the `name` and the **per capita GDP** for those countries with a `population` of at least 200 million.

> _HELP:How to calculate per capita GDP_   
> per capita GDP is the GDP divided by the population GDP/population

In [7]:
(world.withColumn("pcgdp", round($"gdp"/$"population", 2))
    .filter($"population" >= 2e8)
    .select($"name", $"pcgdp")
    .showHTML())

name,pcgdp
Brazil,9721.37
China,8724.31
India,1891.78
Indonesia,3804.77
Nigeria,1822.89
Pakistan,1377.04
United States,59121.19


## 4. South America In millions

Show the `name` and `population` in millions for the countries of the `continent` 'South America'. Divide the population by 1000000 to get population in millions.

In [8]:
(world.withColumn("popl", round($"population"/1e6, 2))
    .filter($"continent" === "South America")
    .select($"name", $"popl")
    .showHTML())

name,popl
Argentina,44.94
Bolivia,11.47
Brazil,211.44
Chile,19.11
Colombia,49.4
Ecuador,17.47
Guyana,0.78
Paraguay,7.25
Peru,32.13
Saint Vincent and the Grenadines,0.11


## 5. France, Germany, Italy

Show the `name` and `population` for France, Germany, Italy

In [9]:
val listVal = Seq("France", "Germany", "Italy")
(world.filter($"name".isin(listVal: _*))
     .select($"name", $"population")
     .showHTML())

name,population
France,67076000.0
Germany,83149300.0
Italy,60238522.0


[36mlistVal[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m([32m"France"[39m, [32m"Germany"[39m, [32m"Italy"[39m)

## 6. United

Show the countries which have a `name` that includes the word 'United'

In [10]:
(world.filter($"name".contains("United"))
 .select($"name").showHTML())

name
United Arab Emirates
United Kingdom
United States


## 7. Two ways to be big

Two ways to be big: A country is **big** if it has an area of more than 3 million sq km or it has a population of more than 250 million.

**Show the countries that are big by area or big by population. Show name, population and area.**

In [11]:
(world.filter(($"area" > 3e6) || ($"population" > 2.5e8))
    .select($"name", $"population", $"area")
    .showHTML())

name,population,area
Australia,25690023.0,7692024.0
Brazil,211442625.0,8515767.0
Canada,38007166.0,9984670.0
China,1402378640.0,9596961.0
India,1361503224.0,3166414.0
Indonesia,266911900.0,1904569.0
Russia,146745098.0,17125242.0
United States,329583916.0,9826675.0


## 8. One or the other (but not both)

**Exclusive OR (XOR). Show the countries that are big by area (more than 3 million) or big by population (more than 250 million) but not both. Show name, population and area.**

- Australia has a big area but a small population, it should be **included**.
- Indonesia has a big population but a small area, it should be **included**.
- China has a big population **and** big area, it should be **excluded**.
- United Kingdom has a small population and a small area, it should be **excluded**.

In [12]:
(world.filter(($"area" > 3e6) !== ($"population" > 2.5e8))
     .select($"name", $"population", $"area")
     .showHTML())

name,population,area
Australia,25690023.0,7692024.0
Brazil,211442625.0,8515767.0
Canada,38007166.0,9984670.0
Indonesia,266911900.0,1904569.0
Russia,146745098.0,17125242.0


## 9. Rounding

Show the `name` and `population` in millions and the GDP in billions for the countries of the `continent` 'South America'. Use the [ROUND](https://sqlzoo.net/wiki/ROUND) function to show the values to two decimal places.

**For South America show population in millions and GDP in billions both to 2 decimal places.**

> _Millions and billions_    
> Divide by 1000000 (6 zeros) for millions. Divide by 1000000000 (9 zeros) for billions.

In [13]:
(world.filter($"continent" === "South America")
    .withColumn("popl", round($"population"/1e6, 2))
    .withColumn("gdp_", round($"gdp"/1e9, 2))
    .select($"name", $"popl", $"gdp_")
    .showHTML())

name,popl,gdp_
Argentina,44.94,637.49
Bolivia,11.47,37.51
Brazil,211.44,2055.51
Chile,19.11,277.08
Colombia,49.4,309.19
Ecuador,17.47,104.3
Guyana,0.78,3.09
Paraguay,7.25,29.44
Peru,32.13,211.4
Saint Vincent and the Grenadines,0.11,0.73


## 10. Trillion dollar economies

Show the `name` and per-capita GDP for those countries with a GDP of at least one trillion (1000000000000; that is 12 zeros). Round this value to the nearest 1000.

**Show per-capita GDP for the trillion dollar countries to the nearest $1000.**

In [14]:
(world
 .withColumn("pcgdp", round($"gdp" / (lit(1000) * $"population"), 0) * lit(1000))
 .filter($"gdp" > 1e12)
 .select($"name", $"pcgdp")
 .showHTML())

name,pcgdp
Australia,55000.0
Brazil,10000.0
Canada,43000.0
China,9000.0
France,39000.0
Germany,44000.0
India,2000.0
Indonesia,4000.0
Italy,32000.0
Japan,39000.0


## 11. Name and capital have the same length

Greece has capital Athens.

Each of the strings 'Greece', and 'Athens' has 6 characters.

**Show the name and capital where the name and the capital have the same number of characters.**

- You can use the [LENGTH](https://sqlzoo.net/wiki/LENGTH) function to find the number of characters in a string

In [15]:
(world.filter(length($"name") === length($"capital"))
    .select($"name", $"capital")
    .showHTML())

name,capital
Algeria,Algiers
Angola,Luanda
Armenia,Yerevan
Botswana,Gaborone
Canada,Ottowa
Djibouti,Djibouti
Egypt,Cairo
Estonia,Tallinn
Fiji,Suva
Gambia,Banjul


## 12. Matching name and capital

The capital of Sweden is Stockholm. Both words start with the letter 'S'.

**Show the name and the capital where the first letters of each match. Don't include countries where the name and the capital are the same word.**

- You can use the function [LEFT](https://sqlzoo.net/wiki/LEFT) to isolate the first character.
- You can use <> as the **NOT EQUALS** operator.

In [16]:
(world.filter(substring($"name", 1, 1) === substring($"capital", 1, 1))
    .select($"name", $"capital")
    .showHTML())

name,capital
Algeria,Algiers
Andorra,Andorra la Vella
Barbados,Bridgetown
Belize,Belmopan
Brazil,Brasília
Brunei,Bandar Seri Begawan
Burundi,Bujumbura
Djibouti,Djibouti
Guatemala,Guatemala City
Guyana,Georgetown


## 13. All the vowels

**Equatorial Guinea** and **Dominican Republic** have all of the vowels (a e i o u) in the name. They don't count because they have more than one word in the name.

**Find the country that has all the vowels and no spaces in its name.**

- You can use the phrase name `NOT LIKE '%a%'` to exclude characters from your results.
- The query shown misses countries like Bahamas and Belarus because they contain at least one 'a'

In [17]:
(world.filter($"name".rlike("[Aa]") &&
              $"name".rlike("[Ee]") &&
              $"name".rlike("[Ii]") &&
              $"name".rlike("[Oo]") &&
              $"name".rlike("[Uu]") &&
              $"name".rlike("^\\S+$"))
    .select($"name")
    .showHTML())

name
Mozambique


In [18]:
spark.stop()