# SELECT names

## Pattern Matching Strings
This tutorial uses the **LIKE** operator to check names. We will be using the SELECT command on the table world:

In [1]:
import $ivy.`org.apache.spark::spark-sql:3.4.0`

import org.apache.log4j.{Level, Logger}
Logger.getLogger("org").setLevel(Level.OFF)

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
    NotebookSparkSession.builder()
    .progress(false)
    .appName("app01")
    // .master("spark://192.168.31.31:7077")
    .master("local[*]")
    .config("spark.sql.warehouse.dir", 
            "hdfs://192.168.31.31:9000/user/hive/warehouse") 
    .config("spark.cores.max", "4") 
    .config("spark.executor.instances", "1") 
    .config("spark.executor.cores", "2") 
    .config("spark.executor.memory", "10g") 
    .config("spark.shuffle.service.enabled", "false") 
    .config("spark.dynamicAllocation.enabled", "false") 
    .config("spark.sql.catalogImplementation", "hive")
    .config("spark.sql.repl.eagerEval.enabled", "true")
    .config("spark.driver.allowMultipleContexts", "true")
    .getOrCreate()
}

Loading spark-stubs, spark-hive
Adding Hive conf dir /opt/hive/conf to classpath
Creating SparkSession


SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.


[32mimport [39m[36m$ivy.$                                  

[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}
[39m
[32mimport [39m[36morg.apache.spark._
[39m
[32mimport [39m[36morg.apache.spark.sql._
[39m
[32mimport [39m[36morg.apache.spark.sql.functions._

[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@656d66e8

In [2]:
import spark.implicits._
def sc = spark.sparkContext
val hiveCxt = new org.apache.spark.sql.hive.HiveContext(sc)

[32mimport [39m[36mspark.implicits._
[39m
defined [32mfunction[39m [36msc[39m
[36mhiveCxt[39m: [32msql[39m.[32mhive[39m.[32mHiveContext[39m = org.apache.spark.sql.hive.HiveContext@1f9fdddf

In [3]:
// Credit to Aivean
implicit class RichDF(val ds:DataFrame) {
    def showHTML(limit: Int = 50, truncate: Int = 100) = {
        import xml.Utility.escape
        val data = ds.take(limit)
        val header = ds.schema.fieldNames.toSeq        
        val rows: Seq[Seq[String]] = data.map { row =>
          row.toSeq.map {cell =>
            val str = cell match {
              case null => "null"
              case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
              case array: Array[_] => array.mkString("[", ", ", "]")
              case seq: Seq[_] => seq.mkString("[", ", ", "]")
              case _ => cell.toString
            }
            if (truncate > 0 && str.length > truncate) {
              // do not show ellipses for strings shorter than 4 characters.
              if (truncate < 4) str.substring(0, truncate)
              else str.substring(0, truncate - 3) + "..."
            } else {
              str
            }
          }: Seq[String]
        }
    publish.html(s""" <table>
                <tr>
                 ${header.map(h => s"<th>${escape(h)}</th>").mkString}
                </tr>
                ${rows.map {row =>
                  s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
                }.mkString}
            </table>
        """)
    }
}

defined [32mclass[39m [36mRichDF[39m

In [4]:
val world = hiveCxt.table("sqlzoo.world")

[36mworld[39m: [32mDataFrame[39m = [name: string, continent: string ... 6 more fields]

## 1.

You can use `WHERE name LIKE 'B%'` to find the countries that start with "B".

The % is a _wild-card_ it can match any characters

**Find the country that start with Y**

In [5]:
(world.filter($"name".rlike("^[Yy]"))
 .select($"name").showHTML())

name
Yemen


## 2.

**Find the countries that end with y**

In [6]:
(world.filter($"name".rlike("[Yy]$"))
 .select("name").showHTML())

name
Germany
Hungary
Italy
Norway
Paraguay
Turkey
Uruguay
Vatican City


## 3.

Luxembourg has an **x** - so does one other country. List them both.

**Find the countries that contain the letter x**

In [7]:
(world.filter($"name".contains("x"))
 .select($"name").showHTML())

name
Luxembourg
Mexico


## 4.

Iceland, Switzerland end with **land** - but are there others?

**Find the countries that end with land**

In [8]:
(world.filter($"name".like("%land"))
 .select($"name").showHTML())

name
Finland
Iceland
Ireland
New Zealand
Poland
Switzerland
Thailand


## 5.

Columbia starts with a **C** and ends with **ia** - there are two more like this.

**Find the countries that start with C and end with ia**

In [9]:
(world.filter($"name".rlike("^C.*ia$"))
 .select($"name").showHTML())

name
Cambodia
Colombia
Croatia


## 6.
Greece has a double **e** - who has **a** double **o**?

**Find the country that has oo in the name**

In [10]:
(world.filter($"name".contains("oo"))
 .select($"name").showHTML())

name
Cameroon


## 7.

Bahamas has three **a** - who else?

**Find the countries that have three or more a in the name**

In [11]:
(world.filter($"name".rlike("(a.*){3,}"))
 .select($"name").showHTML())

name
Antigua and Barbuda
Bahamas
Bosnia and Herzegovina
Canada
Equatorial Guinea
Guatemala
Jamaica
Kazakhstan
Madagascar
Malaysia


## 8.

India and Angola have an **n** as the second character. You can use the underscore as a single character wildcard.

```sql
SELECT name FROM world
 WHERE name LIKE '_n%'
ORDER BY name
```

**Find the countries that have "t" as the second character.**

In [12]:
(world.filter($"name".rlike("^.{1}t"))
 .select($"name").showHTML())

name
Ethiopia
Italy


## 9.

Lesotho and Moldova both have two o characters separated by two other characters.

**Find the countries that have two "o" characters separated by two others.**

In [13]:
(world.filter($"name".rlike("o.{2}o"))
 .select($"name").showHTML())

name
"Congo, Democratic Republic of"
"Congo, Republic of"
Lesotho
Moldova
Mongolia
Morocco
Sao Tomé and Príncipe


## 10.

Cuba and Togo have four characters names.

**Find the countries that have exactly four characters.**

In [14]:
(world.filter(length($"name") === 4)
 .select($"name").showHTML())

name
Chad
Cuba
Fiji
Iran
Iraq
Laos
Mali
Oman
Peru
Togo


## 11.

The capital of **Luxembourg** is **Luxembourg**. Show all the countries where the capital is the same as the name of the country

**Find the country where the name is the capital city.**

In [15]:
(world.filter($"name" === $"capital")
 .select($"name").showHTML())

name
Djibouti
Luxembourg
San Marino
Singapore


## 12.

The capital of **Mexico** is **Mexico City**. Show all the countries where the capital has the country together with the word "City".

**Find the country where the capital is the country plus "City".**

> _The concat function_    
> The function concat is short for concatenate - you can use it to combine two or more strings.

In [16]:
(world.filter($"capital" === concat($"name", lit(" City")))
    .select($"name", $"capital")
    .showHTML())

name,capital
Guatemala,Guatemala City
Kuwait,Kuwait City
Mexico,Mexico City
Panama,Panama City


## 13.

**Find the capital and the name where the capital includes the name of the country.**

In [17]:
(world.filter($"capital".contains($"name"))
    .select($"name", $"capital").showHTML())

name,capital
Andorra,Andorra la Vella
Djibouti,Djibouti
Guatemala,Guatemala City
Kuwait,Kuwait City
Luxembourg,Luxembourg
Mexico,Mexico City
Monaco,Monaco-Ville
Panama,Panama City
San Marino,San Marino
Singapore,Singapore


## 14.

**Find the capital and the name where the capital is an extension of name of the country.**

You _should_ include **Mexico City** as it is longer than **Mexico**. You _should not_ include **Luxembourg** as the capital is the same as the country.

In [18]:
(world.filter(($"capital".contains($"name")) &&
              ($"capital" !== $"name"))
    .select($"capital", $"name").showHTML())

capital,name
Andorra la Vella,Andorra
Guatemala City,Guatemala
Kuwait City,Kuwait
Mexico City,Mexico
Monaco-Ville,Monaco
Panama City,Panama


## 15.

For **Monaco-Ville** the name is **Monaco** and the extension is **-Ville**.

**Show the name and the extension where the capital is an extension of name of the country.**

You can use the SQL function [REPLACE](https://sqlzoo.net/wiki/REPLACE).

In [19]:
(world.filter(($"capital".contains($"name")) &&
              ($"capital" !== $"name"))
    .withColumn("ext", regexp_replace($"capital", lit($"name"), lit("")))
    .select($"name", $"ext")
    .showHTML()
)

name,ext
Andorra,la Vella
Guatemala,City
Kuwait,City
Mexico,City
Monaco,-Ville
Panama,City


In [20]:
spark.stop()