# SELECT names

## Pattern Matching Strings
This tutorial uses the **LIKE** operator to check names. We will be using the SELECT command on the table world:

In [1]:
import findspark
import pandas as pd
findspark.init()

SVR = '192.168.31.31'
from pyspark.sql import SparkSession

sc = (SparkSession.builder.appName('app01') 
      .master(f'spark://{SVR}:7077') 
      .config('spark.sql.warehouse.dir', f'hdfs://{SVR}:9000/user/hive/warehouse') 
      .config('spark.cores.max', '4') 
      .config('spark.executor.instances', '1') 
      .config('spark.executor.cores', '2') 
      .config('spark.executor.memory', '10g') 
      .enableHiveSupport().getOrCreate())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
world = sc.read.table('sqlzoo.world')

## 1.

You can use `WHERE name LIKE 'B%'` to find the countries that start with "B".

The % is a _wild-card_ it can match any characters

**Find the country that start with Y**

In [3]:
world.filter(world['name'].rlike('^[Yy]')).select('name').toPandas()

                                                                                

Unnamed: 0,name
0,Yemen


## 2.

**Find the countries that end with y**

In [4]:
world.filter(world['name'].rlike('[Yy]$')).select('name').toPandas()

Unnamed: 0,name
0,Germany
1,Hungary
2,Italy
3,Norway
4,Paraguay
5,Turkey
6,Uruguay
7,Vatican City


## 3.

Luxembourg has an **x** - so does one other country. List them both.

**Find the countries that contain the letter x**

In [5]:
world.filter(world['name'].contains('x')).select('name').toPandas()

Unnamed: 0,name
0,Luxembourg
1,Mexico


## 4.

Iceland, Switzerland end with **land** - but are there others?

**Find the countries that end with land**

In [6]:
world.filter(world['name'].rlike('land$')).select('name').toPandas()

Unnamed: 0,name
0,Finland
1,Iceland
2,Ireland
3,New Zealand
4,Poland
5,Switzerland
6,Thailand


## 5.

Columbia starts with a **C** and ends with **ia** - there are two more like this.

**Find the countries that start with C and end with ia**

In [7]:
world.filter(world['name'].rlike(r'^C.*ia$')).select('name').toPandas()

Unnamed: 0,name
0,Cambodia
1,Colombia
2,Croatia


## 6.
Greece has a double **e** - who has **a** double **o**?

**Find the country that has oo in the name**

In [8]:
world.filter(world['name'].contains('oo')).select('name').toPandas()

Unnamed: 0,name
0,Cameroon


## 7.

Bahamas has three **a** - who else?

**Find the countries that have three or more a in the name**

In [9]:
world.filter(world['name'].rlike(r'(a.*){3,}')).select('name').toPandas()

Unnamed: 0,name
0,Antigua and Barbuda
1,Bahamas
2,Bosnia and Herzegovina
3,Canada
4,Equatorial Guinea
5,Guatemala
6,Jamaica
7,Kazakhstan
8,Madagascar
9,Malaysia


## 8.

India and Angola have an **n** as the second character. You can use the underscore as a single character wildcard.

```sql
SELECT name FROM world
 WHERE name LIKE '_n%'
ORDER BY name
```

**Find the countries that have "t" as the second character.**

In [10]:
world.filter(world['name'].rlike(r'^.{1}t')).select('name').toPandas()

Unnamed: 0,name
0,Ethiopia
1,Italy


## 9.

Lesotho and Moldova both have two o characters separated by two other characters.

**Find the countries that have two "o" characters separated by two others.**

In [11]:
world.filter(world['name'].rlike('o.{2}o')).select('name').toPandas()

Unnamed: 0,name
0,"Congo, Democratic Republic of"
1,"Congo, Republic of"
2,Lesotho
3,Moldova
4,Mongolia
5,Morocco
6,Sao Tomé and Príncipe


## 10.

Cuba and Togo have four characters names.

**Find the countries that have exactly four characters.**

In [12]:
from pyspark.sql.functions import length
world.filter(length('name')==4).select('name').toPandas()

Unnamed: 0,name
0,Chad
1,Cuba
2,Fiji
3,Iran
4,Iraq
5,Laos
6,Mali
7,Oman
8,Peru
9,Togo


## 11.

The capital of **Luxembourg** is **Luxembourg**. Show all the countries where the capital is the same as the name of the country

**Find the country where the name is the capital city.**

In [13]:
world.filter(world['name']==world['capital']).select('name').toPandas()

Unnamed: 0,name
0,Djibouti
1,Luxembourg
2,San Marino
3,Singapore


## 12.

The capital of **Mexico** is **Mexico City**. Show all the countries where the capital has the country together with the word "City".

**Find the country where the capital is the country plus "City".**

> _The concat function_    
> The function concat is short for concatenate - you can use it to combine two or more strings.

In [14]:
from pyspark.sql.functions import concat, lit
(world.filter(world['capital']==concat(world['name'], lit(' City')))
    .select('name', 'capital')
    .toPandas())

Unnamed: 0,name,capital
0,Guatemala,Guatemala City
1,Kuwait,Kuwait City
2,Mexico,Mexico City
3,Panama,Panama City


## 13.

**Find the capital and the name where the capital includes the name of the country.**

In [15]:
# SQL is much easier in this case
sc.sql("""
    SELECT capital, name FROM sqlzoo.world 
    WHERE capital LIKE CONCAT('%', name, '%')
""").toPandas()

Unnamed: 0,capital,name
0,Andorra la Vella,Andorra
1,Djibouti,Djibouti
2,Guatemala City,Guatemala
3,Kuwait City,Kuwait
4,Luxembourg,Luxembourg
5,Mexico City,Mexico
6,Monaco-Ville,Monaco
7,Panama City,Panama
8,San Marino,San Marino
9,Singapore,Singapore


In [16]:
(world.filter(world['capital'].contains(world['name']))
    .select('name', 'capital').toPandas())

Unnamed: 0,name,capital
0,Andorra,Andorra la Vella
1,Djibouti,Djibouti
2,Guatemala,Guatemala City
3,Kuwait,Kuwait City
4,Luxembourg,Luxembourg
5,Mexico,Mexico City
6,Monaco,Monaco-Ville
7,Panama,Panama City
8,San Marino,San Marino
9,Singapore,Singapore


## 14.

**Find the capital and the name where the capital is an extension of name of the country.**

You _should_ include **Mexico City** as it is longer than **Mexico**. You _should not_ include **Luxembourg** as the capital is the same as the country.

In [17]:
# SQL is much easier in this case
sc.sql("""
    SELECT capital, name FROM sqlzoo.world 
    WHERE capital LIKE CONCAT('%', name, '%') AND capital <> name
""").toPandas()

Unnamed: 0,capital,name
0,Andorra la Vella,Andorra
1,Guatemala City,Guatemala
2,Kuwait City,Kuwait
3,Mexico City,Mexico
4,Monaco-Ville,Monaco
5,Panama City,Panama


In [18]:
(world.filter((world['capital'].contains(world['name']))
              & (world['capital'] != world['name']))
    .select('capital', 'name').toPandas())

Unnamed: 0,capital,name
0,Andorra la Vella,Andorra
1,Guatemala City,Guatemala
2,Kuwait City,Kuwait
3,Mexico City,Mexico
4,Monaco-Ville,Monaco
5,Panama City,Panama


## 15.

For **Monaco-Ville** the name is **Monaco** and the extension is **-Ville**.

**Show the name and the extension where the capital is an extension of name of the country.**

You can use the SQL function [REPLACE](https://sqlzoo.net/wiki/REPLACE).

In [19]:
# SQL is much easier in this case
sc.sql("""
    SELECT name, REPLACE(capital, name, '') AS ext FROM sqlzoo.world 
    WHERE capital LIKE CONCAT(name, '%') AND capital <> name
""").toPandas()

Unnamed: 0,name,ext
0,Andorra,la Vella
1,Guatemala,City
2,Kuwait,City
3,Mexico,City
4,Monaco,-Ville
5,Panama,City


In [20]:
from pyspark.sql.functions import split
(world.filter((world['capital'].contains(world['name']))
              & (world['capital'] != world['name']))
    .rdd.map(lambda x: (x.name, x.capital.replace(x.name, '')))
    .toDF(['name', 'ext'])
    .toPandas()
)

Unnamed: 0,name,ext
0,Andorra,la Vella
1,Guatemala,City
2,Kuwait,City
3,Mexico,City
4,Monaco,-Ville
5,Panama,City


In [21]:
sc.stop()