# SELECT names

## Pattern Matching Strings
This tutorial uses the **LIKE** operator to check names. We will be using the SELECT command on the table world:

In [1]:
import getpass
import psycopg2
from sqlalchemy import create_engine
import pandas as pd
pwd = getpass.getpass()
engine = create_engine(
    'postgresql+psycopg2://postgres:%s@localhost/sqlzoo' % (pwd))
pd.set_option('display.max_rows', 20)

 ····


In [2]:
world = pd.read_sql_table('world', engine)

## 1.

You can use `WHERE name LIKE 'B%'` to find the countries that start with "B".

The % is a _wild-card_ it can match any characters

**Find the country that start with Y**

In [3]:
world.loc[world['name'].str.contains('^[Yy]'),
         ['name']]

Unnamed: 0,name
192,Yemen


## 2.

**Find the countries that end with y**

In [4]:
world.loc[world['name'].str.contains('[Yy]$'),
         ['name']]

Unnamed: 0,name
63,Germany
73,Hungary
81,Italy
127,Norway
133,Paraguay
178,Turkey
186,Uruguay
189,Vatican City


## 3.

Luxembourg has an **x** - so does one other country. List them both.

**Find the countries that contain the letter x**

In [5]:
world.loc[world['name'].str.contains('x'),
         ['name']]

Unnamed: 0,name
98,Luxembourg
109,Mexico


## 4.

Iceland, Switzerland end with **land** - but are there others?

**Find the countries that end with land**

In [6]:
world.loc[world['name'].str.contains('land$'),
         ['name']]

Unnamed: 0,name
58,Finland
74,Iceland
79,Ireland
122,New Zealand
136,Poland
165,Swaziland
167,Switzerland
172,Thailand


## 5.

Columbia starts with a **C** and ends with **ia** - there are two more like this.

**Find the countries that start with C and end with ia**

In [7]:
world.loc[world['name'].str.contains('^C.*ia$'),
         ['name']]

Unnamed: 0,name
28,Cambodia
36,Colombia
42,Croatia


## 6.
Greece has a double **e** - who has **a** double **o**?

**Find the country that has oo in the name**

In [8]:
world.loc[world['name'].str.contains('oo'),
         ['name']]

Unnamed: 0,name
29,Cameroon


## 7.

Bahamas has three **a** - who else?

**Find the countries that have three or more a in the name**

In [9]:
world.loc[world['name'].str.extract(r'(a.*){3,}', expand=False)
          .notna().to_list(),
         ['name']]

Unnamed: 0,name
5,Antigua and Barbuda
11,Bahamas
21,Bosnia and Herzegovina
30,Canada
53,Equatorial Guinea
67,Guatemala
82,Jamaica
85,Kazakhstan
100,Madagascar
102,Malaysia


## 8.

India and Angola have an **n** as the second character. You can use the underscore as a single character wildcard.

```sql
SELECT name FROM world
 WHERE name LIKE '_n%'
ORDER BY name
```

**Find the countries that have "t" as the second character.**

In [10]:
world.loc[world['name'].str.contains(r'^.{1}t'),
          ['name']]

Unnamed: 0,name
56,Ethiopia
81,Italy


## 9.

Lesotho and Moldova both have two o characters separated by two other characters.

**Find the countries that have two "o" characters separated by two others.**

In [11]:
world.loc[world['name'].str.contains('o.{2}o'),
         ['name']]

Unnamed: 0,name
38,"Congo, Democratic Republic of"
39,"Congo, Republic of"
93,Lesotho
111,Moldova
113,Mongolia
115,Morocco
147,Sao Tomé and Príncipe


## 10.

Cuba and Togo have four characters names.

**Find the countries that have exactly four characters.**

In [12]:
world.loc[world['name'].str.len()==4,
         ['name']]

Unnamed: 0,name
33,Chad
43,Cuba
57,Fiji
77,Iran
78,Iraq
90,Laos
104,Mali
128,Oman
134,Peru
174,Togo


## 11.

The capital of **Luxembourg** is **Luxembourg**. Show all the countries where the capital is the same as the name of the country

**Find the country where the name is the capital city.**

In [13]:
world.loc[world['name']==world['capital'],
         ['name']]

Unnamed: 0,name
47,Djibouti
98,Luxembourg
146,San Marino
153,Singapore


## 12.

The capital of **Mexico** is **Mexico City**. Show all the countries where the capital has the country together with the word "City".

**Find the country where the capital is the country plus "City".**

> _The concat function_    
> The function concat is short for concatenate - you can use it to combine two or more strings.

In [14]:
world.loc[world['capital']==world['name']+' City',
         ['name']]

Unnamed: 0,name
67,Guatemala
88,Kuwait
109,Mexico
131,Panama


## 13.

**Find the capital and the name where the capital includes the name of the country.**

In [15]:
import re
import numpy as np
world.loc[world.apply(
    lambda row: bool(re.match(row['name'], row.capital)) 
        if row.capital and row['name'] else False, 
    axis=1), ['capital', 'name']]

Unnamed: 0,capital,name
3,Andorra la Vella,Andorra
47,Djibouti,Djibouti
67,Guatemala City,Guatemala
88,Kuwait City,Kuwait
98,Luxembourg,Luxembourg
109,Mexico City,Mexico
112,Monaco-Ville,Monaco
131,Panama City,Panama
146,San Marino,San Marino
153,Singapore,Singapore


## 14.

**Find the capital and the name where the capital is an extension of name of the country.**

You _should_ include **Mexico City** as it is longer than **Mexico**. You _should not_ include **Luxembourg** as the capital is the same as the country.

In [16]:
# vectorize re.match
from typing import List
def str_detect(string: pd.Series, pattern: pd.Series) -> List[bool]:
    if len(string) > len(pattern):
        pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
    elif len(string) < len(pattern):
        pattern = pattern[1:len(string)]
    return [bool(re.match(y, x)) if x and y else False
            for x, y in zip(string, pattern)]

world.loc[str_detect(
    world['capital'], world['name']+'.+'), ['capital', 'name']]

Unnamed: 0,capital,name
3,Andorra la Vella,Andorra
67,Guatemala City,Guatemala
88,Kuwait City,Kuwait
109,Mexico City,Mexico
112,Monaco-Ville,Monaco
131,Panama City,Panama


## 15.

For **Monaco-Ville** the name is **Monaco** and the extension is **-Ville**.

**Show the name and the extension where the capital is an extension of name of the country.**

You can use the SQL function [REPLACE](https://sqlzoo.net/wiki/REPLACE).

In [17]:
# vectorize re.sub
def str_extract(string: pd.Series, pattern: pd.Series) -> List[str]:
    if len(string) > len(pattern):
        pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
    elif len(string) < len(pattern):
        pattern = pattern[1:len(string)]
    o = [re.search(y, x) if x and y else None
         for x, y in zip(string, pattern)]
    return [x.group() if x else np.nan for x in o]
(world.assign(ext=str_extract(world['capital'], '(?<=^'+ world['name']+')(.+)$'))
    .dropna(subset=['ext'])
    .loc[:, ['name', 'ext']]
)

Unnamed: 0,name,ext
3,Andorra,la Vella
67,Guatemala,City
88,Kuwait,City
109,Mexico,City
112,Monaco,-Ville
131,Panama,City
