# SELECT from WORLD

In [1]:
import os
import pandas as pd
import findspark
os.environ['SPARK_HOME'] =  '/opt/spark'
findspark.init()

from pyspark.sql import SparkSession
ss = (SparkSession.builder.appName('app00')
      .config('spark.sql.warehouse.dir', 'hdfs://quickstart.cloudera:8020/user/hive/warehouse')
      .config('hive.metastore.uris', 'thrift://quickstart.cloudera:9083')
      .enableHiveSupport().getOrCreate())

 ····


In [2]:
world = pd.read_sql_table('world', engine)

## 1. Introduction

[Read the notes about this table](https://sqlzoo.net/wiki/Read_the_notes_about_this_table.). Observe the result of running this SQL command to show the name, continent and population of all countries.

In [3]:
world.loc[:, ['name', 'continent', 'population']]

Unnamed: 0,name,continent,population
0,Afghanistan,Asia,25500100.0
1,Albania,Europe,2821977.0
2,Algeria,Africa,38700000.0
3,Andorra,Europe,76098.0
4,Angola,Africa,19183590.0
...,...,...,...
190,Venezuela,South America,28946101.0
191,Vietnam,Asia,89708900.0
192,Yemen,Asia,25235000.0
193,Zambia,Africa,15023315.0


## 2. Large Countries

[How to use WHERE to filter records](https://sqlzoo.net/wiki/WHERE_filters). Show the name for the countries that have a population of at least 200 million. 200 million is 200000000, there are eight zeros.

In [4]:
world.loc[world['population']>=2e8, ['name']]

Unnamed: 0,name
23,Brazil
35,China
75,India
76,Indonesia
185,United States


## 3. Per capita GDP

Give the `name` and the **per capita GDP** for those countries with a `population` of at least 200 million.

> _HELP:How to calculate per capita GDP_   
> per capita GDP is the GDP divided by the population GDP/population

In [5]:
(world.assign(pcgdp=world['gdp']/world['population'])
      .loc[world['population']>=2e8, ['name', 'pcgdp']])

Unnamed: 0,name,pcgdp
23,Brazil,11115.264751
35,China,6121.710599
75,India,1504.793124
76,Indonesia,3482.020488
185,United States,51032.294546


## 4. South America In millions

Show the `name` and `population` in millions for the countries of the `continent` 'South America'. Divide the population by 1000000 to get population in millions.

In [6]:
(world.assign(popl=world['population']/1e6)
      .loc[world['continent']=='South America', ['name', 'popl']])

Unnamed: 0,name,popl
6,Argentina,42.6695
20,Bolivia,10.027254
23,Brazil,202.794
34,Chile,17.773
36,Colombia,47.662
50,Ecuador,15.7742
70,Guyana,0.784894
133,Paraguay,6.783374
134,Peru,30.475144
144,Saint Vincent and the Grenadines,0.109


## 5. France, Germany, Italy

Show the `name` and `population` for France, Germany, Italy

In [7]:
world.loc[world['name'].isin(['France', 'Germany', 'Italy']), ['name', 'population']]

Unnamed: 0,name,population
59,France,65906000.0
63,Germany,80716000.0
81,Italy,60782668.0


## 6. United

Show the countries which have a `name` that includes the word 'United'

In [8]:
world.loc[world['name'].str.contains('United'), ['name']]

Unnamed: 0,name
183,United Arab Emirates
184,United Kingdom
185,United States


## 7. Two ways to be big

Two ways to be big: A country is **big** if it has an area of more than 3 million sq km or it has a population of more than 250 million.

**Show the countries that are big by area or big by population. Show name, population and area.**

In [9]:
world.loc[(world['area']>3e6) | (world['population']>2.5e8),
          ['name', 'population', 'area']]

Unnamed: 0,name,population,area
8,Australia,23545500.0,7692024.0
23,Brazil,202794000.0,8515767.0
30,Canada,35427520.0,9984670.0
35,China,1365370000.0,9596961.0
75,India,1246160000.0,3166414.0
76,Indonesia,252164800.0,1904569.0
140,Russia,146000000.0,17125242.0
185,United States,318320000.0,9826675.0


## 8. One or the other (but not both)

**Exclusive OR (XOR). Show the countries that are big by area (more than 3 million) or big by population (more than 250 million) but not both. Show name, population and area.**

- Australia has a big area but a small population, it should be **included**.
- Indonesia has a big population but a small area, it should be **included**.
- China has a big population **and** big area, it should be **excluded**.
- United Kingdom has a small population and a small area, it should be **excluded**.

In [10]:
world.loc[(world['area']>3e6)!=(world['population']>2.5e8),
          ['name', 'population', 'area']]

Unnamed: 0,name,population,area
8,Australia,23545500.0,7692024.0
23,Brazil,202794000.0,8515767.0
30,Canada,35427524.0,9984670.0
76,Indonesia,252164800.0,1904569.0
140,Russia,146000000.0,17125242.0


## 9. Rounding

Show the `name` and `population` in millions and the GDP in billions for the countries of the `continent` 'South America'. Use the [ROUND](https://sqlzoo.net/wiki/ROUND) function to show the values to two decimal places.

**For South America show population in millions and GDP in billions both to 2 decimal places.**

> _Millions and billions_    
> Divide by 1000000 (6 zeros) for millions. Divide by 1000000000 (9 zeros) for billions.

In [11]:
(world.loc[world['continent']=='South America', ['name', 'population', 'gdp']]
      .assign(popl=round(world['population']/1e6, 2),
              gdp_=round(world['gdp']/1e9, 2))
      .loc[:, ['name', 'popl', 'gdp_']]
)

Unnamed: 0,name,popl,gdp_
6,Argentina,42.67,477.03
20,Bolivia,10.03,27.04
23,Brazil,202.79,2254.11
34,Chile,17.77,268.31
36,Colombia,47.66,369.81
50,Ecuador,15.77,87.5
70,Guyana,0.78,2.85
133,Paraguay,6.78,25.94
134,Peru,30.48,204.68
144,Saint Vincent and the Grenadines,0.11,0.69


## 10. Trillion dollar economies

Show the `name` and per-capita GDP for those countries with a GDP of at least one trillion (1000000000000; that is 12 zeros). Round this value to the nearest 1000.

**Show per-capita GDP for the trillion dollar countries to the nearest $1000.**

In [12]:
(world.assign(pcgdp=round(world['gdp']/(1000*world['population']), 0)*1000)
      .loc[world['gdp']>1e12, ['name', 'pcgdp']]
)

Unnamed: 0,name,pcgdp
8,Australia,66000.0
23,Brazil,11000.0
30,Canada,45000.0
35,China,6000.0
59,France,40000.0
63,Germany,42000.0
75,India,2000.0
81,Italy,33000.0
83,Japan,47000.0
109,Mexico,10000.0


## 11. Name and capital have the same length

Greece has capital Athens.

Each of the strings 'Greece', and 'Athens' has 6 characters.

**Show the name and capital where the name and the capital have the same number of characters.**

- You can use the [LENGTH](https://sqlzoo.net/wiki/LENGTH) function to find the number of characters in a string

In [13]:
world.loc[world['name'].str.len()==world['capital'].str.len(),
          ['name', 'capital']]

Unnamed: 0,name,capital
2,Algeria,Algiers
4,Angola,Luanda
7,Armenia,Yerevan
22,Botswana,Gaborone
30,Canada,Ottowa
47,Djibouti,Djibouti
51,Egypt,Cairo
55,Estonia,Tallinn
57,Fiji,Suva
61,Gambia,Banjul


## 12. Matching name and capital

The capital of Sweden is Stockholm. Both words start with the letter 'S'.

**Show the name and the capital where the first letters of each match. Don't include countries where the name and the capital are the same word.**

- You can use the function [LEFT](https://sqlzoo.net/wiki/LEFT) to isolate the first character.
- You can use <> as the **NOT EQUALS** operator.

In [14]:
world.loc[world['name'].str.slice(0, 1)==world['capital'].str.slice(0, 1),
          ['name', 'capital']]

Unnamed: 0,name,capital
2,Algeria,Algiers
3,Andorra,Andorra la Vella
14,Barbados,Bridgetown
17,Belize,Belmopan
23,Brazil,Brasília
24,Brunei,Bandar Seri Begawan
27,Burundi,Bujumbura
47,Djibouti,Djibouti
67,Guatemala,Guatemala City
70,Guyana,Georgetown


## 13. All the vowels

**Equatorial Guinea** and **Dominican Republic** have all of the vowels (a e i o u) in the name. They don't count because they have more than one word in the name.

**Find the country that has all the vowels and no spaces in its name.**

- You can use the phrase name `NOT LIKE '%a%'` to exclude characters from your results.
- The query shown misses countries like Bahamas and Belarus because they contain at least one 'a'

In [15]:
world.loc[world['name'].str.contains('[Aa]') &
          world['name'].str.contains('[Ee]') &
          world['name'].str.contains('[Ii]') &
          world['name'].str.contains('[Oo]') &
          world['name'].str.contains('[Uu]') &
          world['name'].str.match(r'^\S+$'),
         ['name']]

Unnamed: 0,name
116,Mozambique
